415 lines
16 KiB
Markdown
415 lines
16 KiB
Markdown
|
|
# C6-Heavy (257-768B) Visibility Analysis - Phase C6-H
|
||
|
|
|
||
|
|
**Date**: 2025-12-10
|
||
|
|
**Benchmark**: `./bench_mid_large_mt_hakmem 1 1000000 400 1` (1 thread, ws=400, iters=1M)
|
||
|
|
**Size Range**: 257-768B (Class 6: 512B allocations)
|
||
|
|
**Configuration**: C6_HEAVY_LEGACY_POOLV1 profile (C7_SAFE + C6_HOT=1)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Executive Summary
|
||
|
|
|
||
|
|
### Performance Gap Analysis
|
||
|
|
- **HAKMEM**: 9.84M ops/s (baseline)
|
||
|
|
- **mimalloc**: 51.3M ops/s
|
||
|
|
- **Performance Gap**: **5.2x** (mimalloc is 421% faster)
|
||
|
|
|
||
|
|
This represents a **critical performance deficit** in the C6-heavy allocation path, where HAKMEM achieves only **19% of mimalloc's throughput**.
|
||
|
|
|
||
|
|
### Key Findings
|
||
|
|
1. **C6 does NOT use Pool flatten path** - With `HAKMEM_TINY_C6_HOT=1`, allocations route through TinyHeap v1, bypassing pool flatten entirely
|
||
|
|
2. **Address lookup dominates CPU time** - `hak_super_lookup` (9.3%) + `mid_desc_lookup` (8.2%) + `classify_ptr` (5.8%) = **23.3% of cycles**
|
||
|
|
3. **Pool operations are expensive** - Despite not using flatten, pool alloc/free combined still consume ~15-20% of cycles
|
||
|
|
4. **Mid_desc cache provides modest gains** - +6.4% improvement (9.8M → 10.4M ops/s)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase C6-H1: Baseline Metrics
|
||
|
|
|
||
|
|
### Test Configuration
|
||
|
|
```bash
|
||
|
|
export HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1
|
||
|
|
export HAKMEM_BENCH_MIN_SIZE=257
|
||
|
|
export HAKMEM_BENCH_MAX_SIZE=768
|
||
|
|
```
|
||
|
|
|
||
|
|
### Baseline Results
|
||
|
|
|
||
|
|
| Configuration | Throughput (ops/s) | vs mimalloc | Notes |
|
||
|
|
|---------------|-------------------|-------------|-------|
|
||
|
|
| **Baseline (C6_HOT=1, mid_desc_cache=1)** | 9,836,420 | 19.2% | Default profile |
|
||
|
|
| **C6_HOT=1, mid_desc_cache=0** | 9,805,954 | 19.1% | Without cache |
|
||
|
|
| **C6_HOT=1, mid_desc_cache=1** | 10,435,480 | 20.3% | With cache (+6.4%) |
|
||
|
|
| **C6_HOT=0 (pure legacy pool)** | 9,938,473 | 19.4% | Pool path ~same as TinyHeap |
|
||
|
|
| **mimalloc baseline** | 51,297,877 | 100.0% | Reference |
|
||
|
|
|
||
|
|
### Key Observations
|
||
|
|
1. **Mid_desc cache effect**: +6.4% improvement, but far from closing the gap
|
||
|
|
2. **C6_HOT vs pool path**: Nearly identical performance (~9.8M-9.9M ops/s), suggesting the bottleneck is in common infrastructure (address lookup, classification)
|
||
|
|
3. **Size class routing**: 257-768B → Class 6 (512B) as expected
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase C6-H2: Pool Flatten and Cache Analysis
|
||
|
|
|
||
|
|
### Pool Flatten Test (ATTEMPTED)
|
||
|
|
|
||
|
|
**Finding**: Pool v1 flatten path is **NOT USED** for C6 allocations with `HAKMEM_TINY_C6_HOT=1`.
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Test with flatten enabled
|
||
|
|
export HAKMEM_POOL_V1_FLATTEN_ENABLED=1
|
||
|
|
export HAKMEM_POOL_V1_FLATTEN_STATS=1
|
||
|
|
# Result: [POOL_V1_FLAT] alloc_tls_hit=0 alloc_fb=0 free_tls_hit=0 free_fb=0
|
||
|
|
```
|
||
|
|
|
||
|
|
**Root Cause**:
|
||
|
|
- With `HAKMEM_TINY_C6_HOT=1`, class 6 routes to `TINY_ROUTE_HEAP` (TinyHeap v1)
|
||
|
|
- TinyHeap v1 uses its own allocation path via `tiny_heap_box.h`, not the pool flatten path
|
||
|
|
- Pool flatten optimizations (Phase 80-82) only apply to **legacy pool path** (when C6_HOT=0)
|
||
|
|
|
||
|
|
### Mid_Desc Cache Analysis
|
||
|
|
|
||
|
|
| Metric | Without Cache | With Cache | Delta |
|
||
|
|
|--------|--------------|------------|-------|
|
||
|
|
| Throughput | 9.81M ops/s | 10.44M ops/s | +6.4% |
|
||
|
|
| Expected self% reduction | mid_desc_lookup: 8.2% | ~6-7% (estimated) | ~1-2% |
|
||
|
|
|
||
|
|
**Conclusion**: Mid_desc cache provides measurable but insufficient improvement. The 8.2% CPU time in `mid_desc_lookup` is reduced, but other lookup costs (hak_super_lookup, classify_ptr) remain.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Phase C6-H3: CPU Hotspot Analysis
|
||
|
|
|
||
|
|
### Perf Stat Results
|
||
|
|
|
||
|
|
```
|
||
|
|
Benchmark: 9,911,926 ops/s (0.101s runtime)
|
||
|
|
Cycles: 398,766,361 cycles:u
|
||
|
|
Instructions: 1,054,643,524 instructions:u
|
||
|
|
IPC: 2.64
|
||
|
|
Page Faults: 7,131
|
||
|
|
Task Clock: 119.08 ms
|
||
|
|
```
|
||
|
|
|
||
|
|
**Analysis**:
|
||
|
|
- **IPC 2.64**: Reasonable instruction-level parallelism, but many cycles wasted
|
||
|
|
- **Cycles per operation**: 398,766,361 / 1,000,000 = **398 cycles/op**
|
||
|
|
- **Instructions per operation**: 1,054,643,524 / 1,000,000 = **1,054 instructions/op**
|
||
|
|
|
||
|
|
**Comparison estimate** (mimalloc at 51.3M ops/s):
|
||
|
|
- Estimated cycles/op for mimalloc: ~76 cycles/op (5.2x faster)
|
||
|
|
- HAKMEM uses **5.2x more cycles** per allocation/free pair
|
||
|
|
|
||
|
|
### Perf Record Hotspots (Top 20 Functions)
|
||
|
|
|
||
|
|
| Function | Self % | Category | Description |
|
||
|
|
|----------|--------|----------|-------------|
|
||
|
|
| `hak_super_lookup` | 9.32% | Address Lookup | Superslab registry lookup (largest single cost) |
|
||
|
|
| `mid_desc_lookup` | 8.23% | Address Lookup | Mid-size descriptor lookup |
|
||
|
|
| `hak_pool_get_class_index` | 5.87% | Classification | Size→class mapping |
|
||
|
|
| `classify_ptr` | 5.76% | Classification | Pointer classification for free |
|
||
|
|
| `hak_pool_free_v1_impl` | 5.52% | Pool Free | Pool free implementation |
|
||
|
|
| `hak_pool_try_alloc_v1_impl` | 5.46% | Pool Alloc | Pool allocation implementation |
|
||
|
|
| `free` | 4.54% | Front Gate | glibc free wrapper |
|
||
|
|
| `worker_run` | 4.47% | Benchmark | Benchmark driver |
|
||
|
|
| `ss_map_lookup` | 4.35% | Address Lookup | Superslab map lookup |
|
||
|
|
| `super_reg_effective_mask` | 4.32% | Address Lookup | Registry mask computation |
|
||
|
|
| `mid_desc_hash` | 3.69% | Address Lookup | Hash computation for mid_desc |
|
||
|
|
| `mid_set_header` | 3.27% | Metadata | Header initialization |
|
||
|
|
| `mid_page_inuse_dec_and_maybe_dn` | 3.17% | Metadata | Page occupancy tracking |
|
||
|
|
| `mid_desc_init_once` | 2.71% | Initialization | Descriptor initialization |
|
||
|
|
| `malloc` | 2.60% | Front Gate | glibc malloc wrapper |
|
||
|
|
| `hak_free_at` | 2.53% | Front Gate | Internal free dispatcher |
|
||
|
|
| `hak_pool_mid_lookup_v1_impl` | 2.17% | Pool Lookup | Pool-specific descriptor lookup |
|
||
|
|
| `super_reg_effective_size` | 1.87% | Address Lookup | Registry size computation |
|
||
|
|
| `hak_pool_free_fast_v1_impl` | 1.77% | Pool Free | Fast path for pool free |
|
||
|
|
| `hak_pool_init` | 1.44% | Initialization | Pool initialization |
|
||
|
|
|
||
|
|
### Hotspot Category Breakdown
|
||
|
|
|
||
|
|
| Category | Combined Self % | Functions |
|
||
|
|
|----------|----------------|-----------|
|
||
|
|
| **Address Lookup & Classification** | **41.5%** | hak_super_lookup, mid_desc_lookup, classify_ptr, hak_pool_get_class_index, ss_map_lookup, super_reg_effective_mask, mid_desc_hash, super_reg_effective_size, hak_pool_mid_lookup_v1_impl |
|
||
|
|
| **Pool Operations** | **14.8%** | hak_pool_try_alloc_v1_impl, hak_pool_free_v1_impl, hak_pool_free_fast_v1_impl |
|
||
|
|
| **Metadata Management** | **9.2%** | mid_set_header, mid_page_inuse_dec_and_maybe_dn, mid_desc_init_once |
|
||
|
|
| **Front Gate** | **9.7%** | malloc, free, hak_free_at |
|
||
|
|
| **Benchmark Driver** | **4.5%** | worker_run |
|
||
|
|
| **Other** | **20.3%** | Various helpers, initialization, etc. |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Root Cause Analysis
|
||
|
|
|
||
|
|
### 1. Address Lookup Dominates (41.5% of CPU)
|
||
|
|
|
||
|
|
The single largest performance killer is **address→metadata lookup infrastructure**:
|
||
|
|
|
||
|
|
- **hak_super_lookup** (9.3%): Superslab registry lookup to find which allocator owns a pointer
|
||
|
|
- **mid_desc_lookup** (8.2%): Hash-based descriptor lookup for mid-size allocations
|
||
|
|
- **ss_map_lookup** (4.3%): Secondary map lookup within superslab
|
||
|
|
- **classify_ptr** (5.8%): Pointer classification during free
|
||
|
|
- **hak_pool_get_class_index** (5.9%): Size→class index computation
|
||
|
|
|
||
|
|
**Why this matters**: Every allocation AND free requires multiple lookups:
|
||
|
|
- Alloc: size → class_idx → descriptor → block
|
||
|
|
- Free: ptr → superslab → descriptor → classification → free handler
|
||
|
|
|
||
|
|
**Comparison to mimalloc**: mimalloc likely uses:
|
||
|
|
- Thread-local caching with minimal lookup
|
||
|
|
- Direct pointer arithmetic from block headers
|
||
|
|
- Segment-based organization reducing lookup depth
|
||
|
|
|
||
|
|
### 2. Pool Operations Still Expensive (14.8%)
|
||
|
|
|
||
|
|
Despite C6 routing through TinyHeap (not pool flatten), pool operations still consume significant cycles:
|
||
|
|
- `hak_pool_try_alloc_v1_impl` (5.5%)
|
||
|
|
- `hak_pool_free_v1_impl` (5.5%)
|
||
|
|
|
||
|
|
**Why**: TinyHeap v1 likely calls into pool infrastructure for:
|
||
|
|
- Page allocation from mid/smallmid pool
|
||
|
|
- Descriptor management
|
||
|
|
- Cross-thread handling
|
||
|
|
|
||
|
|
### 3. Metadata Overhead (9.2%)
|
||
|
|
|
||
|
|
Mid-size allocations carry significant metadata overhead:
|
||
|
|
- Header initialization: `mid_set_header` (3.3%)
|
||
|
|
- Occupancy tracking: `mid_page_inuse_dec_and_maybe_dn` (3.2%)
|
||
|
|
- Descriptor init: `mid_desc_init_once` (2.7%)
|
||
|
|
|
||
|
|
### 4. Front Gate Overhead (9.7%)
|
||
|
|
|
||
|
|
The malloc/free wrappers add non-trivial cost:
|
||
|
|
- Route determination
|
||
|
|
- Cross-allocator checks (jemalloc, system)
|
||
|
|
- Lock depth checks
|
||
|
|
- Initialization checks
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recommendations for Next Phase
|
||
|
|
|
||
|
|
### Priority 1: Address Lookup Reduction (Highest Impact)
|
||
|
|
**Target**: 41.5% → 20-25% of cycles
|
||
|
|
|
||
|
|
**Strategies**:
|
||
|
|
1. **TLS Descriptor Cache**: Extend mid_desc_cache to cache full allocation context (class_idx + descriptor + page_info)
|
||
|
|
2. **Fast Path Header**: Embed class_idx in allocation header for instant classification on free (similar to tiny allocations)
|
||
|
|
3. **Segment-Based Addressing**: Consider segment-style addressing (like mimalloc) where ptr→metadata is direct pointer arithmetic
|
||
|
|
4. **Superslab Lookup Bypass**: For C6-heavy workloads, skip superslab lookup when we know it's mid-size
|
||
|
|
|
||
|
|
**Expected Gain**: 10-15M ops/s (+100-150%)
|
||
|
|
|
||
|
|
### Priority 2: Pool Path Streamlining (Medium Impact)
|
||
|
|
**Target**: 14.8% → 8-10% of cycles
|
||
|
|
|
||
|
|
**Strategies**:
|
||
|
|
1. **Dedicated C6 Fast Path**: Create a specialized alloc/free path for class 6 that skips pool generality
|
||
|
|
2. **TLS Block Cache**: Implement TLS-local block cache for C6 (bypass pool ring buffer overhead)
|
||
|
|
3. **Inline Critical Helpers**: Force-inline `hak_pool_get_class_index` and other hot helpers
|
||
|
|
|
||
|
|
**Expected Gain**: 3-5M ops/s (+30-50%)
|
||
|
|
|
||
|
|
### Priority 3: Metadata Streamlining (Lower Impact)
|
||
|
|
**Target**: 9.2% → 5-6% of cycles
|
||
|
|
|
||
|
|
**Strategies**:
|
||
|
|
1. **Lazy Header Init**: Only initialize headers when necessary (debug mode, cross-thread)
|
||
|
|
2. **Batch Occupancy Updates**: Combine multiple inuse_dec calls
|
||
|
|
3. **Cached Descriptors**: Reduce descriptor initialization overhead
|
||
|
|
|
||
|
|
**Expected Gain**: 1-2M ops/s (+10-20%)
|
||
|
|
|
||
|
|
### Priority 4: Front Gate Thinning (Lower Impact)
|
||
|
|
**Target**: 9.7% → 6-7% of cycles
|
||
|
|
|
||
|
|
**Strategies**:
|
||
|
|
1. **Size-Based Fast Path**: For mid-size range (257-768B), skip most gate checks
|
||
|
|
2. **Compile-Time Routing**: When jemalloc/system allocators are not used, eliminate checks
|
||
|
|
|
||
|
|
**Expected Gain**: 1-2M ops/s (+10-20%)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Comparison to Historical Baselines
|
||
|
|
|
||
|
|
| Phase | Configuration | Throughput | vs Current | Notes |
|
||
|
|
|-------|--------------|------------|------------|-------|
|
||
|
|
| **Phase 54** | C7_SAFE, mixed 16-1024B | 28.1M ops/s | 2.9x | Mixed workload |
|
||
|
|
| **Phase 80** | C6-heavy, flatten OFF | 23.1M ops/s | 2.4x | Legacy baseline |
|
||
|
|
| **Phase 81** | C6-heavy, flatten ON | 25.9M ops/s | 2.6x | +10% from flatten |
|
||
|
|
| **Phase 82** | C6-heavy, flatten ON | 26.7M ops/s | 2.7x | +13% from flatten |
|
||
|
|
| **Current (C6-H)** | C6-heavy, C6_HOT=1 | 9.8M ops/s | 1.0x | **REGRESSION** |
|
||
|
|
|
||
|
|
**CRITICAL FINDING**: Current baseline (9.8M ops/s) is **2.4-2.7x SLOWER** than historical C6-heavy baselines (23-27M ops/s).
|
||
|
|
|
||
|
|
**Possible Causes**:
|
||
|
|
1. **Configuration difference**: Historical tests may have used different profile (LEGACY vs C7_SAFE)
|
||
|
|
2. **Routing change**: C6_HOT=1 may be forcing a slower path through TinyHeap
|
||
|
|
3. **Build/compiler difference**: Flags or LTO settings may have changed
|
||
|
|
4. **Benchmark variance**: Different workload characteristics
|
||
|
|
|
||
|
|
**Action Required**: Replicate historical Phase 80-82 configurations exactly to identify regression point.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Verification of Historical Configuration
|
||
|
|
|
||
|
|
Let me verify the exact configuration used in Phase 80-82:
|
||
|
|
|
||
|
|
**Phase 80-82 Configuration** (from CURRENT_TASK.md):
|
||
|
|
```bash
|
||
|
|
HAKMEM_BENCH_MIN_SIZE=257
|
||
|
|
HAKMEM_BENCH_MAX_SIZE=768
|
||
|
|
HAKMEM_TINY_HEAP_PROFILE=LEGACY # ← Different!
|
||
|
|
HAKMEM_TINY_HOTHEAP_V2=0
|
||
|
|
HAKMEM_POOL_V2_ENABLED=0
|
||
|
|
HAKMEM_POOL_V1_FLATTEN_ENABLED=1
|
||
|
|
HAKMEM_POOL_V1_FLATTEN_STATS=1
|
||
|
|
```
|
||
|
|
|
||
|
|
**Current Configuration**:
|
||
|
|
```bash
|
||
|
|
HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 # Sets TINY_HEAP_PROFILE=C7_SAFE
|
||
|
|
HAKMEM_TINY_C6_HOT=1 # ← Adds TinyHeap routing
|
||
|
|
HAKMEM_POOL_V1_FLATTEN_ENABLED=0 # ← Flatten OFF by default
|
||
|
|
```
|
||
|
|
|
||
|
|
**Key Difference**: Historical tests used `TINY_HEAP_PROFILE=LEGACY`, which likely routes C6 through pure pool path (no TinyHeap). Current `C6_HEAVY_LEGACY_POOLV1` profile sets `TINY_HEAP_PROFILE=C7_SAFE` + `TINY_C6_HOT=1`, routing C6 through TinyHeap.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Action Items for Phase C6-H+1
|
||
|
|
|
||
|
|
1. **Replicate Historical Baseline** (URGENT)
|
||
|
|
```bash
|
||
|
|
export HAKMEM_BENCH_MIN_SIZE=257
|
||
|
|
export HAKMEM_BENCH_MAX_SIZE=768
|
||
|
|
export HAKMEM_TINY_HEAP_PROFILE=LEGACY
|
||
|
|
export HAKMEM_TINY_HOTHEAP_V2=0
|
||
|
|
export HAKMEM_POOL_V2_ENABLED=0
|
||
|
|
export HAKMEM_POOL_V1_FLATTEN_ENABLED=0
|
||
|
|
# Expected: ~23M ops/s
|
||
|
|
```
|
||
|
|
|
||
|
|
2. **Test Flatten ON with Historical Config**
|
||
|
|
```bash
|
||
|
|
# Same as above, but:
|
||
|
|
export HAKMEM_POOL_V1_FLATTEN_ENABLED=1
|
||
|
|
export HAKMEM_POOL_V1_FLATTEN_STATS=1
|
||
|
|
# Expected: ~26M ops/s with active flatten stats
|
||
|
|
```
|
||
|
|
|
||
|
|
3. **Profile Comparison Matrix**
|
||
|
|
- LEGACY vs C7_SAFE profile
|
||
|
|
- C6_HOT=0 vs C6_HOT=1
|
||
|
|
- Flatten OFF vs ON
|
||
|
|
- Identify which combination yields best performance
|
||
|
|
|
||
|
|
4. **Address Lookup Prototype**
|
||
|
|
- Implement TLS allocation context cache (class_idx + descriptor + page)
|
||
|
|
- Measure impact on lookup overhead (target: 41.5% → 25%)
|
||
|
|
|
||
|
|
5. **Update ENV_PROFILE_PRESETS.md**
|
||
|
|
- Clarify that `C6_HEAVY_LEGACY_POOLV1` uses C7_SAFE profile (not pure LEGACY)
|
||
|
|
- Add note about C6_HOT routing implications
|
||
|
|
- Document performance differences between profile choices
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Success Criteria for Phase C6-H+1
|
||
|
|
|
||
|
|
- **Reproduce historical baseline**: Achieve 23-27M ops/s with LEGACY profile
|
||
|
|
- **Understand routing impact**: Quantify C6_HOT=0 vs C6_HOT=1 difference
|
||
|
|
- **Identify optimization path**: Choose between:
|
||
|
|
- Optimizing TinyHeap C6 path (if C6_HOT=1 is strategic)
|
||
|
|
- Optimizing pool flatten path (if LEGACY/C6_HOT=0 is preferred)
|
||
|
|
- Hybrid approach with runtime selection
|
||
|
|
|
||
|
|
**Target**: Close to **30M ops/s** (1/2 of current gap to 51.3M mimalloc baseline) by end of next phase.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Appendix A: Full Perf Report Output
|
||
|
|
|
||
|
|
```
|
||
|
|
# Samples: 656 of event 'cycles:u'
|
||
|
|
# Event count (approx.): 409,174,521
|
||
|
|
#
|
||
|
|
# Overhead Symbol
|
||
|
|
# ........ .....................................
|
||
|
|
9.32% [.] hak_super_lookup
|
||
|
|
8.23% [.] mid_desc_lookup
|
||
|
|
5.87% [.] hak_pool_get_class_index
|
||
|
|
5.76% [.] classify_ptr
|
||
|
|
5.52% [.] hak_pool_free_v1_impl
|
||
|
|
5.46% [.] hak_pool_try_alloc_v1_impl
|
||
|
|
4.54% [.] free
|
||
|
|
4.47% [.] worker_run
|
||
|
|
4.35% [.] ss_map_lookup
|
||
|
|
4.32% [.] super_reg_effective_mask
|
||
|
|
3.69% [.] mid_desc_hash
|
||
|
|
3.27% [.] mid_set_header
|
||
|
|
3.17% [.] mid_page_inuse_dec_and_maybe_dn
|
||
|
|
2.71% [.] mid_desc_init_once
|
||
|
|
2.60% [.] malloc
|
||
|
|
2.53% [.] hak_free_at
|
||
|
|
2.17% [.] hak_pool_mid_lookup_v1_impl
|
||
|
|
1.87% [.] super_reg_effective_size
|
||
|
|
1.77% [.] hak_pool_free_fast_v1_impl
|
||
|
|
1.64% [k] 0xffffffffae200ba0 (kernel)
|
||
|
|
1.44% [.] hak_pool_init
|
||
|
|
1.42% [.] hak_pool_is_poolable
|
||
|
|
1.21% [.] should_sample
|
||
|
|
1.12% [.] hak_pool_free
|
||
|
|
1.11% [.] hak_super_hash
|
||
|
|
1.09% [.] hak_pool_try_alloc
|
||
|
|
0.95% [.] mid_desc_lookup_cached
|
||
|
|
0.93% [.] hak_pool_v1_flatten_enabled
|
||
|
|
0.76% [.] hak_pool_v2_route
|
||
|
|
0.57% [.] ss_map_hash
|
||
|
|
0.55% [.] hak_in_wrapper
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Appendix B: Test Commands Summary
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Baseline
|
||
|
|
export HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1
|
||
|
|
export HAKMEM_BENCH_MIN_SIZE=257
|
||
|
|
export HAKMEM_BENCH_MAX_SIZE=768
|
||
|
|
./bench_mid_large_mt_hakmem 1 1000000 400 1
|
||
|
|
# Result: 9,836,420 ops/s
|
||
|
|
|
||
|
|
# Mimalloc comparison
|
||
|
|
./bench_mid_large_mt_mi 1 1000000 400 1
|
||
|
|
# Result: 51,297,877 ops/s (5.2x faster)
|
||
|
|
|
||
|
|
# Mid_desc cache OFF
|
||
|
|
export HAKMEM_MID_DESC_CACHE_ENABLED=0
|
||
|
|
./bench_mid_large_mt_hakmem 1 1000000 400 1
|
||
|
|
# Result: 9,805,954 ops/s
|
||
|
|
|
||
|
|
# Mid_desc cache ON
|
||
|
|
export HAKMEM_MID_DESC_CACHE_ENABLED=1
|
||
|
|
./bench_mid_large_mt_hakmem 1 1000000 400 1
|
||
|
|
# Result: 10,435,480 ops/s (+6.4%)
|
||
|
|
|
||
|
|
# Perf stat
|
||
|
|
perf stat -e cycles:u,instructions:u,task-clock,page-faults:u \
|
||
|
|
./bench_mid_large_mt_hakmem 1 1000000 400 1
|
||
|
|
# Result: 398M cycles, 1.05B instructions, IPC=2.64
|
||
|
|
|
||
|
|
# Perf record
|
||
|
|
perf record -F 5000 --call-graph dwarf -e cycles:u \
|
||
|
|
-o perf.data.c6_flat ./bench_mid_large_mt_hakmem 1 1000000 400 1
|
||
|
|
perf report -i perf.data.c6_flat --stdio --no-children
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**End of Report**
|