hakmem/docs/analysis/C6_HEAVY_VISIBILITY_ANALYSIS_PHASE_C6H.md

# C6-Heavy (257-768B) Visibility Analysis - Phase C6-H

**Date**: 2025-12-10
**Benchmark**: `./bench_mid_large_mt_hakmem 1 1000000 400 1` (1 thread, ws=400, iters=1M)
**Size Range**: 257-768B (Class 6: 512B allocations)
**Configuration**: C6_HEAVY_LEGACY_POOLV1 profile (C7_SAFE + C6_HOT=1)

---

## Executive Summary

### Performance Gap Analysis
- **HAKMEM**: 9.84M ops/s (baseline)
- **mimalloc**: 51.3M ops/s
- **Performance Gap**: **5.2x** (mimalloc is 421% faster)

This represents a **critical performance deficit** in the C6-heavy allocation path, where HAKMEM achieves only **19% of mimalloc's throughput**.

### Key Findings
1. **C6 does NOT use Pool flatten path** - With `HAKMEM_TINY_C6_HOT=1`, allocations route through TinyHeap v1, bypassing pool flatten entirely
2. **Address lookup dominates CPU time** - `hak_super_lookup` (9.3%) + `mid_desc_lookup` (8.2%) + `classify_ptr` (5.8%) = **23.3% of cycles**
3. **Pool operations are expensive** - Despite not using flatten, pool alloc/free combined still consume ~15-20% of cycles
4. **Mid_desc cache provides modest gains** - +6.4% improvement (9.8M → 10.4M ops/s)

---

## Phase C6-H1: Baseline Metrics

### Test Configuration
```bash
export HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1
export HAKMEM_BENCH_MIN_SIZE=257
export HAKMEM_BENCH_MAX_SIZE=768
```

### Baseline Results

| Configuration | Throughput (ops/s) | vs mimalloc | Notes |
|---------------|-------------------|-------------|-------|
| **Baseline (C6_HOT=1, mid_desc_cache=1)** | 9,836,420 | 19.2% | Default profile |
| **C6_HOT=1, mid_desc_cache=0** | 9,805,954 | 19.1% | Without cache |
| **C6_HOT=1, mid_desc_cache=1** | 10,435,480 | 20.3% | With cache (+6.4%) |
| **C6_HOT=0 (pure legacy pool)** | 9,938,473 | 19.4% | Pool path ~same as TinyHeap |
| **mimalloc baseline** | 51,297,877 | 100.0% | Reference |

### Key Observations
1. **Mid_desc cache effect**: +6.4% improvement, but far from closing the gap
2. **C6_HOT vs pool path**: Nearly identical performance (~9.8M-9.9M ops/s), suggesting the bottleneck is in common infrastructure (address lookup, classification)
3. **Size class routing**: 257-768B → Class 6 (512B) as expected

---

## Phase C6-H2: Pool Flatten and Cache Analysis

### Pool Flatten Test (ATTEMPTED)

**Finding**: Pool v1 flatten path is **NOT USED** for C6 allocations with `HAKMEM_TINY_C6_HOT=1`.

```bash
# Test with flatten enabled
export HAKMEM_POOL_V1_FLATTEN_ENABLED=1
export HAKMEM_POOL_V1_FLATTEN_STATS=1
# Result: [POOL_V1_FLAT] alloc_tls_hit=0 alloc_fb=0 free_tls_hit=0 free_fb=0
```

**Root Cause**:
- With `HAKMEM_TINY_C6_HOT=1`, class 6 routes to `TINY_ROUTE_HEAP` (TinyHeap v1)
- TinyHeap v1 uses its own allocation path via `tiny_heap_box.h`, not the pool flatten path
- Pool flatten optimizations (Phase 80-82) only apply to **legacy pool path** (when C6_HOT=0)

### Mid_Desc Cache Analysis

| Metric | Without Cache | With Cache | Delta |
|--------|--------------|------------|-------|
| Throughput | 9.81M ops/s | 10.44M ops/s | +6.4% |
| Expected self% reduction | mid_desc_lookup: 8.2% | ~6-7% (estimated) | ~1-2% |

**Conclusion**: Mid_desc cache provides measurable but insufficient improvement. The 8.2% CPU time in `mid_desc_lookup` is reduced, but other lookup costs (hak_super_lookup, classify_ptr) remain.

---

## Phase C6-H3: CPU Hotspot Analysis

### Perf Stat Results

```
Benchmark: 9,911,926 ops/s (0.101s runtime)
Cycles:      398,766,361 cycles:u
Instructions: 1,054,643,524 instructions:u
IPC:         2.64
Page Faults: 7,131
Task Clock:  119.08 ms
```

**Analysis**:
- **IPC 2.64**: Reasonable instruction-level parallelism, but many cycles wasted
- **Cycles per operation**: 398,766,361 / 1,000,000 = **398 cycles/op**
- **Instructions per operation**: 1,054,643,524 / 1,000,000 = **1,054 instructions/op**

**Comparison estimate** (mimalloc at 51.3M ops/s):
- Estimated cycles/op for mimalloc: ~76 cycles/op (5.2x faster)
- HAKMEM uses **5.2x more cycles** per allocation/free pair

### Perf Record Hotspots (Top 20 Functions)

| Function | Self % | Category | Description |
|----------|--------|----------|-------------|
| `hak_super_lookup` | 9.32% | Address Lookup | Superslab registry lookup (largest single cost) |
| `mid_desc_lookup` | 8.23% | Address Lookup | Mid-size descriptor lookup |
| `hak_pool_get_class_index` | 5.87% | Classification | Size→class mapping |
| `classify_ptr` | 5.76% | Classification | Pointer classification for free |
| `hak_pool_free_v1_impl` | 5.52% | Pool Free | Pool free implementation |
| `hak_pool_try_alloc_v1_impl` | 5.46% | Pool Alloc | Pool allocation implementation |
| `free` | 4.54% | Front Gate | glibc free wrapper |
| `worker_run` | 4.47% | Benchmark | Benchmark driver |
| `ss_map_lookup` | 4.35% | Address Lookup | Superslab map lookup |
| `super_reg_effective_mask` | 4.32% | Address Lookup | Registry mask computation |
| `mid_desc_hash` | 3.69% | Address Lookup | Hash computation for mid_desc |
| `mid_set_header` | 3.27% | Metadata | Header initialization |
| `mid_page_inuse_dec_and_maybe_dn` | 3.17% | Metadata | Page occupancy tracking |
| `mid_desc_init_once` | 2.71% | Initialization | Descriptor initialization |
| `malloc` | 2.60% | Front Gate | glibc malloc wrapper |
| `hak_free_at` | 2.53% | Front Gate | Internal free dispatcher |
| `hak_pool_mid_lookup_v1_impl` | 2.17% | Pool Lookup | Pool-specific descriptor lookup |
| `super_reg_effective_size` | 1.87% | Address Lookup | Registry size computation |
| `hak_pool_free_fast_v1_impl` | 1.77% | Pool Free | Fast path for pool free |
| `hak_pool_init` | 1.44% | Initialization | Pool initialization |

### Hotspot Category Breakdown

| Category | Combined Self % | Functions |
|----------|----------------|-----------|
| **Address Lookup & Classification** | **41.5%** | hak_super_lookup, mid_desc_lookup, classify_ptr, hak_pool_get_class_index, ss_map_lookup, super_reg_effective_mask, mid_desc_hash, super_reg_effective_size, hak_pool_mid_lookup_v1_impl |
| **Pool Operations** | **14.8%** | hak_pool_try_alloc_v1_impl, hak_pool_free_v1_impl, hak_pool_free_fast_v1_impl |
| **Metadata Management** | **9.2%** | mid_set_header, mid_page_inuse_dec_and_maybe_dn, mid_desc_init_once |
| **Front Gate** | **9.7%** | malloc, free, hak_free_at |
| **Benchmark Driver** | **4.5%** | worker_run |
| **Other** | **20.3%** | Various helpers, initialization, etc. |

---

## Root Cause Analysis

### 1. Address Lookup Dominates (41.5% of CPU)

The single largest performance killer is **address→metadata lookup infrastructure**:

- **hak_super_lookup** (9.3%): Superslab registry lookup to find which allocator owns a pointer
- **mid_desc_lookup** (8.2%): Hash-based descriptor lookup for mid-size allocations
- **ss_map_lookup** (4.3%): Secondary map lookup within superslab
- **classify_ptr** (5.8%): Pointer classification during free
- **hak_pool_get_class_index** (5.9%): Size→class index computation

**Why this matters**: Every allocation AND free requires multiple lookups:
- Alloc: size → class_idx → descriptor → block
- Free: ptr → superslab → descriptor → classification → free handler

**Comparison to mimalloc**: mimalloc likely uses:
- Thread-local caching with minimal lookup
- Direct pointer arithmetic from block headers
- Segment-based organization reducing lookup depth

### 2. Pool Operations Still Expensive (14.8%)

Despite C6 routing through TinyHeap (not pool flatten), pool operations still consume significant cycles:
- `hak_pool_try_alloc_v1_impl` (5.5%)
- `hak_pool_free_v1_impl` (5.5%)

**Why**: TinyHeap v1 likely calls into pool infrastructure for:
- Page allocation from mid/smallmid pool
- Descriptor management
- Cross-thread handling

### 3. Metadata Overhead (9.2%)

Mid-size allocations carry significant metadata overhead:
- Header initialization: `mid_set_header` (3.3%)
- Occupancy tracking: `mid_page_inuse_dec_and_maybe_dn` (3.2%)
- Descriptor init: `mid_desc_init_once` (2.7%)

### 4. Front Gate Overhead (9.7%)

The malloc/free wrappers add non-trivial cost:
- Route determination
- Cross-allocator checks (jemalloc, system)
- Lock depth checks
- Initialization checks

---

## Recommendations for Next Phase

### Priority 1: Address Lookup Reduction (Highest Impact)
**Target**: 41.5% → 20-25% of cycles

**Strategies**:
1. **TLS Descriptor Cache**: Extend mid_desc_cache to cache full allocation context (class_idx + descriptor + page_info)
2. **Fast Path Header**: Embed class_idx in allocation header for instant classification on free (similar to tiny allocations)
3. **Segment-Based Addressing**: Consider segment-style addressing (like mimalloc) where ptr→metadata is direct pointer arithmetic
4. **Superslab Lookup Bypass**: For C6-heavy workloads, skip superslab lookup when we know it's mid-size

**Expected Gain**: 10-15M ops/s (+100-150%)

### Priority 2: Pool Path Streamlining (Medium Impact)
**Target**: 14.8% → 8-10% of cycles

**Strategies**:
1. **Dedicated C6 Fast Path**: Create a specialized alloc/free path for class 6 that skips pool generality
2. **TLS Block Cache**: Implement TLS-local block cache for C6 (bypass pool ring buffer overhead)
3. **Inline Critical Helpers**: Force-inline `hak_pool_get_class_index` and other hot helpers

**Expected Gain**: 3-5M ops/s (+30-50%)

### Priority 3: Metadata Streamlining (Lower Impact)
**Target**: 9.2% → 5-6% of cycles

**Strategies**:
1. **Lazy Header Init**: Only initialize headers when necessary (debug mode, cross-thread)
2. **Batch Occupancy Updates**: Combine multiple inuse_dec calls
3. **Cached Descriptors**: Reduce descriptor initialization overhead

**Expected Gain**: 1-2M ops/s (+10-20%)

### Priority 4: Front Gate Thinning (Lower Impact)
**Target**: 9.7% → 6-7% of cycles

**Strategies**:
1. **Size-Based Fast Path**: For mid-size range (257-768B), skip most gate checks
2. **Compile-Time Routing**: When jemalloc/system allocators are not used, eliminate checks

**Expected Gain**: 1-2M ops/s (+10-20%)

---

## Comparison to Historical Baselines

| Phase | Configuration | Throughput | vs Current | Notes |
|-------|--------------|------------|------------|-------|
| **Phase 54** | C7_SAFE, mixed 16-1024B | 28.1M ops/s | 2.9x | Mixed workload |
| **Phase 80** | C6-heavy, flatten OFF | 23.1M ops/s | 2.4x | Legacy baseline |
| **Phase 81** | C6-heavy, flatten ON | 25.9M ops/s | 2.6x | +10% from flatten |
| **Phase 82** | C6-heavy, flatten ON | 26.7M ops/s | 2.7x | +13% from flatten |
| **Current (C6-H)** | C6-heavy, C6_HOT=1 | 9.8M ops/s | 1.0x | **REGRESSION** |

**CRITICAL FINDING**: Current baseline (9.8M ops/s) is **2.4-2.7x SLOWER** than historical C6-heavy baselines (23-27M ops/s).

**Possible Causes**:
1. **Configuration difference**: Historical tests may have used different profile (LEGACY vs C7_SAFE)
2. **Routing change**: C6_HOT=1 may be forcing a slower path through TinyHeap
3. **Build/compiler difference**: Flags or LTO settings may have changed
4. **Benchmark variance**: Different workload characteristics

**Action Required**: Replicate historical Phase 80-82 configurations exactly to identify regression point.

---

## Verification of Historical Configuration

Let me verify the exact configuration used in Phase 80-82:

**Phase 80-82 Configuration** (from CURRENT_TASK.md):
```bash
HAKMEM_BENCH_MIN_SIZE=257
HAKMEM_BENCH_MAX_SIZE=768
HAKMEM_TINY_HEAP_PROFILE=LEGACY  # ← Different!
HAKMEM_TINY_HOTHEAP_V2=0
HAKMEM_POOL_V2_ENABLED=0
HAKMEM_POOL_V1_FLATTEN_ENABLED=1
HAKMEM_POOL_V1_FLATTEN_STATS=1
```

**Current Configuration**:
```bash
HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1  # Sets TINY_HEAP_PROFILE=C7_SAFE
HAKMEM_TINY_C6_HOT=1  # ← Adds TinyHeap routing
HAKMEM_POOL_V1_FLATTEN_ENABLED=0  # ← Flatten OFF by default
```

**Key Difference**: Historical tests used `TINY_HEAP_PROFILE=LEGACY`, which likely routes C6 through pure pool path (no TinyHeap). Current `C6_HEAVY_LEGACY_POOLV1` profile sets `TINY_HEAP_PROFILE=C7_SAFE` + `TINY_C6_HOT=1`, routing C6 through TinyHeap.

---

## Action Items for Phase C6-H+1

1. **Replicate Historical Baseline** (URGENT)
   ```bash
   export HAKMEM_BENCH_MIN_SIZE=257
   export HAKMEM_BENCH_MAX_SIZE=768
   export HAKMEM_TINY_HEAP_PROFILE=LEGACY
   export HAKMEM_TINY_HOTHEAP_V2=0
   export HAKMEM_POOL_V2_ENABLED=0
   export HAKMEM_POOL_V1_FLATTEN_ENABLED=0
   # Expected: ~23M ops/s
   ```

2. **Test Flatten ON with Historical Config**
   ```bash
   # Same as above, but:
   export HAKMEM_POOL_V1_FLATTEN_ENABLED=1
   export HAKMEM_POOL_V1_FLATTEN_STATS=1
   # Expected: ~26M ops/s with active flatten stats
   ```

3. **Profile Comparison Matrix**
   - LEGACY vs C7_SAFE profile
   - C6_HOT=0 vs C6_HOT=1
   - Flatten OFF vs ON
   - Identify which combination yields best performance

4. **Address Lookup Prototype**
   - Implement TLS allocation context cache (class_idx + descriptor + page)
   - Measure impact on lookup overhead (target: 41.5% → 25%)

5. **Update ENV_PROFILE_PRESETS.md**
   - Clarify that `C6_HEAVY_LEGACY_POOLV1` uses C7_SAFE profile (not pure LEGACY)
   - Add note about C6_HOT routing implications
   - Document performance differences between profile choices

---

## Success Criteria for Phase C6-H+1

- **Reproduce historical baseline**: Achieve 23-27M ops/s with LEGACY profile
- **Understand routing impact**: Quantify C6_HOT=0 vs C6_HOT=1 difference
- **Identify optimization path**: Choose between:
  - Optimizing TinyHeap C6 path (if C6_HOT=1 is strategic)
  - Optimizing pool flatten path (if LEGACY/C6_HOT=0 is preferred)
  - Hybrid approach with runtime selection

**Target**: Close to **30M ops/s** (1/2 of current gap to 51.3M mimalloc baseline) by end of next phase.

---

## Appendix A: Full Perf Report Output

```
# Samples: 656  of event 'cycles:u'
# Event count (approx.): 409,174,521
#
# Overhead  Symbol
# ........  .....................................
     9.32%  [.] hak_super_lookup
     8.23%  [.] mid_desc_lookup
     5.87%  [.] hak_pool_get_class_index
     5.76%  [.] classify_ptr
     5.52%  [.] hak_pool_free_v1_impl
     5.46%  [.] hak_pool_try_alloc_v1_impl
     4.54%  [.] free
     4.47%  [.] worker_run
     4.35%  [.] ss_map_lookup
     4.32%  [.] super_reg_effective_mask
     3.69%  [.] mid_desc_hash
     3.27%  [.] mid_set_header
     3.17%  [.] mid_page_inuse_dec_and_maybe_dn
     2.71%  [.] mid_desc_init_once
     2.60%  [.] malloc
     2.53%  [.] hak_free_at
     2.17%  [.] hak_pool_mid_lookup_v1_impl
     1.87%  [.] super_reg_effective_size
     1.77%  [.] hak_pool_free_fast_v1_impl
     1.64%  [k] 0xffffffffae200ba0 (kernel)
     1.44%  [.] hak_pool_init
     1.42%  [.] hak_pool_is_poolable
     1.21%  [.] should_sample
     1.12%  [.] hak_pool_free
     1.11%  [.] hak_super_hash
     1.09%  [.] hak_pool_try_alloc
     0.95%  [.] mid_desc_lookup_cached
     0.93%  [.] hak_pool_v1_flatten_enabled
     0.76%  [.] hak_pool_v2_route
     0.57%  [.] ss_map_hash
     0.55%  [.] hak_in_wrapper
```

---

## Appendix B: Test Commands Summary

```bash
# Baseline
export HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1
export HAKMEM_BENCH_MIN_SIZE=257
export HAKMEM_BENCH_MAX_SIZE=768
./bench_mid_large_mt_hakmem 1 1000000 400 1
# Result: 9,836,420 ops/s

# Mimalloc comparison
./bench_mid_large_mt_mi 1 1000000 400 1
# Result: 51,297,877 ops/s (5.2x faster)

# Mid_desc cache OFF
export HAKMEM_MID_DESC_CACHE_ENABLED=0
./bench_mid_large_mt_hakmem 1 1000000 400 1
# Result: 9,805,954 ops/s

# Mid_desc cache ON
export HAKMEM_MID_DESC_CACHE_ENABLED=1
./bench_mid_large_mt_hakmem 1 1000000 400 1
# Result: 10,435,480 ops/s (+6.4%)

# Perf stat
perf stat -e cycles:u,instructions:u,task-clock,page-faults:u \
  ./bench_mid_large_mt_hakmem 1 1000000 400 1
# Result: 398M cycles, 1.05B instructions, IPC=2.64

# Perf record
perf record -F 5000 --call-graph dwarf -e cycles:u \
  -o perf.data.c6_flat ./bench_mid_large_mt_hakmem 1 1000000 400 1
perf report -i perf.data.c6_flat --stdio --no-children
```

---

**End of Report**
Optimize C6 heavy and C7 ultra performance analysis with refined design refinements - Update environment profile presets and visibility analysis - Enhance small object and tiny segment v4 box implementations - Refine C7 ultra and C6 heavy allocation strategies - Add comprehensive performance metrics and design documentation 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> 2025-12-10 22:57:26 +09:00			`# C6-Heavy (257-768B) Visibility Analysis - Phase C6-H`

			`Date: 2025-12-10`
			Benchmark: `./bench_mid_large_mt_hakmem 1 1000000 400 1` (1 thread, ws=400, iters=1M)
			`Size Range: 257-768B (Class 6: 512B allocations)`
			`Configuration: C6_HEAVY_LEGACY_POOLV1 profile (C7_SAFE + C6_HOT=1)`

			`---`

			`## Executive Summary`

			`### Performance Gap Analysis`
			`- HAKMEM: 9.84M ops/s (baseline)`
			`- mimalloc: 51.3M ops/s`
			`- Performance Gap: 5.2x (mimalloc is 421% faster)`

			`This represents a critical performance deficit in the C6-heavy allocation path, where HAKMEM achieves only 19% of mimalloc's throughput.`

			`### Key Findings`
			1. C6 does NOT use Pool flatten path - With `HAKMEM_TINY_C6_HOT=1`, allocations route through TinyHeap v1, bypassing pool flatten entirely
			2. Address lookup dominates CPU time - `hak_super_lookup` (9.3%) + `mid_desc_lookup` (8.2%) + `classify_ptr` (5.8%) = 23.3% of cycles
			`3. Pool operations are expensive - Despite not using flatten, pool alloc/free combined still consume ~15-20% of cycles`
			`4. Mid_desc cache provides modest gains - +6.4% improvement (9.8M → 10.4M ops/s)`

			`---`

			`## Phase C6-H1: Baseline Metrics`

			`### Test Configuration`
			```bash
			`export HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1`
			`export HAKMEM_BENCH_MIN_SIZE=257`
			`export HAKMEM_BENCH_MAX_SIZE=768`
			```

			`### Baseline Results`

			`\| Configuration \| Throughput (ops/s) \| vs mimalloc \| Notes \|`
			`\|---------------\|-------------------\|-------------\|-------\|`
			`\| Baseline (C6_HOT=1, mid_desc_cache=1) \| 9,836,420 \| 19.2% \| Default profile \|`
			`\| C6_HOT=1, mid_desc_cache=0 \| 9,805,954 \| 19.1% \| Without cache \|`
			`\| C6_HOT=1, mid_desc_cache=1 \| 10,435,480 \| 20.3% \| With cache (+6.4%) \|`
			`\| C6_HOT=0 (pure legacy pool) \| 9,938,473 \| 19.4% \| Pool path ~same as TinyHeap \|`
			`\| mimalloc baseline \| 51,297,877 \| 100.0% \| Reference \|`

			`### Key Observations`
			`1. Mid_desc cache effect: +6.4% improvement, but far from closing the gap`
			`2. C6_HOT vs pool path: Nearly identical performance (~9.8M-9.9M ops/s), suggesting the bottleneck is in common infrastructure (address lookup, classification)`
			`3. Size class routing: 257-768B → Class 6 (512B) as expected`

			`---`

			`## Phase C6-H2: Pool Flatten and Cache Analysis`

			`### Pool Flatten Test (ATTEMPTED)`

			Finding: Pool v1 flatten path is NOT USED for C6 allocations with `HAKMEM_TINY_C6_HOT=1`.

			```bash
			`# Test with flatten enabled`
			`export HAKMEM_POOL_V1_FLATTEN_ENABLED=1`
			`export HAKMEM_POOL_V1_FLATTEN_STATS=1`
			`# Result: [POOL_V1_FLAT] alloc_tls_hit=0 alloc_fb=0 free_tls_hit=0 free_fb=0`
			```

			`Root Cause:`
			- With `HAKMEM_TINY_C6_HOT=1`, class 6 routes to `TINY_ROUTE_HEAP` (TinyHeap v1)
			- TinyHeap v1 uses its own allocation path via `tiny_heap_box.h`, not the pool flatten path
			`- Pool flatten optimizations (Phase 80-82) only apply to legacy pool path (when C6_HOT=0)`

			`### Mid_Desc Cache Analysis`

			`\| Metric \| Without Cache \| With Cache \| Delta \|`
			`\|--------\|--------------\|------------\|-------\|`
			`\| Throughput \| 9.81M ops/s \| 10.44M ops/s \| +6.4% \|`
			`\| Expected self% reduction \| mid_desc_lookup: 8.2% \| ~6-7% (estimated) \| ~1-2% \|`

			Conclusion: Mid_desc cache provides measurable but insufficient improvement. The 8.2% CPU time in `mid_desc_lookup` is reduced, but other lookup costs (hak_super_lookup, classify_ptr) remain.

			`---`

			`## Phase C6-H3: CPU Hotspot Analysis`

			`### Perf Stat Results`

			```
			`Benchmark: 9,911,926 ops/s (0.101s runtime)`
			`Cycles: 398,766,361 cycles:u`
			`Instructions: 1,054,643,524 instructions:u`
			`IPC: 2.64`
			`Page Faults: 7,131`
			`Task Clock: 119.08 ms`
			```

			`Analysis:`
			`- IPC 2.64: Reasonable instruction-level parallelism, but many cycles wasted`
			`- Cycles per operation: 398,766,361 / 1,000,000 = 398 cycles/op`
			`- Instructions per operation: 1,054,643,524 / 1,000,000 = 1,054 instructions/op`

			`Comparison estimate (mimalloc at 51.3M ops/s):`
			`- Estimated cycles/op for mimalloc: ~76 cycles/op (5.2x faster)`
			`- HAKMEM uses 5.2x more cycles per allocation/free pair`

			`### Perf Record Hotspots (Top 20 Functions)`

			`\| Function \| Self % \| Category \| Description \|`
			`\|----------\|--------\|----------\|-------------\|`
			\| `hak_super_lookup` \| 9.32% \| Address Lookup \| Superslab registry lookup (largest single cost) \|
			\| `mid_desc_lookup` \| 8.23% \| Address Lookup \| Mid-size descriptor lookup \|
			\| `hak_pool_get_class_index` \| 5.87% \| Classification \| Size→class mapping \|
			\| `classify_ptr` \| 5.76% \| Classification \| Pointer classification for free \|
			\| `hak_pool_free_v1_impl` \| 5.52% \| Pool Free \| Pool free implementation \|
			\| `hak_pool_try_alloc_v1_impl` \| 5.46% \| Pool Alloc \| Pool allocation implementation \|
			\| `free` \| 4.54% \| Front Gate \| glibc free wrapper \|
			\| `worker_run` \| 4.47% \| Benchmark \| Benchmark driver \|
			\| `ss_map_lookup` \| 4.35% \| Address Lookup \| Superslab map lookup \|
			\| `super_reg_effective_mask` \| 4.32% \| Address Lookup \| Registry mask computation \|
			\| `mid_desc_hash` \| 3.69% \| Address Lookup \| Hash computation for mid_desc \|
			\| `mid_set_header` \| 3.27% \| Metadata \| Header initialization \|
			\| `mid_page_inuse_dec_and_maybe_dn` \| 3.17% \| Metadata \| Page occupancy tracking \|
			\| `mid_desc_init_once` \| 2.71% \| Initialization \| Descriptor initialization \|
			\| `malloc` \| 2.60% \| Front Gate \| glibc malloc wrapper \|
			\| `hak_free_at` \| 2.53% \| Front Gate \| Internal free dispatcher \|
			\| `hak_pool_mid_lookup_v1_impl` \| 2.17% \| Pool Lookup \| Pool-specific descriptor lookup \|
			\| `super_reg_effective_size` \| 1.87% \| Address Lookup \| Registry size computation \|
			\| `hak_pool_free_fast_v1_impl` \| 1.77% \| Pool Free \| Fast path for pool free \|
			\| `hak_pool_init` \| 1.44% \| Initialization \| Pool initialization \|

			`### Hotspot Category Breakdown`

			`\| Category \| Combined Self % \| Functions \|`
			`\|----------\|----------------\|-----------\|`
			`\| Address Lookup & Classification \| 41.5% \| hak_super_lookup, mid_desc_lookup, classify_ptr, hak_pool_get_class_index, ss_map_lookup, super_reg_effective_mask, mid_desc_hash, super_reg_effective_size, hak_pool_mid_lookup_v1_impl \|`
			`\| Pool Operations \| 14.8% \| hak_pool_try_alloc_v1_impl, hak_pool_free_v1_impl, hak_pool_free_fast_v1_impl \|`
			`\| Metadata Management \| 9.2% \| mid_set_header, mid_page_inuse_dec_and_maybe_dn, mid_desc_init_once \|`
			`\| Front Gate \| 9.7% \| malloc, free, hak_free_at \|`
			`\| Benchmark Driver \| 4.5% \| worker_run \|`
			`\| Other \| 20.3% \| Various helpers, initialization, etc. \|`

			`---`

			`## Root Cause Analysis`

			`### 1. Address Lookup Dominates (41.5% of CPU)`

			`The single largest performance killer is address→metadata lookup infrastructure:`

			`- hak_super_lookup (9.3%): Superslab registry lookup to find which allocator owns a pointer`
			`- mid_desc_lookup (8.2%): Hash-based descriptor lookup for mid-size allocations`
			`- ss_map_lookup (4.3%): Secondary map lookup within superslab`
			`- classify_ptr (5.8%): Pointer classification during free`
			`- hak_pool_get_class_index (5.9%): Size→class index computation`

			`Why this matters: Every allocation AND free requires multiple lookups:`
			`- Alloc: size → class_idx → descriptor → block`
			`- Free: ptr → superslab → descriptor → classification → free handler`

			`Comparison to mimalloc: mimalloc likely uses:`
			`- Thread-local caching with minimal lookup`
			`- Direct pointer arithmetic from block headers`
			`- Segment-based organization reducing lookup depth`

			`### 2. Pool Operations Still Expensive (14.8%)`

			`Despite C6 routing through TinyHeap (not pool flatten), pool operations still consume significant cycles:`
			- `hak_pool_try_alloc_v1_impl` (5.5%)
			- `hak_pool_free_v1_impl` (5.5%)

			`Why: TinyHeap v1 likely calls into pool infrastructure for:`
			`- Page allocation from mid/smallmid pool`
			`- Descriptor management`
			`- Cross-thread handling`

			`### 3. Metadata Overhead (9.2%)`

			`Mid-size allocations carry significant metadata overhead:`
			- Header initialization: `mid_set_header` (3.3%)
			- Occupancy tracking: `mid_page_inuse_dec_and_maybe_dn` (3.2%)
			- Descriptor init: `mid_desc_init_once` (2.7%)

			`### 4. Front Gate Overhead (9.7%)`

			`The malloc/free wrappers add non-trivial cost:`
			`- Route determination`
			`- Cross-allocator checks (jemalloc, system)`
			`- Lock depth checks`
			`- Initialization checks`

			`---`

			`## Recommendations for Next Phase`

			`### Priority 1: Address Lookup Reduction (Highest Impact)`
			`Target: 41.5% → 20-25% of cycles`

			`Strategies:`
			`1. TLS Descriptor Cache: Extend mid_desc_cache to cache full allocation context (class_idx + descriptor + page_info)`
			`2. Fast Path Header: Embed class_idx in allocation header for instant classification on free (similar to tiny allocations)`
			`3. Segment-Based Addressing: Consider segment-style addressing (like mimalloc) where ptr→metadata is direct pointer arithmetic`
			`4. Superslab Lookup Bypass: For C6-heavy workloads, skip superslab lookup when we know it's mid-size`

			`Expected Gain: 10-15M ops/s (+100-150%)`

			`### Priority 2: Pool Path Streamlining (Medium Impact)`
			`Target: 14.8% → 8-10% of cycles`

			`Strategies:`
			`1. Dedicated C6 Fast Path: Create a specialized alloc/free path for class 6 that skips pool generality`
			`2. TLS Block Cache: Implement TLS-local block cache for C6 (bypass pool ring buffer overhead)`
			3. Inline Critical Helpers: Force-inline `hak_pool_get_class_index` and other hot helpers

			`Expected Gain: 3-5M ops/s (+30-50%)`

			`### Priority 3: Metadata Streamlining (Lower Impact)`
			`Target: 9.2% → 5-6% of cycles`

			`Strategies:`
			`1. Lazy Header Init: Only initialize headers when necessary (debug mode, cross-thread)`
			`2. Batch Occupancy Updates: Combine multiple inuse_dec calls`
			`3. Cached Descriptors: Reduce descriptor initialization overhead`

			`Expected Gain: 1-2M ops/s (+10-20%)`

			`### Priority 4: Front Gate Thinning (Lower Impact)`
			`Target: 9.7% → 6-7% of cycles`

			`Strategies:`
			`1. Size-Based Fast Path: For mid-size range (257-768B), skip most gate checks`
			`2. Compile-Time Routing: When jemalloc/system allocators are not used, eliminate checks`

			`Expected Gain: 1-2M ops/s (+10-20%)`

			`---`

			`## Comparison to Historical Baselines`

			`\| Phase \| Configuration \| Throughput \| vs Current \| Notes \|`
			`\|-------\|--------------\|------------\|------------\|-------\|`
			`\| Phase 54 \| C7_SAFE, mixed 16-1024B \| 28.1M ops/s \| 2.9x \| Mixed workload \|`
			`\| Phase 80 \| C6-heavy, flatten OFF \| 23.1M ops/s \| 2.4x \| Legacy baseline \|`
			`\| Phase 81 \| C6-heavy, flatten ON \| 25.9M ops/s \| 2.6x \| +10% from flatten \|`
			`\| Phase 82 \| C6-heavy, flatten ON \| 26.7M ops/s \| 2.7x \| +13% from flatten \|`
			`\| Current (C6-H) \| C6-heavy, C6_HOT=1 \| 9.8M ops/s \| 1.0x \| REGRESSION \|`

			`CRITICAL FINDING: Current baseline (9.8M ops/s) is 2.4-2.7x SLOWER than historical C6-heavy baselines (23-27M ops/s).`

			`Possible Causes:`
			`1. Configuration difference: Historical tests may have used different profile (LEGACY vs C7_SAFE)`
			`2. Routing change: C6_HOT=1 may be forcing a slower path through TinyHeap`
			`3. Build/compiler difference: Flags or LTO settings may have changed`
			`4. Benchmark variance: Different workload characteristics`

			`Action Required: Replicate historical Phase 80-82 configurations exactly to identify regression point.`

			`---`

			`## Verification of Historical Configuration`

			`Let me verify the exact configuration used in Phase 80-82:`

			`Phase 80-82 Configuration (from CURRENT_TASK.md):`
			```bash
			`HAKMEM_BENCH_MIN_SIZE=257`
			`HAKMEM_BENCH_MAX_SIZE=768`
			`HAKMEM_TINY_HEAP_PROFILE=LEGACY # ← Different!`
			`HAKMEM_TINY_HOTHEAP_V2=0`
			`HAKMEM_POOL_V2_ENABLED=0`
			`HAKMEM_POOL_V1_FLATTEN_ENABLED=1`
			`HAKMEM_POOL_V1_FLATTEN_STATS=1`
			```

			`Current Configuration:`
			```bash
			`HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1 # Sets TINY_HEAP_PROFILE=C7_SAFE`
			`HAKMEM_TINY_C6_HOT=1 # ← Adds TinyHeap routing`
			`HAKMEM_POOL_V1_FLATTEN_ENABLED=0 # ← Flatten OFF by default`
			```

			Key Difference: Historical tests used `TINY_HEAP_PROFILE=LEGACY`, which likely routes C6 through pure pool path (no TinyHeap). Current `C6_HEAVY_LEGACY_POOLV1` profile sets `TINY_HEAP_PROFILE=C7_SAFE` + `TINY_C6_HOT=1`, routing C6 through TinyHeap.

			`---`

			`## Action Items for Phase C6-H+1`

			`1. Replicate Historical Baseline (URGENT)`
			```bash
			`export HAKMEM_BENCH_MIN_SIZE=257`
			`export HAKMEM_BENCH_MAX_SIZE=768`
			`export HAKMEM_TINY_HEAP_PROFILE=LEGACY`
			`export HAKMEM_TINY_HOTHEAP_V2=0`
			`export HAKMEM_POOL_V2_ENABLED=0`
			`export HAKMEM_POOL_V1_FLATTEN_ENABLED=0`
			`# Expected: ~23M ops/s`
			```

			`2. Test Flatten ON with Historical Config`
			```bash
			`# Same as above, but:`
			`export HAKMEM_POOL_V1_FLATTEN_ENABLED=1`
			`export HAKMEM_POOL_V1_FLATTEN_STATS=1`
			`# Expected: ~26M ops/s with active flatten stats`
			```

			`3. Profile Comparison Matrix`
			`- LEGACY vs C7_SAFE profile`
			`- C6_HOT=0 vs C6_HOT=1`
			`- Flatten OFF vs ON`
			`- Identify which combination yields best performance`

			`4. Address Lookup Prototype`
			`- Implement TLS allocation context cache (class_idx + descriptor + page)`
			`- Measure impact on lookup overhead (target: 41.5% → 25%)`

			`5. Update ENV_PROFILE_PRESETS.md`
			- Clarify that `C6_HEAVY_LEGACY_POOLV1` uses C7_SAFE profile (not pure LEGACY)
			`- Add note about C6_HOT routing implications`
			`- Document performance differences between profile choices`

			`---`

			`## Success Criteria for Phase C6-H+1`

			`- Reproduce historical baseline: Achieve 23-27M ops/s with LEGACY profile`
			`- Understand routing impact: Quantify C6_HOT=0 vs C6_HOT=1 difference`
			`- Identify optimization path: Choose between:`
			`- Optimizing TinyHeap C6 path (if C6_HOT=1 is strategic)`
			`- Optimizing pool flatten path (if LEGACY/C6_HOT=0 is preferred)`
			`- Hybrid approach with runtime selection`

			`Target: Close to 30M ops/s (1/2 of current gap to 51.3M mimalloc baseline) by end of next phase.`

			`---`

			`## Appendix A: Full Perf Report Output`

			```
			`# Samples: 656 of event 'cycles:u'`
			`# Event count (approx.): 409,174,521`
			`#`
			`# Overhead Symbol`
			`# ........ .....................................`
			`9.32% [.] hak_super_lookup`
			`8.23% [.] mid_desc_lookup`
			`5.87% [.] hak_pool_get_class_index`
			`5.76% [.] classify_ptr`
			`5.52% [.] hak_pool_free_v1_impl`
			`5.46% [.] hak_pool_try_alloc_v1_impl`
			`4.54% [.] free`
			`4.47% [.] worker_run`
			`4.35% [.] ss_map_lookup`
			`4.32% [.] super_reg_effective_mask`
			`3.69% [.] mid_desc_hash`
			`3.27% [.] mid_set_header`
			`3.17% [.] mid_page_inuse_dec_and_maybe_dn`
			`2.71% [.] mid_desc_init_once`
			`2.60% [.] malloc`
			`2.53% [.] hak_free_at`
			`2.17% [.] hak_pool_mid_lookup_v1_impl`
			`1.87% [.] super_reg_effective_size`
			`1.77% [.] hak_pool_free_fast_v1_impl`
			`1.64% [k] 0xffffffffae200ba0 (kernel)`
			`1.44% [.] hak_pool_init`
			`1.42% [.] hak_pool_is_poolable`
			`1.21% [.] should_sample`
			`1.12% [.] hak_pool_free`
			`1.11% [.] hak_super_hash`
			`1.09% [.] hak_pool_try_alloc`
			`0.95% [.] mid_desc_lookup_cached`
			`0.93% [.] hak_pool_v1_flatten_enabled`
			`0.76% [.] hak_pool_v2_route`
			`0.57% [.] ss_map_hash`
			`0.55% [.] hak_in_wrapper`
			```

			`---`

			`## Appendix B: Test Commands Summary`

			```bash
			`# Baseline`
			`export HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1`
			`export HAKMEM_BENCH_MIN_SIZE=257`
			`export HAKMEM_BENCH_MAX_SIZE=768`
			`./bench_mid_large_mt_hakmem 1 1000000 400 1`
			`# Result: 9,836,420 ops/s`

			`# Mimalloc comparison`
			`./bench_mid_large_mt_mi 1 1000000 400 1`
			`# Result: 51,297,877 ops/s (5.2x faster)`

			`# Mid_desc cache OFF`
			`export HAKMEM_MID_DESC_CACHE_ENABLED=0`
			`./bench_mid_large_mt_hakmem 1 1000000 400 1`
			`# Result: 9,805,954 ops/s`

			`# Mid_desc cache ON`
			`export HAKMEM_MID_DESC_CACHE_ENABLED=1`
			`./bench_mid_large_mt_hakmem 1 1000000 400 1`
			`# Result: 10,435,480 ops/s (+6.4%)`

			`# Perf stat`
			`perf stat -e cycles:u,instructions:u,task-clock,page-faults:u \`
			`./bench_mid_large_mt_hakmem 1 1000000 400 1`
			`# Result: 398M cycles, 1.05B instructions, IPC=2.64`

			`# Perf record`
			`perf record -F 5000 --call-graph dwarf -e cycles:u \`
			`-o perf.data.c6_flat ./bench_mid_large_mt_hakmem 1 1000000 400 1`
			`perf report -i perf.data.c6_flat --stdio --no-children`
			```

			`---`

			`End of Report`