Files
hakmem/docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md
Moe Charm (CI) ec87025da6 Phase 17 v2 (FORCE_LIBC fix) + Phase 19-1b (FastLane Direct) — GO (+5.88%)
## Phase 17 v2: FORCE_LIBC Gap Validation Fix

**Critical bug fix**: Phase 17 v1 の測定が壊れていた

**Problem**: HAKMEM_FORCE_LIBC_ALLOC=1 が FastLane より後でしか見えず、
same-binary A/B が実質 "hakmem vs hakmem" になっていた(+0.39% 誤測定)

**Fix**: core/box/hak_wrappers.inc.h:171 と :645 に g_force_libc_alloc==1 の
early bypass を追加、__libc_malloc/__libc_free に最初に直行

**Result**: 正しい同一バイナリ A/B 測定
- hakmem (FORCE_LIBC=0): 48.99M ops/s
- libc (FORCE_LIBC=1): 79.72M ops/s (+62.7%)
- system binary: 88.06M ops/s (+10.5% vs libc)

**Gap 分解**:
- Allocator 差: +62.7% (主戦場)
- Layout penalty: +10.5% (副次的)

**Conclusion**: Case A 確定 (allocator dominant, NOT layout)
Phase 17 v1 の Case B 判定は誤り。

Files:
- docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_AB_TEST_RESULTS.md (v2)
- docs/analysis/PHASE17_FORCE_LIBC_GAP_VALIDATION_1_NEXT_INSTRUCTIONS.md (updated)

---

## Phase 19: FastLane Instruction Reduction Analysis

**Goal**: libc との instruction gap (-35% instructions, -56% branches) を削減

**perf stat 分析** (FORCE_LIBC=0 vs 1, 200M ops):
- hakmem: 209.09 instructions/op, 52.33 branches/op
- libc: 135.92 instructions/op, 22.93 branches/op
- Delta: +73.17 instructions/op (+53.8%), +29.40 branches/op (+128.2%)

**Hot path** (perf report):
- front_fastlane_try_free: 23.97% cycles
- malloc wrapper: 23.84% cycles
- free wrapper: 6.82% cycles
- **Wrapper overhead: ~55% of all cycles**

**Reduction candidates**:
- A: Wrapper layer 削除 (-17.5 inst/op, +10-15% 期待)
- B: ENV snapshot 統合 (-10.0 inst/op, +5-8%)
- C: Stats 削除 (-5.0 inst/op, +3-5%)
- D: Header inline (-4.0 inst/op, +2-3%)
- E: Route fast path (-3.5 inst/op, +2-3%)

Files:
- docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_1_DESIGN.md
- docs/analysis/PHASE19_FASTLANE_INSTRUCTION_REDUCTION_2_NEXT_INSTRUCTIONS.md

---

## Phase 19-1b: FastLane Direct — GO (+5.88%)

**Strategy**: Wrapper layer を bypass し、core allocator を直接呼ぶ
- free() → free_tiny_fast() (not free_tiny_fast_hot)
- malloc() → malloc_tiny_fast()

**Phase 19-1 が NO-GO (-3.81%) だった原因**:
1. __builtin_expect(fastlane_direct_enabled(), 0) が逆効果(A/B 不公平)
2. free_tiny_fast_hot() が誤選択(free_tiny_fast() が勝ち筋)

**Phase 19-1b の修正**:
1. __builtin_expect() 削除
2. free_tiny_fast() を直接呼び出し

**Result** (Mixed, 10-run, 20M iters, ws=400):
- Baseline (FASTLANE_DIRECT=0): 49.17M ops/s
- Optimized (FASTLANE_DIRECT=1): 52.06M ops/s
- **Delta: +5.88%** (GO 基準 +5% クリア)

**perf stat** (200M iters):
- Instructions/op: 199.90 → 169.45 (-30.45, -15.23%)
- Branches/op: 51.49 → 41.52 (-9.97, -19.36%)
- Cycles/op: 88.88 → 84.37 (-4.51, -5.07%)
- I-cache miss: 111K → 98K (-11.79%)

**Trade-offs** (acceptable):
- iTLB miss: +41.46% (front-end cost)
- dTLB miss: +29.15% (backend cost)
- Overall gain (+5.88%) outweighs costs

**Implementation**:
1. **ENV gate**: core/box/fastlane_direct_env_box.{h,c}
   - HAKMEM_FASTLANE_DIRECT=0/1 (default: 0, opt-in)
   - Single _Atomic global (wrapper キャッシュ問題を解決)

2. **Wrapper 修正**: core/box/hak_wrappers.inc.h
   - malloc: direct call to malloc_tiny_fast() when FASTLANE_DIRECT=1
   - free: direct call to free_tiny_fast() when FASTLANE_DIRECT=1
   - Safety: !g_initialized では direct 使わない、fallback 維持

3. **Preset 昇格**: core/bench_profile.h:88
   - bench_setenv_default("HAKMEM_FASTLANE_DIRECT", "1")
   - Comment: +5.88% proven on Mixed, 10-run

4. **cleanenv 更新**: scripts/run_mixed_10_cleanenv.sh:22
   - HAKMEM_FASTLANE_DIRECT=${HAKMEM_FASTLANE_DIRECT:-1}
   - Phase 9/10 と同様に昇格

**Verdict**: GO — 本線採用、プリセット昇格完了

**Rollback**: HAKMEM_FASTLANE_DIRECT=0 で既存 FastLane path に戻る

Files:
- core/box/fastlane_direct_env_box.{h,c} (new)
- core/box/hak_wrappers.inc.h (modified)
- core/bench_profile.h (preset promotion)
- scripts/run_mixed_10_cleanenv.sh (ENV default aligned)
- Makefile (new obj)
- docs/analysis/PHASE19_1B_FASTLANE_DIRECT_REVISED_AB_TEST_RESULTS.md

---

## Cumulative Performance

- Baseline (all optimizations OFF): ~40M ops/s (estimated)
- Current (Phase 19-1b): 52.06M ops/s
- **Cumulative gain: ~+30% from baseline**

Remaining gap to libc (79.72M):
- Current: 52.06M ops/s
- Target: 79.72M ops/s
- **Gap: +53.2%** (was +62.7% before Phase 19-1b)

Next: Phase 19-2 (ENV snapshot consolidation, +5-8% expected)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-15 11:28:40 +09:00

18 KiB

Phase 19: FastLane Instruction Reduction - Design Document

0. Executive Summary

Goal: Reduce instruction/branch count gap between hakmem and libc to close throughput gap Current Gap: hakmem 44.88M ops/s vs libc 77.62M ops/s (+73.0% advantage for libc) Target: Reduce instruction gap from +53.8% to <+25%, targeting +15-25% throughput improvement Success Criteria: Achieve 52-56M ops/s (from current 44.88M ops/s)

Key Findings

Per-operation overhead comparison (200M ops):

Metric hakmem libc Delta Delta %
Instructions/op 209.09 135.92 +73.17 +53.8%
Branches/op 52.33 22.93 +29.40 +128.2%
Cycles/op 96.48 54.69 +41.79 +76.4%
Branch-miss % 2.22% 2.87% -0.65% Better

Critical insight: hakmem executes 73 extra instructions and 29 extra branches per operation vs libc. This massive overhead accounts for the entire throughput gap.


1. Gap Analysis (Per-Operation Breakdown)

1.1 Instruction Gap: +73.17 instructions/op (+53.8%)

This excess comes from multiple layers of overhead:

  • FastLane wrapper checks: ENV gates, class mask validation, size checks
  • Policy snapshot overhead: TLS reads for routing decisions (3+ reads even with ENV snapshot)
  • Route determination: Static route table lookup vs direct path
  • Multiple ENV gates: Scattered throughout hot path (DUALHOT, LEGACY_DIRECT, C7_ULTRA, etc.)
  • Stats counters: Atomic increments on hot path (FREE_PATH_STAT_INC, ALLOC_GATE_STAT_INC, etc.)
  • Header validation duplication: FastLane + free_tiny_fast both validate header

1.2 Branch Gap: +29.40 branches/op (+128.2%)

Branching is 2.3x worse than instruction gap:

  • Cascading ENV checks: Each layer adds 1-2 branches (g_initialized, class_mask, DUALHOT, C7_ULTRA, LEGACY_DIRECT)
  • Route dispatch: Static route check + route_kind switch
  • Early-exit patterns: Multiple if-checks for ULTRA/DUALHOT/LEGACY paths
  • Stats gating: if (__builtin_expect(...)) patterns around counters

1.3 Why Cycles/op Gap is Smaller Than Expected

Despite +76.4% cycle gap, the CPU is achieving 2.17 IPC (hakmem) vs 2.49 IPC (libc). This suggests:

  • Good CPU pipelining: Branch predictor is working well (2.22% miss rate)
  • I-cache locality: Code is reasonably compact despite extra instructions
  • But: We're paying for every extra branch in pipeline stalls

2. Hot Path Breakdown (perf report)

Top 10 hot functions (% of cycles):

Function % time Category Reduction Target?
front_fastlane_try_free 23.97% Wrapper YES (remove layer)
malloc 23.84% Wrapper YES (remove layer)
main 22.02% Benchmark (baseline)
free 6.82% Wrapper YES (remove layer)
unified_cache_push 4.44% Core Optimize later
tiny_header_finalize_alloc 4.34% Core Optimize later
tiny_c7_ultra_alloc 3.38% Core Optimize later
tiny_c7_ultra_free 2.07% Core Optimize later
hakmem_env_snapshot_enabled 1.22% ENV YES (eliminate checks)
hak_super_lookup 0.98% Core Optimize later

Critical observation: The top 3 user-space functions are all wrappers:

  • front_fastlane_try_free (23.97%) + free (6.82%) = 30.79% on free wrappers
  • malloc (23.84%) on alloc wrapper
  • Combined wrapper overhead: ~54-55% of all cycles

2.1 front_fastlane_try_free Annotated Breakdown

From perf annotate, the hot path has these expensive operations:

Header validation (lines 1c786-1c791, ~3% samples):

movzbl -0x1(%rbp),%ebx          # Load header byte
mov    %ebx,%eax                # Copy to eax
and    $0xfffffff0,%eax         # Extract magic (0xA0)
cmp    $0xa0,%al                # Check magic
jne    ... (fallback)           # Branch on mismatch

ENV snapshot checks (lines 1c7ff-1c822, ~7% samples):

cmpl   $0x1,0x628fa(%rip)       # g_hakmem_env_snapshot_ctor_mode (3.01%)
mov    0x628ef(%rip),%r15d      # g_hakmem_env_snapshot_gate (1.36%)
je     ...
cmp    $0xffffffff,%r15d
je     ... (init path)
test   %r15d,%r15d
jne    ... (snapshot path)

Class routing overhead (lines 1c7d1-1c7fb, ~3% samples):

mov    0x6299c(%rip),%r15d      # g.5.lto_priv.0 (policy gate)
cmp    $0x1,%r15d
jne    ... (fallback)
movzbl 0x6298f(%rip),%eax       # g_mask.3.lto_priv.0
cmp    $0xff,%al
je     ... (all-classes path)
movzbl %al,%r9d
bt     %r13d,%r9d               # Bit test class mask
jae    ... (fallback)

Total overhead: ~15-20% of cycles in front_fastlane_try_free are spent on:

  • Header validation (already done again in free_tiny_fast)
  • ENV snapshot probing
  • Policy/route checks

3. Reduction Candidates (Prioritized by ROI)

Candidate A: Eliminate FastLane Wrapper Layer (Highest ROI)

Problem: front_fastlane_try_free + free wrappers consume 30.79% of cycles Root cause: Double header validation + ENV checks + class mask checks

Proposal: Direct call to free_tiny_fast() from free() wrapper

Implementation:

// In free() wrapper:
void free(void* ptr) {
    if (__builtin_expect(!ptr, 0)) return;

    // Phase 19-A: Direct call (no FastLane layer)
    if (free_tiny_fast(ptr)) {
        return;  // Handled
    }

    // Fallback to cold path
    free_cold(ptr);
}

Reduction estimate:

  • Instructions: -15-20/op (eliminate duplicate header read, ENV checks, class mask checks)
  • Branches: -5-7/op (remove FastLane gate checks)
  • Impact: ~10-15% throughput improvement (remove 30% wrapper overhead)

Risk: LOW (free_tiny_fast already has validation + routing logic)


Candidate B: Consolidate ENV Snapshot Checks (High ROI)

Problem: ENV snapshot is checked 3+ times per operation:

  1. FastLane entry: g_initialized check
  2. Route determination: hakmem_env_snapshot_enabled() check
  3. Route-specific: tiny_c7_ultra_enabled_env() check
  4. Legacy fallback: Another ENV snapshot check

Proposal: Single ENV snapshot read at entry, pass context down

Implementation:

// Phase 19-B: ENV context struct
typedef struct {
    bool c7_ultra_enabled;
    bool dualhot_enabled;
    bool legacy_direct_enabled;
    SmallRouteKind route_kind[8];  // Pre-computed routes
} FastLaneCtx;

static __thread FastLaneCtx g_fastlane_ctx = {0};
static __thread int g_fastlane_ctx_init = 0;

static inline const FastLaneCtx* fastlane_ctx_get(void) {
    if (__builtin_expect(g_fastlane_ctx_init == 0, 0)) {
        // One-time init per thread
        const HakmemEnvSnapshot* env = hakmem_env_snapshot();
        g_fastlane_ctx.c7_ultra_enabled = env->tiny_c7_ultra_enabled;
        // ... populate other fields
        g_fastlane_ctx_init = 1;
    }
    return &g_fastlane_ctx;
}

Reduction estimate:

  • Instructions: -8-12/op (eliminate redundant TLS reads)
  • Branches: -3-5/op (single init check instead of multiple)
  • Impact: ~5-8% throughput improvement

Risk: MEDIUM (need to handle ENV changes during runtime - use invalidation hook)


Candidate C: Remove Stats Counters from Hot Path (Medium ROI)

Problem: Stats counters on hot path add atomic increments:

  • FRONT_FASTLANE_STAT_INC(free_total) (every op)
  • FREE_PATH_STAT_INC(total_calls) (every op)
  • ALLOC_GATE_STAT_INC(total_calls) (every alloc)
  • tiny_front_free_stat_inc(class_idx) (every free)

Proposal: Make stats DEBUG-only or sample-based (1-in-N)

Implementation:

// Phase 19-C: Sampling-based stats
#if !HAKMEM_BUILD_RELEASE
    static __thread uint32_t g_stat_counter = 0;
    if (__builtin_expect((++g_stat_counter & 0xFFF) == 0, 0)) {
        // Sample 1-in-4096 operations
        FRONT_FASTLANE_STAT_INC(free_total);
    }
#endif

Reduction estimate:

  • Instructions: -4-6/op (remove atomic increments)
  • Branches: -2-3/op (remove if (__builtin_expect(...)) checks)
  • Impact: ~3-5% throughput improvement

Risk: LOW (stats already compile-time optional)


Candidate D: Inline Header Validation (Medium ROI)

Problem: Header validation happens twice:

  1. FastLane wrapper: *((uint8_t*)ptr - 1) (lines 179-191 in front_fastlane_box.h)
  2. free_tiny_fast: Same check (lines 598-605 in malloc_tiny_fast.h)

Proposal: Trust FastLane validation, remove duplicate check

Implementation:

// Phase 19-D: Add "trusted" variant
static inline int free_tiny_fast_trusted(void* ptr, int class_idx, void* base) {
    // Skip header validation (caller already validated)
    // Direct to route dispatch
    ...
}

// In FastLane:
uint8_t header = *((uint8_t*)ptr - 1);
int class_idx = header & 0x0F;
void* base = tiny_user_to_base_inline(ptr);
return free_tiny_fast_trusted(ptr, class_idx, base);

Reduction estimate:

  • Instructions: -3-5/op (remove duplicate header load + extract)
  • Branches: -1-2/op (remove duplicate magic check)
  • Impact: ~2-3% throughput improvement

Risk: MEDIUM (need to ensure all callers validate header)


Candidate E: Static Route Table Optimization (Lower ROI)

Problem: Route determination uses TLS lookups + bit tests:

if (tiny_static_route_ready_fast()) {
    route_kind = tiny_static_route_get_kind_fast(class_idx);
} else {
    route_kind = tiny_policy_hot_get_route(class_idx);
}

Proposal: Pre-compute common routes at init, inline direct paths

Implementation:

// Phase 19-E: Route fast path (C0-C3 LEGACY, C7 ULTRA)
static __thread uint8_t g_route_fastmap = 0;  // bit 0=C0...bit 7=C7, 1=LEGACY

static inline bool is_legacy_route_fast(int class_idx) {
    return (g_route_fastmap >> class_idx) & 1;
}

Reduction estimate:

  • Instructions: -3-4/op (replace function call with bit test)
  • Branches: -1-2/op (replace nested if with single bit test)
  • Impact: ~2-3% throughput improvement

Risk: LOW (route table is already static)


4. Combined Impact Estimate

Assuming independent reductions (conservative estimate with 80% efficiency due to overlap):

Candidate Instructions/op Branches/op Throughput
Baseline 209.09 52.33 44.88M ops/s
A: Remove FastLane layer -17.5 -6.0 +12%
B: ENV snapshot consolidation -10.0 -4.0 +6%
C: Stats removal (Release) -5.0 -2.5 +4%
D: Inline header validation -4.0 -1.5 +2%
E: Static route fast path -3.5 -1.5 +2%
Combined (80% efficiency) -32.0 -12.4 +21%

Projected outcome:

  • Instructions/op: 209.09 → 177.09 (vs libc 135.92, gap reduced from +53.8% to +30.3%)
  • Branches/op: 52.33 → 39.93 (vs libc 22.93, gap reduced from +128.2% to +74.1%)
  • Throughput: 44.88M → 54.3M ops/s (vs libc 77.62M, gap reduced from +73.0% to +43.0%)

Achievement vs Goal: ✓ Exceeds target (+21% vs +15-25% goal)


5. Implementation Plan

Phase 19-1: Remove FastLane Wrapper Layer (A)

Priority: P0 (highest ROI) Effort: 2-3 hours Risk: Low (free_tiny_fast already complete)

Steps:

  1. Modify free() wrapper to directly call free_tiny_fast(ptr)
  2. Modify malloc() wrapper to directly call malloc_tiny_fast(size)
  3. Measure: Expect +10-15% throughput
  4. Fallback: Keep FastLane as compile-time option

Phase 19-2: ENV Snapshot Consolidation (B)

Priority: P1 (high ROI, moderate risk) Effort: 4-6 hours Risk: Medium (ENV invalidation needed)

Steps:

  1. Create FastLaneCtx struct with pre-computed ENV state
  2. Add TLS cache with invalidation hook
  3. Replace scattered ENV checks with single context read
  4. Measure: Expect +5-8% throughput on top of Phase 19-1
  5. Fallback: ENV-gate new path (HAKMEM_FASTLANE_ENV_CTX=1)

Phase 19-3: Stats Removal (C) + Header Inline (D)

Priority: P2 (medium ROI, low risk) Effort: 2-3 hours Risk: Low (already compile-time optional)

Steps:

  1. Make stats sample-based (1-in-4096) in Release builds
  2. Add free_tiny_fast_trusted() variant (skip header validation)
  3. Measure: Expect +3-5% throughput on top of Phase 19-2
  4. Fallback: Compile-time flags for both features

Phase 19-4: Static Route Fast Path (E)

Priority: P3 (lower ROI, polish) Effort: 2-3 hours Risk: Low (route table is static)

Steps:

  1. Add g_route_fastmap TLS cache
  2. Replace function calls with bit tests
  3. Measure: Expect +2-3% throughput on top of Phase 19-3
  4. Fallback: Keep existing path as fallback

6. Box Theory Compliance

Boundary Preservation

  • L0 (ENV): Keep existing ENV gates, add new ones for each optimization
  • L1 (Hot inline): free_tiny_fast(), malloc_tiny_fast() remain unchanged
  • L2 (Cold fallback): free_cold(), malloc_cold() remain unchanged
  • L3 (Stats): Make optional via #if guards

Reversibility

  • Each phase is ENV-gated (can revert at runtime)
  • Compile-time fallback preserved (HAKMEM_BUILD_RELEASE controls stats)
  • FastLane layer can be kept as compile-time option for A/B testing

Incremental Rollout

  • Phase 19-1: Remove wrapper (default ON)
  • Phase 19-2: ENV context (default OFF, opt-in for testing)
  • Phase 19-3: Stats/header (default ON in Release, OFF in Debug)
  • Phase 19-4: Route fast path (default ON)

7. Validation Checklist

After each phase:

  • Run perf stat (compare instructions/branches/cycles per-op)
  • Run perf record + annotate (verify hot path reduction)
  • Run benchmark suite (Mixed, C6-heavy, C7-heavy)
  • Check correctness (Larson, multithreaded, stress tests)
  • Measure RSS/memory overhead (should be unchanged)
  • A/B test (ENV toggle to verify reversibility)

Success criteria:

  • Throughput improvement matches estimate (±20%)
  • Instruction count reduction matches estimate (±20%)
  • Branch count reduction matches estimate (±20%)
  • No correctness regressions (all tests pass)
  • No memory overhead increase (RSS unchanged)

8. Risk Assessment

High-Risk Areas

  1. ENV invalidation (Phase 19-2): Runtime ENV changes could break cached context

    • Mitigation: Use invalidation hooks (existing hakmem_env_snapshot infrastructure)
    • Fallback: Revert to scattered ENV checks
  2. Header validation trust (Phase 19-3D): Skipping validation could miss corruption

    • Mitigation: Keep validation in Debug builds, extensive testing
    • Fallback: Compile-time option to keep duplicate checks

Medium-Risk Areas

  1. FastLane removal (Phase 19-1): Could break gradual rollout (class_mask filtering)
    • Mitigation: Keep class_mask filtering in FastLane path only (direct path always falls back safely)
    • Fallback: Keep FastLane as compile-time option

Low-Risk Areas

  1. Stats removal (Phase 19-3C): Already compile-time optional
  2. Route fast path (Phase 19-4): Route table is static, no runtime changes

9. Future Optimization Opportunities (Post-Phase 19)

After Phase 19 closes the wrapper gap, next targets:

  1. Unified Cache optimization (4.44% cycles):

    • Reduce cache miss overhead (refill path)
    • Optimize LIFO vs ring buffer trade-off
  2. Header finalization (4.34% cycles):

    • Investigate always_inline for tiny_header_finalize_alloc()
    • Reduce metadata writes (defer to batch update)
  3. C7 ULTRA optimization (3.38% + 2.07% = 5.45% cycles):

    • Investigate TLS cache locality
    • Reduce ULTRA push/pop overhead
  4. Super lookup optimization (0.98% cycles):

    • Already optimized in Phase 12 (mask-based)
    • Further reduction may require architectural changes

Estimated ceiling: With all optimizations, could approach ~65-70M ops/s (vs libc 77.62M) Remaining gap: Likely fundamental architectural differences (thread-local vs global allocator)


10. Appendix: Detailed perf Data

10.1 perf stat Results (200M ops)

hakmem (FORCE_LIBC=0):

Performance counter stats for 'bench_random_mixed_hakmem ... HAKMEM_FORCE_LIBC_ALLOC=0':

    19,296,118,430  cycles
    41,817,886,925  instructions              #  2.17  insn per cycle
    10,466,190,806  branches
       232,592,257  branch-misses             #  2.22% of all branches
         1,660,073  cache-misses
           134,601  L1-icache-load-misses

       4.913685503 seconds time elapsed
Throughput: 44.88M ops/s

libc (FORCE_LIBC=1):

Performance counter stats for 'bench_random_mixed_hakmem ... HAKMEM_FORCE_LIBC_ALLOC=1':

    10,937,550,228  cycles
    27,183,469,339  instructions              #  2.49  insn per cycle
     4,586,617,379  branches
       131,515,905  branch-misses             #  2.87% of all branches
           767,370  cache-misses
            64,102  L1-icache-load-misses

       2.835174452 seconds time elapsed
Throughput: 77.62M ops/s

10.2 Top 30 Hot Functions (perf report)

    23.97%  front_fastlane_try_free.lto_priv.0
    23.84%  malloc
    22.02%  main
     6.82%  free
     4.44%  unified_cache_push.lto_priv.0
     4.34%  tiny_header_finalize_alloc.lto_priv.0
     3.38%  tiny_c7_ultra_alloc.constprop.0
     2.07%  tiny_c7_ultra_free
     1.22%  hakmem_env_snapshot_enabled.lto_priv.0
     0.98%  hak_super_lookup.part.0.lto_priv.4.lto_priv.0
     0.85%  hakmem_env_snapshot.lto_priv.0
     0.82%  hak_pool_free_v1_slow_impl
     0.59%  tiny_front_v3_snapshot_get.lto_priv.0
     0.30%  __memset_avx2_unaligned_erms (libc)
     0.30%  tiny_unified_lifo_enabled.lto_priv.0
     0.28%  hak_free_at.constprop.0
     0.24%  hak_pool_try_alloc.part.0
     0.24%  malloc_cold
     0.16%  hak_pool_try_alloc_v1_impl.part.0
     0.14%  free_cold.constprop.0
     0.13%  mid_inuse_dec_deferred
     0.12%  hak_pool_mid_lookup
     0.12%  do_user_addr_fault (kernel)
     0.11%  handle_pte_fault (kernel)
     0.11%  __mod_memcg_lruvec_state (kernel)
     0.10%  do_anonymous_page (kernel)
     0.09%  classify_ptr
     0.07%  tiny_get_max_size.lto_priv.0
     0.06%  __handle_mm_fault (kernel)
     0.06%  __alloc_pages (kernel)

11. Conclusion

Phase 19 has clear, actionable targets with high ROI:

  1. Immediate action (Phase 19-1): Remove FastLane wrapper layer

    • Expected: +10-15% throughput
    • Risk: Low
    • Effort: 2-3 hours
  2. Follow-up (Phase 19-2-4): ENV consolidation + stats + route optimization

    • Expected: +6-11% additional throughput
    • Risk: Medium (ENV invalidation)
    • Effort: 8-12 hours

Combined target: +21% throughput (44.88M → 54.3M ops/s) Gap closure: Reduce instruction gap from +53.8% to +30.3% vs libc

This positions hakmem for competitive performance while maintaining safety and Box Theory compliance.