Files
hakmem/docs/analysis/PHASE1_EXECUTIVE_SUMMARY.md
Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
Phase 1 完了:環境変数整理 + fprintf デバッグガード

ENV変数削除(BG/HotMag系):
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除(旧レポート・重複docs)

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作)
- ENV整理による機能影響なし
- Debug出力は一部残存(次phase で対応)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 14:45:26 +09:00

6.2 KiB
Raw Blame History

Phase 1 Quick Wins - Executive Summary

TL;DR: REFILL_COUNT optimization failed because we optimized the wrong thing. The real bottleneck is superslab_refill (28.56% CPU), not refill frequency.


The Numbers

REFILL_COUNT Throughput L1d Miss Rate Verdict
32 4.19 M/s 12.88% OPTIMAL
64 3.89 M/s 14.12% -7.2%
128 2.68 M/s 16.08% -36%

Root Causes

1. superslab_refill is the Bottleneck (28.56% CPU)

perf report (REFILL_COUNT=32):
  28.56%  superslab_refill  ← THIS IS THE PROBLEM
   3.10%  [kernel] (various)
   ...

Impact: Even if we eliminate ALL refill overhead, max gain is 28.56%. In reality, we made it worse.

2. Cache Pollution from Large Batches

REFILL_COUNT=32:  L1d miss rate = 12.88%
REFILL_COUNT=128: L1d miss rate = 16.08% (+25% worse!)

Why:

  • 128 blocks × 128 bytes = 16 KB
  • L1 cache = 32 KB total
  • Batch + working set > L1 capacity
  • Result: More cache misses, slower performance

3. Refill Frequency Already Low

Larson benchmark characteristics:

  • FIFO pattern with 1024 chunks per thread
  • High TLS freelist hit rate
  • Refills are rare, not frequent

Implication: Reducing refill frequency has minimal impact when refills are already uncommon.

4. memset is NOT in Hot Path

Search results:

memset found in:
  - hakmem_tiny_init.inc (one-time init)
  - hakmem_tiny_intel.inc (debug ring init)

Conclusion: memset removal would have ZERO impact on allocation performance.


Why Task Teacher's +31% Projection Failed

Expected:

REFILL 32→128: reduce calls by 4x → +31% speedup

Reality:

REFILL 32→128: -36% slowdown

Mistakes:

  1. Assumed refill is cheap (it's 28.56% of CPU)
  2. Assumed refills are frequent (they're rare in Larson)
  3. Ignored cache effects (L1d misses +25%)
  4. Used Larson-specific pattern (not generalizable)

Immediate Actions

DO THIS NOW

  1. Keep REFILL_COUNT=32 (optimal for Larson)
  2. Focus on superslab_refill optimization (28.56% CPU → biggest win)
  3. Profile superslab_refill internals:
    • Bitmap scanning
    • mmap syscalls
    • Metadata initialization

DO NOT DO THIS

  1. DO NOT increase REFILL_COUNT to 64+ (causes cache pollution)
  2. DO NOT optimize memset (not in hot path, waste of time)
  3. DO NOT trust Larson alone (need diverse benchmarks)

Next Steps (Priority Order)

🔥 P0: Superslab_refill Deep Dive (This Week)

Hypothesis: 28.56% CPU in one function is unacceptable. Break it down:

superslab_refill() {
    // Profile each step:
    1. Bitmap scan to find free slab       How much time?
    2. mmap() for new SuperSlab             How much time?
    3. Metadata initialization              How much time?
    4. Slab carving / freelist setup        How much time?
}

Tools:

perf record -e cycles -g --call-graph=dwarf -- ./larson_hakmem ...
perf report --stdio -g --no-children | grep superslab

Expected outcome: Find sub-bottleneck, get 10-20% speedup by optimizing it.


🔥 P1: Cache-Aware Refill (Next Week)

Goal: Reduce L1d miss rate from 12.88% to <10%

Approach:

  1. Limit batch size to fit in L1 with working set

    • Current: REFILL_COUNT=32 (4KB for 128B class)
    • Test: REFILL_COUNT=16 (2KB)
    • Hypothesis: Smaller batches = fewer misses
  2. Prefetching

    • Prefetch next batch while using current batch
    • Reduces cache miss penalty
  3. Adaptive batch sizing

    • Small batches when working set is large
    • Large batches when working set is small

🔥 P2: Benchmark Diversity (Next 2 Weeks)

Problem: Larson is NOT representative

Larson characteristics:

  • FIFO allocation pattern
  • Fixed working set (1024 chunks)
  • Predictable sizes (8-128B)
  • High freelist hit rate

Need to test:

  1. Random allocation/free (not FIFO)
  2. Bursty allocations (malloc storms)
  3. Mixed lifetime (long-lived + short-lived)
  4. Variable sizes (less predictable)

Hypothesis: Other patterns may have different bottlenecks (refill frequency might matter more).


🔥 P3: Fast Path Simplification (Phase 6 Goal)

Long-term vision: Eliminate superslab_refill from hot path

Approach:

  1. Background refill thread

    • Keep freelists pre-filled
    • Allocation never waits for superslab_refill
  2. Lock-free slab exchange

    • Reduce atomic operations
    • Faster refill when needed
  3. System tcache study

    • Understand why System malloc is 3-4 instructions
    • Adopt proven patterns

Key Metrics to Track

Performance

  • Throughput: 4.19 M ops/s (Larson baseline)
  • superslab_refill CPU: 28.56% → target <10%
  • L1d miss rate: 12.88% → target <10%
  • IPC: 1.93 → maintain or improve

Health

  • Stability: Results should be consistent (±2%)
  • Memory usage: Monitor RSS growth
  • Fragmentation: Track over time

Data-Driven Checklist

Before ANY optimization:

  • Profile with perf record -g
  • Identify TOP bottleneck (>5% CPU)
  • Verify with perf stat (cache, branches, IPC)
  • Test with MULTIPLE benchmarks (not just Larson)
  • Document baseline metrics
  • A/B test changes (at least 3 runs each)
  • Verify improvements are statistically significant

Rule: If perf doesn't show it, don't optimize it.


Lessons Learned

  1. Profile first, optimize second

    • Task Teacher's intuition was wrong
    • Data revealed superslab_refill as real bottleneck
  2. Cache effects can reverse gains

    • More batching ≠ always faster
    • L1 cache is precious (32 KB)
  3. Benchmarks lie

    • Larson has special properties (FIFO, stable working set)
    • Real workloads may differ significantly
  4. Measure, don't guess

    • memset "optimization" would have been wasted effort
    • perf shows what actually matters

Final Recommendation

STOP optimizing refill frequency. START optimizing superslab_refill.

The data is clear: superslab_refill is 28.56% of CPU time. That's where the wins are.


Questions? See full report: PHASE1_REFILL_INVESTIGATION.md