Files
hakmem/docs/archive/SESSION_SUMMARY_2025-10-26.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

16 KiB
Raw Blame History

Session Summary - 2025-10-26

TL;DR

重大な発見: Phase 1-4.1 の全ベンチマークが glibc malloc を測定していた。 hakmem を正しく測定した結果、mimalloc の 8.8倍遅いことが判明。

成果:

  • Benchmark infrastructure を修正(二度と間違えない仕組み)
  • 正確な性能比較hakmem vs glibc vs mimalloc
  • ボトルネック分析(推定)
  • 検証ツール・チェックリスト作成

現実:

  • hakmem: 103 M ops/sec (9.7 ns/op)
  • mimalloc: 908 M ops/sec (1.1 ns/op) - 8.8x faster
  • glibc: 364 M ops/sec (2.7 ns/op) - 3.5x faster

📅 Timeline

Morning: Phase 4 Regression Analysis

  • Phase 4: 373-380 M ops/sec (-3.6% vs Phase 3)
  • Created PHASE4_REGRESSION_ANALYSIS.md
  • Created PHASE4_IMPROVEMENT_ROADMAP.md
  • ChatGPT Pro advice: Gating + Batching + Pull型

Afternoon: Phase 4 Improvements

  • Phase 4.1: Option A+B micro-optimization
    • Result: 381 M ops/sec (+1-2%)
  • Phase 4.2: High-water gating
    • Result: 103 M ops/sec (???)

Evening: Critical Discovery 🚨

  • All benchmarks were measuring glibc malloc!
  • Root cause: Makefile implicit rule
  • bench_tiny.c didn't link hakmem

Night: Infrastructure Fix + Reality Check

  • Fixed Makefile (explicit targets)
  • Created verify_bench.sh
  • Created BENCHMARKING_CHECKLIST.md
  • True measurement: hakmem = 103 M ops/sec
  • Comparison: mimalloc = 908 M ops/sec (8.8x faster!)

🔍 Critical Discovery Details

The Mistake

What happened:

# Before (wrong)
make bench_tiny
→ gcc bench_tiny.c -o bench_tiny  # Makefile implicit rule!
→ No hakmem linkage
→ Using glibc malloc

# All reported numbers were glibc malloc:
Phase 3: 391 M ops/sec (glibc)
Phase 4: 373 M ops/sec (glibc)
Phase 4.1: 381 M ops/sec (glibc)

Root Cause:

  • Makefile had no explicit bench_tiny target
  • bench_tiny.c only calls malloc/free (no hakmem.h include)
  • System malloc was used by default
  • No errors → looked like it was working

Why performance "changed":

  • 361→391 M ops/sec (8.3% variation) was measurement noise
  • CPU load, cache state, system activity
  • No relation to hakmem code changes

The Fix

1. Explicit Makefile targets:

TINY_BENCH_OBJS = hakmem.o hakmem_config.o ... (all hakmem objects)

bench_tiny: bench_tiny.o $(TINY_BENCH_OBJS)
	$(CC) -o $@ $^ $(LDFLAGS)
	@echo "✓ bench_tiny built with hakmem"

2. Verification script (verify_bench.sh):

$ ./verify_bench.sh ./bench_tiny
✅ hakmem symbols: 119
✅ Binary size: 156KB
✅ Verification PASSED

3. Checklist (BENCHMARKING_CHECKLIST.md):

  • Pre-benchmark verification steps
  • Post-benchmark validation
  • Prevents future mistakes

📊 True Performance Comparison

Benchmark Setup

Workload: bench_tiny

  • 16B allocations
  • 100 alloc → 100 free × 10M iterations
  • Total: 2B operations (1B alloc + 1B free)

Environment:

  • Platform: WSL2 (Linux 5.15.167.4-microsoft-standard-WSL2)
  • CPU: (not specified, but modern x86-64)
  • Compiler: gcc -O3 -march=native -mtune=native

Results

Allocator Throughput Latency vs hakmem Notes
mimalloc 908.31 M ops/sec 1.1 ns/op 8.8x faster LD_PRELOAD=libmimalloc.so.2
glibc 364.00 M ops/sec 2.7 ns/op 3.5x faster System default
hakmem 103.35 M ops/sec 9.7 ns/op 1.0x (baseline) Phase 4.2

Analysis

mimalloc dominance:

  • 908 M ops/sec is exceptionally fast
  • 1.1 ns/op ≈ 3-4 CPU cycles (assuming 3 GHz)
  • Near theoretical minimum for malloc/free
  • Result of 10+ years of optimization

hakmem gap:

  • 8.8x slower than mimalloc
  • 3.5x slower than glibc
  • 8.6 ns overhead vs mimalloc

Note: Previous ANALYSIS_SUMMARY.md stated mimalloc = 14 ns/op, but that was from different benchmark (likely more complex allocations).


🔬 Bottleneck Analysis

Estimated 8.6 ns Breakdown

Based on code review (perf unavailable in WSL2):

Component Estimated Cost Details
Registry lookup 1-2 ns Linear probing hash (max 8 probes), lock-free
Bitmap operations 1-2 ns set_used/free + summary bitmap update
Stats (batched TLS) 0.5 ns TLS increment (Phase 3 optimization)
Mini-magazine 0.5-1 ns pop/push operations (Phase 2 addition)
Other overhead 3-5 ns Function calls, branches, cache misses
Total 6.5-10.5 ns Matches measured 9.7 ns

Key Findings

1. Registry lookup (O(1) but not free):

static int g_use_registry = 1;  // Enabled by default
// registry_lookup(): lock-free but linear probing
  • O(1) average case
  • Lock-free (good!)
  • But: linear probing up to 8 probes
  • Cost: ~1-2 ns per lookup

2. Bitmap overhead:

hak_tiny_set_used(slab, block_idx);
// Updates bitmap + summary bitmap
  • Two-tier bitmap for fast empty-word detection
  • Each set/free updates both levels
  • Cost: ~1-2 ns

3. Mini-magazine (Phase 2):

  • Added for "fast path" (1-2 ns vs 5-6 ns bitmap)
  • But: adds extra layer vs direct TLS cache
  • Cost: ~0.5-1 ns

4. Stats (Phase 3 - optimized):

  • Batched TLS counters (0.5 ns)
  • Good! (vs 10-15 ns XOR RNG before)

5. Unaccounted overhead (~3-5 ns):

  • Function call overhead
  • Branch mispredictions
  • Cache misses
  • Data structure indirection

💡 Why is mimalloc so fast?

mimalloc Architecture (inferred)

1. Minimalist TLS cache:

  • Direct pointer to free list
  • Single pointer dereference
  • Near-zero overhead

2. Inline metadata:

  • Page metadata embedded in page
  • O(1) pointer→page calculation
  • No hash table needed

3. Lock-free everywhere:

  • Thread-local pages
  • Atomic operations only for shared structures
  • No pthread mutexes in hot path

4. Cache-optimized:

  • Sequential layout
  • Prefetch-friendly
  • Minimal pointer chasing

5. Extreme simplicity:

  • Fewer layers
  • Fewer branches
  • Fewer memory accesses

hakmem's complexity:

  • TLS Magazine → Mini-magazine → Bitmap
  • 3 layers vs mimalloc's 1-2 layers
  • Each layer adds overhead

📈 Progress Summary

What Was Completed Today

Phase 1-3 (Previously, but measured wrong)

  • Phase 1: Mini-magazine infrastructure (measured glibc)
  • Phase 2: Hot path integration (measured glibc)
  • Phase 3: Stats batching (measured glibc)

Phase 4 (Attempted, but measured wrong)

  • Phase 4: TLS spill optimization (measured glibc)
  • Phase 4.1: Micro-optimization A+B (measured glibc)

Critical Fix (Actual Work)

  • Discovered benchmark infrastructure bug
  • Fixed Makefile (explicit targets)
  • Created verify_bench.sh (119 hakmem symbols detected)
  • Created BENCHMARKING_CHECKLIST.md (prevention)
  • Phase 4.2: High-water gating (first correct measurement!)

Analysis & Documentation

  • PHASE4_REGRESSION_ANALYSIS.md (16 KB)
  • PHASE4_IMPROVEMENT_ROADMAP.md (13 KB)
  • BENCHMARKING_CHECKLIST.md (comprehensive)
  • verify_bench.sh (automated verification)
  • Comparative benchmark (hakmem vs glibc vs mimalloc)
  • Bottleneck analysis (code review)

🎯 Reality Check

The Gap

mimalloc: 908 M ops/sec (1.1 ns/op) hakmem: 103 M ops/sec (9.7 ns/op) Gap: 8.8x

Is This Gap Closeable?

Honest Assessment:

Short-term (Quick Wins - weeks):

  • Target: 103 → 150 M ops/sec (+45%)
  • Method:
    • Reduce bitmap overhead
    • Optimize registry hash
    • Remove mini-magazine layer (if not helping)
    • Tune TLS Magazine capacity
  • Realistic: Achievable

Mid-term (Major Optimization - months):

  • Target: 150 → 250 M ops/sec (+140%)
  • Method:
    • Redesign data structures
    • Inline metadata (like mimalloc)
    • Lock-free everywhere
    • Cache-optimized layout
  • Realistic: Difficult but possible

Long-term (Mimalloc-level - years):

  • Target: 250 → 700+ M ops/sec (+580%)
  • Method:
    • Complete architectural overhaul
    • Learn from mimalloc source code
    • Remove all unnecessary layers
    • Extreme optimization
  • Realistic: Requires mimalloc-level expertise

Fundamental Challenge:

  • hakmem is a research allocator (visibility, experimentation)
  • mimalloc is a production allocator (speed, simplicity)
  • These goals are inherently in conflict

Research features that add overhead:

  • Bitmap (vs inline metadata)
  • Statistics (even batched TLS)
  • Multiple layers (TLS Mag → Mini-mag → Bitmap)
  • Flexibility (registry toggle, SuperSlab, etc.)

🚀 Next Steps

Goal: hakmem as research platform, not production allocator

Approach:

  1. Document current performance (103 M ops/sec baseline)
  2. Focus on research features:
    • Bitmap visibility for debugging
    • Statistics for analysis
    • Experimentation platform
  3. Optimize moderately:
    • Quick wins (103 → 150 M ops/sec)
    • Don't sacrifice research goals for speed
  4. Accept 5-8x slower than mimalloc as reasonable trade-off

Outcome: Honest, useful research tool


Option B: Pursue Performance (Challenging)

Goal: Close the gap to 2-3x slower than mimalloc

Phase 1: Quick Wins (Target: 150 M ops/sec)

  • Remove mini-magazine if not helping
  • Optimize registry hash (reduce collisions)
  • Tune TLS Magazine capacity
  • Inline hot functions
  • Profile with proper tools (perf, gprof)

Phase 2: Architectural (Target: 250 M ops/sec)

  • Inline metadata (like mimalloc)
  • Simplify data structures
  • Remove unnecessary layers
  • Cache-optimized layout

Phase 3: Extreme (Target: 400+ M ops/sec)

  • Study mimalloc source code deeply
  • Adopt mimalloc techniques
  • Sacrifice research features if needed

Outcome: High-performance allocator, less research-friendly


Option C: Hybrid Approach (Balanced)

Goal: Research + reasonable performance

Approach:

  1. Default mode: Research-friendly (bitmap, stats, visibility)
    • Performance: 100-150 M ops/sec
  2. Fast mode: Production-optimized (inline metadata, no stats)
    • Performance: 300-500 M ops/sec
    • Toggle: HAKMEM_FAST_MODE=1

Outcome: Best of both worlds, more complexity


📚 Documentation Created Today

File Size Purpose
PHASE4_REGRESSION_ANALYSIS.md 16 KB Root cause analysis of Phase 4 regression
PHASE4_IMPROVEMENT_ROADMAP.md 13 KB Phased improvement plan (4.1→4.2→4.3→4.4)
BENCHMARKING_CHECKLIST.md 11 KB Prevention guide (never make same mistake)
verify_bench.sh 2 KB Automated verification (119 symbols check)
bench_hakmem.sh 2 KB Benchmark script template
SESSION_SUMMARY_2025-10-26.md (this file) Comprehensive session summary

Total: ~44 KB of documentation


🎓 Lessons Learned

Technical Lessons

  1. Makefile implicit rules are dangerous

    • Always define explicit targets
    • Especially for critical binaries
  2. "It works" ≠ "It's correct"

    • bench_tiny "worked" (no errors)
    • But measured wrong thing entirely
  3. Verification is essential

    • nm to check symbols
    • ldd to check libraries
    • size to check binary size
  4. Performance claims need causality

    • Code change → performance change?
    • Or just measurement noise?
  5. Benchmarking is hard

    • Same code
    • Same compiler flags
    • Same workload
    • Verify what you're measuring!

Process Lessons

  1. Document procedures

    • Checklists prevent mistakes
    • Future-you will thank you
  2. Automate verification

    • Scripts don't forget
    • verify_bench.sh (119 symbols)
  3. Question surprising results

    • "Phase 3 improved 1.3%" - Was it real?
    • If glibc improved, that'd be newsworthy!
  4. Be honest about limitations

    • hakmem is 8.8x slower
    • That's okay for research!
  5. Celebrate discoveries

    • Found a fundamental issue
    • Fixed it permanently
    • That's progress!

💭 Reflections

What Went Well

  1. Fixed fundamental infrastructure bug

    • Will never happen again
    • Proper tooling in place
  2. Honest performance comparison

    • Now know true baseline
    • Can set realistic goals
  3. Comprehensive documentation

    • 44 KB of useful docs
    • Future-proofing
  4. Learned from ChatGPT Pro

    • Gating strategy
    • Pull型 architecture
    • Batching techniques

What Could Be Better 🤔

  1. Should have verified linkage earlier

    • Wasted time on Phase 1-4.1
    • But: learned valuable lessons!
  2. Need better profiling tools

    • perf doesn't work in WSL2
    • Consider: gprof, valgrind, manual timing
  3. Architectural complexity

    • 3 layers (TLS Mag → Mini-mag → Bitmap)
    • Each adds overhead
    • Simplification opportunity?

🎯 Recommendation

For tomorrow (if continuing):

Immediate Actions

  1. Run verify_bench.sh before any benchmark

    ./verify_bench.sh ./bench_tiny
    
  2. Re-measure Phase 0 (baseline) with correct infrastructure

    • Before Phase 1
    • True starting point
  3. Profile with working tools

    • gprof (if available)
    • Manual timing (HAKMEM_DEBUG_TIMING=1)
    • Identify true bottlenecks
  4. Implement one Quick Win

    • Start with easiest improvement
    • Measure impact properly
    • Document result

Strategic Decision

Choose a path:

  • Path A: Research focus (accept 5-8x slower)
  • Path B: Performance pursuit (target 2-3x slower)
  • Path C: Hybrid (modes for different use cases)

Recommended: Path A (Research focus)

  • Honest about trade-offs
  • Useful research tool
  • Moderate optimization (Quick Wins)
  • Don't sacrifice research goals

📊 Final Numbers

Before Today (Wrong)

Phase Throughput Note
Phase 2 361 M ops/sec glibc malloc
Phase 3 391 M ops/sec glibc malloc
Phase 4 373 M ops/sec glibc malloc
Phase 4.1 381 M ops/sec glibc malloc

After Today (Correct)

Allocator Throughput Latency Verified
hakmem (Phase 4.2) 103 M ops/sec 9.7 ns/op 119 symbols
glibc malloc 364 M ops/sec 2.7 ns/op No hakmem
mimalloc 908 M ops/sec 1.1 ns/op LD_PRELOAD

Gap to close: 8.8x (mimalloc) or 3.5x (glibc)


🙏 Acknowledgments

  • ChatGPT Pro: Excellent architectural advice (Gating, Batching, Pull型)
  • User (にゃーん): Persistent questioning led to discovery
  • verify_bench.sh: Will prevent future mistakes

📅 Session Stats

  • Duration: ~8 hours
  • Commits: 6
  • Files Created: 6
  • Lines of Code: ~500
  • Documentation: ~44 KB
  • Major Discovery: 1 (benchmark infrastructure bug)
  • Performance Gap Discovered: 8.8x vs mimalloc

Conclusion

Today was tough 😿 but extremely valuable

Bad News:

  • Phase 1-4.1 measurements were wrong
  • hakmem is 8.8x slower than mimalloc
  • Big gap to close

Good News:

  • Found fundamental infrastructure bug
  • Fixed it permanently (verify_bench.sh + checklist)
  • Know true baseline now (103 M ops/sec)
  • Can set realistic goals going forward
  • Comprehensive documentation for future

Moving Forward:

  • Be honest about performance
  • Focus on research value
  • Optimize pragmatically
  • Never make same mistake again

Final Thought:

"It's better to know uncomfortable truth than comfortable lie."

hakmem は 103 M ops/sec。それが現実。 でも、それは研究プラットフォームとして十分な性能かもしれませんにゃ。

にゃーん!今日も一日お疲れさまでしたにゃ!😸


Generated: 2025-10-26 Session: Phase 4 Optimization & Infrastructure Fix Status: Documented & Lessons Learned