Files

Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)

Phase 1 完了：環境変数整理 + fprintf デバッグガード

ENV変数削除（BG/HotMag系）:
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除（旧レポート・重複docs）

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作✅)
- ENV整理による機能影響なし
- Debug出力は一部残存（次phase で対応）

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 14:45:26 +09:00

4.1 KiB

Raw Blame History

Phase 6: Learning-Based Tiny Allocator Results

📊 Phase 1: Ultra-Simple Fast Path (COMPLETED 2025-11-02)

🎯 Design Goal

Implement tcache-style ultra-simple fast path:

3-4 instruction fast path (pop from free list)
Simple mmap-based backend
Target: 70-80% of System malloc performance

✅ Implementation

Files:

core/hakmem_tiny_simple.h - Header with inline size-to-class
core/hakmem_tiny_simple.c - Implementation (200 lines)
bench_tiny_simple.c - Benchmark program

Fast Path (core/hakmem_tiny_simple.c:79-97):

void* hak_tiny_simple_alloc(size_t size) {
    int cls = hak_tiny_simple_size_to_class(size);  // Inline
    if (cls < 0) return NULL;

    void** head = &g_tls_tiny_cache[cls];
    void* ptr = *head;
    if (ptr) {
        *head = *(void**)ptr;  // 1-instruction pop!
        return ptr;
    }
    return hak_tiny_simple_alloc_slow(size, cls);
}

🚀 Benchmark Results

Test: bench_tiny_simple (64B LIFO)

Pattern: Sequential LIFO (alloc + free)
Size: 64B
Iterations: 10,000,000

Results:
- Throughput: 478.60 M ops/sec
- Cycles/op:  4.17 cycles
- Hit rate:   100.00%

Comparison:

Allocator	Throughput	Cycles/op	vs Phase 6-1
Phase 6-1 Simple	478.60 M/s	4.17	100% ✅
System glibc	174.69 M/s	~11.4	+174% 🏆
Current HAKMEM	54.56 M/s	~36.6	+777% 🚀

📈 Performance Analysis

Why so fast?

Ultra-simple fast path:
- Size-to-class: Inline if-chain (predictable branches)
- Cache lookup: Single array index (g_tls_tiny_cache[cls])
- Pop operation: Single pointer dereference
- Total: ~4 cycles for hot path
Perfect cache locality:
- TLS array fits in L1 cache (8 pointers = 64 bytes)
- Freed blocks immediately reused (hot in L1)
- 100% hit rate in LIFO pattern
No overhead:
- No magazine layers
- No HotMag checks
- No bitmap scans
- No refcount updates
- No branch mispredictions (linear code)

Comparison with System tcache:

System: ~11.4 cycles/op (174.69 M ops/sec)
Phase 6-1: 4.17 cycles/op (478.60 M ops/sec)
Difference: Phase 6-1 is 7.3 cycles faster per operation

Reasons Phase 6-1 beats System:

Simpler size-to-class (inline if-chain vs System's bin calculation)
Direct TLS array access (no tcache structure indirection)
Fewer security checks (System has hardening overhead)
Better compiler optimization (newer GCC, -O2)

🎯 Goals Status

Goal	Target	Achieved	Status
Beat current HAKMEM	>54 M/s	478.60 M/s	✅ +777%
System parity	~175 M/s	478.60 M/s	✅ +174%
Phase 1 target	70-80% of System (122-140 M/s)	478.60 M/s	✅ 274% of System!

📝 Next Steps

Phase 1 Comprehensive Testing:

Run bench_comprehensive with Phase 6-1
Test all 21 patterns (LIFO, FIFO, Random, Interleaved, etc.)
Test all sizes (8B, 16B, 32B, 64B, 128B, 256B, 512B, 1KB)
Measure memory efficiency (RSS usage)
Compare with baseline comprehensive results

Phase 2 Planning (if Phase 1 comprehensive results good):

Design learning layer (hotness tracking)
Implement dynamic capacity adjustment (16-256 slots)
Implement adaptive refill count (16-128 blocks)
Integration with existing HAKMEM infrastructure

💡 Key Insights

Simplicity wins: Ultra-simple design (200 lines) beats complex magazine system (8+ layers)
Cache is king: L1 cache locality + 100% hit rate = 4 cycles/op
HAKX pattern works for Tiny: "Simple Front + Smart Back" (from Mid-Large +171%) applies here too
Target crushed: 274% of System (vs 70-80% target) leaves room for learning layer overhead

🎉 Conclusion

Phase 6-1 Ultra-Simple Fast Path is a massive success:

✅ Implementation complete (200 lines, clean design)
✅ Beats System malloc by +174%
✅ Beats current HAKMEM by +777%
✅ 4.17 cycles/op (near-theoretical minimum)

This validates the "Simple Front + Smart Back" strategy and provides a solid foundation for Phase 2 learning layer.

4.1 KiB Raw Blame History