Phase 1 完了:環境変数整理 + fprintf デバッグガード ENV変数削除(BG/HotMag系): - core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines) - core/hakmem_tiny_bg_spill.c: BG spill ENV 削除 - core/tiny_refill.h: BG remote 固定値化 - core/hakmem_tiny_slow.inc: BG refs 削除 fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE): - core/hakmem_shared_pool.c: Lock stats (~18 fprintf) - core/page_arena.c: Init/Shutdown/Stats (~27 fprintf) - core/hakmem.c: SIGSEGV init message ドキュメント整理: - 328 markdown files 削除(旧レポート・重複docs) 性能確認: - Larson: 52.35M ops/s (前回52.8M、安定動作✅) - ENV整理による機能影響なし - Debug出力は一部残存(次phase で対応) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
4.1 KiB
4.1 KiB
Phase 6: Learning-Based Tiny Allocator Results
📊 Phase 1: Ultra-Simple Fast Path (COMPLETED 2025-11-02)
🎯 Design Goal
Implement tcache-style ultra-simple fast path:
- 3-4 instruction fast path (pop from free list)
- Simple mmap-based backend
- Target: 70-80% of System malloc performance
✅ Implementation
Files:
core/hakmem_tiny_simple.h- Header with inline size-to-classcore/hakmem_tiny_simple.c- Implementation (200 lines)bench_tiny_simple.c- Benchmark program
Fast Path (core/hakmem_tiny_simple.c:79-97):
void* hak_tiny_simple_alloc(size_t size) {
int cls = hak_tiny_simple_size_to_class(size); // Inline
if (cls < 0) return NULL;
void** head = &g_tls_tiny_cache[cls];
void* ptr = *head;
if (ptr) {
*head = *(void**)ptr; // 1-instruction pop!
return ptr;
}
return hak_tiny_simple_alloc_slow(size, cls);
}
🚀 Benchmark Results
Test: bench_tiny_simple (64B LIFO)
Pattern: Sequential LIFO (alloc + free)
Size: 64B
Iterations: 10,000,000
Results:
- Throughput: 478.60 M ops/sec
- Cycles/op: 4.17 cycles
- Hit rate: 100.00%
Comparison:
| Allocator | Throughput | Cycles/op | vs Phase 6-1 |
|---|---|---|---|
| Phase 6-1 Simple | 478.60 M/s | 4.17 | 100% ✅ |
| System glibc | 174.69 M/s | ~11.4 | +174% 🏆 |
| Current HAKMEM | 54.56 M/s | ~36.6 | +777% 🚀 |
📈 Performance Analysis
Why so fast?
-
Ultra-simple fast path:
- Size-to-class: Inline if-chain (predictable branches)
- Cache lookup: Single array index (
g_tls_tiny_cache[cls]) - Pop operation: Single pointer dereference
- Total: ~4 cycles for hot path
-
Perfect cache locality:
- TLS array fits in L1 cache (8 pointers = 64 bytes)
- Freed blocks immediately reused (hot in L1)
- 100% hit rate in LIFO pattern
-
No overhead:
- No magazine layers
- No HotMag checks
- No bitmap scans
- No refcount updates
- No branch mispredictions (linear code)
Comparison with System tcache:
- System: ~11.4 cycles/op (174.69 M ops/sec)
- Phase 6-1: 4.17 cycles/op (478.60 M ops/sec)
- Difference: Phase 6-1 is 7.3 cycles faster per operation
Reasons Phase 6-1 beats System:
- Simpler size-to-class (inline if-chain vs System's bin calculation)
- Direct TLS array access (no tcache structure indirection)
- Fewer security checks (System has hardening overhead)
- Better compiler optimization (newer GCC, -O2)
🎯 Goals Status
| Goal | Target | Achieved | Status |
|---|---|---|---|
| Beat current HAKMEM | >54 M/s | 478.60 M/s | ✅ +777% |
| System parity | ~175 M/s | 478.60 M/s | ✅ +174% |
| Phase 1 target | 70-80% of System (122-140 M/s) | 478.60 M/s | ✅ 274% of System! |
📝 Next Steps
Phase 1 Comprehensive Testing:
- Run bench_comprehensive with Phase 6-1
- Test all 21 patterns (LIFO, FIFO, Random, Interleaved, etc.)
- Test all sizes (8B, 16B, 32B, 64B, 128B, 256B, 512B, 1KB)
- Measure memory efficiency (RSS usage)
- Compare with baseline comprehensive results
Phase 2 Planning (if Phase 1 comprehensive results good):
- Design learning layer (hotness tracking)
- Implement dynamic capacity adjustment (16-256 slots)
- Implement adaptive refill count (16-128 blocks)
- Integration with existing HAKMEM infrastructure
💡 Key Insights
- Simplicity wins: Ultra-simple design (200 lines) beats complex magazine system (8+ layers)
- Cache is king: L1 cache locality + 100% hit rate = 4 cycles/op
- HAKX pattern works for Tiny: "Simple Front + Smart Back" (from Mid-Large +171%) applies here too
- Target crushed: 274% of System (vs 70-80% target) leaves room for learning layer overhead
🎉 Conclusion
Phase 6-1 Ultra-Simple Fast Path is a massive success:
- ✅ Implementation complete (200 lines, clean design)
- ✅ Beats System malloc by +174%
- ✅ Beats current HAKMEM by +777%
- ✅ 4.17 cycles/op (near-theoretical minimum)
This validates the "Simple Front + Smart Back" strategy and provides a solid foundation for Phase 2 learning layer.