hakmem/docs/analysis/PHASE6_RESULTS.md

# Phase 6: Learning-Based Tiny Allocator Results

## 📊 Phase 1: Ultra-Simple Fast Path (COMPLETED 2025-11-02)

### 🎯 Design Goal
Implement tcache-style ultra-simple fast path:
- 3-4 instruction fast path (pop from free list)
- Simple mmap-based backend
- Target: 70-80% of System malloc performance

### ✅ Implementation
**Files:**
- `core/hakmem_tiny_simple.h` - Header with inline size-to-class
- `core/hakmem_tiny_simple.c` - Implementation (200 lines)
- `bench_tiny_simple.c` - Benchmark program

**Fast Path (core/hakmem_tiny_simple.c:79-97):**
```c
void* hak_tiny_simple_alloc(size_t size) {
    int cls = hak_tiny_simple_size_to_class(size);  // Inline
    if (cls < 0) return NULL;

    void** head = &g_tls_tiny_cache[cls];
    void* ptr = *head;
    if (ptr) {
        *head = *(void**)ptr;  // 1-instruction pop!
        return ptr;
    }
    return hak_tiny_simple_alloc_slow(size, cls);
}
```

### 🚀 Benchmark Results

**Test: bench_tiny_simple (64B LIFO)**
```
Pattern: Sequential LIFO (alloc + free)
Size: 64B
Iterations: 10,000,000

Results:
- Throughput: 478.60 M ops/sec
- Cycles/op:  4.17 cycles
- Hit rate:   100.00%
```

**Comparison:**

| Allocator | Throughput | Cycles/op | vs Phase 6-1 |
|-----------|------------|-----------|--------------|
| **Phase 6-1 Simple** | **478.60 M/s** | **4.17** | **100%** ✅ |
| System glibc | 174.69 M/s | ~11.4 | **+174%** 🏆 |
| Current HAKMEM | 54.56 M/s | ~36.6 | **+777%** 🚀 |

### 📈 Performance Analysis

**Why so fast?**

1. **Ultra-simple fast path:**
   - Size-to-class: Inline if-chain (predictable branches)
   - Cache lookup: Single array index (`g_tls_tiny_cache[cls]`)
   - Pop operation: Single pointer dereference
   - Total: ~4 cycles for hot path

2. **Perfect cache locality:**
   - TLS array fits in L1 cache (8 pointers = 64 bytes)
   - Freed blocks immediately reused (hot in L1)
   - 100% hit rate in LIFO pattern

3. **No overhead:**
   - No magazine layers
   - No HotMag checks
   - No bitmap scans
   - No refcount updates
   - No branch mispredictions (linear code)

**Comparison with System tcache:**
- System: ~11.4 cycles/op (174.69 M ops/sec)
- Phase 6-1: **4.17 cycles/op** (478.60 M ops/sec)
- Difference: Phase 6-1 is **7.3 cycles faster per operation**

Reasons Phase 6-1 beats System:
1. Simpler size-to-class (inline if-chain vs System's bin calculation)
2. Direct TLS array access (no tcache structure indirection)
3. Fewer security checks (System has hardening overhead)
4. Better compiler optimization (newer GCC, -O2)

### 🎯 Goals Status

| Goal | Target | Achieved | Status |
|------|--------|----------|--------|
| Beat current HAKMEM | >54 M/s | 478.60 M/s | ✅ **+777%** |
| System parity | ~175 M/s | 478.60 M/s | ✅ **+174%** |
| Phase 1 target | 70-80% of System (122-140 M/s) | 478.60 M/s | ✅ **274% of System!** |

### 📝 Next Steps

**Phase 1 Comprehensive Testing:**
- [ ] Run bench_comprehensive with Phase 6-1
- [ ] Test all 21 patterns (LIFO, FIFO, Random, Interleaved, etc.)
- [ ] Test all sizes (8B, 16B, 32B, 64B, 128B, 256B, 512B, 1KB)
- [ ] Measure memory efficiency (RSS usage)
- [ ] Compare with baseline comprehensive results

**Phase 2 Planning (if Phase 1 comprehensive results good):**
- [ ] Design learning layer (hotness tracking)
- [ ] Implement dynamic capacity adjustment (16-256 slots)
- [ ] Implement adaptive refill count (16-128 blocks)
- [ ] Integration with existing HAKMEM infrastructure

---

## 💡 Key Insights

1. **Simplicity wins:** Ultra-simple design (200 lines) beats complex magazine system (8+ layers)
2. **Cache is king:** L1 cache locality + 100% hit rate = 4 cycles/op
3. **HAKX pattern works for Tiny:** "Simple Front + Smart Back" (from Mid-Large +171%) applies here too
4. **Target crushed:** 274% of System (vs 70-80% target) leaves room for learning layer overhead

## 🎉 Conclusion

Phase 6-1 Ultra-Simple Fast Path is a **massive success**:
- ✅ Implementation complete (200 lines, clean design)
- ✅ Beats System malloc by **+174%**
- ✅ Beats current HAKMEM by **+777%**
- ✅ **4.17 cycles/op** (near-theoretical minimum)

This validates the "Simple Front + Smart Back" strategy and provides a solid foundation for Phase 2 learning layer.
Debug Counters Implementation - Clean History Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-05 12:31:14 +09:00			`# Phase 6: Learning-Based Tiny Allocator Results`

			`## 📊 Phase 1: Ultra-Simple Fast Path (COMPLETED 2025-11-02)`

			`### 🎯 Design Goal`
			`Implement tcache-style ultra-simple fast path:`
			`- 3-4 instruction fast path (pop from free list)`
			`- Simple mmap-based backend`
			`- Target: 70-80% of System malloc performance`

			`### ✅ Implementation`
			`Files:`
			- `core/hakmem_tiny_simple.h` - Header with inline size-to-class
			- `core/hakmem_tiny_simple.c` - Implementation (200 lines)
			- `bench_tiny_simple.c` - Benchmark program

			`Fast Path (core/hakmem_tiny_simple.c:79-97):`
			```c
			`void* hak_tiny_simple_alloc(size_t size) {`
			`int cls = hak_tiny_simple_size_to_class(size); // Inline`
			`if (cls < 0) return NULL;`

			`void** head = &g_tls_tiny_cache[cls];`
			`void* ptr = *head;`
			`if (ptr) {`
			`head = (void**)ptr; // 1-instruction pop!`
			`return ptr;`
			`}`
			`return hak_tiny_simple_alloc_slow(size, cls);`
			`}`
			```

			`### 🚀 Benchmark Results`

			`Test: bench_tiny_simple (64B LIFO)`
			```
			`Pattern: Sequential LIFO (alloc + free)`
			`Size: 64B`
			`Iterations: 10,000,000`

			`Results:`
			`- Throughput: 478.60 M ops/sec`
			`- Cycles/op: 4.17 cycles`
			`- Hit rate: 100.00%`
			```

			`Comparison:`

			`\| Allocator \| Throughput \| Cycles/op \| vs Phase 6-1 \|`
			`\|-----------\|------------\|-----------\|--------------\|`
			`\| Phase 6-1 Simple \| 478.60 M/s \| 4.17 \| 100% ✅ \|`
			`\| System glibc \| 174.69 M/s \| ~11.4 \| +174% 🏆 \|`
			`\| Current HAKMEM \| 54.56 M/s \| ~36.6 \| +777% 🚀 \|`

			`### 📈 Performance Analysis`

			`Why so fast?`

			`1. Ultra-simple fast path:`
			`- Size-to-class: Inline if-chain (predictable branches)`
			- Cache lookup: Single array index (`g_tls_tiny_cache[cls]`)
			`- Pop operation: Single pointer dereference`
			`- Total: ~4 cycles for hot path`

			`2. Perfect cache locality:`
			`- TLS array fits in L1 cache (8 pointers = 64 bytes)`
			`- Freed blocks immediately reused (hot in L1)`
			`- 100% hit rate in LIFO pattern`

			`3. No overhead:`
			`- No magazine layers`
			`- No HotMag checks`
			`- No bitmap scans`
			`- No refcount updates`
			`- No branch mispredictions (linear code)`

			`Comparison with System tcache:`
			`- System: ~11.4 cycles/op (174.69 M ops/sec)`
			`- Phase 6-1: 4.17 cycles/op (478.60 M ops/sec)`
			`- Difference: Phase 6-1 is 7.3 cycles faster per operation`

			`Reasons Phase 6-1 beats System:`
			`1. Simpler size-to-class (inline if-chain vs System's bin calculation)`
			`2. Direct TLS array access (no tcache structure indirection)`
			`3. Fewer security checks (System has hardening overhead)`
			`4. Better compiler optimization (newer GCC, -O2)`

			`### 🎯 Goals Status`

			`\| Goal \| Target \| Achieved \| Status \|`
			`\|------\|--------\|----------\|--------\|`
			`\| Beat current HAKMEM \| >54 M/s \| 478.60 M/s \| ✅ +777% \|`
			`\| System parity \| ~175 M/s \| 478.60 M/s \| ✅ +174% \|`
			`\| Phase 1 target \| 70-80% of System (122-140 M/s) \| 478.60 M/s \| ✅ 274% of System! \|`

			`### 📝 Next Steps`

			`Phase 1 Comprehensive Testing:`
			`- [ ] Run bench_comprehensive with Phase 6-1`
			`- [ ] Test all 21 patterns (LIFO, FIFO, Random, Interleaved, etc.)`
			`- [ ] Test all sizes (8B, 16B, 32B, 64B, 128B, 256B, 512B, 1KB)`
			`- [ ] Measure memory efficiency (RSS usage)`
			`- [ ] Compare with baseline comprehensive results`

			`Phase 2 Planning (if Phase 1 comprehensive results good):`
			`- [ ] Design learning layer (hotness tracking)`
			`- [ ] Implement dynamic capacity adjustment (16-256 slots)`
			`- [ ] Implement adaptive refill count (16-128 blocks)`
			`- [ ] Integration with existing HAKMEM infrastructure`

			`---`

			`## 💡 Key Insights`

			`1. Simplicity wins: Ultra-simple design (200 lines) beats complex magazine system (8+ layers)`
			`2. Cache is king: L1 cache locality + 100% hit rate = 4 cycles/op`
			`3. HAKX pattern works for Tiny: "Simple Front + Smart Back" (from Mid-Large +171%) applies here too`
			`4. Target crushed: 274% of System (vs 70-80% target) leaves room for learning layer overhead`

			`## 🎉 Conclusion`

			`Phase 6-1 Ultra-Simple Fast Path is a massive success:`
			`- ✅ Implementation complete (200 lines, clean design)`
			`- ✅ Beats System malloc by +174%`
			`- ✅ Beats current HAKMEM by +777%`
			`- ✅ 4.17 cycles/op (near-theoretical minimum)`

			`This validates the "Simple Front + Smart Back" strategy and provides a solid foundation for Phase 2 learning layer.`