Files
hakmem/docs/archive/POOL_FULL_FIX_EVALUATION.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

287 lines
7.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Pool Full Fix Ultrathink Evaluation
**Date**: 2025-11-08
**Evaluator**: Task Agent (Critical Mode)
**Mission**: Evaluate Full Fix strategy against 3 critical criteria
## Executive Summary
| Criteria | Status | Verdict |
|----------|--------|---------|
| **綺麗さ (Clean Architecture)** | ✅ **YES** | 286 lines → 10-20 lines, Box Theory aligned |
| **速さ (Performance)** | ⚠️ **CONDITIONAL** | 40-60M ops/s achievable BUT requires header addition |
| **学習層 (Learning Layer)** | ⚠️ **DEGRADED** | ACE will lose visibility, needs redesign |
**Overall Verdict**: **CONDITIONAL GO** - Proceed BUT address 2 critical requirements first
---
## 1. 綺麗さ判定: ✅ **YES - Major Improvement**
### Current Complexity (UGLY)
```
Pool hot path: 286 lines, 5 cache layers, mutex locks, atomic operations
├── TC drain check (lines 234-236)
├── TLS ring check (line 236)
├── TLS LIFO check (line 237)
├── Trylock probe loop (lines 240-256) - 3 attempts!
├── Active page checks (lines 258-261) - 3 pages!
├── FULL MUTEX LOCK (line 267) 💀
├── Remote drain logic
├── Neighbor stealing
└── Refill with mmap
```
### After Full Fix (CLEAN)
```c
void* hak_pool_try_alloc_fast(size_t size, uintptr_t site_id) {
int class_idx = hak_pool_get_class_index(size);
// Ultra-simple TLS freelist (3-4 instructions)
void* head = g_tls_pool_head[class_idx];
if (head) {
g_tls_pool_head[class_idx] = *(void**)head;
return (char*)head + HEADER_SIZE;
}
// Batch refill (no locks)
return pool_refill_and_alloc(class_idx);
}
```
### Box Theory Alignment
**Single Responsibility**: TLS for hot path, backend for refill
**Clear Boundaries**: No mixing of concerns
**Visible Failures**: Simple code = obvious bugs
**Testable**: Each component isolated
**Verdict**: The fix will make the code **dramatically cleaner** (286 lines → 10-20 lines)
---
## 2. 速さ判定: ⚠️ **CONDITIONAL - Critical Requirement**
### Performance Analysis
#### Expected Performance
**Without header optimization**: 15-25M ops/s
**With header optimization**: 40-60M ops/s ✅
#### Why Conditional?
**Current Pool blocks are 8-52KB** - these don't have Tiny's 1-byte header!
```c
// Tiny has this (Phase 7):
uint8_t magic_and_class = 0xa0 | class_idx; // 1-byte header
// Pool doesn't have ANY header for class identification!
// Must add header OR use registry lookup (slower)
```
#### Performance Breakdown
**Option A: Add 1-byte header to Pool blocks** ✅ RECOMMENDED
- Allocation: Write header (1 cycle)
- Free: Read header, pop to TLS (5-6 cycles total)
- **Expected**: 40-60M ops/s (matches Tiny)
- **Overhead**: 1 byte per 8-52KB block = **0.002-0.012%** (negligible!)
**Option B: Use registry lookup** ⚠️ NOT RECOMMENDED
- Free path needs `mid_desc_lookup()` first
- Adds 20-30 cycles to free path
- **Expected**: 15-25M ops/s (still good but not target)
### Critical Evidence
**Tiny's success** (Phase 7 Task 3):
- 128B allocations: **59M ops/s** (92% of System)
- 1024B allocations: **65M ops/s** (146% of System!)
- **Key**: Header-based class identification
**Pool can replicate this IF headers are added**
**Verdict**: 40-60M ops/s is **achievable** BUT **requires header addition**
---
## 3. 学習層判定: ⚠️ **DEGRADED - Needs Redesign**
### Current ACE Integration
ACE currently monitors:
- TC drain events
- Ring underflow/overflow
- Active page transitions
- Remote free patterns
- Shard contention
### After Full Fix
**What ACE loses**:
- ❌ TC drain events (no TC layer)
- ❌ Ring metrics (simple freelist instead)
- ❌ Active page patterns (no active pages)
- ❌ Shard contention data (no shards in TLS)
**What ACE can still monitor**:
- ✅ TLS hit/miss rate
- ✅ Refill frequency
- ✅ Allocation size distribution
- ✅ Per-thread usage patterns
### Required ACE Adaptations
1. **New Metrics Collection**:
```c
// Add to TLS freelist
if (head) {
g_ace_tls_hits[class_idx]++; // NEW
} else {
g_ace_tls_misses[class_idx]++; // NEW
}
```
2. **Simplified Learning**:
- Focus on TLS cache capacity tuning
- Batch refill size optimization
- No more complex multi-layer decisions
3. **UCB1 Algorithm Still Works**:
- Just fewer knobs to tune
- Simpler state space = faster convergence
**Verdict**: ACE will be **simpler but less sophisticated**. This might be GOOD!
---
## 4. Risk Assessment
### Critical Risks
**Risk 1: Header Addition Complexity** 🔴
- Must modify ALL Pool allocation paths
- Need to ensure header consistency
- **Mitigation**: Use same header format as Tiny (proven)
**Risk 2: ACE Learning Degradation** 🟡
- Loses multi-layer optimization capability
- **Mitigation**: Simpler system might learn faster
**Risk 3: Memory Overhead** 🟢
- TLS freelist: 7 classes × 8 bytes × N threads
- For 100 threads: ~5.6KB overhead (negligible)
- **Mitigation**: Pre-warm with reasonable counts
### Hidden Concerns
**Is mutex really the bottleneck?**
- YES! Profiling shows pthread_mutex_lock at 25-30% CPU
- Tiny without mutex: 59-70M ops/s
- Pool with mutex: 0.4M ops/s
- **170x difference confirms mutex is THE problem**
---
## 5. Alternative Analysis
### Quick Win First?
**Not Recommended** - Band-aids won't fix 100x performance gap
Increasing TLS cache sizes will help but:
- Still hits mutex eventually
- Complexity remains
- Max improvement: 5-10x (not enough)
### Should We Try Lock-Free CAS?
**Not Recommended** - More complex than TLS approach
CAS-based freelist:
- Still has contention (cache line bouncing)
- Complex ABA problem handling
- Expected: 20-30M ops/s (inferior to TLS)
---
## Final Verdict: **CONDITIONAL GO**
### Conditions That MUST Be Met:
1. **Add 1-byte header to Pool blocks** (like Tiny Phase 7)
- Without this: Only 15-25M ops/s
- With this: 40-60M ops/s ✅
2. **Implement ACE metric collection in new TLS path**
- Simple hit/miss counters minimum
- Refill tracking for learning
### If Conditions Are Met:
| Criteria | Result |
|----------|--------|
| 綺麗さ | ✅ 286 lines → 20 lines, Box Theory perfect |
| 速さ | ✅ 40-60M ops/s achievable (100x improvement) |
| 学習層 | ✅ Simpler but functional |
### Implementation Steps (If GO)
**Phase 1 (Day 1): Header Addition**
1. Add 1-byte header write in Pool allocation
2. Verify header consistency
3. Test with existing free path
**Phase 2 (Day 2): TLS Freelist Implementation**
1. Copy Tiny's TLS approach
2. Add batch refill (64 blocks)
3. Feature flag for safety
**Phase 3 (Day 3): ACE Integration**
1. Add TLS hit/miss metrics
2. Connect to ACE controller
3. Test learning convergence
**Phase 4 (Day 4): Testing & Tuning**
1. MT stress tests
2. Benchmark validation (must hit 40M ops/s)
3. Memory overhead verification
### Alternative Recommendation (If NO-GO)
If header addition is deemed too risky:
**Hybrid Approach**:
1. Keep Pool as-is for compatibility
2. Create new "FastPool" allocator with headers
3. Gradually migrate allocations
4. **Expected timeline**: 2 weeks (safer but slower)
---
## Decision Matrix
| Factor | Weight | Full Fix | Quick Win | Do Nothing |
|--------|--------|----------|-----------|------------|
| Performance | 40% | 100x | 5x | 1x |
| Clean Code | 20% | Excellent | Poor | Poor |
| ACE Function | 20% | Degraded | Same | Same |
| Risk | 20% | Medium | Low | None |
| **Total Score** | | **85/100** | **45/100** | **20/100** |
---
## Final Recommendation
**GO WITH CONDITIONS**
The Full Fix will deliver:
- 100x performance improvement (0.4M → 40-60M ops/s)
- Dramatically cleaner architecture
- Functional (though simpler) ACE learning
**BUT YOU MUST**:
1. Add 1-byte headers to Pool blocks (non-negotiable for 40-60M target)
2. Implement basic ACE metrics in new path
**Expected Outcome**: Pool will match or exceed Tiny's performance while maintaining ACE adaptability.
**Confidence Level**: 85% success if both conditions are met, 40% if only one condition is met.