Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
10 KiB
Phase 6.11.4: Implementation Guide
Quick Reference: Step-by-step implementation for hak_alloc optimization
🎯 Goal
Reduce hak_alloc overhead: 126,479 cycles (39.6%) → <70,000 cycles (<22%)
Target improvement: -45% reduction in 2-3 hours
📋 Implementation Checklist
✅ Phase 6.11.4 (P0-1): Atomic Operation Elimination (30 minutes)
Expected gain: -30,000 cycles (-24%)
Step 1: Modify hakmem.c
File: apps/experiments/hakmem-poc/hakmem.c:362-369
void* hak_alloc_at(size_t size, hak_callsite_t site) {
HKM_TIME_START(t0);
if (!g_initialized) hak_init();
- // Phase 6.8: Feature-gated evolution tick (every 1024 allocs)
- if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_EVOLUTION)) {
+ // Phase 6.11.4 (P0-1): Compile-time guard for atomic operation
+ #if HAKMEM_FEATURE_EVOLUTION
static _Atomic uint64_t tick_counter = 0;
if ((atomic_fetch_add(&tick_counter, 1) & 0x3FF) == 0) {
- struct timespec now;
- clock_gettime(CLOCK_MONOTONIC, &now);
- uint64_t now_ns = now.tv_sec * 1000000000ULL + now.tv_nsec;
- hak_evo_tick(now_ns);
+ hak_evo_tick(get_time_ns());
}
- }
+ #endif
Key changes:
- Replace runtime check
if (HAK_ENABLED_LEARNING(...))with compile-time#if HAKMEM_FEATURE_EVOLUTION - Use
get_time_ns()helper instead of inlineclock_gettime(minor cleanup)
Step 2: Add helper function (optional cleanup)
File: apps/experiments/hakmem-poc/hakmem_evo.c
// Public helper (expose in hakmem_evo.h)
uint64_t get_time_ns(void) {
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return (uint64_t)ts.tv_sec * 1000000000ULL + (uint64_t)ts.tv_nsec;
}
File: apps/experiments/hakmem-poc/hakmem_evo.h
// Add to public API
uint64_t get_time_ns(void); // Helper for external callers
Step 3: Test with Evolution disabled
# Baseline (with atomic)
cd apps/experiments/hakmem-poc
HAKMEM_DEBUG_TIMING=1 make bench_allocators_hakmem
HAKMEM_TIMING=1 ./bench_allocators_hakmem
# Modify hakmem_config.h temporarily
# Change: #define HAKMEM_FEATURE_EVOLUTION 0
# Rebuild and benchmark
HAKMEM_DEBUG_TIMING=1 make bench_allocators_hakmem
HAKMEM_TIMING=1 ./bench_allocators_hakmem
Expected output:
Before:
hak_alloc: 126,479 cycles (39.6%)
After:
hak_alloc: 96,000 cycles (30.0%) ← -24% reduction ✅
✅ Phase 6.11.4 (P0-2): Cached Strategy (1-2 hours)
Expected gain: -26,000 cycles (-27% additional)
Step 1: Add global cache variables
File: apps/experiments/hakmem-poc/hakmem.c:52-60
static int g_initialized = 0;
// Statistics
static uint64_t g_malloc_count = 0; // Used for optimization stats display
-// Phase 6.11: ELO Sampling Rate reduction (1/100 sampling)
-static uint64_t g_elo_call_count = 0; // Total calls to ELO path
-static int g_cached_strategy_id = -1; // Cached strategy ID (updated every 100 calls)
+// Phase 6.11.4 (P0-2): Async ELO strategy cache
+static _Atomic int g_cached_strategy_id = 2; // Default: 2MB threshold (strategy_id=4)
+static _Atomic uint64_t g_elo_generation = 0; // Invalidation counter
Step 2: Update hak_alloc logic
File: apps/experiments/hakmem-poc/hakmem.c:377-417
if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_ELO)) {
// ELO enabled: use strategy selection
int strategy_id;
if (hak_evo_is_frozen()) {
// FROZEN: Use confirmed best strategy (zero overhead)
strategy_id = hak_evo_get_confirmed_strategy();
threshold = hak_elo_get_threshold(strategy_id);
} else if (hak_evo_is_canary()) {
// CANARY: 5% trial with candidate, 95% with confirmed
if (hak_evo_should_use_candidate()) {
strategy_id = hak_evo_get_candidate_strategy();
} else {
strategy_id = hak_evo_get_confirmed_strategy();
}
threshold = hak_elo_get_threshold(strategy_id);
} else {
- // LEARN: ELO operation with 1/100 sampling (Phase 6.11 optimization)
- g_elo_call_count++;
-
- // Update strategy every 100 calls (99% overhead reduction)
- if (g_elo_call_count % 100 == 0 || g_cached_strategy_id == -1) {
- // Sample: Select strategy using epsilon-greedy (10% exploration, 90% exploitation)
- strategy_id = hak_elo_select_strategy();
- g_cached_strategy_id = strategy_id;
-
- // Record allocation for ELO learning (simplified: no timing yet)
- hak_elo_record_alloc(strategy_id, size, 0);
- } else {
- // Use cached strategy (fast path, no ELO overhead)
- strategy_id = g_cached_strategy_id;
- }
+ // Phase 6.11.4 (P0-2): LEARN mode uses cached strategy (updated async)
+ strategy_id = atomic_load(&g_cached_strategy_id);
threshold = hak_elo_get_threshold(strategy_id);
}
} else {
// ELO disabled: use default threshold (2MB - mimalloc's large threshold)
threshold = 2097152; // 2MB
}
Step 3: Add async recompute in evo_tick
File: apps/experiments/hakmem-poc/hakmem_evo.c
Add new function:
// Phase 6.11.4 (P0-2): Async ELO strategy recomputation
void hak_elo_async_recompute(void) {
if (!hak_elo_is_initialized()) return;
// Re-select best strategy (epsilon-greedy)
int new_strategy = hak_elo_select_strategy();
// Update cached strategy
extern _Atomic int g_cached_strategy_id; // From hakmem.c
extern _Atomic uint64_t g_elo_generation;
atomic_store(&g_cached_strategy_id, new_strategy);
atomic_fetch_add(&g_elo_generation, 1); // Invalidate
fprintf(stderr, "[ELO] Async strategy update: %d → %d (gen=%lu)\n",
atomic_load(&g_cached_strategy_id), new_strategy,
atomic_load(&g_elo_generation));
}
Call from hak_evo_tick:
void hak_evo_tick(uint64_t now_ns) {
// ... existing logic ...
// Close window if conditions met
if (should_close) {
// ... existing window closure logic ...
+ // Phase 6.11.4 (P0-2): Recompute ELO strategy (every window)
+ if (HAK_ENABLED_LEARNING(HAKMEM_FEATURE_ELO)) {
+ hak_elo_async_recompute();
+ }
// Reset window
g_window_ops_count = 0;
g_window_start_ns = now_ns;
}
}
Expose in header:
File: apps/experiments/hakmem-poc/hakmem_evo.h
// Phase 6.11.4 (P0-2): Async ELO update
void hak_elo_async_recompute(void);
int hak_elo_is_initialized(void); // Helper
File: apps/experiments/hakmem-poc/hakmem_elo.c
int hak_elo_is_initialized(void) {
return g_initialized;
}
Step 4: Test with Evolution enabled
# Restore HAKMEM_FEATURE_EVOLUTION=1 in hakmem_config.h
cd apps/experiments/hakmem-poc
HAKMEM_DEBUG_TIMING=1 make bench_allocators_hakmem
HAKMEM_TIMING=1 ./bench_allocators_hakmem
Expected output:
Before (P0-1):
hak_alloc: 96,000 cycles (30.0%)
After (P0-2):
hak_alloc: 70,000 cycles (21.9%) ← -27% additional reduction ✅
Total:
126,479 → 70,000 cycles (-45% total) 🎉
🔧 Troubleshooting
Issue 1: Undefined reference to g_cached_strategy_id
Cause: External variable not declared in header
Fix: Add to hakmem_evo.h or make variables accessible via getter:
// Option 1: Getter function (safer)
int hak_elo_get_cached_strategy(void);
// Option 2: Extern declaration (faster)
extern _Atomic int g_cached_strategy_id;
Issue 2: ELO strategy not updating
Cause: hak_elo_async_recompute() not called
Debug:
# Add debug prints
fprintf(stderr, "[DEBUG] hak_evo_tick called, should_close=%d\n", should_close);
Issue 3: Race condition on g_elo_generation
Not a problem: Read-only in hot-path, atomic increment in cold-path
📊 Validation
Benchmark all scenarios
cd apps/experiments/hakmem-poc
./bench_allocators_hakmem
Expected improvements:
| Scenario | Before (ns) | After (ns) | Reduction |
|---|---|---|---|
| json (64KB) | 298 | ~220 | -26% |
| mir (256KB) | 1,698 | ~1,250 | -26% |
| vm (2MB) | 15,021 | ~11,000 | -27% |
Profiling validation
HAKMEM_TIMING=1 ./bench_allocators_hakmem
Expected cycle distribution:
Before:
hak_alloc: 126,479 cycles (39.6%) ← Bottleneck
syscall_munmap: 131,666 cycles (41.3%)
After:
hak_alloc: 70,000 cycles (27.5%) ← Reduced! ✅
syscall_munmap: 131,666 cycles (51.7%) ← Now #1 bottleneck
Success criterion: hak_alloc < 75,000 cycles (40% reduction)
🎯 Next Steps After P0-2
Option A: Stop here (RECOMMENDED)
Rationale:
- 45% reduction achieved (126,479 → 70,000 cycles)
- 2-3 hours total investment
- Excellent ROI
Decision: Move to Phase 6.13 (L2.5 Pool mir scenario optimization)
Option B: Continue to P2 (Hash Optimization)
Expected gain: Additional 10,000 cycles (-14%) Time investment: 2-3 hours Priority: Medium
Implementation: See PHASE_6.11.4_THREADING_COST_ANALYSIS.md Section 3
📝 Documentation Updates
After completion, update:
-
CURRENT_TASK.md:
## ✅ Phase 6.11.4 完了!(YYYY-MM-DD) **実装完了**: hak_alloc 最適化 (-45% reduction) **P0-1**: Atomic operation elimination (-24%) **P0-2**: Cached strategy (-27%) **結果**: 126,479 → 70,000 cycles (-45%) -
PHASE_6.11.4_COMPLETION_REPORT.md:
- Copy template from
PHASE_6.11.3_COMPLETION_REPORT.md - Fill in actual benchmark results
- Add profiling comparison
- Copy template from
🚀 Quick Start Commands
# 1. Implement P0-1 (30 min)
vim apps/experiments/hakmem-poc/hakmem.c # Edit line 362-369
make bench_allocators_hakmem
HAKMEM_TIMING=1 ./bench_allocators_hakmem
# 2. Implement P0-2 (1-2 hrs)
vim apps/experiments/hakmem-poc/hakmem.c # Edit line 52-60, 377-417
vim apps/experiments/hakmem-poc/hakmem_evo.c # Add hak_elo_async_recompute
make bench_allocators_hakmem
HAKMEM_TIMING=1 ./bench_allocators_hakmem
# 3. Validate
./bench_allocators_hakmem | tee results_p0.txt
python3 quick_analyze.py results_p0.txt
Total time: 2-3 hours for -45% reduction 🎉