# HAKMEM Configuration Crisis - Executive Summary **Date**: 2025-11-26 **Status**: 🔴 CRITICAL - Configuration complexity is hindering development **Reading Time**: 10 minutes --- ## 🚨 The Crisis in Numbers | Metric | Current | Target | Reduction | |--------|---------|--------|-----------| | **Runtime ENV variables** | 236 | 80 | **-66%** | | **Build-time flags** | 59+ | ~40 | **-32%** | | **Shell scripts** | 30 files (3000 LOC) | 8 entry points | **-73%** | | **JSON presets** | 1 file, 3 presets | 4+ files, organized | Better structure | | **Configuration guides** | 0 | 3+ comprehensive | ∞% improvement | | **Deprecation tracking** | None | Automated timeline | Needed | **Bottom Line**: HAKMEM has grown from a research allocator to a production system, but configuration management hasn't scaled. We're at the point where **even the original developers are losing track of features**. --- ## 📊 Quick Facts ### Environment Variables (236 total) **By Category**: ``` TINY Allocator: 113 vars (48%) 🔴 BLOATED Debug/Profiling: 31 vars (13%) Learning Systems: 18 vars (8%) 🟡 6 independent systems SuperSlab: 15 vars (6%) Shared Pool: 12 vars (5%) Mid-Large: 11 vars (5%) Benchmarking: 10 vars (4%) Others: 26 vars (11%) ``` **By Status**: ``` Active & Used: ~120 vars (51%) Deprecated/Dead: ~60 vars (25%) 🔴 REMOVE Research/Experimental: ~40 vars (17%) Undocumented: ~16 vars (7%) 🔴 UNCLEAR ``` ### Build Flags (59+ total) **By Category**: ``` Feature Toggles: 23 flags (39%) Optimization: 15 flags (25%) Debug/Instrumentation: 12 flags (20%) Build Modes: 9 flags (15%) ``` ### Shell Scripts (30 files) **By Type**: ``` Benchmarking: 14 scripts (47%) 🟡 Overlapping ENV Setup: 6 scripts (20%) 🔴 Duplicated Build Helpers: 5 scripts (17%) Utilities: 5 scripts (17%) ``` **Problem**: No clear entry points, duplicated logic across 30 files, zero coordination. --- ## 🔥 Top 5 Critical Issues ### 1. TINY Allocator Configuration Explosion (113 vars) **The Problem**: TINY allocator has evolved through multiple phases (v1 → v2 → ULTRA → SLIM → Unified), but **old configuration layers were never removed**. Result: 113 overlapping environment variables. **Examples of Chaos**: ```bash # Refill configuration (7 overlapping strategies!) HAKMEM_TINY_REFILL_BATCH_SIZE=64 HAKMEM_TINY_P0_BATCH=32 # Same as above? HAKMEM_TINY_SFC_REFILL=16 # SFC is deprecated! HAKMEM_UNIFIED_REFILL_SIZE=64 # Unified path HAKMEM_TINY_FAST_REFILL_COUNT=32 # Fast path HAKMEM_TINY_ULTRA_REFILL=8 # Ultra path HAKMEM_TINY_SLIM_REFILL_BATCH=16 # SLIM path # Debug toggles (11 variants with overlapping names!) HAKMEM_TINY_DEBUG=1 HAKMEM_DEBUG_TINY=1 # Same thing? HAKMEM_TINY_VERBOSE=1 HAKMEM_TINY_DEBUG_VERBOSE=1 # Combined? HAKMEM_TINY_LOG=1 ... (6 more variants) ``` **Impact**: - Developers don't know which variables to use - Testing matrix is impossibly large (2^113 combinations) - Configuration bugs are common - Onboarding new developers takes weeks **Recommendation**: Consolidate to **~40 variables** organized by architectural layer: - Core allocation: 15 vars - TLS caching: 8 vars - Refill/drain: 6 vars - Debug: 5 vars - Learning: 6 vars --- ### 2. Dead Code Still Has Active Config (60+ vars) **The Problem**: Features have been replaced or deprecated, but their configuration variables are still active, causing confusion. **Examples**: **SFC (Single-Free-Cache) - REPLACED by Unified Cache**: ```bash HAKMEM_TINY_SFC_ENABLE=1 # 🔴 Dead (replaced Nov 2024) HAKMEM_TINY_SFC_CAP=128 # 🔴 Dead HAKMEM_TINY_SFC_REFILL=16 # 🔴 Dead HAKMEM_TINY_SFC_SPILL_THRESH=96 # 🔴 Dead HAKMEM_TINY_SFC_BATCH_POP=8 # 🔴 Dead HAKMEM_TINY_SFC_STATS=1 # 🔴 Dead ``` **Status**: Unified Cache replaced SFC in Phase 3d-B (2025-11-20), but SFC vars still parsed. **PAGE_ARENA - Research artifact, never integrated**: ```bash HAKMEM_PAGE_ARENA_ENABLE=1 # 🔴 Research-only HAKMEM_PAGE_ARENA_SIZE_MB=16 # 🔴 Research-only HAKMEM_PAGE_ARENA_GROWTH=2 # 🔴 Research-only HAKMEM_PAGE_ARENA_MAX_MB=128 # 🔴 Research-only HAKMEM_PAGE_ARENA_THP=1 # 🔴 Research-only ``` **Status**: Experimental code from 2024-09, never productionized, still has active config. **Other Dead Features**: - EXTERNAL_GUARD (3 vars) - Purpose unclear, no documentation - MF2 (3 vars) - Undocumented, possibly abandoned - OLD_REFILL (5 vars) - Replaced by P0 batch refill **Impact**: - Users waste time trying dead features - CI tests dead code paths - Codebase appears larger than it is **Recommendation**: Remove dead code and deprecate variables with 6-month timeline. --- ### 3. Learning System Chaos (6 independent systems) **The Problem**: HAKMEM has 6 separate learning/adaptive systems with unclear interaction semantics. **The 6 Systems**: ```bash 1. HAKMEM_LEARN=1 # Global meta-learner? 2. HAKMEM_TINY_LEARN=1 # TINY-specific learning 3. HAKMEM_TINY_CAP_LEARN=1 # TLS capacity learning 4. HAKMEM_ADAPTIVE_SIZING=1 # Size class tuning 5. HAKMEM_THP_LEARN=1 # Transparent Huge Pages 6. HAKMEM_WMAX_LEARN=1 # Workload max size learning ``` **Questions with No Answers**: - Can these be enabled together? Do they conflict? - Which learning system owns TLS cache sizing? - What happens if TINY_LEARN=1 but LEARN=0? - Is there a master learning coordinator? **Additional Learning Vars** (12 more): ```bash HAKMEM_LEARN_RATE=0.1 HAKMEM_LEARN_DECAY=0.95 HAKMEM_LEARN_MIN_SAMPLES=1000 HAKMEM_TINY_LEARN_WINDOW=10000 HAKMEM_ADAPTIVE_SIZING_INTERVAL_MS=5000 ... (7 more tuning parameters) ``` **Impact**: - Unpredictable behavior when multiple systems enabled - No documented interaction model - Difficult to debug performance issues - Unclear which system to tune **Recommendation**: Consolidate to **2 learning systems**: 1. **Allocation Learning**: Size classes, TLS capacity, refill tuning 2. **Memory Learning**: THP, RSS optimization, SuperSlab lifecycle With clear boundaries and documented interaction semantics. --- ### 4. Scripts Anarchy (30 files, 3000 LOC, zero hierarchy) **The Problem**: Scripts have accumulated organically with no organization. Multiple scripts do the same thing with subtle differences. **Examples**: **Running Larson - 6 different ways**: ```bash scripts/run_larson.sh # Which one to use? scripts/run_larson_1t.sh # 1 thread variant scripts/run_larson_8t.sh # 8 thread variant scripts/larson_benchmark.sh # Different from run_larson.sh? scripts/bench_larson_preset.sh # Uses JSON presets scripts/quick_larson.sh # Quick test variant ``` **Which should I use?** → Unclear. **Running Random Mixed - 3 different ways**: ```bash scripts/run_random_mixed.sh # Hardcoded params scripts/bench_random_mixed_json.sh # Uses JSON preset scripts/quick_random_mixed.sh # Different defaults ``` **ENV Setup Duplication** (copy-pasted across 30 files): ```bash # This block appears in 12+ scripts: export HAKMEM_TINY_HEADER_CLASSIDX=1 export HAKMEM_TINY_AGGRESSIVE_INLINE=1 export HAKMEM_TINY_PREWARM_TLS=1 export HAKMEM_SS_EMPTY_REUSE=1 export HAKMEM_TINY_UNIFIED_CACHE=1 # ... (20 more vars duplicated everywhere) ``` **Impact**: - New developers don't know where to start - Bug fixes need to be applied to 6+ scripts - Inconsistent behavior across scripts - No single source of truth **Recommendation**: Reorganize to **8 entry points**: ``` scripts/ ├── bench/ # Benchmarking entry points │ ├── larson.sh # Single Larson entry (flags for 1T/8T) │ ├── random_mixed.sh # Single Random Mixed entry │ └── suite.sh # Full benchmark suite ├── config/ # Configuration presets │ ├── production.env # Production defaults │ ├── debug.env # Debug configuration │ └── research.env # Research/experimental ├── lib/ # Shared utilities │ ├── env_setup.sh # Single source of ENV setup │ └── validation.sh # Config validation └── README.md # Scripts guide ``` --- ### 5. Zero Configuration Documentation **The Problem**: 236 environment variables, 59 build flags, 30 scripts → **ZERO master documentation**. **What's Missing**: - ❌ Master list of all ENV variables - ❌ Categorization of variables by purpose - ❌ Default values documentation - ❌ Interaction semantics (which vars conflict?) - ❌ Preset selection guide - ❌ Deprecation timeline - ❌ Scripts coordination guide - ❌ Configuration examples for common use cases **Current State**: Configuration knowledge exists only in: 1. Source code (scattered across 100+ files) 2. Git commit messages (hard to search) 3. Claude's memory (not accessible to others) 4. Tribal knowledge (not written down) **Impact**: - 2+ weeks onboarding time for new developers - Configuration bugs in production - Wasted time experimenting with dead features - Duplicate questions ("Which Larson script should I use?") **Recommendation**: Create **3 comprehensive guides**: 1. **CONFIGURATION.md** - Master reference (all vars categorized) 2. **PRESET_GUIDE.md** - How to choose presets 3. **SCRIPTS_GUIDE.md** - Scripts hierarchy and usage --- ## 🎯 Proposed Cleanup Strategy ### Phase 0: Immediate Wins (P0, 2 days effort, LOW risk) **Goal**: Quick improvements that establish cleanup patterns. **P0.1: Unify SuperSlab Variables** (5 vars → 3 vars) - Remove: `HAKMEM_SS_EMPTY_REUSE`, `HAKMEM_SUPERSLAB_REUSE` (duplicates) - Keep: `HAKMEM_SUPERSLAB_REUSE`, `HAKMEM_SUPERSLAB_LAZY`, `HAKMEM_SUPERSLAB_PREWARM` - Effort: 1 hour (grep + replace + deprecation notice) **P0.2: Create Master Preset Registry** (1 file → 4 files) - `presets/production.json` - Recommended production config - `presets/debug.json` - Full debugging enabled - `presets/research.json` - Experimental features - `presets/minimal.json` - Minimal feature set - Effort: 2 hours (extract from current presets) **P0.3: Clean Up build.sh Pinned Flags** - Document all pinned flags in `BUILD_FLAGS.md` - Remove obsolete flags (POOL_TLS_PHASE1=0, etc.) - Effort: 2 hours **P0.4: Consolidate Debug Variables** (11 vars → 4 vars) - `HAKMEM_DEBUG_LEVEL` (0-3): 0=none, 1=errors, 2=info, 3=verbose - `HAKMEM_DEBUG_TINY` (0/1): TINY allocator specific - `HAKMEM_DEBUG_POOL` (0/1): Pool allocator specific - `HAKMEM_DEBUG_MID` (0/1): Mid-Large allocator specific - Effort: 3 hours (consolidate scattered debug toggles) **P0.5: Create DEPRECATED.md** - List all deprecated variables with sunset dates - Add deprecation warnings to code (TLS-cached, lightweight) - Effort: 1 hour **Total Phase 0 Effort**: 2 days **Risk**: LOW (backward compatible with deprecation warnings) --- ### Phase 1: Structural Improvements (P1, 3 days effort, MEDIUM risk) **Goal**: Reorganize and document configuration system. **P1.1: Reorganize Scripts Hierarchy** - Move to `scripts/{bench,config,lib}/` structure - Consolidate 6 Larson scripts → 1 with flags - Create shared `lib/env_setup.sh` - Effort: 1 day **P1.2: Create CONFIGURATION.md** - Master reference for all 236 variables - Categorize by allocator/feature - Document defaults and interactions - Effort: 1 day **P1.3: Create PRESET_GUIDE.md** - When to use each preset - How to customize presets - Common configuration patterns - Effort: 4 hours **P1.4: Add Preset Versioning** - `presets/v1/production.json` (semantic versioning) - Migration guide for preset changes - Effort: 2 hours **P1.5: Add Configuration Validation** - Runtime check for conflicting vars - Warning for deprecated vars (console + log) - Effort: 4 hours **Total Phase 1 Effort**: 3 days **Risk**: MEDIUM (scripts reorganization may break workflows) --- ### Phase 2: Deep Cleanup (P2, 4 days effort, MEDIUM risk) **Goal**: Remove dead code and consolidate overlapping features. **P2.1: Remove Dead Code** - SFC (6 vars) → Remove - PAGE_ARENA (5 vars) → Remove or document as research - EXTERNAL_GUARD (3 vars) → Remove - MF2 (3 vars) → Remove - OLD_REFILL (5 vars) → Remove - Effort: 1 day (with 6-month deprecation period) **P2.2: Consolidate Learning Systems** (6 systems → 2 systems) - Allocation Learning: size classes, TLS, refill - Memory Learning: THP, RSS, SuperSlab lifecycle - Document interaction semantics - Effort: 2 days (complex refactoring) **P2.3: Reorganize TINY Allocator Config** (113 vars → ~40 vars) - Core allocation: 15 vars - TLS caching: 8 vars - Refill/drain: 6 vars - Debug: 5 vars - Learning: 6 vars - Effort: 2 days (with 6-month migration) **P2.4: Unify Profiling/Stats** (15 vars → 4 vars) - `HAKMEM_PROFILE_LEVEL` (0-3) - `HAKMEM_STATS_INTERVAL_MS` - `HAKMEM_STATS_OUTPUT_FILE` - `HAKMEM_TRACE_ALLOCATIONS` (0/1) - Effort: 4 hours **P2.5: Remove Benchmark-Specific Hacks** - `HAKMEM_BENCH_FAST_MODE` - should be a preset, not ENV var - `HAKMEM_TINY_ULTRA_SIMPLE` - merge into debug level - Effort: 2 hours **Total Phase 2 Effort**: 4 days **Risk**: MEDIUM (requires careful migration planning) --- ## 📈 Success Metrics ### Quantitative ``` ENV Variables: 236 → 80 (-66%) Build Flags: 59 → 40 (-32%) Shell Scripts: 30 → 8 (-73%) Undocumented Vars: 16 → 0 (-100%) ``` ### Qualitative - ✅ New developer onboarding: 2 weeks → 2 days - ✅ Configuration bugs: Common → Rare - ✅ Testing matrix: Intractable → Manageable - ✅ Feature discovery: Trial-and-error → Documented --- ## 📅 Timeline | Phase | Duration | Risk | Dependencies | |-------|----------|------|--------------| | **Phase 0** | 2 days | LOW | None | | **Phase 1** | 3 days | MEDIUM | Phase 0 complete | | **Phase 2** | 4 days | MEDIUM | Phase 1 complete | | **Total** | **9 days** | Manageable | Incremental rollout | **Deprecation Period**: 6 months (2025-11-26 → 2026-05-26) --- ## 🚀 Getting Started **Immediate Next Steps**: 1. ✅ Read this summary (you're done!) 2. 📖 Review detailed analysis: `hakmem_config_analysis.txt` 3. 🛠️ Review concrete proposal: `hakmem_cleanup_proposal.txt` 4. 🎯 Start with P0.1 (SuperSlab unification) - lowest risk, sets pattern 5. 📝 Track progress in `CONFIG_CLEANUP_PROGRESS.md` **Questions?** - Technical details → `hakmem_config_analysis.txt` - Implementation plan → `hakmem_cleanup_proposal.txt` - Quick reference → This document --- ## 📚 Related Documents - **hakmem_config_analysis.txt** (30-min read) - Complete inventory of 236 ENV variables - Detailed categorization and pain points - Scripts analysis and configuration drift examples - **hakmem_cleanup_proposal.txt** (30-min read) - Concrete implementation roadmap - Step-by-step instructions for each phase - Risk mitigation strategies - **CONFIGURATION.md** (to be created in P1.2) - Master reference for all configuration - Will become single source of truth --- **Last Updated**: 2025-11-26 **Next Review**: After Phase 0 completion (est. 2025-11-28)