Files
hakmem/docs/analysis/HAKMEM_CONFIG_SUMMARY.md

478 lines
15 KiB
Markdown
Raw Normal View History

Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization) ## Changes ### 1. core/page_arena.c - Removed init failure message (lines 25-27) - error is handled by returning early - All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks ### 2. core/hakmem.c - Wrapped SIGSEGV handler init message (line 72) - CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs ### 3. core/hakmem_shared_pool.c - Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE: - Node pool exhaustion warning (line 252) - SP_META_CAPACITY_ERROR warning (line 421) - SP_FIX_GEOMETRY debug logging (line 745) - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865) - SP_ACQUIRE_STAGE0_L0 debug logging (line 803) - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922) - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996) - SP_ACQUIRE_STAGE3 debug logging (line 1116) - SP_SLOT_RELEASE debug logging (line 1245) - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305) - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316) - Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized ## Performance Validation Before: 51M ops/s (with debug fprintf overhead) After: 49.1M ops/s (consistent performance, fprintf removed from hot paths) ## Build & Test ```bash ./build.sh larson_hakmem ./out/release/larson_hakmem 1 5 1 1000 100 10000 42 # Result: 49.1M ops/s ``` Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00
# HAKMEM Configuration Crisis - Executive Summary
**Date**: 2025-11-26
**Status**: 🔴 CRITICAL - Configuration complexity is hindering development
**Reading Time**: 10 minutes
---
## 🚨 The Crisis in Numbers
| Metric | Current | Target | Reduction |
|--------|---------|--------|-----------|
| **Runtime ENV variables** | 236 | 80 | **-66%** |
| **Build-time flags** | 59+ | ~40 | **-32%** |
| **Shell scripts** | 30 files (3000 LOC) | 8 entry points | **-73%** |
| **JSON presets** | 1 file, 3 presets | 4+ files, organized | Better structure |
| **Configuration guides** | 0 | 3+ comprehensive | ∞% improvement |
| **Deprecation tracking** | None | Automated timeline | Needed |
**Bottom Line**: HAKMEM has grown from a research allocator to a production system, but configuration management hasn't scaled. We're at the point where **even the original developers are losing track of features**.
---
## 📊 Quick Facts
### Environment Variables (236 total)
**By Category**:
```
TINY Allocator: 113 vars (48%) 🔴 BLOATED
Debug/Profiling: 31 vars (13%)
Learning Systems: 18 vars (8%) 🟡 6 independent systems
SuperSlab: 15 vars (6%)
Shared Pool: 12 vars (5%)
Mid-Large: 11 vars (5%)
Benchmarking: 10 vars (4%)
Others: 26 vars (11%)
```
**By Status**:
```
Active & Used: ~120 vars (51%)
Deprecated/Dead: ~60 vars (25%) 🔴 REMOVE
Research/Experimental: ~40 vars (17%)
Undocumented: ~16 vars (7%) 🔴 UNCLEAR
```
### Build Flags (59+ total)
**By Category**:
```
Feature Toggles: 23 flags (39%)
Optimization: 15 flags (25%)
Debug/Instrumentation: 12 flags (20%)
Build Modes: 9 flags (15%)
```
### Shell Scripts (30 files)
**By Type**:
```
Benchmarking: 14 scripts (47%) 🟡 Overlapping
ENV Setup: 6 scripts (20%) 🔴 Duplicated
Build Helpers: 5 scripts (17%)
Utilities: 5 scripts (17%)
```
**Problem**: No clear entry points, duplicated logic across 30 files, zero coordination.
---
## 🔥 Top 5 Critical Issues
### 1. TINY Allocator Configuration Explosion (113 vars)
**The Problem**: TINY allocator has evolved through multiple phases (v1 → v2 → ULTRA → SLIM → Unified), but **old configuration layers were never removed**. Result: 113 overlapping environment variables.
**Examples of Chaos**:
```bash
# Refill configuration (7 overlapping strategies!)
HAKMEM_TINY_REFILL_BATCH_SIZE=64
HAKMEM_TINY_P0_BATCH=32 # Same as above?
HAKMEM_TINY_SFC_REFILL=16 # SFC is deprecated!
HAKMEM_UNIFIED_REFILL_SIZE=64 # Unified path
HAKMEM_TINY_FAST_REFILL_COUNT=32 # Fast path
HAKMEM_TINY_ULTRA_REFILL=8 # Ultra path
HAKMEM_TINY_SLIM_REFILL_BATCH=16 # SLIM path
# Debug toggles (11 variants with overlapping names!)
HAKMEM_TINY_DEBUG=1
HAKMEM_DEBUG_TINY=1 # Same thing?
HAKMEM_TINY_VERBOSE=1
HAKMEM_TINY_DEBUG_VERBOSE=1 # Combined?
HAKMEM_TINY_LOG=1
... (6 more variants)
```
**Impact**:
- Developers don't know which variables to use
- Testing matrix is impossibly large (2^113 combinations)
- Configuration bugs are common
- Onboarding new developers takes weeks
**Recommendation**: Consolidate to **~40 variables** organized by architectural layer:
- Core allocation: 15 vars
- TLS caching: 8 vars
- Refill/drain: 6 vars
- Debug: 5 vars
- Learning: 6 vars
---
### 2. Dead Code Still Has Active Config (60+ vars)
**The Problem**: Features have been replaced or deprecated, but their configuration variables are still active, causing confusion.
**Examples**:
**SFC (Single-Free-Cache) - REPLACED by Unified Cache**:
```bash
HAKMEM_TINY_SFC_ENABLE=1 # 🔴 Dead (replaced Nov 2024)
HAKMEM_TINY_SFC_CAP=128 # 🔴 Dead
HAKMEM_TINY_SFC_REFILL=16 # 🔴 Dead
HAKMEM_TINY_SFC_SPILL_THRESH=96 # 🔴 Dead
HAKMEM_TINY_SFC_BATCH_POP=8 # 🔴 Dead
HAKMEM_TINY_SFC_STATS=1 # 🔴 Dead
```
**Status**: Unified Cache replaced SFC in Phase 3d-B (2025-11-20), but SFC vars still parsed.
**PAGE_ARENA - Research artifact, never integrated**:
```bash
HAKMEM_PAGE_ARENA_ENABLE=1 # 🔴 Research-only
HAKMEM_PAGE_ARENA_SIZE_MB=16 # 🔴 Research-only
HAKMEM_PAGE_ARENA_GROWTH=2 # 🔴 Research-only
HAKMEM_PAGE_ARENA_MAX_MB=128 # 🔴 Research-only
HAKMEM_PAGE_ARENA_THP=1 # 🔴 Research-only
```
**Status**: Experimental code from 2024-09, never productionized, still has active config.
**Other Dead Features**:
- EXTERNAL_GUARD (3 vars) - Purpose unclear, no documentation
- MF2 (3 vars) - Undocumented, possibly abandoned
- OLD_REFILL (5 vars) - Replaced by P0 batch refill
**Impact**:
- Users waste time trying dead features
- CI tests dead code paths
- Codebase appears larger than it is
**Recommendation**: Remove dead code and deprecate variables with 6-month timeline.
---
### 3. Learning System Chaos (6 independent systems)
**The Problem**: HAKMEM has 6 separate learning/adaptive systems with unclear interaction semantics.
**The 6 Systems**:
```bash
1. HAKMEM_LEARN=1 # Global meta-learner?
2. HAKMEM_TINY_LEARN=1 # TINY-specific learning
3. HAKMEM_TINY_CAP_LEARN=1 # TLS capacity learning
4. HAKMEM_ADAPTIVE_SIZING=1 # Size class tuning
5. HAKMEM_THP_LEARN=1 # Transparent Huge Pages
6. HAKMEM_WMAX_LEARN=1 # Workload max size learning
```
**Questions with No Answers**:
- Can these be enabled together? Do they conflict?
- Which learning system owns TLS cache sizing?
- What happens if TINY_LEARN=1 but LEARN=0?
- Is there a master learning coordinator?
**Additional Learning Vars** (12 more):
```bash
HAKMEM_LEARN_RATE=0.1
HAKMEM_LEARN_DECAY=0.95
HAKMEM_LEARN_MIN_SAMPLES=1000
HAKMEM_TINY_LEARN_WINDOW=10000
HAKMEM_ADAPTIVE_SIZING_INTERVAL_MS=5000
... (7 more tuning parameters)
```
**Impact**:
- Unpredictable behavior when multiple systems enabled
- No documented interaction model
- Difficult to debug performance issues
- Unclear which system to tune
**Recommendation**: Consolidate to **2 learning systems**:
1. **Allocation Learning**: Size classes, TLS capacity, refill tuning
2. **Memory Learning**: THP, RSS optimization, SuperSlab lifecycle
With clear boundaries and documented interaction semantics.
---
### 4. Scripts Anarchy (30 files, 3000 LOC, zero hierarchy)
**The Problem**: Scripts have accumulated organically with no organization. Multiple scripts do the same thing with subtle differences.
**Examples**:
**Running Larson - 6 different ways**:
```bash
scripts/run_larson.sh # Which one to use?
scripts/run_larson_1t.sh # 1 thread variant
scripts/run_larson_8t.sh # 8 thread variant
scripts/larson_benchmark.sh # Different from run_larson.sh?
scripts/bench_larson_preset.sh # Uses JSON presets
scripts/quick_larson.sh # Quick test variant
```
**Which should I use?** → Unclear.
**Running Random Mixed - 3 different ways**:
```bash
scripts/run_random_mixed.sh # Hardcoded params
scripts/bench_random_mixed_json.sh # Uses JSON preset
scripts/quick_random_mixed.sh # Different defaults
```
**ENV Setup Duplication** (copy-pasted across 30 files):
```bash
# This block appears in 12+ scripts:
export HAKMEM_TINY_HEADER_CLASSIDX=1
export HAKMEM_TINY_AGGRESSIVE_INLINE=1
export HAKMEM_TINY_PREWARM_TLS=1
export HAKMEM_SS_EMPTY_REUSE=1
export HAKMEM_TINY_UNIFIED_CACHE=1
# ... (20 more vars duplicated everywhere)
```
**Impact**:
- New developers don't know where to start
- Bug fixes need to be applied to 6+ scripts
- Inconsistent behavior across scripts
- No single source of truth
**Recommendation**: Reorganize to **8 entry points**:
```
scripts/
├── bench/ # Benchmarking entry points
│ ├── larson.sh # Single Larson entry (flags for 1T/8T)
│ ├── random_mixed.sh # Single Random Mixed entry
│ └── suite.sh # Full benchmark suite
├── config/ # Configuration presets
│ ├── production.env # Production defaults
│ ├── debug.env # Debug configuration
│ └── research.env # Research/experimental
├── lib/ # Shared utilities
│ ├── env_setup.sh # Single source of ENV setup
│ └── validation.sh # Config validation
└── README.md # Scripts guide
```
---
### 5. Zero Configuration Documentation
**The Problem**: 236 environment variables, 59 build flags, 30 scripts → **ZERO master documentation**.
**What's Missing**:
- ❌ Master list of all ENV variables
- ❌ Categorization of variables by purpose
- ❌ Default values documentation
- ❌ Interaction semantics (which vars conflict?)
- ❌ Preset selection guide
- ❌ Deprecation timeline
- ❌ Scripts coordination guide
- ❌ Configuration examples for common use cases
**Current State**: Configuration knowledge exists only in:
1. Source code (scattered across 100+ files)
2. Git commit messages (hard to search)
3. Claude's memory (not accessible to others)
4. Tribal knowledge (not written down)
**Impact**:
- 2+ weeks onboarding time for new developers
- Configuration bugs in production
- Wasted time experimenting with dead features
- Duplicate questions ("Which Larson script should I use?")
**Recommendation**: Create **3 comprehensive guides**:
1. **CONFIGURATION.md** - Master reference (all vars categorized)
2. **PRESET_GUIDE.md** - How to choose presets
3. **SCRIPTS_GUIDE.md** - Scripts hierarchy and usage
---
## 🎯 Proposed Cleanup Strategy
### Phase 0: Immediate Wins (P0, 2 days effort, LOW risk)
**Goal**: Quick improvements that establish cleanup patterns.
**P0.1: Unify SuperSlab Variables** (5 vars → 3 vars)
- Remove: `HAKMEM_SS_EMPTY_REUSE`, `HAKMEM_SUPERSLAB_REUSE` (duplicates)
- Keep: `HAKMEM_SUPERSLAB_REUSE`, `HAKMEM_SUPERSLAB_LAZY`, `HAKMEM_SUPERSLAB_PREWARM`
- Effort: 1 hour (grep + replace + deprecation notice)
**P0.2: Create Master Preset Registry** (1 file → 4 files)
- `presets/production.json` - Recommended production config
- `presets/debug.json` - Full debugging enabled
- `presets/research.json` - Experimental features
- `presets/minimal.json` - Minimal feature set
- Effort: 2 hours (extract from current presets)
**P0.3: Clean Up build.sh Pinned Flags**
- Document all pinned flags in `BUILD_FLAGS.md`
- Remove obsolete flags (POOL_TLS_PHASE1=0, etc.)
- Effort: 2 hours
**P0.4: Consolidate Debug Variables** (11 vars → 4 vars)
- `HAKMEM_DEBUG_LEVEL` (0-3): 0=none, 1=errors, 2=info, 3=verbose
- `HAKMEM_DEBUG_TINY` (0/1): TINY allocator specific
- `HAKMEM_DEBUG_POOL` (0/1): Pool allocator specific
- `HAKMEM_DEBUG_MID` (0/1): Mid-Large allocator specific
- Effort: 3 hours (consolidate scattered debug toggles)
**P0.5: Create DEPRECATED.md**
- List all deprecated variables with sunset dates
- Add deprecation warnings to code (TLS-cached, lightweight)
- Effort: 1 hour
**Total Phase 0 Effort**: 2 days
**Risk**: LOW (backward compatible with deprecation warnings)
---
### Phase 1: Structural Improvements (P1, 3 days effort, MEDIUM risk)
**Goal**: Reorganize and document configuration system.
**P1.1: Reorganize Scripts Hierarchy**
- Move to `scripts/{bench,config,lib}/` structure
- Consolidate 6 Larson scripts → 1 with flags
- Create shared `lib/env_setup.sh`
- Effort: 1 day
**P1.2: Create CONFIGURATION.md**
- Master reference for all 236 variables
- Categorize by allocator/feature
- Document defaults and interactions
- Effort: 1 day
**P1.3: Create PRESET_GUIDE.md**
- When to use each preset
- How to customize presets
- Common configuration patterns
- Effort: 4 hours
**P1.4: Add Preset Versioning**
- `presets/v1/production.json` (semantic versioning)
- Migration guide for preset changes
- Effort: 2 hours
**P1.5: Add Configuration Validation**
- Runtime check for conflicting vars
- Warning for deprecated vars (console + log)
- Effort: 4 hours
**Total Phase 1 Effort**: 3 days
**Risk**: MEDIUM (scripts reorganization may break workflows)
---
### Phase 2: Deep Cleanup (P2, 4 days effort, MEDIUM risk)
**Goal**: Remove dead code and consolidate overlapping features.
**P2.1: Remove Dead Code**
- SFC (6 vars) → Remove
- PAGE_ARENA (5 vars) → Remove or document as research
- EXTERNAL_GUARD (3 vars) → Remove
- MF2 (3 vars) → Remove
- OLD_REFILL (5 vars) → Remove
- Effort: 1 day (with 6-month deprecation period)
**P2.2: Consolidate Learning Systems** (6 systems → 2 systems)
- Allocation Learning: size classes, TLS, refill
- Memory Learning: THP, RSS, SuperSlab lifecycle
- Document interaction semantics
- Effort: 2 days (complex refactoring)
**P2.3: Reorganize TINY Allocator Config** (113 vars → ~40 vars)
- Core allocation: 15 vars
- TLS caching: 8 vars
- Refill/drain: 6 vars
- Debug: 5 vars
- Learning: 6 vars
- Effort: 2 days (with 6-month migration)
**P2.4: Unify Profiling/Stats** (15 vars → 4 vars)
- `HAKMEM_PROFILE_LEVEL` (0-3)
- `HAKMEM_STATS_INTERVAL_MS`
- `HAKMEM_STATS_OUTPUT_FILE`
- `HAKMEM_TRACE_ALLOCATIONS` (0/1)
- Effort: 4 hours
**P2.5: Remove Benchmark-Specific Hacks**
- `HAKMEM_BENCH_FAST_MODE` - should be a preset, not ENV var
- `HAKMEM_TINY_ULTRA_SIMPLE` - merge into debug level
- Effort: 2 hours
**Total Phase 2 Effort**: 4 days
**Risk**: MEDIUM (requires careful migration planning)
---
## 📈 Success Metrics
### Quantitative
```
ENV Variables: 236 → 80 (-66%)
Build Flags: 59 → 40 (-32%)
Shell Scripts: 30 → 8 (-73%)
Undocumented Vars: 16 → 0 (-100%)
```
### Qualitative
- ✅ New developer onboarding: 2 weeks → 2 days
- ✅ Configuration bugs: Common → Rare
- ✅ Testing matrix: Intractable → Manageable
- ✅ Feature discovery: Trial-and-error → Documented
---
## 📅 Timeline
| Phase | Duration | Risk | Dependencies |
|-------|----------|------|--------------|
| **Phase 0** | 2 days | LOW | None |
| **Phase 1** | 3 days | MEDIUM | Phase 0 complete |
| **Phase 2** | 4 days | MEDIUM | Phase 1 complete |
| **Total** | **9 days** | Manageable | Incremental rollout |
**Deprecation Period**: 6 months (2025-11-26 → 2026-05-26)
---
## 🚀 Getting Started
**Immediate Next Steps**:
1. ✅ Read this summary (you're done!)
2. 📖 Review detailed analysis: `hakmem_config_analysis.txt`
3. 🛠️ Review concrete proposal: `hakmem_cleanup_proposal.txt`
4. 🎯 Start with P0.1 (SuperSlab unification) - lowest risk, sets pattern
5. 📝 Track progress in `CONFIG_CLEANUP_PROGRESS.md`
**Questions?**
- Technical details → `hakmem_config_analysis.txt`
- Implementation plan → `hakmem_cleanup_proposal.txt`
- Quick reference → This document
---
## 📚 Related Documents
- **hakmem_config_analysis.txt** (30-min read)
- Complete inventory of 236 ENV variables
- Detailed categorization and pain points
- Scripts analysis and configuration drift examples
- **hakmem_cleanup_proposal.txt** (30-min read)
- Concrete implementation roadmap
- Step-by-step instructions for each phase
- Risk mitigation strategies
- **CONFIGURATION.md** (to be created in P1.2)
- Master reference for all configuration
- Will become single source of truth
---
**Last Updated**: 2025-11-26
**Next Review**: After Phase 0 completion (est. 2025-11-28)