Files
hakmem/docs/analysis/HAKMEM_CONFIG_SUMMARY.md
Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
Phase 1 完了:環境変数整理 + fprintf デバッグガード

ENV変数削除(BG/HotMag系):
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除(旧レポート・重複docs)

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作)
- ENV整理による機能影響なし
- Debug出力は一部残存(次phase で対応)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 14:45:26 +09:00

15 KiB

HAKMEM Configuration Crisis - Executive Summary

Date: 2025-11-26 Status: 🔴 CRITICAL - Configuration complexity is hindering development Reading Time: 10 minutes


🚨 The Crisis in Numbers

Metric Current Target Reduction
Runtime ENV variables 236 80 -66%
Build-time flags 59+ ~40 -32%
Shell scripts 30 files (3000 LOC) 8 entry points -73%
JSON presets 1 file, 3 presets 4+ files, organized Better structure
Configuration guides 0 3+ comprehensive ∞% improvement
Deprecation tracking None Automated timeline Needed

Bottom Line: HAKMEM has grown from a research allocator to a production system, but configuration management hasn't scaled. We're at the point where even the original developers are losing track of features.


📊 Quick Facts

Environment Variables (236 total)

By Category:

TINY Allocator:        113 vars (48%) 🔴 BLOATED
Debug/Profiling:        31 vars (13%)
Learning Systems:       18 vars (8%)  🟡 6 independent systems
SuperSlab:              15 vars (6%)
Shared Pool:            12 vars (5%)
Mid-Large:              11 vars (5%)
Benchmarking:           10 vars (4%)
Others:                 26 vars (11%)

By Status:

Active & Used:         ~120 vars (51%)
Deprecated/Dead:        ~60 vars (25%) 🔴 REMOVE
Research/Experimental:  ~40 vars (17%)
Undocumented:          ~16 vars (7%)  🔴 UNCLEAR

Build Flags (59+ total)

By Category:

Feature Toggles:        23 flags (39%)
Optimization:           15 flags (25%)
Debug/Instrumentation:  12 flags (20%)
Build Modes:             9 flags (15%)

Shell Scripts (30 files)

By Type:

Benchmarking:          14 scripts (47%) 🟡 Overlapping
ENV Setup:              6 scripts (20%) 🔴 Duplicated
Build Helpers:          5 scripts (17%)
Utilities:              5 scripts (17%)

Problem: No clear entry points, duplicated logic across 30 files, zero coordination.


🔥 Top 5 Critical Issues

1. TINY Allocator Configuration Explosion (113 vars)

The Problem: TINY allocator has evolved through multiple phases (v1 → v2 → ULTRA → SLIM → Unified), but old configuration layers were never removed. Result: 113 overlapping environment variables.

Examples of Chaos:

# Refill configuration (7 overlapping strategies!)
HAKMEM_TINY_REFILL_BATCH_SIZE=64
HAKMEM_TINY_P0_BATCH=32              # Same as above?
HAKMEM_TINY_SFC_REFILL=16            # SFC is deprecated!
HAKMEM_UNIFIED_REFILL_SIZE=64        # Unified path
HAKMEM_TINY_FAST_REFILL_COUNT=32     # Fast path
HAKMEM_TINY_ULTRA_REFILL=8           # Ultra path
HAKMEM_TINY_SLIM_REFILL_BATCH=16     # SLIM path

# Debug toggles (11 variants with overlapping names!)
HAKMEM_TINY_DEBUG=1
HAKMEM_DEBUG_TINY=1                  # Same thing?
HAKMEM_TINY_VERBOSE=1
HAKMEM_TINY_DEBUG_VERBOSE=1          # Combined?
HAKMEM_TINY_LOG=1
... (6 more variants)

Impact:

  • Developers don't know which variables to use
  • Testing matrix is impossibly large (2^113 combinations)
  • Configuration bugs are common
  • Onboarding new developers takes weeks

Recommendation: Consolidate to ~40 variables organized by architectural layer:

  • Core allocation: 15 vars
  • TLS caching: 8 vars
  • Refill/drain: 6 vars
  • Debug: 5 vars
  • Learning: 6 vars

2. Dead Code Still Has Active Config (60+ vars)

The Problem: Features have been replaced or deprecated, but their configuration variables are still active, causing confusion.

Examples:

SFC (Single-Free-Cache) - REPLACED by Unified Cache:

HAKMEM_TINY_SFC_ENABLE=1         # 🔴 Dead (replaced Nov 2024)
HAKMEM_TINY_SFC_CAP=128          # 🔴 Dead
HAKMEM_TINY_SFC_REFILL=16        # 🔴 Dead
HAKMEM_TINY_SFC_SPILL_THRESH=96  # 🔴 Dead
HAKMEM_TINY_SFC_BATCH_POP=8      # 🔴 Dead
HAKMEM_TINY_SFC_STATS=1          # 🔴 Dead

Status: Unified Cache replaced SFC in Phase 3d-B (2025-11-20), but SFC vars still parsed.

PAGE_ARENA - Research artifact, never integrated:

HAKMEM_PAGE_ARENA_ENABLE=1       # 🔴 Research-only
HAKMEM_PAGE_ARENA_SIZE_MB=16     # 🔴 Research-only
HAKMEM_PAGE_ARENA_GROWTH=2       # 🔴 Research-only
HAKMEM_PAGE_ARENA_MAX_MB=128     # 🔴 Research-only
HAKMEM_PAGE_ARENA_THP=1          # 🔴 Research-only

Status: Experimental code from 2024-09, never productionized, still has active config.

Other Dead Features:

  • EXTERNAL_GUARD (3 vars) - Purpose unclear, no documentation
  • MF2 (3 vars) - Undocumented, possibly abandoned
  • OLD_REFILL (5 vars) - Replaced by P0 batch refill

Impact:

  • Users waste time trying dead features
  • CI tests dead code paths
  • Codebase appears larger than it is

Recommendation: Remove dead code and deprecate variables with 6-month timeline.


3. Learning System Chaos (6 independent systems)

The Problem: HAKMEM has 6 separate learning/adaptive systems with unclear interaction semantics.

The 6 Systems:

1. HAKMEM_LEARN=1                    # Global meta-learner?
2. HAKMEM_TINY_LEARN=1               # TINY-specific learning
3. HAKMEM_TINY_CAP_LEARN=1           # TLS capacity learning
4. HAKMEM_ADAPTIVE_SIZING=1          # Size class tuning
5. HAKMEM_THP_LEARN=1                # Transparent Huge Pages
6. HAKMEM_WMAX_LEARN=1               # Workload max size learning

Questions with No Answers:

  • Can these be enabled together? Do they conflict?
  • Which learning system owns TLS cache sizing?
  • What happens if TINY_LEARN=1 but LEARN=0?
  • Is there a master learning coordinator?

Additional Learning Vars (12 more):

HAKMEM_LEARN_RATE=0.1
HAKMEM_LEARN_DECAY=0.95
HAKMEM_LEARN_MIN_SAMPLES=1000
HAKMEM_TINY_LEARN_WINDOW=10000
HAKMEM_ADAPTIVE_SIZING_INTERVAL_MS=5000
... (7 more tuning parameters)

Impact:

  • Unpredictable behavior when multiple systems enabled
  • No documented interaction model
  • Difficult to debug performance issues
  • Unclear which system to tune

Recommendation: Consolidate to 2 learning systems:

  1. Allocation Learning: Size classes, TLS capacity, refill tuning
  2. Memory Learning: THP, RSS optimization, SuperSlab lifecycle

With clear boundaries and documented interaction semantics.


4. Scripts Anarchy (30 files, 3000 LOC, zero hierarchy)

The Problem: Scripts have accumulated organically with no organization. Multiple scripts do the same thing with subtle differences.

Examples:

Running Larson - 6 different ways:

scripts/run_larson.sh                # Which one to use?
scripts/run_larson_1t.sh             # 1 thread variant
scripts/run_larson_8t.sh             # 8 thread variant
scripts/larson_benchmark.sh          # Different from run_larson.sh?
scripts/bench_larson_preset.sh       # Uses JSON presets
scripts/quick_larson.sh              # Quick test variant

Which should I use? → Unclear.

Running Random Mixed - 3 different ways:

scripts/run_random_mixed.sh          # Hardcoded params
scripts/bench_random_mixed_json.sh   # Uses JSON preset
scripts/quick_random_mixed.sh        # Different defaults

ENV Setup Duplication (copy-pasted across 30 files):

# This block appears in 12+ scripts:
export HAKMEM_TINY_HEADER_CLASSIDX=1
export HAKMEM_TINY_AGGRESSIVE_INLINE=1
export HAKMEM_TINY_PREWARM_TLS=1
export HAKMEM_SS_EMPTY_REUSE=1
export HAKMEM_TINY_UNIFIED_CACHE=1
# ... (20 more vars duplicated everywhere)

Impact:

  • New developers don't know where to start
  • Bug fixes need to be applied to 6+ scripts
  • Inconsistent behavior across scripts
  • No single source of truth

Recommendation: Reorganize to 8 entry points:

scripts/
├── bench/                     # Benchmarking entry points
│   ├── larson.sh             # Single Larson entry (flags for 1T/8T)
│   ├── random_mixed.sh       # Single Random Mixed entry
│   └── suite.sh              # Full benchmark suite
├── config/                    # Configuration presets
│   ├── production.env        # Production defaults
│   ├── debug.env             # Debug configuration
│   └── research.env          # Research/experimental
├── lib/                       # Shared utilities
│   ├── env_setup.sh          # Single source of ENV setup
│   └── validation.sh         # Config validation
└── README.md                  # Scripts guide

5. Zero Configuration Documentation

The Problem: 236 environment variables, 59 build flags, 30 scripts → ZERO master documentation.

What's Missing:

  • Master list of all ENV variables
  • Categorization of variables by purpose
  • Default values documentation
  • Interaction semantics (which vars conflict?)
  • Preset selection guide
  • Deprecation timeline
  • Scripts coordination guide
  • Configuration examples for common use cases

Current State: Configuration knowledge exists only in:

  1. Source code (scattered across 100+ files)
  2. Git commit messages (hard to search)
  3. Claude's memory (not accessible to others)
  4. Tribal knowledge (not written down)

Impact:

  • 2+ weeks onboarding time for new developers
  • Configuration bugs in production
  • Wasted time experimenting with dead features
  • Duplicate questions ("Which Larson script should I use?")

Recommendation: Create 3 comprehensive guides:

  1. CONFIGURATION.md - Master reference (all vars categorized)
  2. PRESET_GUIDE.md - How to choose presets
  3. SCRIPTS_GUIDE.md - Scripts hierarchy and usage

🎯 Proposed Cleanup Strategy

Phase 0: Immediate Wins (P0, 2 days effort, LOW risk)

Goal: Quick improvements that establish cleanup patterns.

P0.1: Unify SuperSlab Variables (5 vars → 3 vars)

  • Remove: HAKMEM_SS_EMPTY_REUSE, HAKMEM_SUPERSLAB_REUSE (duplicates)
  • Keep: HAKMEM_SUPERSLAB_REUSE, HAKMEM_SUPERSLAB_LAZY, HAKMEM_SUPERSLAB_PREWARM
  • Effort: 1 hour (grep + replace + deprecation notice)

P0.2: Create Master Preset Registry (1 file → 4 files)

  • presets/production.json - Recommended production config
  • presets/debug.json - Full debugging enabled
  • presets/research.json - Experimental features
  • presets/minimal.json - Minimal feature set
  • Effort: 2 hours (extract from current presets)

P0.3: Clean Up build.sh Pinned Flags

  • Document all pinned flags in BUILD_FLAGS.md
  • Remove obsolete flags (POOL_TLS_PHASE1=0, etc.)
  • Effort: 2 hours

P0.4: Consolidate Debug Variables (11 vars → 4 vars)

  • HAKMEM_DEBUG_LEVEL (0-3): 0=none, 1=errors, 2=info, 3=verbose
  • HAKMEM_DEBUG_TINY (0/1): TINY allocator specific
  • HAKMEM_DEBUG_POOL (0/1): Pool allocator specific
  • HAKMEM_DEBUG_MID (0/1): Mid-Large allocator specific
  • Effort: 3 hours (consolidate scattered debug toggles)

P0.5: Create DEPRECATED.md

  • List all deprecated variables with sunset dates
  • Add deprecation warnings to code (TLS-cached, lightweight)
  • Effort: 1 hour

Total Phase 0 Effort: 2 days Risk: LOW (backward compatible with deprecation warnings)


Phase 1: Structural Improvements (P1, 3 days effort, MEDIUM risk)

Goal: Reorganize and document configuration system.

P1.1: Reorganize Scripts Hierarchy

  • Move to scripts/{bench,config,lib}/ structure
  • Consolidate 6 Larson scripts → 1 with flags
  • Create shared lib/env_setup.sh
  • Effort: 1 day

P1.2: Create CONFIGURATION.md

  • Master reference for all 236 variables
  • Categorize by allocator/feature
  • Document defaults and interactions
  • Effort: 1 day

P1.3: Create PRESET_GUIDE.md

  • When to use each preset
  • How to customize presets
  • Common configuration patterns
  • Effort: 4 hours

P1.4: Add Preset Versioning

  • presets/v1/production.json (semantic versioning)
  • Migration guide for preset changes
  • Effort: 2 hours

P1.5: Add Configuration Validation

  • Runtime check for conflicting vars
  • Warning for deprecated vars (console + log)
  • Effort: 4 hours

Total Phase 1 Effort: 3 days Risk: MEDIUM (scripts reorganization may break workflows)


Phase 2: Deep Cleanup (P2, 4 days effort, MEDIUM risk)

Goal: Remove dead code and consolidate overlapping features.

P2.1: Remove Dead Code

  • SFC (6 vars) → Remove
  • PAGE_ARENA (5 vars) → Remove or document as research
  • EXTERNAL_GUARD (3 vars) → Remove
  • MF2 (3 vars) → Remove
  • OLD_REFILL (5 vars) → Remove
  • Effort: 1 day (with 6-month deprecation period)

P2.2: Consolidate Learning Systems (6 systems → 2 systems)

  • Allocation Learning: size classes, TLS, refill
  • Memory Learning: THP, RSS, SuperSlab lifecycle
  • Document interaction semantics
  • Effort: 2 days (complex refactoring)

P2.3: Reorganize TINY Allocator Config (113 vars → ~40 vars)

  • Core allocation: 15 vars
  • TLS caching: 8 vars
  • Refill/drain: 6 vars
  • Debug: 5 vars
  • Learning: 6 vars
  • Effort: 2 days (with 6-month migration)

P2.4: Unify Profiling/Stats (15 vars → 4 vars)

  • HAKMEM_PROFILE_LEVEL (0-3)
  • HAKMEM_STATS_INTERVAL_MS
  • HAKMEM_STATS_OUTPUT_FILE
  • HAKMEM_TRACE_ALLOCATIONS (0/1)
  • Effort: 4 hours

P2.5: Remove Benchmark-Specific Hacks

  • HAKMEM_BENCH_FAST_MODE - should be a preset, not ENV var
  • HAKMEM_TINY_ULTRA_SIMPLE - merge into debug level
  • Effort: 2 hours

Total Phase 2 Effort: 4 days Risk: MEDIUM (requires careful migration planning)


📈 Success Metrics

Quantitative

ENV Variables:     236 → 80  (-66%)
Build Flags:        59 → 40  (-32%)
Shell Scripts:      30 → 8   (-73%)
Undocumented Vars:  16 → 0   (-100%)

Qualitative

  • New developer onboarding: 2 weeks → 2 days
  • Configuration bugs: Common → Rare
  • Testing matrix: Intractable → Manageable
  • Feature discovery: Trial-and-error → Documented

📅 Timeline

Phase Duration Risk Dependencies
Phase 0 2 days LOW None
Phase 1 3 days MEDIUM Phase 0 complete
Phase 2 4 days MEDIUM Phase 1 complete
Total 9 days Manageable Incremental rollout

Deprecation Period: 6 months (2025-11-26 → 2026-05-26)


🚀 Getting Started

Immediate Next Steps:

  1. Read this summary (you're done!)
  2. 📖 Review detailed analysis: hakmem_config_analysis.txt
  3. 🛠️ Review concrete proposal: hakmem_cleanup_proposal.txt
  4. 🎯 Start with P0.1 (SuperSlab unification) - lowest risk, sets pattern
  5. 📝 Track progress in CONFIG_CLEANUP_PROGRESS.md

Questions?

  • Technical details → hakmem_config_analysis.txt
  • Implementation plan → hakmem_cleanup_proposal.txt
  • Quick reference → This document

  • hakmem_config_analysis.txt (30-min read)

    • Complete inventory of 236 ENV variables
    • Detailed categorization and pain points
    • Scripts analysis and configuration drift examples
  • hakmem_cleanup_proposal.txt (30-min read)

    • Concrete implementation roadmap
    • Step-by-step instructions for each phase
    • Risk mitigation strategies
  • CONFIGURATION.md (to be created in P1.2)

    • Master reference for all configuration
    • Will become single source of truth

Last Updated: 2025-11-26 Next Review: After Phase 0 completion (est. 2025-11-28)