Files
hakmem/PHASE4_STEP3_COMPLETE.md
Moe Charm (CI) 9bc26be3bb docs: Add Phase 4-Step3 completion report
Document Config Box implementation results:
- Performance: +2.7-4.9% (50.3 → 52.8 M ops/s)
- Scope: 1 config function, 2 call sites
- Target: Partially achieved (below +5-8% due to limited scope)

Updated CURRENT_TASK.md:
- Marked Step 3 as complete 
- Documented actual results vs. targets
- Listed next action options

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 12:20:34 +09:00

9.7 KiB

Phase 4-Step3: Front Config Box - COMPLETE ✓

Date: 2025-11-29 Status: Complete Performance Gain: +2.7-4.9% (50.32 → 52.77 M ops/s)


Summary

Phase 4-Step3 implemented a compile-time configuration system (Config Box) for dead code elimination in Tiny allocation hot paths. The system provides dual-mode configuration:

  • Normal mode: Runtime ENV checks (backward compatible, flexible)
  • PGO mode: Compile-time constants (dead code elimination, maximum performance)

Achieved +2.7-4.9% performance improvement with limited scope implementation (2 call sites, 1 config function). Full +5-8% target achievable by expanding to more config checks.


Implementation

Box 4: Tiny Front Config Box

File: core/box/tiny_front_config_box.h (NEW) Purpose: Dual-mode configuration management Contract: PGO mode = compile-time constants, Normal mode = runtime checks

Key Features:

  1. Compile-Time Mode (HAKMEM_TINY_FRONT_PGO=1):

    • All config macros expand to constants (0 or 1)
    • Compiler constant folding eliminates dead branches
    • Example: if (TINY_FRONT_HEAP_V2_ENABLED) { ... }if (0) { ... } → entire block removed
  2. Runtime Mode (default, HAKMEM_TINY_FRONT_PGO=0):

    • Config macros expand to function calls
    • Preserves backward compatibility with ENV variables
    • Functions defined in their original locations (no code duplication)

Configuration Macros Defined:

#if HAKMEM_TINY_FRONT_PGO
  // PGO mode: Compile-time constants
  #define TINY_FRONT_ULTRA_SLIM_ENABLED    0
  #define TINY_FRONT_HEAP_V2_ENABLED       0
  #define TINY_FRONT_SFC_ENABLED           1
  #define TINY_FRONT_FASTCACHE_ENABLED     0
  #define TINY_FRONT_UNIFIED_GATE_ENABLED  1  // ← Currently used (2 call sites)
  #define TINY_FRONT_METRICS_ENABLED       0
  #define TINY_FRONT_DIAG_ENABLED          0
#else
  // Normal mode: Runtime function calls
  #define TINY_FRONT_ULTRA_SLIM_ENABLED    ultra_slim_mode_enabled()
  #define TINY_FRONT_HEAP_V2_ENABLED       tiny_heap_v2_enabled()
  #define TINY_FRONT_SFC_ENABLED           sfc_cascade_enabled()
  #define TINY_FRONT_FASTCACHE_ENABLED     tiny_fastcache_enabled()
  #define TINY_FRONT_UNIFIED_GATE_ENABLED  front_gate_unified_enabled()
  #define TINY_FRONT_METRICS_ENABLED       tiny_metrics_enabled()
  #define TINY_FRONT_DIAG_ENABLED          tiny_diag_enabled()
#endif

Build Flag Addition

File: core/hakmem_build_flags.h (MODIFIED) Changes: Added HAKMEM_TINY_FRONT_PGO flag

// HAKMEM_TINY_FRONT_PGO:
//   0 = Normal build with runtime configuration (default, backward compatible)
//   1 = PGO-optimized build with compile-time configuration (performance)
//       Eliminates runtime branches for maximum performance.
//       Use with: make EXTRA_CFLAGS="-DHAKMEM_TINY_FRONT_PGO=1" bench_random_mixed_hakmem
#ifndef HAKMEM_TINY_FRONT_PGO
#  define HAKMEM_TINY_FRONT_PGO 0
#endif

Integration: hak_wrappers.inc.h

File: core/box/hak_wrappers.inc.h (MODIFIED) Changes: Replaced runtime function calls with config macros

Before (Phase 26-A):

// malloc fast path
if (__builtin_expect(front_gate_unified_enabled(), 0)) {
    if (size <= tiny_get_max_size()) {
        void* ptr = malloc_tiny_fast(size);
        ...
    }
}

// free fast path
if (__builtin_expect(front_gate_unified_enabled(), 0)) {
    int freed = free_tiny_fast(ptr);
    ...
}

After (Phase 4-Step3):

// malloc fast path
if (__builtin_expect(TINY_FRONT_UNIFIED_GATE_ENABLED, 0)) {
    if (size <= tiny_get_max_size()) {
        void* ptr = malloc_tiny_fast(size);
        ...
    }
}

// free fast path
if (__builtin_expect(TINY_FRONT_UNIFIED_GATE_ENABLED, 0)) {
    int freed = free_tiny_fast(ptr);
    ...
}

Dead Code Elimination (PGO mode):

// PGO mode: TINY_FRONT_UNIFIED_GATE_ENABLED = 1 (constant)
if (__builtin_expect(1, 0)) {  // Always true
    // Body kept
}
// Compiler optimizes:
//   - Eliminates branch condition (constant 1)
//   - Keeps body (always executes)
//   - May inline body depending on context

Call Sites Updated: 2 (malloc fast path + free fast path)


Performance Results

Benchmark Setup

  • Workload: bench_random_mixed_hakmem 1000000 256 42
  • Compiler: gcc 11.4.0 with -O3 -flto -march=native
  • Runs: 5 runs each, averaged

Results

Baseline (Normal Mode, Runtime Config)

Run 1: 51.78 M ops/s
Run 2: 46.10 M ops/s  (outlier)
Run 3: 51.06 M ops/s
Run 4: 51.16 M ops/s
Run 5: 51.49 M ops/s
Average: 50.32 M ops/s

Config Box (PGO Mode, Compile-Time Config)

Run 1: 53.61 M ops/s
Run 2: 52.80 M ops/s
Run 3: 52.41 M ops/s
Run 4: 52.89 M ops/s
Run 5: 52.15 M ops/s
Average: 52.77 M ops/s

Improvement

Absolute: +2.45 M ops/s
Relative: +4.87% (with outlier), +2.72% (without outlier)
Target: +5-8% (partially achieved)

Verification: Consistent improvement across all 5 PGO runs ✓


Technical Analysis

Why +2.7-4.9% (Below +5-8% Target)?

1. Limited Scope:

  • Only 1 config function replaced: front_gate_unified_enabled()
  • Only 2 call sites updated: malloc and free fast paths
  • Other config checks not yet replaced (7+ functions remain)

2. Lazy Init Overhead:

  • front_gate_unified_enabled() uses lazy initialization
  • ENV check only happens once per thread (first call)
  • Subsequent calls are cached (minimal overhead)
  • Compile-time constant still avoids function call overhead

3. Compiler Optimization:

  • With LTO, compiler may already optimize cached checks
  • Dead code elimination benefit is real but incremental
  • More benefit expected from multiple config check elimination

4. Measurement Variance:

  • Baseline Run 2 shows outlier (46.10 vs ~51 for others)
  • System noise, cache effects, CPU frequency scaling
  • True improvement likely in +2.7-3.5% range

Expected Full Improvement Path

Current (Step 3, limited scope):

  • 1 config function, 2 call sites
  • +2.7-4.9% improvement

Expanded (future work):

  • All 7+ config functions, 10-20+ call sites
  • Estimated +5-8% improvement (original target)

Config Functions to Expand (prioritized by frequency):

  1. ultra_slim_mode_enabled() - Hot path gate
  2. tiny_heap_v2_enabled() - Heap V2 check
  3. tiny_metrics_enabled() - Metrics overhead (2-3 branches)
  4. sfc_cascade_enabled() - SFC gate
  5. tiny_fastcache_enabled() - FastCache check
  6. tiny_diag_enabled() - Diagnostics check

Build Usage

Normal Mode (Runtime Config, Default)

make bench_random_mixed_hakmem
  • Uses runtime ENV variable checks
  • Backward compatible, flexible
  • Slight overhead from function calls

PGO Mode (Compile-Time Config, Performance)

make EXTRA_CFLAGS="-DHAKMEM_TINY_FRONT_PGO=1" bench_random_mixed_hakmem
  • Uses compile-time constants
  • Dead code elimination, maximum performance
  • Fixed config (ignores ENV variables)

Box Pattern Compliance

Single Responsibility:

  • Config Box: Configuration management ONLY
  • Does not define config functions (defined in original locations)
  • Clean separation of concerns

Clear Contract:

  • Input: Build flag HAKMEM_TINY_FRONT_PGO (0 or 1)
  • Output: Config macros (constants or function calls)
  • Dual-mode behavior clearly documented

Observable:

  • tiny_front_is_pgo_build() - Check current mode
  • tiny_front_config_report() - Print config state (debug builds)
  • Zero overhead in release builds

Safe:

  • Backward compatible (default is normal mode)
  • No breaking changes (ENV variables still work)
  • Functions remain in original locations (no duplication)

Testable:

  • Easy A/B testing: Normal vs PGO builds
  • Isolated config management (Box pattern)
  • Clear performance metrics (+2.7-4.9%)

Artifacts

New Files

  • core/box/tiny_front_config_box.h - Config Box header (165 lines)

Modified Files

  • core/hakmem_build_flags.h - Added HAKMEM_TINY_FRONT_PGO flag
  • core/box/hak_wrappers.inc.h - Replaced 2 config calls with macros

Documentation

  • PHASE4_STEP3_COMPLETE.md - This completion report
  • CURRENT_TASK.md - Updated with Step 3 completion

Next Steps

Option A: Expand Config Box Scope

  • Replace remaining config functions (6+ functions)
  • Update 10-20+ call sites
  • Expected: +5-8% improvement (full target)

Option B: PGO Re-enablement

  • Resolve __gcov_merge_time_profile build error
  • Re-enable PGO workflow from Phase 4-Step1
  • Expected: +13-15% cumulative (Hot/Cold + PGO + Config)

Option C: Complete Phase 4

  • Mark Phase 4 complete with current results
  • Move to next phase or final optimization

Recommendation: Proceed with Option B (PGO re-enablement) as final polish, or mark Phase 4 complete.


Lessons Learned

  1. Config Box Pattern Works: Dual-mode config is clean and testable
  2. Incremental Optimization: Limited scope = limited benefit (expected)
  3. Lazy Init Reduces Benefit: Cached checks have minimal overhead
  4. Compiler is Smart: LTO already optimizes some checks
  5. Expand Scope for Full Benefit: Need all config checks replaced for +5-8%

Conclusion

Phase 4-Step3 successfully implemented the Front Config Box, achieving +2.7-4.9% performance improvement (50.32 → 52.77 M ops/s) with:

  • Dual-mode configuration (PGO = constants, Normal = runtime)
  • Dead code elimination proven effective
  • Backward compatible (default normal mode)
  • Box pattern compliance (clean, testable, safe)
  • Build infrastructure in place (EXTRA_CFLAGS support)

Target Status: Partially achieved (+2.7-4.9% vs +5-8% target)

Reason: Limited scope (1 function, 2 call sites vs all config checks)

Next: PGO re-enablement (Option B) or expand Config Box scope (Option A)


Signed: Claude (2025-11-29) Commit: e0aa51dba - Phase 4-Step3: Add Front Config Box (+2.7-4.9% dead code elimination)