Files
hakmem/docs/design/PHASE_E3-2_IMPLEMENTATION.md
Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
Phase 1 完了:環境変数整理 + fprintf デバッグガード

ENV変数削除(BG/HotMag系):
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除(旧レポート・重複docs)

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作)
- ENV整理による機能影響なし
- Debug出力は一部残存(次phase で対応)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 14:45:26 +09:00

10 KiB

Phase E3-2: Restore Direct TLS Push - Implementation Guide

Date: 2025-11-12 Goal: Restore Phase 7 ultra-fast free (3 instructions, 5-10 cycles) Expected: 6-9M → 30-50M ops/s (+226-443%)


Strategy

Hybrid Approach: Direct push in release, Box TLS-SLL in debug

Rationale:

  • Release: Maximum performance (Phase 7 speed)
  • Debug: Maximum safety (catch bugs before release)
  • Best of both worlds: Speed + Safety

Implementation

File to Modify

/mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h

Current Code (Lines 119-137)

    // 3. Push base to TLS freelist (4 instructions, 5-7 cycles)
    //    Must push base (block start) not user pointer!
    //    Phase E1: ALL classes (C0-C7) have 1-byte header → base = ptr-1
    void* base = (char*)ptr - 1;

    // Use Box TLS-SLL API (C7-safe)
    if (!tls_sll_push(class_idx, base, UINT32_MAX)) {
        // C7 rejected or capacity exceeded - route to slow path
        return 0;
    }

    return 1;  // Success - handled in fast path
}

New Code (Phase E3-2)

    // 3. Push base to TLS freelist (3 instructions, 5-7 cycles in release)
    //    Must push base (block start) not user pointer!
    //    Phase E1: ALL classes (C0-C7) have 1-byte header → base = ptr-1
    void* base = (char*)ptr - 1;

    // Phase E3-2: Hybrid approach (Direct push in release, Box API in debug)
    // Reason: Release needs Phase 7 speed (5-10 cycles), Debug needs safety checks
#if HAKMEM_BUILD_RELEASE
    // Release: Ultra-fast direct push (Phase 7 restoration)
    // CRITICAL: Restore header byte before push (defense in depth)
    // Cost: 1 byte write (~1-2 cycles), prevents header corruption bugs
    *(uint8_t*)base = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);

    // Direct TLS push (3 instructions, 5-7 cycles)
    // Store next pointer at base+1 (skip 1-byte header)
    *(void**)((uint8_t*)base + 1) = g_tls_sll_head[class_idx];  // 1 mov
    g_tls_sll_head[class_idx] = base;                            // 1 mov
    g_tls_sll_count[class_idx]++;                                // 1 inc

    // Total: 8-12 cycles (vs 50-100 with Box TLS-SLL)
#else
    // Debug: Full Box TLS-SLL validation (safety first)
    // This catches: double-free, header corruption, alignment issues, etc.
    // Cost: 50-100+ cycles (includes O(n) double-free scan)
    // Benefit: Catch ALL bugs before release
    if (!tls_sll_push(class_idx, base, UINT32_MAX)) {
        // C7 rejected or capacity exceeded - route to slow path
        return 0;
    }
#endif

    return 1;  // Success - handled in fast path
}

Verification Steps

1. Clean Build

cd /mnt/workdisk/public_share/hakmem
make clean
make bench_random_mixed_hakmem

Expected: Clean compilation, no warnings

2. Release Build Test (Performance)

# Test E3-2 (current code with fix)
./out/release/bench_random_mixed_hakmem 100000 256 42
./out/release/bench_random_mixed_hakmem 100000 128 42
./out/release/bench_random_mixed_hakmem 100000 512 42
./out/release/bench_random_mixed_hakmem 100000 1024 42

Expected Results:

  • 128B: 30-50M ops/s (+260-506% vs 8.25M baseline)
  • 256B: 30-50M ops/s (+391-718% vs 6.11M baseline)
  • 512B: 30-50M ops/s (+244-474% vs 8.71M baseline)
  • 1024B: 30-50M ops/s (+473-854% vs 5.24M baseline)

Acceptable Range:

  • Any improvement >100% is a win
  • Target: +226-443% (Phase 7 claimed levels)

3. Debug Build Test (Safety)

make clean
make debug bench_random_mixed_hakmem
./out/debug/bench_random_mixed_hakmem 10000 256 42

Expected:

  • No crashes, no assertions
  • Full Box TLS-SLL validation enabled
  • Performance will be slower (expected)

4. Stress Test (Stability)

# Large workload
./out/release/bench_random_mixed_hakmem 1000000 8192 42

# Multiple runs (check consistency)
for i in {1..5}; do
  ./out/release/bench_random_mixed_hakmem 100000 256 $i
done

Expected:

  • All runs complete successfully
  • Consistent performance (±5% variance)
  • No crashes, no memory leaks

5. Comparison Test

# Create comparison script
cat > /tmp/bench_comparison.sh << 'EOF'
#!/bin/bash
echo "=== Phase E3-2 Performance Comparison ==="
echo ""

for size in 128 256 512 1024; do
    echo "Testing size=${size}B..."
    total=0
    runs=3

    for i in $(seq 1 $runs); do
        result=$(./out/release/bench_random_mixed_hakmem 100000 $size 42 2>/dev/null | grep "Throughput" | awk '{print $3}')
        total=$(echo "$total + $result" | bc)
    done

    avg=$(echo "scale=2; $total / $runs" | bc)
    echo "  Average: ${avg} ops/s"
    echo ""
done
EOF

chmod +x /tmp/bench_comparison.sh
/tmp/bench_comparison.sh

Expected Output:

=== Phase E3-2 Performance Comparison ===

Testing size=128B...
  Average: 35000000.00 ops/s

Testing size=256B...
  Average: 40000000.00 ops/s

Testing size=512B...
  Average: 38000000.00 ops/s

Testing size=1024B...
  Average: 35000000.00 ops/s

Success Criteria

Must Have (P0)

  • Performance: >20M ops/s on all sizes (>2x current)
  • Stability: 5/5 runs succeed, no crashes
  • Debug safety: Box TLS-SLL validation works in debug

Should Have (P1)

  • Performance: >30M ops/s on most sizes (>3x current)
  • Consistency: <10% variance across runs

Nice to Have (P2)

  • Performance: >50M ops/s on some sizes (Phase 7 levels)
  • All sizes: Uniform improvement across 128-1024B

Rollback Plan

If Performance Doesn't Improve

Hypothesis Failed: Direct push not the bottleneck

Action:

  1. Revert change: git checkout HEAD -- core/tiny_free_fast_v2.inc.h
  2. Profile with perf: Find actual hot path
  3. Investigate other bottlenecks (allocation, refill, etc.)

If Crashes in Release

Safety Issue: Header corruption or double-free

Action:

  1. Run debug build: Catch specific failure
  2. Add release-mode checks: Minimal validation
  3. Revert if unfixable: Keep Box TLS-SLL

If Debug Build Breaks

Integration Issue: Box TLS-SLL API changed

Action:

  1. Check tls_sll_push() signature
  2. Update call site: Match current API
  3. Test debug build: Verify safety checks work

Performance Tracking

Baseline (E3-1 Current)

Size Ops/s Cycles/Op (5GHz)
128B 8.25M ~606
256B 6.11M ~818
512B 8.71M ~574
1024B 5.24M ~954

Average: 7.08M ops/s (~738 cycles/op)

Target (E3-2 Phase 7 Recovery)

Size Ops/s Cycles/Op (5GHz) Improvement
128B 30-50M 100-167 +264-506%
256B 30-50M 100-167 +391-718%
512B 30-50M 100-167 +244-474%
1024B 30-50M 100-167 +473-854%

Average: 30-50M ops/s (~100-167 cycles/op) = 4-7x improvement

Theoretical Maximum

  • CPU: 5 GHz = 5B cycles/sec
  • Direct push: 8-12 cycles/op
  • Max throughput: 417-625M ops/s

Phase 7 efficiency: 59-70M / 500M = 12-14% (reasonable with cache misses)


Debugging Guide

If Performance is Slow (<20M ops/s)

Check 1: Is HAKMEM_BUILD_RELEASE=1?

make print-flags | grep BUILD_RELEASE
# Should show: CFLAGS contains = -DHAKMEM_BUILD_RELEASE=1

Check 2: Is direct push being used?

objdump -d out/release/bench_random_mixed_hakmem > /tmp/asm.txt
grep -A 30 "hak_tiny_free_fast_v2" /tmp/asm.txt | grep -E "tls_sll_push|call"
# Should NOT see: call to tls_sll_push (inlined direct push instead)

Check 3: Is LTO enabled?

make print-flags | grep LTO
# Should show: -flto

If Debug Build Crashes

Check 1: Is Box TLS-SLL path enabled?

./out/debug/bench_random_mixed_hakmem 100 256 42 2>&1 | grep "TLS_SLL"
# Should see Box TLS-SLL validation logs

Check 2: What's the error?

gdb ./out/debug/bench_random_mixed_hakmem
(gdb) run 10000 256 42
(gdb) bt  # Backtrace on crash

If Results are Inconsistent

Check 1: CPU frequency scaling?

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
# Should be: performance (not powersave)

Check 2: Other processes running?

top -n 1 | head -20
# Should show: Idle CPU

Check 3: Thermal throttling?

sensors  # Check CPU temperature
# Should be: <80°C

Expected Commit Message

Phase E3-2: Restore Phase 7 ultra-fast free (direct TLS push)

Problem:
- Phase E3-1 removed Registry lookup expecting +226-443% improvement
- Performance decreased -10% to -38% instead
- Root cause: Registry lookup was NOT in fast path (only 1-5% miss rate)
- True bottleneck: Box TLS-SLL API overhead (150 lines vs 3 instructions)

Solution:
- Restore Phase 7 direct TLS push in RELEASE builds (3 instructions, 8-12 cycles)
- Keep Box TLS-SLL in DEBUG builds (full safety validation)
- Hybrid approach: Speed in production, safety in development

Performance Results:
- 128B: 8.25M → 35M ops/s (+324%)
- 256B: 6.11M → 40M ops/s (+555%)
- 512B: 8.71M → 38M ops/s (+336%)
- 1024B: 5.24M → 35M ops/s (+568%)
- Average: 7.08M → 37M ops/s (+423%)

Implementation:
- File: core/tiny_free_fast_v2.inc.h line 119-137
- Change: #if HAKMEM_BUILD_RELEASE → direct push, #else → Box TLS-SLL
- Defense in depth: Header restoration (1 byte write, 1-2 cycles)
- Safety: Debug catches all bugs before release

Verification:
- Release: 5/5 stress test runs passed (1M ops each)
- Debug: Box TLS-SLL validation enabled, no crashes
- Stability: <5% variance across runs

Co-Authored-By: Claude <noreply@anthropic.com>

Post-Implementation

Documentation

  1. Update CLAUDE.md: Add Phase E3-2 results
  2. Update HISTORY.md: Document E3-1 failure + E3-2 success
  3. Create PHASE_E3_COMPLETE.md: Full E3 saga

Next Steps

  1. Phase E4: Optimize slow path (Registry → header probe)
  2. Phase E5: Profile allocation path (malloc vs refill)
  3. Phase E6: Investigate Phase 7 original test (verify 59-70M)

Implementation Time: 15 minutes Testing Time: 15 minutes Total Time: 30 minutes

Status: READY TO IMPLEMENT


Generated: 2025-11-12 18:15 JST Guide Version: 1.0