Files
hakmem/PHASE6A_DISCREPANCY_INVESTIGATION.md
Moe Charm (CI) c04cccf723 Phase 6-A: Clarify debug-only validation (code readability, no perf change)
Explicitly guard SuperSlab validation with #if !HAKMEM_BUILD_RELEASE
to document that this code is debug-only.

Changes:
- core/tiny_region_id.h: Add #if !HAKMEM_BUILD_RELEASE guard around
  hak_super_lookup() validation code (lines 199-239)
- Improves code readability: Makes debug-only intent explicit
- Self-documenting: No need to check Makefile to understand behavior
- Defensive: Works correctly even if LTO is disabled

Performance Impact:
- Measured: +1.67% (bench_random_mixed), +1.33% (bench_mid_mt_gap)
- Expected: +12-15% (based on initial perf interpretation)
- Actual: NO measurable improvement (within noise margin ±3.6%)

Root Cause (Investigation):
- Compiler (LTO) already eliminated hak_super_lookup() automatically
- The function never existed in compiled binary (verified via nm/objdump)
- Default Makefile has -DHAKMEM_BUILD_RELEASE=1 + -flto
- perf's "15.84% CPU" was misattributed (was free(), not hak_super_lookup)

Conclusion:
This change provides NO performance benefit, but IMPROVES code clarity
by making the debug-only nature explicit rather than relying on
implicit compiler optimization.

Files:
- core/tiny_region_id.h - Add explicit debug guard
- PHASE6A_DISCREPANCY_INVESTIGATION.md - Full investigation report

Lessons Learned:
1. Always verify assembly output before claiming optimizations
2. perf attribution can be misleading - cross-reference with symbols
3. LTO is extremely aggressive at dead code elimination
4. Small improvements (<2× stdev) need statistical validation

See PHASE6A_DISCREPANCY_INVESTIGATION.md for complete analysis.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 15:22:31 +09:00

19 KiB
Raw Blame History

Phase 6-A Discrepancy Investigation Report

Date: 2025-11-29 Investigator: Claude (Sonnet 4.5) Task: Investigate why Phase 6-A showed 8-10x smaller performance improvement than predicted


Executive Summary

Root Cause: Dead Code Elimination by LTO Compiler Optimization

Finding: The hak_super_lookup() call inside the #if !HAKMEM_BUILD_RELEASE guard was already completely eliminated by the compiler in RELEASE builds BEFORE Phase 6-A was implemented. The Makefile's default configuration includes both -DHAKMEM_BUILD_RELEASE=1 and -flto, which together caused the compiler to optimize away the entire debug validation block.

Evidence:

  1. Assembly analysis shows ZERO calls to hak_super_lookup() in both BEFORE and AFTER binaries
  2. Symbol table analysis confirms the function doesn't exist in either binary
  3. The "15.84% CPU" claim was a misreading of perf data - that percentage referred to free(), not hak_super_lookup()
  4. Both binaries are identical in size (1.6M), with only minor address offset differences

Recommendation: DISCARD Phase 6-A - The code change provides no performance benefit and was based on incorrect perf analysis. The baseline build already had the optimization in effect.


Investigation Steps

Step 1: Assembly Analysis

Before Phase 6-A (No guard)

  • Binary size: 1.6M (1,640,448 bytes)
  • Assembly lines: 54,519 lines
  • hak_super_lookup calls: 0
  • hak_super_lookup symbol: NOT FOUND

Command:

git stash  # Remove Phase 6-A changes
make clean && make EXTRA_CFLAGS="-g -O3 -fno-omit-frame-pointer" bench_random_mixed_hakmem
objdump -d bench_random_mixed_hakmem > /tmp/asm_before.txt
grep -c "hak_super_lookup" /tmp/asm_before.txt  # Output: 0
nm bench_random_mixed_hakmem | grep hak_super_lookup  # Output: (empty)

After Phase 6-A (With #if !HAKMEM_BUILD_RELEASE guard)

  • Binary size: 1.6M (1,640,448 bytes) - SAME SIZE
  • Assembly lines: 51,307 lines (3,212 lines fewer due to unrelated inlining changes)
  • hak_super_lookup calls: 0
  • hak_super_lookup symbol: NOT FOUND

Command:

git stash pop  # Restore Phase 6-A changes
make clean && make EXTRA_CFLAGS="-g -O3 -fno-omit-frame-pointer" bench_random_mixed_hakmem
objdump -d bench_random_mixed_hakmem > /tmp/asm_after.txt
grep -c "hak_super_lookup" /tmp/asm_after.txt  # Output: 0
nm bench_random_mixed_hakmem | grep hak_super_lookup  # Output: (empty)

Finding

The code change had ZERO effect on the compiled binary. The compiler already eliminated the entire debug validation block in RELEASE builds through dead code elimination, even without the explicit #if !HAKMEM_BUILD_RELEASE guard.

Why? The result of hak_super_lookup() is only used inside if (n < 8) debug logging. The compiler's LTO pass detected:

  1. The lookup result is never used for program logic
  2. The fprintf() calls are side-effect-only (no return value used)
  3. In RELEASE mode with -DNDEBUG, these are low-priority debug paths
  4. Entire block can be eliminated without changing observable behavior

Step 2: perf Re-verification

Original Claim

  • Claim: hak_super_lookup() costs 15.84% CPU
  • Source: Code comment in core/tiny_region_id.h:197

Investigation of Original perf Data

  • File checked: /mnt/workdisk/public_share/hakmem/perf_phase2_symbols.txt
  • Finding: The 15.84% entry in that file is for free(), NOT hak_super_lookup()

Excerpt from perf_phase2_symbols.txt:

    15.84%  [.] free                          bench_random_mixed_hakmem  -      -
            |
            |--8.15%--main
  • Search for hak_super_lookup in perf files: NOT FOUND

Conclusion: The 15.84% claim was a misreading of perf data. There is no evidence that hak_super_lookup() ever appeared as a hot function in release builds.

Re-measured perf (BEFORE binary)

perf record -g /tmp/bench_before_phase6a 10000000 256 42
perf report --stdio --sort=symbol --percent-limit=1

Results:

Function Self % Children % Notes
main 26.51% 87.54% Top-level benchmark loop
malloc 23.01% 51.65% Allocation wrapper
free 21.48% 44.79% Free wrapper
tiny_region_id_write_header.lto_priv.0 22.06% 30.16% Header write (LTO-optimized)
superslab_refill 0.00% 3.49% Slab allocation

Key Finding:

  • hak_super_lookup does NOT appear in the perf report
  • tiny_region_id_write_header shows 22.06% self cost, but this is the entire function (including header write, guards, logging)
  • No evidence of SuperSlab lookup overhead

Step 3: Line-by-Line Cost Analysis

Not applicable - Since hak_super_lookup() doesn't exist in the binary, there are no assembly instructions to annotate.

What happened to the code?

The original source code in core/tiny_region_id.h:199-239 (BEFORE Phase 6-A):

// Debug: detect header writes with class_idx that disagrees with slab metadata.
do {
    static _Atomic uint32_t g_hdr_meta_mis = 0;
    struct SuperSlab* ss = hak_super_lookup(base);  // ← This call
    if (ss && ss->magic == SUPERSLAB_MAGIC) {
        // ... validation and logging ...
    }
} while (0);

After LTO optimization (with -DHAKMEM_BUILD_RELEASE=1):

  • Compiler sees that:
    1. ss is only used for debug logging (fprintf)
    2. The logging is gated by if (n < 8) (low-frequency)
    3. The atomic counter g_hdr_meta_mis is debug-only
  • Result: Entire do-while block eliminated
  • Final assembly: No call to hak_super_lookup()

Step 4: LTO Status and Impact

LTO Configuration

CFLAGS += -flto
CFLAGS_SHARED += -flto
LDFLAGS += -flto

Enabled: YES - Link-Time Optimization is active in all builds

Impact Analysis

LTO enables aggressive optimizations across translation units:

  1. Dead Code Elimination (DCE):

    • Identifies code with no observable side effects
    • Removes unused function calls, even across files
    • Result: hak_super_lookup() eliminated because its result is unused
  2. Function Inlining:

    • tiny_region_id_write_header is marked static inline
    • LTO can inline across files, creating .lto_priv.0 versions
    • Enables further optimization within inlined context
  3. Constant Propagation:

    • With -DHAKMEM_BUILD_RELEASE=1, the preprocessor removes the guard
    • But even WITHOUT the guard, LTO eliminates the code anyway

Why Phase 6-A had minimal impact:

  • The explicit #if !HAKMEM_BUILD_RELEASE guard is redundant
  • LTO already achieved the same result through DCE
  • Adding the guard only makes the optimization explicit (no performance change)

Step 5: Binary Size Comparison

Metric Before Phase 6-A After Phase 6-A Change
Binary size 1,640,448 bytes (1.6M) 1,640,448 bytes (1.6M) 0 bytes
Assembly lines 54,519 51,307 -3,212 lines
hak_super_lookup calls 0 0 0
hak_super_lookup symbol NOT FOUND NOT FOUND -

Finding: Binary size is IDENTICAL. The assembly line count difference is due to LTO's non-deterministic inlining decisions (different runs produce slightly different inlining), not from removing hak_super_lookup().

Proof: Both builds were done with the same flags. The only code change was adding the #if !HAKMEM_BUILD_RELEASE guard. Since the binary size didn't change, the guard had no effect.


Root Cause Analysis

Primary Cause: Compiler Already Optimized (Dead Code Elimination)

Hypothesis: The compiler's LTO pass already eliminated hak_super_lookup() through dead code elimination, even before Phase 6-A added the explicit guard.

Evidence:

  1. Symbol table: hak_super_lookup doesn't exist in BEFORE binary

    nm bench_random_mixed_hakmem | grep hak_super_lookup
    # Output: (empty)
    
  2. Assembly code: ZERO calls to hak_super_lookup in BEFORE binary

    grep "call.*hak_super_lookup" /tmp/asm_before.txt
    # Output: (empty)
    
  3. Binary size: IDENTICAL before/after (1.6M), proving no code was removed

  4. LTO flags: Makefile has -flto enabled, allowing aggressive DCE

Explanation:

The compiler's optimization pipeline works as follows:

  1. Source → AST (Abstract Syntax Tree)

    • Code includes the do-while block with hak_super_lookup(base)
  2. AST → IR (Intermediate Representation)

    • LLVM/GCC generates IR with all function calls intact
  3. LTO Pass 1: Inlining

    • tiny_region_id_write_header() is inlined into callers
    • hak_super_lookup() call is now visible in inlined context
  4. LTO Pass 2: Dead Code Elimination

    • Analyzes data flow: ss is only used for fprintf(stderr, ...)
    • fprintf is a side effect (I/O), but it's:
      • Gated by if (n < 8) (unlikely path)
      • Writing to stderr (debug output, no program logic)
      • Inside a do-while that doesn't affect return value
    • Decision: Entire block is dead code → ELIMINATE
  5. Code Generation

    • No assembly instructions for hak_super_lookup() call
    • No symbol for hak_super_lookup() in binary

Why the benchmark showed +1.67% improvement anyway?

The small improvement is measurement noise:

  • Variance in benchmark: ±1.86 M ops/s (3.6% stdev)
  • Measured improvement: +0.89 M ops/s (1.67%)
  • Conclusion: Within noise margin, NOT statistically significant

Secondary Cause: Misreading of perf Data

Hypothesis: The original "15.84% CPU" claim was based on a misreading of perf profiling output.

Evidence:

  1. perf_phase2_symbols.txt shows:

    15.84%  [.] free
    

    This is the free() function, NOT hak_super_lookup()

  2. Search for hak_super_lookup in all perf files:

    grep -r "hak_super_lookup" /mnt/workdisk/public_share/hakmem/perf_*.txt
    # Output: (empty - no matches)
    
  3. Re-measured perf (10M operations):

    • tiny_region_id_write_header: 22.06% self cost
    • hak_super_lookup: NOT FOUND

Explanation:

The code comment claimed:

// Phase 6-A: Debug validation (disabled in release builds for performance)
// perf profiling showed hak_super_lookup() costs 15.84% CPU on hot path

This claim is FALSE. The 15.84% was from a different function (free()). Likely sequence of events:

  1. Developer ran perf on a benchmark
  2. Saw tiny_region_id_write_header consuming ~22% CPU
  3. Incorrectly assumed the cost was from hak_super_lookup() (which is called inside)
  4. Mistakenly attributed the 15.84% free() cost to hak_super_lookup()
  5. Added the guard based on faulty analysis

Reality: hak_super_lookup() never appeared in perf output because it was already eliminated by the compiler.


Alternative Explanations (Ruled Out)

1. Perf Sampling Bias

Hypothesis: Maybe the original perf was run on a DEBUG build?

Ruled out: The benchmark results document states "Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default", and the Makefile confirms this. All benchmarks were RELEASE builds.

2. Lookup Already Cache-Friendly

Hypothesis: Maybe hak_super_lookup() is so fast it doesn't show in perf?

Ruled out: The function doesn't exist in the binary at all. It's not that it's fast - it's that it was eliminated entirely.

3. Wrong Hot Path

Hypothesis: Maybe the call is on a different path that benchmarks don't exercise?

Ruled out: Symbol table analysis shows the function doesn't exist in the binary. It was eliminated from ALL paths, not just the hot path.

4. Measurement Noise

Hypothesis: The +1.67% improvement is real but smaller than expected?

Partially valid: The benchmark does show slight improvement, but it's within the noise margin (stdev = 1.86 M ops/s). The improvement is likely due to:

  • Different LTO inlining decisions (non-deterministic)
  • Cache alignment changes from binary layout differences
  • NOT from removing hak_super_lookup() (it was already gone)

Recommendations

Option A: Commit Phase 6-A Anyway

Reason: Code clarity - makes the debug-only intent explicit

Pros:

  • Documents that the validation is debug-only
  • Future-proof: if LTO is disabled, the guard still works
  • No harm: performance is identical

Cons:

  • Code churn for zero benefit
  • Misleading comment claims "Expected gain: +12-15% throughput" (false)
  • Sets bad precedent: committing "optimizations" without verifying compiler output

Verdict: NOT RECOMMENDED


Option B: Discard Phase 6-A

Reason: No measurable benefit, based on incorrect analysis

Pros:

  • Avoids code churn
  • Avoids misleading performance claims in code comments
  • Acknowledges that the compiler already did the optimization

Cons:

  • Loses explicit documentation of debug-only intent
  • If LTO is disabled in future, the code would run in release builds

Verdict: RECOMMENDED

Action:

git stash drop  # Discard Phase 6-A changes

Option C: Commit with Corrected Documentation

Reason: Keep the guard for clarity, but fix the misleading comments

Pros:

  • Explicit guard prevents future confusion
  • Corrected comments document the actual situation
  • No performance regression risk

Cons:

  • Still code churn for minimal value
  • Guard is redundant with LTO enabled

Action (if chosen):

# Edit core/tiny_region_id.h to correct the comments:
# BEFORE:
# // Phase 6-A: Debug validation (disabled in release builds for performance)
# // perf profiling showed hak_super_lookup() costs 15.84% CPU on hot path
# // Expected gain: +12-15% throughput by removing this in release builds

# AFTER:
# // Phase 6-A: Debug-only validation (explicit guard for code clarity)
# // Note: LTO already eliminates this code in release builds via DCE
# // This guard makes the debug-only intent explicit and future-proof

Verdict: ⚠️ ACCEPTABLE COMPROMISE


Rationale:

  1. No performance benefit: The compiler already optimized the code
  2. False premise: The 15.84% claim was incorrect
  3. Misleading documentation: The comments claim benefits that don't exist
  4. Code quality: We should verify compiler output before claiming optimizations

Next Steps:

  1. Discard Phase 6-A:

    git stash drop
    
  2. Document the findings: Update perf methodology to:

    • Always verify symbol table (nm) after claiming function costs
    • Check assembly output (objdump -d) for claimed hot paths
    • Distinguish between source code and compiled code
  3. Improve perf analysis process:

    • Build BOTH debug and release binaries
    • Profile BOTH to see which code paths exist
    • Use perf annotate to see actual assembly being executed
    • Cross-reference perf output with symbol table
  4. Add to development guidelines:

    "Before claiming a function costs X% CPU:

    1. Verify the function exists in the binary (nm)
    2. Check if calls are present (objdump -d | grep call)
    3. Run perf on the EXACT binary being benchmarked
    4. Use perf annotate to confirm attribution"

Lessons Learned

1. Trust but Verify Compiler Optimizations

What we learned: Modern compilers with LTO are extremely aggressive at dead code elimination. Code that "looks" expensive in source may not exist in the binary at all.

Action: Always verify assembly output before claiming performance improvements from code removal.

2. perf Data Can Be Misleading

What we learned: A percentage in perf output can refer to different things (function self-cost, children cost, total cost). Always verify the exact attribution.

Action: Use perf annotate to see assembly-level attribution, not just function-level summaries.

3. RELEASE vs DEBUG Builds Are Different

What we learned: -DHAKMEM_BUILD_RELEASE=1 + -flto enables optimizations that can completely eliminate code blocks, even without explicit #if guards.

Action: When profiling for optimization opportunities, profile DEBUG builds to see what code exists, then RELEASE builds to see what actually runs.

4. Small Performance Improvements Can Be Noise

What we learned: A +1.67% improvement with ±3.6% variance is NOT statistically significant.

Action: Require at least 2× stdev improvement (>7% in this case) before claiming success.

5. Document Optimization Assumptions

What we learned: The Phase 6-A code comment claimed "Expected gain: +12-15% throughput" without verifying the baseline.

Action: Document:

  • What was measured (perf output, benchmark results)
  • What assumptions were made (function X costs Y%)
  • How the improvement was calculated (removed Y% → expect +Y% throughput)
  • Verify each assumption before committing

Appendix: Full Investigation Commands

Assembly Analysis

# Build BEFORE Phase 6-A
git stash
make clean
make EXTRA_CFLAGS="-g -O3 -fno-omit-frame-pointer" bench_random_mixed_hakmem
cp bench_random_mixed_hakmem /tmp/bench_before_phase6a
objdump -d /tmp/bench_before_phase6a > /tmp/asm_before.txt
nm /tmp/bench_before_phase6a | grep hak_super_lookup  # Output: (empty)
grep -c "hak_super_lookup" /tmp/asm_before.txt  # Output: 0

# Build AFTER Phase 6-A
git stash pop
make clean
make EXTRA_CFLAGS="-g -O3 -fno-omit-frame-pointer" bench_random_mixed_hakmem
cp bench_random_mixed_hakmem /tmp/bench_after_phase6a
objdump -d /tmp/bench_after_phase6a > /tmp/asm_after.txt
nm /tmp/bench_after_phase6a | grep hak_super_lookup  # Output: (empty)
grep -c "hak_super_lookup" /tmp/asm_after.txt  # Output: 0

# Compare binary sizes
ls -lh /tmp/bench_before_phase6a /tmp/bench_after_phase6a
# Both: 1.6M (identical)

perf Analysis

# Profile BEFORE binary
perf record -o /tmp/perf_before.data -g /tmp/bench_before_phase6a 10000000 256 42
perf report -i /tmp/perf_before.data --stdio --sort=symbol --percent-limit=1

# Search for hak_super_lookup
perf report -i /tmp/perf_before.data --stdio --sort=symbol 2>/dev/null | grep -i super
# Output: Only superslab_refill (3.49%), no hak_super_lookup

# Check original perf data
grep -r "15.84" /mnt/workdisk/public_share/hakmem/perf_*.txt
# Output: perf_phase2_symbols.txt shows 15.84% for free(), NOT hak_super_lookup()

LTO Verification

# Check Makefile for LTO flags
grep "flto" /mnt/workdisk/public_share/hakmem/Makefile
# Output: CFLAGS += -flto, LDFLAGS += -flto

# Check RELEASE flag
grep "HAKMEM_BUILD_RELEASE" /mnt/workdisk/public_share/hakmem/Makefile
# Output: CFLAGS += -DNDEBUG -DHAKMEM_BUILD_RELEASE=1

Conclusion

Phase 6-A was based on two faulty assumptions:

  1. Assumption 1: hak_super_lookup() costs 15.84% CPU

    • Reality: The function was already eliminated by LTO; the 15.84% was free()
  2. Assumption 2: Adding #if !HAKMEM_BUILD_RELEASE would remove the code

    • Reality: The code was already gone; the guard is redundant

Result: +1.67% improvement is measurement noise, not from removing the lookup.

Recommendation: Discard Phase 6-A and improve the perf analysis methodology to verify compiler output before claiming optimizations.

Impact: No performance loss from discarding (the optimization was never present), and we avoid misleading documentation in the codebase.