Files

Moe Charm (CI) c04cccf723 Phase 6-A: Clarify debug-only validation (code readability, no perf change)

Explicitly guard SuperSlab validation with #if !HAKMEM_BUILD_RELEASE
to document that this code is debug-only.

Changes:
- core/tiny_region_id.h: Add #if !HAKMEM_BUILD_RELEASE guard around
  hak_super_lookup() validation code (lines 199-239)
- Improves code readability: Makes debug-only intent explicit
- Self-documenting: No need to check Makefile to understand behavior
- Defensive: Works correctly even if LTO is disabled

Performance Impact:
- Measured: +1.67% (bench_random_mixed), +1.33% (bench_mid_mt_gap)
- Expected: +12-15% (based on initial perf interpretation)
- Actual: NO measurable improvement (within noise margin ±3.6%)

Root Cause (Investigation):
- Compiler (LTO) already eliminated hak_super_lookup() automatically
- The function never existed in compiled binary (verified via nm/objdump)
- Default Makefile has -DHAKMEM_BUILD_RELEASE=1 + -flto
- perf's "15.84% CPU" was misattributed (was free(), not hak_super_lookup)

Conclusion:
This change provides NO performance benefit, but IMPROVES code clarity
by making the debug-only nature explicit rather than relying on
implicit compiler optimization.

Files:
- core/tiny_region_id.h - Add explicit debug guard
- PHASE6A_DISCREPANCY_INVESTIGATION.md - Full investigation report

Lessons Learned:
1. Always verify assembly output before claiming optimizations
2. perf attribution can be misleading - cross-reference with symbols
3. LTO is extremely aggressive at dead code elimination
4. Small improvements (<2× stdev) need statistical validation

See PHASE6A_DISCREPANCY_INVESTIGATION.md for complete analysis.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-29 15:22:31 +09:00

19 KiB

Raw Blame History

Phase 6-A Discrepancy Investigation Report

Date: 2025-11-29 Investigator: Claude (Sonnet 4.5) Task: Investigate why Phase 6-A showed 8-10x smaller performance improvement than predicted

Executive Summary

Root Cause: Dead Code Elimination by LTO Compiler Optimization

Finding: The hak_super_lookup() call inside the #if !HAKMEM_BUILD_RELEASE guard was already completely eliminated by the compiler in RELEASE builds BEFORE Phase 6-A was implemented. The Makefile's default configuration includes both -DHAKMEM_BUILD_RELEASE=1 and -flto, which together caused the compiler to optimize away the entire debug validation block.

Evidence:

Assembly analysis shows ZERO calls to hak_super_lookup() in both BEFORE and AFTER binaries
Symbol table analysis confirms the function doesn't exist in either binary
The "15.84% CPU" claim was a misreading of perf data - that percentage referred to free(), not hak_super_lookup()
Both binaries are identical in size (1.6M), with only minor address offset differences

Recommendation: DISCARD Phase 6-A - The code change provides no performance benefit and was based on incorrect perf analysis. The baseline build already had the optimization in effect.

Investigation Steps

Step 1: Assembly Analysis

Before Phase 6-A (No guard)

Binary size: 1.6M (1,640,448 bytes)
Assembly lines: 54,519 lines
hak_super_lookup calls: 0
hak_super_lookup symbol: NOT FOUND

Command:

git stash  # Remove Phase 6-A changes
make clean && make EXTRA_CFLAGS="-g -O3 -fno-omit-frame-pointer" bench_random_mixed_hakmem
objdump -d bench_random_mixed_hakmem > /tmp/asm_before.txt
grep -c "hak_super_lookup" /tmp/asm_before.txt  # Output: 0
nm bench_random_mixed_hakmem | grep hak_super_lookup  # Output: (empty)

After Phase 6-A (With `#if !HAKMEM_BUILD_RELEASE` guard)

Binary size: 1.6M (1,640,448 bytes) - SAME SIZE
Assembly lines: 51,307 lines (3,212 lines fewer due to unrelated inlining changes)
hak_super_lookup calls: 0
hak_super_lookup symbol: NOT FOUND

Command:

git stash pop  # Restore Phase 6-A changes
make clean && make EXTRA_CFLAGS="-g -O3 -fno-omit-frame-pointer" bench_random_mixed_hakmem
objdump -d bench_random_mixed_hakmem > /tmp/asm_after.txt
grep -c "hak_super_lookup" /tmp/asm_after.txt  # Output: 0
nm bench_random_mixed_hakmem | grep hak_super_lookup  # Output: (empty)

Finding

The code change had ZERO effect on the compiled binary. The compiler already eliminated the entire debug validation block in RELEASE builds through dead code elimination, even without the explicit #if !HAKMEM_BUILD_RELEASE guard.

Why? The result of hak_super_lookup() is only used inside if (n < 8) debug logging. The compiler's LTO pass detected:

The lookup result is never used for program logic
The fprintf() calls are side-effect-only (no return value used)
In RELEASE mode with -DNDEBUG, these are low-priority debug paths
Entire block can be eliminated without changing observable behavior

Step 2: perf Re-verification

Original Claim

Claim: hak_super_lookup() costs 15.84% CPU
Source: Code comment in core/tiny_region_id.h:197

Investigation of Original perf Data

File checked: /mnt/workdisk/public_share/hakmem/perf_phase2_symbols.txt
Finding: The 15.84% entry in that file is for free(), NOT hak_super_lookup()

Excerpt from perf_phase2_symbols.txt:

    15.84%  [.] free                          bench_random_mixed_hakmem  -      -
            |
            |--8.15%--main

Search for hak_super_lookup in perf files: NOT FOUND

Conclusion: The 15.84% claim was a misreading of perf data. There is no evidence that hak_super_lookup() ever appeared as a hot function in release builds.

Re-measured perf (BEFORE binary)

perf record -g /tmp/bench_before_phase6a 10000000 256 42
perf report --stdio --sort=symbol --percent-limit=1

Results:

Function	Self %	Children %	Notes
`main`	26.51%	87.54%	Top-level benchmark loop
`malloc`	23.01%	51.65%	Allocation wrapper
`free`	21.48%	44.79%	Free wrapper
`tiny_region_id_write_header.lto_priv.0`	22.06%	30.16%	Header write (LTO-optimized)
`superslab_refill`	0.00%	3.49%	Slab allocation

Key Finding:

hak_super_lookup does NOT appear in the perf report
tiny_region_id_write_header shows 22.06% self cost, but this is the entire function (including header write, guards, logging)
No evidence of SuperSlab lookup overhead

Step 3: Line-by-Line Cost Analysis

Not applicable - Since hak_super_lookup() doesn't exist in the binary, there are no assembly instructions to annotate.

What happened to the code?

The original source code in core/tiny_region_id.h:199-239 (BEFORE Phase 6-A):

// Debug: detect header writes with class_idx that disagrees with slab metadata.
do {
    static _Atomic uint32_t g_hdr_meta_mis = 0;
    struct SuperSlab* ss = hak_super_lookup(base);  // ← This call
    if (ss && ss->magic == SUPERSLAB_MAGIC) {
        // ... validation and logging ...
    }
} while (0);

After LTO optimization (with -DHAKMEM_BUILD_RELEASE=1):

Compiler sees that:
1. ss is only used for debug logging (fprintf)
2. The logging is gated by if (n < 8) (low-frequency)
3. The atomic counter g_hdr_meta_mis is debug-only
Result: Entire do-while block eliminated
Final assembly: No call to hak_super_lookup()

Step 4: LTO Status and Impact

LTO Configuration

CFLAGS += -flto
CFLAGS_SHARED += -flto
LDFLAGS += -flto

Enabled: YES - Link-Time Optimization is active in all builds

Impact Analysis

LTO enables aggressive optimizations across translation units:

Dead Code Elimination (DCE):
- Identifies code with no observable side effects
- Removes unused function calls, even across files
- Result: hak_super_lookup() eliminated because its result is unused
Function Inlining:
- tiny_region_id_write_header is marked static inline
- LTO can inline across files, creating .lto_priv.0 versions
- Enables further optimization within inlined context
Constant Propagation:
- With -DHAKMEM_BUILD_RELEASE=1, the preprocessor removes the guard
- But even WITHOUT the guard, LTO eliminates the code anyway

Why Phase 6-A had minimal impact:

The explicit #if !HAKMEM_BUILD_RELEASE guard is redundant
LTO already achieved the same result through DCE
Adding the guard only makes the optimization explicit (no performance change)

Step 5: Binary Size Comparison

Metric	Before Phase 6-A	After Phase 6-A	Change
Binary size	1,640,448 bytes (1.6M)	1,640,448 bytes (1.6M)	0 bytes
Assembly lines	54,519	51,307	-3,212 lines
`hak_super_lookup` calls	0	0	0
`hak_super_lookup` symbol	NOT FOUND	NOT FOUND	-

Finding: Binary size is IDENTICAL. The assembly line count difference is due to LTO's non-deterministic inlining decisions (different runs produce slightly different inlining), not from removing hak_super_lookup().

Proof: Both builds were done with the same flags. The only code change was adding the #if !HAKMEM_BUILD_RELEASE guard. Since the binary size didn't change, the guard had no effect.

Root Cause Analysis

Primary Cause: Compiler Already Optimized (Dead Code Elimination)

Hypothesis: The compiler's LTO pass already eliminated hak_super_lookup() through dead code elimination, even before Phase 6-A added the explicit guard.

Evidence:

Symbol table: hak_super_lookup doesn't exist in BEFORE binary

nm bench_random_mixed_hakmem | grep hak_super_lookup
# Output: (empty)

Assembly code: ZERO calls to hak_super_lookup in BEFORE binary

grep "call.*hak_super_lookup" /tmp/asm_before.txt
# Output: (empty)

Binary size: IDENTICAL before/after (1.6M), proving no code was removed
LTO flags: Makefile has -flto enabled, allowing aggressive DCE

Explanation:

The compiler's optimization pipeline works as follows:

Source → AST (Abstract Syntax Tree)
- Code includes the do-while block with hak_super_lookup(base)
AST → IR (Intermediate Representation)
- LLVM/GCC generates IR with all function calls intact
LTO Pass 1: Inlining
- tiny_region_id_write_header() is inlined into callers
- hak_super_lookup() call is now visible in inlined context
LTO Pass 2: Dead Code Elimination
- Analyzes data flow: ss is only used for fprintf(stderr, ...)
- fprintf is a side effect (I/O), but it's:
  - Gated by if (n < 8) (unlikely path)
  - Writing to stderr (debug output, no program logic)
  - Inside a do-while that doesn't affect return value
- Decision: Entire block is dead code → ELIMINATE
Code Generation
- No assembly instructions for hak_super_lookup() call
- No symbol for hak_super_lookup() in binary

Why the benchmark showed +1.67% improvement anyway?

The small improvement is measurement noise:

Variance in benchmark: ±1.86 M ops/s (3.6% stdev)
Measured improvement: +0.89 M ops/s (1.67%)
Conclusion: Within noise margin, NOT statistically significant

Secondary Cause: Misreading of perf Data

Hypothesis: The original "15.84% CPU" claim was based on a misreading of perf profiling output.

Evidence:

perf_phase2_symbols.txt shows:
```
15.84%  [.] free
```
This is the free() function, NOT hak_super_lookup()

Search for hak_super_lookup in all perf files:

grep -r "hak_super_lookup" /mnt/workdisk/public_share/hakmem/perf_*.txt
# Output: (empty - no matches)

Re-measured perf (10M operations):
- tiny_region_id_write_header: 22.06% self cost
- hak_super_lookup: NOT FOUND

Explanation:

The code comment claimed:

// Phase 6-A: Debug validation (disabled in release builds for performance)
// perf profiling showed hak_super_lookup() costs 15.84% CPU on hot path

This claim is FALSE. The 15.84% was from a different function (free()). Likely sequence of events:

Developer ran perf on a benchmark
Saw tiny_region_id_write_header consuming ~22% CPU
Incorrectly assumed the cost was from hak_super_lookup() (which is called inside)
Mistakenly attributed the 15.84% free() cost to hak_super_lookup()
Added the guard based on faulty analysis

Reality: hak_super_lookup() never appeared in perf output because it was already eliminated by the compiler.

Alternative Explanations (Ruled Out)

1. Perf Sampling Bias

Hypothesis: Maybe the original perf was run on a DEBUG build?

Ruled out: The benchmark results document states "Makefile sets -DHAKMEM_BUILD_RELEASE=1 by default", and the Makefile confirms this. All benchmarks were RELEASE builds.

2. Lookup Already Cache-Friendly

Hypothesis: Maybe hak_super_lookup() is so fast it doesn't show in perf?

Ruled out: The function doesn't exist in the binary at all. It's not that it's fast - it's that it was eliminated entirely.

3. Wrong Hot Path

Hypothesis: Maybe the call is on a different path that benchmarks don't exercise?

Ruled out: Symbol table analysis shows the function doesn't exist in the binary. It was eliminated from ALL paths, not just the hot path.

4. Measurement Noise

Hypothesis: The +1.67% improvement is real but smaller than expected?

Partially valid: The benchmark does show slight improvement, but it's within the noise margin (stdev = 1.86 M ops/s). The improvement is likely due to:

Different LTO inlining decisions (non-deterministic)
Cache alignment changes from binary layout differences
NOT from removing hak_super_lookup() (it was already gone)

Recommendations

Option A: Commit Phase 6-A Anyway

Reason: Code clarity - makes the debug-only intent explicit

Pros:

Documents that the validation is debug-only
Future-proof: if LTO is disabled, the guard still works
No harm: performance is identical

Cons:

Code churn for zero benefit
Misleading comment claims "Expected gain: +12-15% throughput" (false)
Sets bad precedent: committing "optimizations" without verifying compiler output

Verdict: ❌ NOT RECOMMENDED

Option B: Discard Phase 6-A

Reason: No measurable benefit, based on incorrect analysis

Pros:

Avoids code churn
Avoids misleading performance claims in code comments
Acknowledges that the compiler already did the optimization

Cons:

Loses explicit documentation of debug-only intent
If LTO is disabled in future, the code would run in release builds

Verdict: ✅ RECOMMENDED

Action:

git stash drop  # Discard Phase 6-A changes

Option C: Commit with Corrected Documentation

Reason: Keep the guard for clarity, but fix the misleading comments

Pros:

Explicit guard prevents future confusion
Corrected comments document the actual situation
No performance regression risk

Cons:

Still code churn for minimal value
Guard is redundant with LTO enabled

Action (if chosen):

# Edit core/tiny_region_id.h to correct the comments:
# BEFORE:
# // Phase 6-A: Debug validation (disabled in release builds for performance)
# // perf profiling showed hak_super_lookup() costs 15.84% CPU on hot path
# // Expected gain: +12-15% throughput by removing this in release builds

# AFTER:
# // Phase 6-A: Debug-only validation (explicit guard for code clarity)
# // Note: LTO already eliminates this code in release builds via DCE
# // This guard makes the debug-only intent explicit and future-proof

Verdict: ⚠️ ACCEPTABLE COMPROMISE

Recommended Action: Option B - Discard Phase 6-A

Rationale:

No performance benefit: The compiler already optimized the code
False premise: The 15.84% claim was incorrect
Misleading documentation: The comments claim benefits that don't exist
Code quality: We should verify compiler output before claiming optimizations

Next Steps:

Discard Phase 6-A:
```
git stash drop
```
Document the findings: Update perf methodology to:
- Always verify symbol table (nm) after claiming function costs
- Check assembly output (objdump -d) for claimed hot paths
- Distinguish between source code and compiled code
Improve perf analysis process:
- Build BOTH debug and release binaries
- Profile BOTH to see which code paths exist
- Use perf annotate to see actual assembly being executed
- Cross-reference perf output with symbol table
Add to development guidelines:
"Before claiming a function costs X% CPU:
1. Verify the function exists in the binary (nm)
2. Check if calls are present (objdump -d | grep call)
3. Run perf on the EXACT binary being benchmarked
4. Use perf annotate to confirm attribution"

Lessons Learned

1. Trust but Verify Compiler Optimizations

What we learned: Modern compilers with LTO are extremely aggressive at dead code elimination. Code that "looks" expensive in source may not exist in the binary at all.

Action: Always verify assembly output before claiming performance improvements from code removal.

2. perf Data Can Be Misleading

What we learned: A percentage in perf output can refer to different things (function self-cost, children cost, total cost). Always verify the exact attribution.

Action: Use perf annotate to see assembly-level attribution, not just function-level summaries.

3. RELEASE vs DEBUG Builds Are Different

What we learned: -DHAKMEM_BUILD_RELEASE=1 + -flto enables optimizations that can completely eliminate code blocks, even without explicit #if guards.

Action: When profiling for optimization opportunities, profile DEBUG builds to see what code exists, then RELEASE builds to see what actually runs.

4. Small Performance Improvements Can Be Noise

What we learned: A +1.67% improvement with ±3.6% variance is NOT statistically significant.

Action: Require at least 2× stdev improvement (>7% in this case) before claiming success.

5. Document Optimization Assumptions

What we learned: The Phase 6-A code comment claimed "Expected gain: +12-15% throughput" without verifying the baseline.

Action: Document:

What was measured (perf output, benchmark results)
What assumptions were made (function X costs Y%)
How the improvement was calculated (removed Y% → expect +Y% throughput)
Verify each assumption before committing

Appendix: Full Investigation Commands

Assembly Analysis

# Build BEFORE Phase 6-A
git stash
make clean
make EXTRA_CFLAGS="-g -O3 -fno-omit-frame-pointer" bench_random_mixed_hakmem
cp bench_random_mixed_hakmem /tmp/bench_before_phase6a
objdump -d /tmp/bench_before_phase6a > /tmp/asm_before.txt
nm /tmp/bench_before_phase6a | grep hak_super_lookup  # Output: (empty)
grep -c "hak_super_lookup" /tmp/asm_before.txt  # Output: 0

# Build AFTER Phase 6-A
git stash pop
make clean
make EXTRA_CFLAGS="-g -O3 -fno-omit-frame-pointer" bench_random_mixed_hakmem
cp bench_random_mixed_hakmem /tmp/bench_after_phase6a
objdump -d /tmp/bench_after_phase6a > /tmp/asm_after.txt
nm /tmp/bench_after_phase6a | grep hak_super_lookup  # Output: (empty)
grep -c "hak_super_lookup" /tmp/asm_after.txt  # Output: 0

# Compare binary sizes
ls -lh /tmp/bench_before_phase6a /tmp/bench_after_phase6a
# Both: 1.6M (identical)

perf Analysis

# Profile BEFORE binary
perf record -o /tmp/perf_before.data -g /tmp/bench_before_phase6a 10000000 256 42
perf report -i /tmp/perf_before.data --stdio --sort=symbol --percent-limit=1

# Search for hak_super_lookup
perf report -i /tmp/perf_before.data --stdio --sort=symbol 2>/dev/null | grep -i super
# Output: Only superslab_refill (3.49%), no hak_super_lookup

# Check original perf data
grep -r "15.84" /mnt/workdisk/public_share/hakmem/perf_*.txt
# Output: perf_phase2_symbols.txt shows 15.84% for free(), NOT hak_super_lookup()

LTO Verification

# Check Makefile for LTO flags
grep "flto" /mnt/workdisk/public_share/hakmem/Makefile
# Output: CFLAGS += -flto, LDFLAGS += -flto

# Check RELEASE flag
grep "HAKMEM_BUILD_RELEASE" /mnt/workdisk/public_share/hakmem/Makefile
# Output: CFLAGS += -DNDEBUG -DHAKMEM_BUILD_RELEASE=1

Conclusion

Phase 6-A was based on two faulty assumptions:

Assumption 1: hak_super_lookup() costs 15.84% CPU
- Reality: The function was already eliminated by LTO; the 15.84% was free()
Assumption 2: Adding #if !HAKMEM_BUILD_RELEASE would remove the code
- Reality: The code was already gone; the guard is redundant

Result: +1.67% improvement is measurement noise, not from removing the lookup.

Recommendation: Discard Phase 6-A and improve the perf analysis methodology to verify compiler output before claiming optimizations.

Impact: No performance loss from discarding (the optimization was never present), and we avoid misleading documentation in the codebase.

19 KiB Raw Blame History Unescape Escape

Phase 6-A Discrepancy Investigation Report

Executive Summary

Investigation Steps

Step 1: Assembly Analysis

Before Phase 6-A (No guard)

After Phase 6-A (With #if !HAKMEM_BUILD_RELEASE guard)

Finding

Step 2: perf Re-verification

Original Claim

Investigation of Original perf Data

Re-measured perf (BEFORE binary)

Step 3: Line-by-Line Cost Analysis

Step 4: LTO Status and Impact

LTO Configuration

Impact Analysis

Step 5: Binary Size Comparison

Root Cause Analysis

Primary Cause: Compiler Already Optimized (Dead Code Elimination)

Secondary Cause: Misreading of perf Data

Alternative Explanations (Ruled Out)

1. Perf Sampling Bias

2. Lookup Already Cache-Friendly

3. Wrong Hot Path

4. Measurement Noise

Recommendations

Option A: Commit Phase 6-A Anyway

Option B: Discard Phase 6-A

Option C: Commit with Corrected Documentation

Recommended Action: Option B - Discard Phase 6-A

Lessons Learned

1. Trust but Verify Compiler Optimizations

2. perf Data Can Be Misleading

3. RELEASE vs DEBUG Builds Are Different

4. Small Performance Improvements Can Be Noise

5. Document Optimization Assumptions

Appendix: Full Investigation Commands

Assembly Analysis

perf Analysis

LTO Verification

Conclusion

19 KiB

Raw Blame History

After Phase 6-A (With `#if !HAKMEM_BUILD_RELEASE` guard)