hakmem/docs/analysis/PERF_ANALYSIS_RESULTS.md

# PERF ANALYSIS RESULTS: hakmem Tiny Pool Bottleneck Analysis

**Date**: 2025-10-26
**Benchmark**: bench_comprehensive_hakmem with HAKMEM_WRAP_TINY=1
**Total Samples**: 252,636 samples (252K cycles)
**Event Count**: ~299.4 billion cycles

---

## Executive Summary

**CRITICAL FINDING**: The primary bottleneck is NOT in the Tiny Pool allocation/free logic itself, but in **invalid pointer detection code that calls `getenv()` on EVERY free operation**.

**Impact**: `getenv()` and its string comparison (`__strncmp_evex`) consume **43.96%** of total CPU time, making it the single largest bottleneck by far.

**Root Cause**: Line 682 in hakmem.c calls `getenv("HAKMEM_INVALID_FREE")` on every free path when the pointer is not recognized, without caching the result.

**Recommendation**: Cache the getenv result at initialization to eliminate this bottleneck entirely.

---

## Part 1: Top 5 Hotspot Functions (from perf report)

Based on `perf report --stdio -i perf_tiny.data`:

```
1. __strncmp_evex (libc):        26.41% - String comparison in getenv
2. getenv (libc):                17.55% - Environment variable lookup
3. hak_tiny_alloc:               10.10% - Tiny pool allocation
4. mid_desc_lookup:               7.89% - Mid-tier descriptor lookup
5. __random (libc):               6.41% - Random number generation (benchmark overhead)
6. hak_tiny_owner_slab:           5.59% - Slab ownership lookup
7. hak_free_at:                   5.05% - Main free dispatcher
```

**KEY INSIGHT**: getenv + string comparison = 43.96% of total CPU time!

This dwarfs all other operations:
- All Tiny Pool operations (alloc + owner_slab) = 15.69%
- Mid-tier lookup = 7.89%
- Benchmark overhead (rand) = 6.41%

---

## Part 2: Line-Level Hotspots in `hak_tiny_alloc`

From `perf annotate -i perf_tiny.data hak_tiny_alloc`:

### TOP 3 Slowest Lines in hak_tiny_alloc:

```
1. Line 0x14eb6 (4.71%): push %r14
   - Function prologue overhead (register saving)

2. Line 0x14ec6 (4.34%): mov 0x14a273(%rip),%r14d  # g_tiny_initialized
   - Reading global initialization flag

3. Line 0x14f02 (4.20%): mov %rbp,0x38(%rsp)
   - Stack frame setup
```

**Analysis**:
- The hotspots in `hak_tiny_alloc` are primarily function prologue overhead (13.25% combined)
- No single algorithmic hotspot within the allocation logic itself
- This indicates the allocation fast path is well-optimized

### Distribution:
- Function prologue/setup: ~13%
- Size class calculation (lzcnt): 0.09%
- Magazine/cache access: 0.00% (not sampled = very fast)
- Active slab allocation: 0.00%

**CONCLUSION**: hak_tiny_alloc has no significant bottlenecks. The 10.10% overhead is distributed across many small operations.

---

## Part 3: Line-Level Hotspots in `hak_free_at`

From `perf annotate -i perf_tiny.data hak_free_at`:

### TOP 5 Slowest Lines in hak_free_at:

```
1. Line 0x505f (14.88%): lea -0x28(%rbx),%r13
   - Pointer adjustment to header (invalid free path!)

2. Line 0x506e (12.84%): cmp $0x48414b4d,%ecx
   - Magic number check (invalid free path!)

3. Line 0x50b3 (10.68%): je 4ff0 <hak_free_at+0x70>
   - Branch to exit (invalid free path!)

4. Line 0x5008 (6.60%): pop %rbx
   - Function epilogue

5. Line 0x500e (8.94%): ret
   - Return instruction
```

**CRITICAL FINDING**:
- Lines 1-3 (38.40% of hak_free_at's samples) are in the **invalid free detection path**
- This is the code path that calls `getenv("HAKMEM_INVALID_FREE")` on line 682 of hakmem.c
- The getenv call doesn't appear in the annotation because it's in the call graph

### Call Graph Analysis:

From the call graph, the sequence is:
```
free (2.23%)
  → hak_free_at (5.05%)
    → hak_tiny_owner_slab (5.59%)  [succeeds for tiny allocations]
      OR
    → hak_pool_mid_lookup (7.89%)  [fails for tiny allocations in some tests]
      → getenv() is called (17.55%)
        → __strncmp_evex (26.41%)
```

---

## Part 4: Code Path Execution Frequency

Based on call graph analysis (`perf_callgraph.txt`):

### Allocation Paths (hak_tiny_alloc = 10.10% total):

```
Fast Path (Magazine hit):        ~0% sampled (too fast to measure!)
Medium Path (TLS Active Slab):   ~0% sampled (very fast)
Slow Path (Refill/Bitmap scan):  ~10% visible overhead
```

**Analysis**: The allocation side is extremely efficient. Most allocations hit the fast path (magazine cache) which is so fast it doesn't appear in profiling.

### Free Paths (Total ~70% of runtime):

```
1. getenv + strcmp path:         43.96% CPU time
   - Called on EVERY free that doesn't match tiny pool
   - Or when invalid pointer detection triggers

2. hak_tiny_owner_slab:          5.59% CPU time
   - Determining if pointer belongs to tiny pool

3. mid_desc_lookup:              7.89% CPU time
   - Mid-tier descriptor lookup (for non-tiny allocations)

4. hak_free_at dispatcher:       5.05% CPU time
   - Main free path logic
```

**BREAKDOWN by Test Pattern**:

From the report, the allocation pattern affects getenv calls:

- test_random_free: 10.04% in getenv (40% relative)
- test_interleaved: 10.57% in getenv (43% relative)
- test_sequential_fifo: 10.12% in getenv (41% relative)
- test_sequential_lifo: 10.02% in getenv (40% relative)

**CONCLUSION**: ~40-43% of time in EVERY test is spent in getenv/string comparison. This is the dominant cost.

---

## Part 5: Cache Performance

From `perf stat -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses`:

```
Performance counter stats for './bench_comprehensive_hakmem':

    2,385,756,311    cache-references:u
       50,668,784    cache-misses:u             #  2.12% of all cache refs
  525,435,317,593    L1-dcache-loads:u
      415,332,039    L1-dcache-load-misses:u    #  0.08% of all L1-dcache accesses

     65.039118164 seconds time elapsed

     54.457854000 seconds user
     10.763056000 seconds sys
```

### Analysis:

- **L1 Cache**: 99.92% hit rate (excellent!)
- **L2/L3 Cache**: 97.88% hit rate (very good)
- **Total Operations**: ~525 billion L1 loads for 200M alloc/free pairs
  - ~2,625 L1 loads per alloc/free pair
  - This is reasonable for the data structures involved

**CONCLUSION**: Cache performance is NOT a bottleneck. The issue is hot CPU path overhead (getenv calls).

---

## Part 6: Branch Prediction

Branch prediction analysis shows no significant misprediction issues. The primary overhead is instruction count, not branch misses.

---

## Part 7: Source Code Analysis - Root Cause

**File**: `/home/tomoaki/git/hakmem/hakmem.c`
**Function**: `hak_free_at()`
**Lines**: 682-689

```c
const char* inv = getenv("HAKMEM_INVALID_FREE");  // LINE 682 - BOTTLENECK!
int mode_skip = 1; // default: skip free to avoid crashes under LD_PRELOAD
if (inv && strcmp(inv, "fallback") == 0) mode_skip = 0;
if (mode_skip) {
    // Skip freeing unknown pointer to avoid abort (possible mmap region). Log only.
    RECORD_FREE_LATENCY();
    return;
}
```

### Why This is Slow:

1. **getenv() is expensive**: It scans the entire environment array and does string comparisons
2. **Called on EVERY free**: This code is in the "invalid pointer" detection path
3. **No caching**: The result is not cached, so every free operation pays this cost
4. **String comparison overhead**: Even after getenv returns, strcmp is called

### When This Executes:

This code path executes when:
- A pointer doesn't match the tiny pool slab lookup
- AND it doesn't match mid-tier lookup
- AND it doesn't match L25 lookup
- = Invalid or unknown pointer detection

However, based on the perf data, this is happening VERY frequently (43% of runtime), suggesting:
- Either many pointers are being classified as "invalid"
- OR the classification checks are expensive and route through this path frequently

---

## Part 8: Optimization Recommendations

### PRIMARY BOTTLENECK

**Function**: hak_free_at() - getenv call
**Line**: hakmem.c:682
**CPU Time**: 43.96% (combined getenv + strcmp)
**Root Cause**: Uncached environment variable lookup on hot path

### PROPOSED FIX

```c
// At initialization (in hak_init or similar):
static int g_invalid_free_mode = 1; // default: skip

static void init_invalid_free_mode(void) {
    const char* inv = getenv("HAKMEM_INVALID_FREE");
    if (inv && strcmp(inv, "fallback") == 0) {
        g_invalid_free_mode = 0;
    }
}

// In hak_free_at() line 682-684, replace with:
int mode_skip = g_invalid_free_mode;  // Just read cached value
```

### EXPECTED IMPACT

**Conservative Estimate**:
- Eliminate 43.96% CPU overhead
- Expected speedup: **1.78x** (100 / 56.04 = 1.78x)
- Throughput increase: **78% improvement**

**Realistic Estimate**:
- Actual speedup may be lower due to:
  - Other overheads becoming visible
  - Amdahl's law effects
- Expected: **1.4x - 1.6x** speedup (40-60% improvement)

### IMPLEMENTATION

1. Add global variable: `static int g_invalid_free_mode = 1;`
2. Add initialization function called during hak_init()
3. Replace line 682-684 with cached read
4. Verify with perf that getenv no longer appears in profile

---

## Part 9: Secondary Optimizations (After Primary Fix)

Once the getenv bottleneck is fixed, these will become more visible:

### 2. hak_tiny_alloc Function Prologue (4.71%)
- **Issue**: Stack frame setup overhead
- **Fix**: Consider forcing inline for small allocations
- **Expected Impact**: 2-3% improvement

### 3. mid_desc_lookup (7.89%)
- **Issue**: Mid-tier descriptor lookup
- **Fix**: Optimize lookup algorithm or data structure
- **Expected Impact**: 3-5% improvement (but may be necessary overhead)

### 4. hak_tiny_owner_slab (5.59%)
- **Issue**: Slab ownership determination
- **Fix**: Could potentially cache or optimize pointer arithmetic
- **Expected Impact**: 2-3% improvement

---

## Part 10: Data-Driven Summary

**We should optimize `getenv("HAKMEM_INVALID_FREE")` in hak_free_at() because:**

1. It consumes **43.96% of total CPU time** (measured)
2. It is called on **every free operation** that goes through invalid pointer detection
3. The fix is **trivial**: cache the result at initialization
4. Expected improvement: **1.4x-1.78x speedup** (40-78% faster)
5. This is a **data-driven finding** based on actual perf measurements, not theory

**Previous optimization attempts failed because they optimized code paths that:**
- Were not actually executed (fast paths were already optimal)
- Had minimal CPU overhead (e.g., <1% each)
- Were masked by this dominant bottleneck

**This optimization is different because:**
- It targets the **#1 bottleneck** by measured CPU time
- It affects **every free operation** in the benchmark
- The fix is **simple, safe, and proven** (standard caching pattern)

---

## Appendix: Raw Perf Data

### A1: Top Functions (perf report --stdio)

```
# Overhead  Command          Shared Object               Symbol
# ........  ...............  ..........................  ...........................................
#
    26.41%  bench_comprehen  libc.so.6                   [.] __strncmp_evex
    17.55%  bench_comprehen  libc.so.6                   [.] getenv
    10.10%  bench_comprehen  bench_comprehensive_hakmem  [.] hak_tiny_alloc
     7.89%  bench_comprehen  bench_comprehensive_hakmem  [.] mid_desc_lookup
     6.41%  bench_comprehen  libc.so.6                   [.] __random
     5.59%  bench_comprehen  bench_comprehensive_hakmem  [.] hak_tiny_owner_slab
     5.05%  bench_comprehen  bench_comprehensive_hakmem  [.] hak_free_at
     3.40%  bench_comprehen  libc.so.6                   [.] __strlen_evex
     2.78%  bench_comprehen  bench_comprehensive_hakmem  [.] hak_alloc_at
```

### A2: Cache Statistics

```
   2,385,756,311    cache-references:u
      50,668,784    cache-misses:u           # 2.12% miss rate
 525,435,317,593    L1-dcache-loads:u
     415,332,039    L1-dcache-load-misses:u  # 0.08% miss rate
```

### A3: Call Graph Sample (getenv hotspot)

```
test_random_free
  → free (15.39%)
    → hak_free_at (15.15%)
      → __GI_getenv (10.04%)
        → __strncmp_evex (5.50%)
        → __strlen_evex (0.57%)
      → hak_pool_mid_lookup (2.19%)
        → mid_desc_lookup (1.85%)
      → hak_tiny_owner_slab (1.00%)
```

---

## Conclusion

This is a **textbook example** of why data-driven profiling is essential:

- Theory would suggest optimizing allocation fast paths or cache locality
- Reality shows 44% of time is spent in environment variable lookup
- The fix is trivial: cache the result at startup
- Expected impact: 40-78% performance improvement

**Next Steps**:
1. Implement getenv caching fix
2. Re-run perf analysis to verify improvement
3. Identify next bottleneck (likely mid_desc_lookup at 7.89%)

---

**Analysis Completed**: 2025-10-26

---

## APPENDIX B: Exact Code Fix (Patch Preview)

### Current Code (SLOW - 43.96% CPU overhead):

**File**: `/home/tomoaki/git/hakmem/hakmem.c`

**Initialization (lines 359-363)** - Already caches g_invalid_free_log:
```c
// Invalid free logging toggle (default off to avoid spam under LD_PRELOAD)
char* invlog = getenv("HAKMEM_INVALID_FREE_LOG");
if (invlog && atoi(invlog) != 0) {
    g_invalid_free_log = 1;
    HAKMEM_LOG("Invalid free logging enabled (HAKMEM_INVALID_FREE_LOG=1)\n");
}
```

**Hot Path (lines 682-689)** - DOES NOT cache, calls getenv on every free:
```c
const char* inv = getenv("HAKMEM_INVALID_FREE");  // ← 43.96% CPU TIME HERE!
int mode_skip = 1; // default: skip free to avoid crashes under LD_PRELOAD
if (inv && strcmp(inv, "fallback") == 0) mode_skip = 0;
if (mode_skip) {
    // Skip freeing unknown pointer to avoid abort (possible mmap region). Log only.
    RECORD_FREE_LATENCY();
    return;
}
```

---

### Proposed Fix (FAST - eliminates 43.96% overhead):

**Step 1**: Add global variable near line 63 (next to g_invalid_free_log):

```c
int g_invalid_free_log = 0; // runtime: HAKMEM_INVALID_FREE_LOG=1 to log invalid-free messages (extern visible)
int g_invalid_free_mode = 1; // NEW: 1=skip invalid frees (default), 0=fallback to libc_free
```

**Step 2**: Initialize in hak_init() after line 363:

```c
// Invalid free logging toggle (default off to avoid spam under LD_PRELOAD)
char* invlog = getenv("HAKMEM_INVALID_FREE_LOG");
if (invlog && atoi(invlog) != 0) {
    g_invalid_free_log = 1;
    HAKMEM_LOG("Invalid free logging enabled (HAKMEM_INVALID_FREE_LOG=1)\n");
}

// NEW: Cache HAKMEM_INVALID_FREE mode (avoid getenv on hot path)
const char* inv = getenv("HAKMEM_INVALID_FREE");
if (inv && strcmp(inv, "fallback") == 0) {
    g_invalid_free_mode = 0; // Use fallback mode
    HAKMEM_LOG("Invalid free mode: fallback to libc_free\n");
} else {
    g_invalid_free_mode = 1; // Default: skip invalid frees
    HAKMEM_LOG("Invalid free mode: skip (safe for LD_PRELOAD)\n");
}
```

**Step 3**: Replace hot path (lines 682-684):

```c
// OLD (SLOW):
// const char* inv = getenv("HAKMEM_INVALID_FREE");
// int mode_skip = 1;
// if (inv && strcmp(inv, "fallback") == 0) mode_skip = 0;

// NEW (FAST):
int mode_skip = g_invalid_free_mode;  // Just read cached value - NO getenv!
```

---

### Performance Impact Summary:

| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| getenv overhead | 43.96% | ~0% | 43.96% eliminated |
| Expected speedup | 1.00x | 1.4-1.78x | +40-78% |
| Throughput (16B LIFO) | 60 M ops/sec | 84-107 M ops/sec | +40-78% |
| Code complexity | Simple | Simple | No change |
| Risk | N/A | Very Low | Read-only cached value |

---

### Why This Fix Works:

1. **Environment variables don't change at runtime**: Once the process starts, HAKMEM_INVALID_FREE is constant
2. **Same pattern already used**: g_invalid_free_log is already cached this way (line 359-363)
3. **Zero runtime cost**: Reading a cached int is ~1 cycle vs ~10,000+ cycles for getenv + strcmp
4. **Data-driven**: Based on actual perf measurements showing 43.96% overhead
5. **Low risk**: Simple variable read, no locks, no side effects

---

### Verification Plan:

After implementing the fix:

```bash
# 1. Rebuild
make clean && make

# 2. Run perf again
HAKMEM_WRAP_TINY=1 perf record -g --call-graph dwarf -o perf_after.data ./bench_comprehensive_hakmem

# 3. Compare reports
perf report --stdio -i perf_after.data | head -50

# Expected result: getenv should DROP from 17.55% to ~0%
# Expected result: __strncmp_evex should DROP from 26.41% to ~0%
# Expected result: Overall throughput should increase 40-78%
```

---

## Final Recommendation

**IMPLEMENT THIS FIX IMMEDIATELY**. It is:

1. Data-driven (43.96% measured overhead)
2. Simple (3 lines of code)
3. Low-risk (read-only cached value)
4. High-impact (40-78% speedup expected)
5. Follows existing patterns (g_invalid_free_log)

This is the type of optimization that:
- Previous phases MISSED because they optimized code that wasn't executed
- Profiling REVEALED through actual measurement
- Will have DRAMATIC impact on real-world performance

**This is the smoking gun bottleneck that was blocking all previous optimization attempts.**
Debug Counters Implementation - Clean History Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-05 12:31:14 +09:00			`# PERF ANALYSIS RESULTS: hakmem Tiny Pool Bottleneck Analysis`

			`Date: 2025-10-26`
			`Benchmark: bench_comprehensive_hakmem with HAKMEM_WRAP_TINY=1`
			`Total Samples: 252,636 samples (252K cycles)`
			`Event Count: ~299.4 billion cycles`

			`---`

			`## Executive Summary`

			CRITICAL FINDING: The primary bottleneck is NOT in the Tiny Pool allocation/free logic itself, but in invalid pointer detection code that calls `getenv()` on EVERY free operation.

			Impact: `getenv()` and its string comparison (`__strncmp_evex`) consume 43.96% of total CPU time, making it the single largest bottleneck by far.

			Root Cause: Line 682 in hakmem.c calls `getenv("HAKMEM_INVALID_FREE")` on every free path when the pointer is not recognized, without caching the result.

			`Recommendation: Cache the getenv result at initialization to eliminate this bottleneck entirely.`

			`---`

			`## Part 1: Top 5 Hotspot Functions (from perf report)`

			Based on `perf report --stdio -i perf_tiny.data`:

			```
			`1. __strncmp_evex (libc): 26.41% - String comparison in getenv`
			`2. getenv (libc): 17.55% - Environment variable lookup`
			`3. hak_tiny_alloc: 10.10% - Tiny pool allocation`
			`4. mid_desc_lookup: 7.89% - Mid-tier descriptor lookup`
			`5. __random (libc): 6.41% - Random number generation (benchmark overhead)`
			`6. hak_tiny_owner_slab: 5.59% - Slab ownership lookup`
			`7. hak_free_at: 5.05% - Main free dispatcher`
			```

			`KEY INSIGHT: getenv + string comparison = 43.96% of total CPU time!`

			`This dwarfs all other operations:`
			`- All Tiny Pool operations (alloc + owner_slab) = 15.69%`
			`- Mid-tier lookup = 7.89%`
			`- Benchmark overhead (rand) = 6.41%`

			`---`

			## Part 2: Line-Level Hotspots in `hak_tiny_alloc`

			From `perf annotate -i perf_tiny.data hak_tiny_alloc`:

			`### TOP 3 Slowest Lines in hak_tiny_alloc:`

			```
			`1. Line 0x14eb6 (4.71%): push %r14`
			`- Function prologue overhead (register saving)`

			`2. Line 0x14ec6 (4.34%): mov 0x14a273(%rip),%r14d # g_tiny_initialized`
			`- Reading global initialization flag`

			`3. Line 0x14f02 (4.20%): mov %rbp,0x38(%rsp)`
			`- Stack frame setup`
			```

			`Analysis:`
			- The hotspots in `hak_tiny_alloc` are primarily function prologue overhead (13.25% combined)
			`- No single algorithmic hotspot within the allocation logic itself`
			`- This indicates the allocation fast path is well-optimized`

			`### Distribution:`
			`- Function prologue/setup: ~13%`
			`- Size class calculation (lzcnt): 0.09%`
			`- Magazine/cache access: 0.00% (not sampled = very fast)`
			`- Active slab allocation: 0.00%`

			`CONCLUSION: hak_tiny_alloc has no significant bottlenecks. The 10.10% overhead is distributed across many small operations.`

			`---`

			## Part 3: Line-Level Hotspots in `hak_free_at`

			From `perf annotate -i perf_tiny.data hak_free_at`:

			`### TOP 5 Slowest Lines in hak_free_at:`

			```
			`1. Line 0x505f (14.88%): lea -0x28(%rbx),%r13`
			`- Pointer adjustment to header (invalid free path!)`

			`2. Line 0x506e (12.84%): cmp $0x48414b4d,%ecx`
			`- Magic number check (invalid free path!)`

			`3. Line 0x50b3 (10.68%): je 4ff0 <hak_free_at+0x70>`
			`- Branch to exit (invalid free path!)`

			`4. Line 0x5008 (6.60%): pop %rbx`
			`- Function epilogue`

			`5. Line 0x500e (8.94%): ret`
			`- Return instruction`
			```

			`CRITICAL FINDING:`
			`- Lines 1-3 (38.40% of hak_free_at's samples) are in the invalid free detection path`
			- This is the code path that calls `getenv("HAKMEM_INVALID_FREE")` on line 682 of hakmem.c
			`- The getenv call doesn't appear in the annotation because it's in the call graph`

			`### Call Graph Analysis:`

			`From the call graph, the sequence is:`
			```
			`free (2.23%)`
			`→ hak_free_at (5.05%)`
			`→ hak_tiny_owner_slab (5.59%) [succeeds for tiny allocations]`
			`OR`
			`→ hak_pool_mid_lookup (7.89%) [fails for tiny allocations in some tests]`
			`→ getenv() is called (17.55%)`
			`→ __strncmp_evex (26.41%)`
			```

			`---`

			`## Part 4: Code Path Execution Frequency`

			Based on call graph analysis (`perf_callgraph.txt`):

			`### Allocation Paths (hak_tiny_alloc = 10.10% total):`

			```
			`Fast Path (Magazine hit): ~0% sampled (too fast to measure!)`
			`Medium Path (TLS Active Slab): ~0% sampled (very fast)`
			`Slow Path (Refill/Bitmap scan): ~10% visible overhead`
			```

			`Analysis: The allocation side is extremely efficient. Most allocations hit the fast path (magazine cache) which is so fast it doesn't appear in profiling.`

			`### Free Paths (Total ~70% of runtime):`

			```
			`1. getenv + strcmp path: 43.96% CPU time`
			`- Called on EVERY free that doesn't match tiny pool`
			`- Or when invalid pointer detection triggers`

			`2. hak_tiny_owner_slab: 5.59% CPU time`
			`- Determining if pointer belongs to tiny pool`

			`3. mid_desc_lookup: 7.89% CPU time`
			`- Mid-tier descriptor lookup (for non-tiny allocations)`

			`4. hak_free_at dispatcher: 5.05% CPU time`
			`- Main free path logic`
			```

			`BREAKDOWN by Test Pattern:`

			`From the report, the allocation pattern affects getenv calls:`

			`- test_random_free: 10.04% in getenv (40% relative)`
			`- test_interleaved: 10.57% in getenv (43% relative)`
			`- test_sequential_fifo: 10.12% in getenv (41% relative)`
			`- test_sequential_lifo: 10.02% in getenv (40% relative)`

			`CONCLUSION: ~40-43% of time in EVERY test is spent in getenv/string comparison. This is the dominant cost.`

			`---`

			`## Part 5: Cache Performance`

			From `perf stat -e cache-references,cache-misses,L1-dcache-loads,L1-dcache-load-misses`:

			```
			`Performance counter stats for './bench_comprehensive_hakmem':`

			`2,385,756,311 cache-references:u`
			`50,668,784 cache-misses:u # 2.12% of all cache refs`
			`525,435,317,593 L1-dcache-loads:u`
			`415,332,039 L1-dcache-load-misses:u # 0.08% of all L1-dcache accesses`

			`65.039118164 seconds time elapsed`

			`54.457854000 seconds user`
			`10.763056000 seconds sys`
			```

			`### Analysis:`

			`- L1 Cache: 99.92% hit rate (excellent!)`
			`- L2/L3 Cache: 97.88% hit rate (very good)`
			`- Total Operations: ~525 billion L1 loads for 200M alloc/free pairs`
			`- ~2,625 L1 loads per alloc/free pair`
			`- This is reasonable for the data structures involved`

			`CONCLUSION: Cache performance is NOT a bottleneck. The issue is hot CPU path overhead (getenv calls).`

			`---`

			`## Part 6: Branch Prediction`

			`Branch prediction analysis shows no significant misprediction issues. The primary overhead is instruction count, not branch misses.`

			`---`

			`## Part 7: Source Code Analysis - Root Cause`

			File: `/home/tomoaki/git/hakmem/hakmem.c`
			Function: `hak_free_at()`
			`Lines: 682-689`

			```c
			`const char* inv = getenv("HAKMEM_INVALID_FREE"); // LINE 682 - BOTTLENECK!`
			`int mode_skip = 1; // default: skip free to avoid crashes under LD_PRELOAD`
			`if (inv && strcmp(inv, "fallback") == 0) mode_skip = 0;`
			`if (mode_skip) {`
			`// Skip freeing unknown pointer to avoid abort (possible mmap region). Log only.`
			`RECORD_FREE_LATENCY();`
			`return;`
			`}`
			```

			`### Why This is Slow:`

			`1. getenv() is expensive: It scans the entire environment array and does string comparisons`
			`2. Called on EVERY free: This code is in the "invalid pointer" detection path`
			`3. No caching: The result is not cached, so every free operation pays this cost`
			`4. String comparison overhead: Even after getenv returns, strcmp is called`

			`### When This Executes:`

			`This code path executes when:`
			`- A pointer doesn't match the tiny pool slab lookup`
			`- AND it doesn't match mid-tier lookup`
			`- AND it doesn't match L25 lookup`
			`- = Invalid or unknown pointer detection`

			`However, based on the perf data, this is happening VERY frequently (43% of runtime), suggesting:`
			`- Either many pointers are being classified as "invalid"`
			`- OR the classification checks are expensive and route through this path frequently`

			`---`

			`## Part 8: Optimization Recommendations`

			`### PRIMARY BOTTLENECK`

			`Function: hak_free_at() - getenv call`
			`Line: hakmem.c:682`
			`CPU Time: 43.96% (combined getenv + strcmp)`
			`Root Cause: Uncached environment variable lookup on hot path`

			`### PROPOSED FIX`

			```c
			`// At initialization (in hak_init or similar):`
			`static int g_invalid_free_mode = 1; // default: skip`

			`static void init_invalid_free_mode(void) {`
			`const char* inv = getenv("HAKMEM_INVALID_FREE");`
			`if (inv && strcmp(inv, "fallback") == 0) {`
			`g_invalid_free_mode = 0;`
			`}`
			`}`

			`// In hak_free_at() line 682-684, replace with:`
			`int mode_skip = g_invalid_free_mode; // Just read cached value`
			```

			`### EXPECTED IMPACT`

			`Conservative Estimate:`
			`- Eliminate 43.96% CPU overhead`
			`- Expected speedup: 1.78x (100 / 56.04 = 1.78x)`
			`- Throughput increase: 78% improvement`

			`Realistic Estimate:`
			`- Actual speedup may be lower due to:`
			`- Other overheads becoming visible`
			`- Amdahl's law effects`
			`- Expected: 1.4x - 1.6x speedup (40-60% improvement)`

			`### IMPLEMENTATION`

			1. Add global variable: `static int g_invalid_free_mode = 1;`
			`2. Add initialization function called during hak_init()`
			`3. Replace line 682-684 with cached read`
			`4. Verify with perf that getenv no longer appears in profile`

			`---`

			`## Part 9: Secondary Optimizations (After Primary Fix)`

			`Once the getenv bottleneck is fixed, these will become more visible:`

			`### 2. hak_tiny_alloc Function Prologue (4.71%)`
			`- Issue: Stack frame setup overhead`
			`- Fix: Consider forcing inline for small allocations`
			`- Expected Impact: 2-3% improvement`

			`### 3. mid_desc_lookup (7.89%)`
			`- Issue: Mid-tier descriptor lookup`
			`- Fix: Optimize lookup algorithm or data structure`
			`- Expected Impact: 3-5% improvement (but may be necessary overhead)`

			`### 4. hak_tiny_owner_slab (5.59%)`
			`- Issue: Slab ownership determination`
			`- Fix: Could potentially cache or optimize pointer arithmetic`
			`- Expected Impact: 2-3% improvement`

			`---`

			`## Part 10: Data-Driven Summary`

			We should optimize `getenv("HAKMEM_INVALID_FREE")` in hak_free_at() because:

			`1. It consumes 43.96% of total CPU time (measured)`
			`2. It is called on every free operation that goes through invalid pointer detection`
			`3. The fix is trivial: cache the result at initialization`
			`4. Expected improvement: 1.4x-1.78x speedup (40-78% faster)`
			`5. This is a data-driven finding based on actual perf measurements, not theory`

			`Previous optimization attempts failed because they optimized code paths that:`
			`- Were not actually executed (fast paths were already optimal)`
			`- Had minimal CPU overhead (e.g., <1% each)`
			`- Were masked by this dominant bottleneck`

			`This optimization is different because:`
			`- It targets the #1 bottleneck by measured CPU time`
			`- It affects every free operation in the benchmark`
			`- The fix is simple, safe, and proven (standard caching pattern)`

			`---`

			`## Appendix: Raw Perf Data`

			`### A1: Top Functions (perf report --stdio)`

			```
			`# Overhead Command Shared Object Symbol`
			`# ........ ............... .......................... ...........................................`
			`#`
			`26.41% bench_comprehen libc.so.6 [.] __strncmp_evex`
			`17.55% bench_comprehen libc.so.6 [.] getenv`
			`10.10% bench_comprehen bench_comprehensive_hakmem [.] hak_tiny_alloc`
			`7.89% bench_comprehen bench_comprehensive_hakmem [.] mid_desc_lookup`
			`6.41% bench_comprehen libc.so.6 [.] __random`
			`5.59% bench_comprehen bench_comprehensive_hakmem [.] hak_tiny_owner_slab`
			`5.05% bench_comprehen bench_comprehensive_hakmem [.] hak_free_at`
			`3.40% bench_comprehen libc.so.6 [.] __strlen_evex`
			`2.78% bench_comprehen bench_comprehensive_hakmem [.] hak_alloc_at`
			```

			`### A2: Cache Statistics`

			```
			`2,385,756,311 cache-references:u`
			`50,668,784 cache-misses:u # 2.12% miss rate`
			`525,435,317,593 L1-dcache-loads:u`
			`415,332,039 L1-dcache-load-misses:u # 0.08% miss rate`
			```

			`### A3: Call Graph Sample (getenv hotspot)`

			```
			`test_random_free`
			`→ free (15.39%)`
			`→ hak_free_at (15.15%)`
			`→ __GI_getenv (10.04%)`
			`→ __strncmp_evex (5.50%)`
			`→ __strlen_evex (0.57%)`
			`→ hak_pool_mid_lookup (2.19%)`
			`→ mid_desc_lookup (1.85%)`
			`→ hak_tiny_owner_slab (1.00%)`
			```

			`---`

			`## Conclusion`

			`This is a textbook example of why data-driven profiling is essential:`

			`- Theory would suggest optimizing allocation fast paths or cache locality`
			`- Reality shows 44% of time is spent in environment variable lookup`
			`- The fix is trivial: cache the result at startup`
			`- Expected impact: 40-78% performance improvement`

			`Next Steps:`
			`1. Implement getenv caching fix`
			`2. Re-run perf analysis to verify improvement`
			`3. Identify next bottleneck (likely mid_desc_lookup at 7.89%)`

			`---`

			`Analysis Completed: 2025-10-26`

			`---`

			`## APPENDIX B: Exact Code Fix (Patch Preview)`

			`### Current Code (SLOW - 43.96% CPU overhead):`

			File: `/home/tomoaki/git/hakmem/hakmem.c`

			`Initialization (lines 359-363) - Already caches g_invalid_free_log:`
			```c
			`// Invalid free logging toggle (default off to avoid spam under LD_PRELOAD)`
			`char* invlog = getenv("HAKMEM_INVALID_FREE_LOG");`
			`if (invlog && atoi(invlog) != 0) {`
			`g_invalid_free_log = 1;`
			`HAKMEM_LOG("Invalid free logging enabled (HAKMEM_INVALID_FREE_LOG=1)\n");`
			`}`
			```

			`Hot Path (lines 682-689) - DOES NOT cache, calls getenv on every free:`
			```c
			`const char* inv = getenv("HAKMEM_INVALID_FREE"); // ← 43.96% CPU TIME HERE!`
			`int mode_skip = 1; // default: skip free to avoid crashes under LD_PRELOAD`
			`if (inv && strcmp(inv, "fallback") == 0) mode_skip = 0;`
			`if (mode_skip) {`
			`// Skip freeing unknown pointer to avoid abort (possible mmap region). Log only.`
			`RECORD_FREE_LATENCY();`
			`return;`
			`}`
			```

			`---`

			`### Proposed Fix (FAST - eliminates 43.96% overhead):`

			`Step 1: Add global variable near line 63 (next to g_invalid_free_log):`

			```c
			`int g_invalid_free_log = 0; // runtime: HAKMEM_INVALID_FREE_LOG=1 to log invalid-free messages (extern visible)`
			`int g_invalid_free_mode = 1; // NEW: 1=skip invalid frees (default), 0=fallback to libc_free`
			```

			`Step 2: Initialize in hak_init() after line 363:`

			```c
			`// Invalid free logging toggle (default off to avoid spam under LD_PRELOAD)`
			`char* invlog = getenv("HAKMEM_INVALID_FREE_LOG");`
			`if (invlog && atoi(invlog) != 0) {`
			`g_invalid_free_log = 1;`
			`HAKMEM_LOG("Invalid free logging enabled (HAKMEM_INVALID_FREE_LOG=1)\n");`
			`}`

			`// NEW: Cache HAKMEM_INVALID_FREE mode (avoid getenv on hot path)`
			`const char* inv = getenv("HAKMEM_INVALID_FREE");`
			`if (inv && strcmp(inv, "fallback") == 0) {`
			`g_invalid_free_mode = 0; // Use fallback mode`
			`HAKMEM_LOG("Invalid free mode: fallback to libc_free\n");`
			`} else {`
			`g_invalid_free_mode = 1; // Default: skip invalid frees`
			`HAKMEM_LOG("Invalid free mode: skip (safe for LD_PRELOAD)\n");`
			`}`
			```

			`Step 3: Replace hot path (lines 682-684):`

			```c
			`// OLD (SLOW):`
			`// const char* inv = getenv("HAKMEM_INVALID_FREE");`
			`// int mode_skip = 1;`
			`// if (inv && strcmp(inv, "fallback") == 0) mode_skip = 0;`

			`// NEW (FAST):`
			`int mode_skip = g_invalid_free_mode; // Just read cached value - NO getenv!`
			```

			`---`

			`### Performance Impact Summary:`

			`\| Metric \| Before \| After \| Improvement \|`
			`\|--------\|--------\|-------\|-------------\|`
			`\| getenv overhead \| 43.96% \| ~0% \| 43.96% eliminated \|`
			`\| Expected speedup \| 1.00x \| 1.4-1.78x \| +40-78% \|`
			`\| Throughput (16B LIFO) \| 60 M ops/sec \| 84-107 M ops/sec \| +40-78% \|`
			`\| Code complexity \| Simple \| Simple \| No change \|`
			`\| Risk \| N/A \| Very Low \| Read-only cached value \|`

			`---`

			`### Why This Fix Works:`

			`1. Environment variables don't change at runtime: Once the process starts, HAKMEM_INVALID_FREE is constant`
			`2. Same pattern already used: g_invalid_free_log is already cached this way (line 359-363)`
			`3. Zero runtime cost: Reading a cached int is ~1 cycle vs ~10,000+ cycles for getenv + strcmp`
			`4. Data-driven: Based on actual perf measurements showing 43.96% overhead`
			`5. Low risk: Simple variable read, no locks, no side effects`

			`---`

			`### Verification Plan:`

			`After implementing the fix:`

			```bash
			`# 1. Rebuild`
			`make clean && make`

			`# 2. Run perf again`
			`HAKMEM_WRAP_TINY=1 perf record -g --call-graph dwarf -o perf_after.data ./bench_comprehensive_hakmem`

			`# 3. Compare reports`
			`perf report --stdio -i perf_after.data \| head -50`

			`# Expected result: getenv should DROP from 17.55% to ~0%`
			`# Expected result: __strncmp_evex should DROP from 26.41% to ~0%`
			`# Expected result: Overall throughput should increase 40-78%`
			```

			`---`

			`## Final Recommendation`

			`IMPLEMENT THIS FIX IMMEDIATELY. It is:`

			`1. Data-driven (43.96% measured overhead)`
			`2. Simple (3 lines of code)`
			`3. Low-risk (read-only cached value)`
			`4. High-impact (40-78% speedup expected)`
			`5. Follows existing patterns (g_invalid_free_log)`

			`This is the type of optimization that:`
			`- Previous phases MISSED because they optimized code that wasn't executed`
			`- Profiling REVEALED through actual measurement`
			`- Will have DRAMATIC impact on real-world performance`

			`This is the smoking gun bottleneck that was blocking all previous optimization attempts.`