Files

Moe Charm (CI) 2a13478dc7 Optimize C6 heavy and C7 ultra performance analysis with refined design refinements

- Update environment profile presets and visibility analysis
- Enhance small object and tiny segment v4 box implementations
- Refine C7 ultra and C6 heavy allocation strategies
- Add comprehensive performance metrics and design documentation

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

2025-12-10 22:57:26 +09:00

16 KiB

Raw Blame History

C6-Heavy (257-768B) Visibility Analysis - Phase C6-H

Date: 2025-12-10 Benchmark: ./bench_mid_large_mt_hakmem 1 1000000 400 1 (1 thread, ws=400, iters=1M) Size Range: 257-768B (Class 6: 512B allocations) Configuration: C6_HEAVY_LEGACY_POOLV1 profile (C7_SAFE + C6_HOT=1)

Executive Summary

Performance Gap Analysis

HAKMEM: 9.84M ops/s (baseline)
mimalloc: 51.3M ops/s
Performance Gap: 5.2x (mimalloc is 421% faster)

This represents a critical performance deficit in the C6-heavy allocation path, where HAKMEM achieves only 19% of mimalloc's throughput.

Key Findings

C6 does NOT use Pool flatten path - With HAKMEM_TINY_C6_HOT=1, allocations route through TinyHeap v1, bypassing pool flatten entirely
Address lookup dominates CPU time - hak_super_lookup (9.3%) + mid_desc_lookup (8.2%) + classify_ptr (5.8%) = 23.3% of cycles
Pool operations are expensive - Despite not using flatten, pool alloc/free combined still consume ~15-20% of cycles
Mid_desc cache provides modest gains - +6.4% improvement (9.8M → 10.4M ops/s)

Phase C6-H1: Baseline Metrics

Test Configuration

export HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1
export HAKMEM_BENCH_MIN_SIZE=257
export HAKMEM_BENCH_MAX_SIZE=768

Baseline Results

Configuration	Throughput (ops/s)	vs mimalloc	Notes
Baseline (C6_HOT=1, mid_desc_cache=1)	9,836,420	19.2%	Default profile
C6_HOT=1, mid_desc_cache=0	9,805,954	19.1%	Without cache
C6_HOT=1, mid_desc_cache=1	10,435,480	20.3%	With cache (+6.4%)
C6_HOT=0 (pure legacy pool)	9,938,473	19.4%	Pool path ~same as TinyHeap
mimalloc baseline	51,297,877	100.0%	Reference

Key Observations

Mid_desc cache effect: +6.4% improvement, but far from closing the gap
C6_HOT vs pool path: Nearly identical performance (~9.8M-9.9M ops/s), suggesting the bottleneck is in common infrastructure (address lookup, classification)
Size class routing: 257-768B → Class 6 (512B) as expected

Phase C6-H2: Pool Flatten and Cache Analysis

Pool Flatten Test (ATTEMPTED)

Finding: Pool v1 flatten path is NOT USED for C6 allocations with HAKMEM_TINY_C6_HOT=1.

# Test with flatten enabled
export HAKMEM_POOL_V1_FLATTEN_ENABLED=1
export HAKMEM_POOL_V1_FLATTEN_STATS=1
# Result: [POOL_V1_FLAT] alloc_tls_hit=0 alloc_fb=0 free_tls_hit=0 free_fb=0

Root Cause:

With HAKMEM_TINY_C6_HOT=1, class 6 routes to TINY_ROUTE_HEAP (TinyHeap v1)
TinyHeap v1 uses its own allocation path via tiny_heap_box.h, not the pool flatten path
Pool flatten optimizations (Phase 80-82) only apply to legacy pool path (when C6_HOT=0)

Mid_Desc Cache Analysis

Metric	Without Cache	With Cache	Delta
Throughput	9.81M ops/s	10.44M ops/s	+6.4%
Expected self% reduction	mid_desc_lookup: 8.2%	~6-7% (estimated)	~1-2%

Conclusion: Mid_desc cache provides measurable but insufficient improvement. The 8.2% CPU time in mid_desc_lookup is reduced, but other lookup costs (hak_super_lookup, classify_ptr) remain.

Phase C6-H3: CPU Hotspot Analysis

Perf Stat Results

Benchmark: 9,911,926 ops/s (0.101s runtime)
Cycles:      398,766,361 cycles:u
Instructions: 1,054,643,524 instructions:u
IPC:         2.64
Page Faults: 7,131
Task Clock:  119.08 ms

Analysis:

IPC 2.64: Reasonable instruction-level parallelism, but many cycles wasted
Cycles per operation: 398,766,361 / 1,000,000 = 398 cycles/op
Instructions per operation: 1,054,643,524 / 1,000,000 = 1,054 instructions/op

Comparison estimate (mimalloc at 51.3M ops/s):

Estimated cycles/op for mimalloc: ~76 cycles/op (5.2x faster)
HAKMEM uses 5.2x more cycles per allocation/free pair

Perf Record Hotspots (Top 20 Functions)

Function	Self %	Category	Description
`hak_super_lookup`	9.32%	Address Lookup	Superslab registry lookup (largest single cost)
`mid_desc_lookup`	8.23%	Address Lookup	Mid-size descriptor lookup
`hak_pool_get_class_index`	5.87%	Classification	Size→class mapping
`classify_ptr`	5.76%	Classification	Pointer classification for free
`hak_pool_free_v1_impl`	5.52%	Pool Free	Pool free implementation
`hak_pool_try_alloc_v1_impl`	5.46%	Pool Alloc	Pool allocation implementation
`free`	4.54%	Front Gate	glibc free wrapper
`worker_run`	4.47%	Benchmark	Benchmark driver
`ss_map_lookup`	4.35%	Address Lookup	Superslab map lookup
`super_reg_effective_mask`	4.32%	Address Lookup	Registry mask computation
`mid_desc_hash`	3.69%	Address Lookup	Hash computation for mid_desc
`mid_set_header`	3.27%	Metadata	Header initialization
`mid_page_inuse_dec_and_maybe_dn`	3.17%	Metadata	Page occupancy tracking
`mid_desc_init_once`	2.71%	Initialization	Descriptor initialization
`malloc`	2.60%	Front Gate	glibc malloc wrapper
`hak_free_at`	2.53%	Front Gate	Internal free dispatcher
`hak_pool_mid_lookup_v1_impl`	2.17%	Pool Lookup	Pool-specific descriptor lookup
`super_reg_effective_size`	1.87%	Address Lookup	Registry size computation
`hak_pool_free_fast_v1_impl`	1.77%	Pool Free	Fast path for pool free
`hak_pool_init`	1.44%	Initialization	Pool initialization

Hotspot Category Breakdown

Category	Combined Self %	Functions
Address Lookup & Classification	41.5%	hak_super_lookup, mid_desc_lookup, classify_ptr, hak_pool_get_class_index, ss_map_lookup, super_reg_effective_mask, mid_desc_hash, super_reg_effective_size, hak_pool_mid_lookup_v1_impl
Pool Operations	14.8%	hak_pool_try_alloc_v1_impl, hak_pool_free_v1_impl, hak_pool_free_fast_v1_impl
Metadata Management	9.2%	mid_set_header, mid_page_inuse_dec_and_maybe_dn, mid_desc_init_once
Front Gate	9.7%	malloc, free, hak_free_at
Benchmark Driver	4.5%	worker_run
Other	20.3%	Various helpers, initialization, etc.

Root Cause Analysis

1. Address Lookup Dominates (41.5% of CPU)

The single largest performance killer is address→metadata lookup infrastructure:

hak_super_lookup (9.3%): Superslab registry lookup to find which allocator owns a pointer
mid_desc_lookup (8.2%): Hash-based descriptor lookup for mid-size allocations
ss_map_lookup (4.3%): Secondary map lookup within superslab
classify_ptr (5.8%): Pointer classification during free
hak_pool_get_class_index (5.9%): Size→class index computation

Why this matters: Every allocation AND free requires multiple lookups:

Alloc: size → class_idx → descriptor → block
Free: ptr → superslab → descriptor → classification → free handler

Comparison to mimalloc: mimalloc likely uses:

Thread-local caching with minimal lookup
Direct pointer arithmetic from block headers
Segment-based organization reducing lookup depth

2. Pool Operations Still Expensive (14.8%)

Despite C6 routing through TinyHeap (not pool flatten), pool operations still consume significant cycles:

hak_pool_try_alloc_v1_impl (5.5%)
hak_pool_free_v1_impl (5.5%)

Why: TinyHeap v1 likely calls into pool infrastructure for:

Page allocation from mid/smallmid pool
Descriptor management
Cross-thread handling

3. Metadata Overhead (9.2%)

Mid-size allocations carry significant metadata overhead:

Header initialization: mid_set_header (3.3%)
Occupancy tracking: mid_page_inuse_dec_and_maybe_dn (3.2%)
Descriptor init: mid_desc_init_once (2.7%)

4. Front Gate Overhead (9.7%)

The malloc/free wrappers add non-trivial cost:

Route determination
Cross-allocator checks (jemalloc, system)
Lock depth checks
Initialization checks

Recommendations for Next Phase

Priority 1: Address Lookup Reduction (Highest Impact)

Target: 41.5% → 20-25% of cycles

Strategies:

TLS Descriptor Cache: Extend mid_desc_cache to cache full allocation context (class_idx + descriptor + page_info)
Fast Path Header: Embed class_idx in allocation header for instant classification on free (similar to tiny allocations)
Segment-Based Addressing: Consider segment-style addressing (like mimalloc) where ptr→metadata is direct pointer arithmetic
Superslab Lookup Bypass: For C6-heavy workloads, skip superslab lookup when we know it's mid-size

Expected Gain: 10-15M ops/s (+100-150%)

Priority 2: Pool Path Streamlining (Medium Impact)

Target: 14.8% → 8-10% of cycles

Strategies:

Dedicated C6 Fast Path: Create a specialized alloc/free path for class 6 that skips pool generality
TLS Block Cache: Implement TLS-local block cache for C6 (bypass pool ring buffer overhead)
Inline Critical Helpers: Force-inline hak_pool_get_class_index and other hot helpers

Expected Gain: 3-5M ops/s (+30-50%)

Priority 3: Metadata Streamlining (Lower Impact)

Target: 9.2% → 5-6% of cycles

Strategies:

Lazy Header Init: Only initialize headers when necessary (debug mode, cross-thread)
Batch Occupancy Updates: Combine multiple inuse_dec calls
Cached Descriptors: Reduce descriptor initialization overhead

Expected Gain: 1-2M ops/s (+10-20%)

Priority 4: Front Gate Thinning (Lower Impact)

Target: 9.7% → 6-7% of cycles

Strategies:

Size-Based Fast Path: For mid-size range (257-768B), skip most gate checks
Compile-Time Routing: When jemalloc/system allocators are not used, eliminate checks

Expected Gain: 1-2M ops/s (+10-20%)

Comparison to Historical Baselines

Phase	Configuration	Throughput	vs Current	Notes
Phase 54	C7_SAFE, mixed 16-1024B	28.1M ops/s	2.9x	Mixed workload
Phase 80	C6-heavy, flatten OFF	23.1M ops/s	2.4x	Legacy baseline
Phase 81	C6-heavy, flatten ON	25.9M ops/s	2.6x	+10% from flatten
Phase 82	C6-heavy, flatten ON	26.7M ops/s	2.7x	+13% from flatten
Current (C6-H)	C6-heavy, C6_HOT=1	9.8M ops/s	1.0x	REGRESSION

CRITICAL FINDING: Current baseline (9.8M ops/s) is 2.4-2.7x SLOWER than historical C6-heavy baselines (23-27M ops/s).

Possible Causes:

Configuration difference: Historical tests may have used different profile (LEGACY vs C7_SAFE)
Routing change: C6_HOT=1 may be forcing a slower path through TinyHeap
Build/compiler difference: Flags or LTO settings may have changed
Benchmark variance: Different workload characteristics

Action Required: Replicate historical Phase 80-82 configurations exactly to identify regression point.

Verification of Historical Configuration

Let me verify the exact configuration used in Phase 80-82:

Phase 80-82 Configuration (from CURRENT_TASK.md):

HAKMEM_BENCH_MIN_SIZE=257
HAKMEM_BENCH_MAX_SIZE=768
HAKMEM_TINY_HEAP_PROFILE=LEGACY  # ← Different!
HAKMEM_TINY_HOTHEAP_V2=0
HAKMEM_POOL_V2_ENABLED=0
HAKMEM_POOL_V1_FLATTEN_ENABLED=1
HAKMEM_POOL_V1_FLATTEN_STATS=1

Current Configuration:

HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1  # Sets TINY_HEAP_PROFILE=C7_SAFE
HAKMEM_TINY_C6_HOT=1  # ← Adds TinyHeap routing
HAKMEM_POOL_V1_FLATTEN_ENABLED=0  # ← Flatten OFF by default

Key Difference: Historical tests used TINY_HEAP_PROFILE=LEGACY, which likely routes C6 through pure pool path (no TinyHeap). Current C6_HEAVY_LEGACY_POOLV1 profile sets TINY_HEAP_PROFILE=C7_SAFE + TINY_C6_HOT=1, routing C6 through TinyHeap.

Action Items for Phase C6-H+1

Replicate Historical Baseline (URGENT)

export HAKMEM_BENCH_MIN_SIZE=257
export HAKMEM_BENCH_MAX_SIZE=768
export HAKMEM_TINY_HEAP_PROFILE=LEGACY
export HAKMEM_TINY_HOTHEAP_V2=0
export HAKMEM_POOL_V2_ENABLED=0
export HAKMEM_POOL_V1_FLATTEN_ENABLED=0
# Expected: ~23M ops/s

Test Flatten ON with Historical Config

# Same as above, but:
export HAKMEM_POOL_V1_FLATTEN_ENABLED=1
export HAKMEM_POOL_V1_FLATTEN_STATS=1
# Expected: ~26M ops/s with active flatten stats

Profile Comparison Matrix
- LEGACY vs C7_SAFE profile
- C6_HOT=0 vs C6_HOT=1
- Flatten OFF vs ON
- Identify which combination yields best performance
Address Lookup Prototype
- Implement TLS allocation context cache (class_idx + descriptor + page)
- Measure impact on lookup overhead (target: 41.5% → 25%)
Update ENV_PROFILE_PRESETS.md
- Clarify that C6_HEAVY_LEGACY_POOLV1 uses C7_SAFE profile (not pure LEGACY)
- Add note about C6_HOT routing implications
- Document performance differences between profile choices

Success Criteria for Phase C6-H+1

Reproduce historical baseline: Achieve 23-27M ops/s with LEGACY profile
Understand routing impact: Quantify C6_HOT=0 vs C6_HOT=1 difference
Identify optimization path: Choose between:
- Optimizing TinyHeap C6 path (if C6_HOT=1 is strategic)
- Optimizing pool flatten path (if LEGACY/C6_HOT=0 is preferred)
- Hybrid approach with runtime selection

Target: Close to 30M ops/s (1/2 of current gap to 51.3M mimalloc baseline) by end of next phase.

Appendix A: Full Perf Report Output

# Samples: 656  of event 'cycles:u'
# Event count (approx.): 409,174,521
#
# Overhead  Symbol
# ........  .....................................
     9.32%  [.] hak_super_lookup
     8.23%  [.] mid_desc_lookup
     5.87%  [.] hak_pool_get_class_index
     5.76%  [.] classify_ptr
     5.52%  [.] hak_pool_free_v1_impl
     5.46%  [.] hak_pool_try_alloc_v1_impl
     4.54%  [.] free
     4.47%  [.] worker_run
     4.35%  [.] ss_map_lookup
     4.32%  [.] super_reg_effective_mask
     3.69%  [.] mid_desc_hash
     3.27%  [.] mid_set_header
     3.17%  [.] mid_page_inuse_dec_and_maybe_dn
     2.71%  [.] mid_desc_init_once
     2.60%  [.] malloc
     2.53%  [.] hak_free_at
     2.17%  [.] hak_pool_mid_lookup_v1_impl
     1.87%  [.] super_reg_effective_size
     1.77%  [.] hak_pool_free_fast_v1_impl
     1.64%  [k] 0xffffffffae200ba0 (kernel)
     1.44%  [.] hak_pool_init
     1.42%  [.] hak_pool_is_poolable
     1.21%  [.] should_sample
     1.12%  [.] hak_pool_free
     1.11%  [.] hak_super_hash
     1.09%  [.] hak_pool_try_alloc
     0.95%  [.] mid_desc_lookup_cached
     0.93%  [.] hak_pool_v1_flatten_enabled
     0.76%  [.] hak_pool_v2_route
     0.57%  [.] ss_map_hash
     0.55%  [.] hak_in_wrapper

Appendix B: Test Commands Summary

# Baseline
export HAKMEM_PROFILE=C6_HEAVY_LEGACY_POOLV1
export HAKMEM_BENCH_MIN_SIZE=257
export HAKMEM_BENCH_MAX_SIZE=768
./bench_mid_large_mt_hakmem 1 1000000 400 1
# Result: 9,836,420 ops/s

# Mimalloc comparison
./bench_mid_large_mt_mi 1 1000000 400 1
# Result: 51,297,877 ops/s (5.2x faster)

# Mid_desc cache OFF
export HAKMEM_MID_DESC_CACHE_ENABLED=0
./bench_mid_large_mt_hakmem 1 1000000 400 1
# Result: 9,805,954 ops/s

# Mid_desc cache ON
export HAKMEM_MID_DESC_CACHE_ENABLED=1
./bench_mid_large_mt_hakmem 1 1000000 400 1
# Result: 10,435,480 ops/s (+6.4%)

# Perf stat
perf stat -e cycles:u,instructions:u,task-clock,page-faults:u \
  ./bench_mid_large_mt_hakmem 1 1000000 400 1
# Result: 398M cycles, 1.05B instructions, IPC=2.64

# Perf record
perf record -F 5000 --call-graph dwarf -e cycles:u \
  -o perf.data.c6_flat ./bench_mid_large_mt_hakmem 1 1000000 400 1
perf report -i perf.data.c6_flat --stdio --no-children

End of Report

16 KiB Raw Blame History

C6-Heavy (257-768B) Visibility Analysis - Phase C6-H

Executive Summary

Performance Gap Analysis

Key Findings

Phase C6-H1: Baseline Metrics

Test Configuration

Baseline Results

Key Observations

Phase C6-H2: Pool Flatten and Cache Analysis

Pool Flatten Test (ATTEMPTED)

Mid_Desc Cache Analysis

Phase C6-H3: CPU Hotspot Analysis

Perf Stat Results

Perf Record Hotspots (Top 20 Functions)

Hotspot Category Breakdown

Root Cause Analysis

1. Address Lookup Dominates (41.5% of CPU)

2. Pool Operations Still Expensive (14.8%)

3. Metadata Overhead (9.2%)

4. Front Gate Overhead (9.7%)

Recommendations for Next Phase

Priority 1: Address Lookup Reduction (Highest Impact)

Priority 2: Pool Path Streamlining (Medium Impact)

Priority 3: Metadata Streamlining (Lower Impact)

Priority 4: Front Gate Thinning (Lower Impact)

Comparison to Historical Baselines

Verification of Historical Configuration

Action Items for Phase C6-H+1

Success Criteria for Phase C6-H+1

Appendix A: Full Perf Report Output

Appendix B: Test Commands Summary

16 KiB

Raw Blame History