Files

Moe Charm (CI) a9ddb52ad4 ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)

Phase 1 完了：環境変数整理 + fprintf デバッグガード

ENV変数削除（BG/HotMag系）:
- core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines)
- core/hakmem_tiny_bg_spill.c: BG spill ENV 削除
- core/tiny_refill.h: BG remote 固定値化
- core/hakmem_tiny_slow.inc: BG refs 削除

fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE):
- core/hakmem_shared_pool.c: Lock stats (~18 fprintf)
- core/page_arena.c: Init/Shutdown/Stats (~27 fprintf)
- core/hakmem.c: SIGSEGV init message

ドキュメント整理:
- 328 markdown files 削除（旧レポート・重複docs）

性能確認:
- Larson: 52.35M ops/s (前回52.8M、安定動作✅)
- ENV整理による機能影響なし
- Debug出力は一部残存（次phase で対応）

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 14:45:26 +09:00

9.2 KiB

Raw Blame History

Phase 7 Tiny Allocator - Syscall Bottleneck Investigation

Date: 2025-11-09 Issue: Phase 7 performance is 8x slower than System malloc (10.9M vs 89M ops/s) Root Cause: Excessive syscalls (442 mmaps vs System's 8 mmaps in 50k operations)

Executive Summary

Measured syscalls (50k operations, 256B working set):

HAKMEM Phase 7: 447 mmaps, 409 madvise (856 total syscalls)
System malloc: 8 mmaps, 1 munmap (9 total syscalls)
HAKMEM has 55-95x more syscalls than System malloc

Root cause breakdown:

Header overflow (1016-1024B): 206 allocations (0.82%) → 409 mmaps
SuperSlab initialization: 6 mmaps (one-time cost)
Alignment overhead: 32 additional mmaps from 2x allocation pattern

Performance impact:

Each mmap: ~500-1000 cycles
409 excessive mmaps: ~200,000-400,000 cycles total
Benchmark: 50,000 operations
Syscall overhead: 4-8 cycles per operation (significant!)

Detailed Analysis

1. Allocation Size Distribution

Benchmark: bench_random_mixed (size = 16 + (rand() & 0x3FF))
Total allocations: 25,063

Size Range       Count      Percentage   Classification
--------------------------------------------------------------
  16 -  127:      2,750     10.97%       Safe (no header overflow)
 128 -  255:      3,091     12.33%       Safe (no header overflow)
 256 -  511:      6,225     24.84%       Safe (no header overflow)
 512 - 1015:     12,384     49.41%       Safe (no header overflow)
1016 - 1024:        206      0.82%       ← CRITICAL: Header overflow!
1025 - 1039:          0      0.00%       (Out of benchmark range)

Key insight: Only 0.82% of allocations cause header overflow, but they generate 98% of syscalls.

2. MMAP Source Breakdown

Instrumentation results:

SuperSlab mmaps:        6  (TLS cache initialization, one-time)
Final fallback mmaps: 409  (header overflow 1016-1024B)
-------------------------------------------
TOTAL mmaps:          415  (measured by instrumentation)
Actual mmaps (strace):447  (32 unaccounted, likely alignment overhead)

madvise breakdown:

madvise calls: 409  (matches final fallback mmaps EXACTLY)

Why 409 mmaps for 206 allocations?

Each allocation triggers hak_alloc_mmap_impl(size)
Implementation allocates 2x size for alignment
Munmaps excess → triggers madvise for memory release
Each allocation = ~2 syscalls (mmap + madvise)

3. Code Path Analysis

What happens for a 1024B allocation with Phase 7 header:

// User requests 1024B
size_t size = 1024;

// Phase 7 adds 1-byte header
size_t alloc_size = size + 1;  // 1025B

// Check Tiny range
if (alloc_size > TINY_MAX_SIZE) {  // 1025 > 1024 → TRUE
    // Reject to Tiny, fall through to Mid/ACE
}

// Mid range check (8KB-32KB)
if (size >= 8192) → FALSE  // 1025 < 8192

// ACE check (disabled in benchmark)
→ Returns NULL

// Final fallback (core/box/hak_alloc_api.inc.h:161-181)
else if (size >= TINY_MAX_SIZE) {  // 1025 >= 1024 → TRUE
    ptr = hak_alloc_mmap_impl(size);  // ← SYSCALL!
}

Result: Every 1016-1024B allocation triggers mmap fallback.

4. Performance Impact Calculation

Syscall overhead:

mmap latency: ~500-1000 cycles (kernel mode switch + page table update)
madvise latency: ~300-500 cycles

Total cost for 206 header overflow allocations:

409 mmaps × 750 cycles = ~307,000 cycles
409 madvise × 400 cycles = ~164,000 cycles
Total: ~471,000 cycles overhead

Benchmark workload:

50,000 operations
Throughput: 10.9M ops/s (Phase 7) vs 89M ops/s (System)
Overhead per operation: 471,000 / 25,000 ≈ 19 cycles/alloc

Why this is catastrophic:

TLS cache hit (normal case): ~5-10 cycles
Header overflow case: ~19 cycles overhead + allocation cost
Net effect: 3-4x slowdown for affected sizes

5. Comparison with System Malloc

System malloc (glibc tcache):

mmap calls: 8 (initialization only)
  - Main arena: 1 mmap
  - Thread cache: 7 mmaps (one per thread/arena)
munmap calls: 1

System malloc strategy for 1024B:

Uses tcache (thread-local cache)
Pre-allocated from arena
No syscalls in hot path

HAKMEM Phase 7:

Header forces 1025B allocation
Exceeds TINY_MAX_SIZE
Falls to mmap syscall
Syscall on EVERY allocation

Root Cause Summary

Problem #1: Off-by-one TINY_MAX_SIZE boundary

TINY_MAX_SIZE = 1024
Header overhead = 1 byte
Request 1024B → allocate 1025B → reject to mmap
All 1KB allocations fall through to syscalls

Problem #2: Missing Mid allocator coverage

Gap: 1025-8191B (TINY_MAX_SIZE+1 to Mid 8KB)
ACE disabled in benchmark
No fallback except mmap
8KB gap forces syscalls

Problem #3: mmap overhead pattern

Each mmap allocates 2x size for alignment
Munmaps excess
Triggers madvise
Each allocation = 2+ syscalls

Quick Fixes (Priority Order)

Fix #1: Increase TINY_MAX_SIZE to 1025+ ⭐⭐⭐⭐⭐ (CRITICAL)

Change:

// core/hakmem_tiny.h:26
-#define TINY_MAX_SIZE 1024
+#define TINY_MAX_SIZE 1536  // Accommodate 1024B + header with margin

Effect:

All 1016-1024B allocations stay in Tiny
Eliminates 409 mmaps (92% reduction!)
Expected improvement: 10.9M → 40-60M ops/s (+270-450%)

Implementation time: 5 minutes Risk: Low (just increases Tiny range)

Fix #2: Add class 8 (2KB) to Tiny allocator ⭐⭐⭐⭐

Change:

// core/hakmem_tiny.h
-#define TINY_NUM_CLASSES 8
+#define TINY_NUM_CLASSES 9
#define TINY_MAX_SIZE 2048

static const size_t g_tiny_class_sizes[] = {
    8, 16, 32, 64, 128, 256, 512, 1024,
+   2048  // Class 8
};

Effect:

Covers 1025-2048B gap
Future-proof for larger headers (if needed)
Expected improvement: Same as Fix #1, plus better coverage

Implementation time: 30 minutes Risk: Medium (need to update SuperSlab capacity calculations)

Fix #3: Pre-warm TLS cache for class 7 (1KB) ⭐⭐⭐

Already implemented in Phase 7-3!

Effect:

First allocation hits TLS cache (not refill)
Reduces cold-start mmap calls
Expected improvement: Already done (+180-280%)

Fix #4: Optimize mmap alignment overhead ⭐⭐

Change: Use MAP_ALIGNED or posix_memalign instead of 2x mmap pattern

Effect:

Reduces mmap calls from 2 per allocation to 1
Eliminates madvise calls
Expected improvement: +10-15% (minor)

Implementation time: 2 hours Risk: Medium (platform-specific)

Recommended Action Plan

Immediate (今すぐ - 5 minutes):

Change TINY_MAX_SIZE from 1024 to 1536 ← DO THIS NOW!
Rebuild and test
Measure performance (expect 40-60M ops/s)

Short-term (今日中 - 2 hours):

Add class 8 (2KB) to Tiny allocator
Update SuperSlab configuration
Full benchmark suite validation

Long-term (今週 - 1 week):

Fill 1KB-8KB gap with Mid allocator extension
Optimize mmap alignment pattern
Consider adaptive TINY_MAX_SIZE based on workload

Expected Performance After Fix #1

Before (current):

bench_random_mixed 128B:  10.9M ops/s  (vs System 89M = 12%)
Bottleneck: 409 mmaps for 206 allocations (0.82%)

After (TINY_MAX_SIZE=1536):

bench_random_mixed 128B:  40-60M ops/s  (vs System 89M = 45-67%)
Improvement: +270-450% 🚀
Syscalls: 6-10 mmaps (initialization only)

Rationale:

Eliminates 409/447 mmaps (91% reduction)
Remaining 6 mmaps are SuperSlab initialization (one-time)
Hot path returns to 3-5 instruction TLS cache hit
Matches System malloc design (no syscalls in hot path)

Conclusion

Root cause: 1-byte header pushes 1024B allocations to 1025B, exceeding TINY_MAX_SIZE (1024), forcing mmap fallback for every allocation.

Impact: 98% of syscalls (409/447) come from 0.82% of allocations (206/25,063).

Solution: Increase TINY_MAX_SIZE to 1536+ to accommodate header overhead.

Expected result: +270-450% performance improvement (10.9M → 40-60M ops/s), approaching System malloc parity.

Next step: Implement Fix #1 (5 minutes), rebuild, and verify with benchmarks.

Appendix: Benchmark Data

Test command:

./bench_syscall_trace_hakmem 50000 256 42

strace output:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 53.52    0.002279           5       447           mmap
 44.79    0.001907           4       409           madvise
  1.69    0.000072           8         9           munmap
------ ----------- ----------- --------- --------- ----------------
100.00    0.004258           4       865           total

Instrumentation output:

SuperSlab mmaps:        6  (TLS cache initialization)
Final fallback mmaps: 409  (header overflow 1016-1024B)
-------------------------------------------
TOTAL mmaps:          415

Size distribution:

1016-1024B: 206 allocations (0.82%)
512-1015B: 12,384 allocations (49.41%)
All others: 12,473 allocations (49.77%)

Key metrics:

Total operations: 50,000
Total allocations: 25,063
Total frees: 25,063
Throughput: 10.9M ops/s (Phase 7) vs 89M ops/s (System) → 8.2x slower

Generated by: Claude Code (Task Agent) Date: 2025-11-09 Status: Investigation complete, fix identified, ready for implementation

9.2 KiB Raw Blame History Unescape Escape

Phase 7 Tiny Allocator - Syscall Bottleneck Investigation

Executive Summary

Detailed Analysis

1. Allocation Size Distribution

2. MMAP Source Breakdown

3. Code Path Analysis

4. Performance Impact Calculation

5. Comparison with System Malloc

Root Cause Summary

Quick Fixes (Priority Order)

Fix #1: Increase TINY_MAX_SIZE to 1025+ ⭐⭐⭐⭐⭐ (CRITICAL)

Fix #2: Add class 8 (2KB) to Tiny allocator ⭐⭐⭐⭐

Fix #3: Pre-warm TLS cache for class 7 (1KB) ⭐⭐⭐

Fix #4: Optimize mmap alignment overhead ⭐⭐

Recommended Action Plan

Expected Performance After Fix #1

Conclusion

Appendix: Benchmark Data

9.2 KiB

Raw Blame History