hakmem

tomoaki/hakmem

Fork 0

Commit Graph

Author	SHA1	Message	Date
Moe Charm (CI)	7311d32574	Phase 24 PageArena/HotSpanBox: Mid/VM page reuse cache (structural limit identified) Summary: - Implemented PageArena (Box PA1-PA3) for Mid-Large (8-52KB) / L25 (64KB-2MB) - Integration: Pool TLS Arena + L25 alloc/refill paths - Result: Minimal impact (+4.7% Mid, 0% VM page-fault reduction) - Conclusion: Structural limit - existing Arena/Pool/L25 already optimized Implementation: 1. Box PA1: Hot Page Cache (4KB pages, LIFO stack, 1024 slots) - core/page_arena.c: hot_page_alloc/free with mutex protection - TLS cache for 4KB pages 2. Box PA2: Warm Span Cache (64KB-2MB spans, size-bucketed) - 64KB/128KB/2MB span caches (256/128/64 slots) - Size-class based allocation 3. Box PA3: Cold Path (mmap fallback) - page_arena_alloc_pages/aligned with fallback to direct mmap Integration Points: 4. Pool TLS Arena (core/pool_tls_arena.c) - chunk_ensure(): Lazy init + page_arena_alloc_pages() hook - arena_cleanup_thread(): Return chunks to PageArena if enabled - Exponential growth preserved (1MB → 8MB) 5. L25 Pool (core/hakmem_l25_pool.c) - l25_alloc_new_run(): Lazy init + page_arena_alloc_aligned() hook - refill_freelist(): PageArena allocation for bundles - 2MB run carving preserved ENV Variables: - HAKMEM_PAGE_ARENA_ENABLE=1 (default: 0, OFF) - HAKMEM_PAGE_ARENA_HOT_SIZE=1024 (default: 1024) - HAKMEM_PAGE_ARENA_WARM_64K=256 (default: 256) - HAKMEM_PAGE_ARENA_WARM_128K=128 (default: 128) - HAKMEM_PAGE_ARENA_WARM_2M=64 (default: 64) Benchmark Results: - Mid-Large MT (4T, 40K iter, 2KB): - OFF: 84,535 page-faults, 726K ops/s - ON: 84,534 page-faults, 760K ops/s (+4.7% ops, -0.001% faults) - VM Mixed (200K iter): - OFF: 102,134 page-faults, 257K ops/s - ON: 102,134 page-faults, 255K ops/s (0% change) Root Cause Analysis: - Hypothesis: 50-66% page-fault reduction (80-100K → 30-40K) - Actual: <1% page-fault reduction, minimal performance impact - Reason: Structural limit - existing Arena/Pool/L25 already highly optimized - 1MB chunk sizes with high-density linear carving - TLS ring + exponential growth minimize mmap calls - PageArena becomes double-buffering layer with no benefit - Remaining page-faults from kernel zero-clear + app access patterns Lessons Learned: 1. Mid/Large allocators already page-optimal via Arena/Pool design 2. Middle-layer caching ineffective when base layer already optimized 3. Page-fault reduction requires app-level access pattern changes 4. Tiny layer (Phase 23) remains best target for frontend optimization Next Steps: - Defer PageArena (low ROI, structural limit reached) - Focus on upper layers (allocation pattern analysis, size distribution) - Consider app-side access pattern optimization 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-17 03:22:27 +09:00
Moe Charm (CI)	40be86425b	Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis Phase 12 SP-SLOT Box (Complete): - Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs - 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS - Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%) - Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md Mid-Large P0 Analysis (2025-11-14): - Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0) - Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%) - Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex - Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures) - Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md Next: Lock-free remote queue to reduce futex from 67% → <10% Files modified: - core/hakmem_shared_pool.c (SP-SLOT implementation) - core/pool_tls.c (debug logging + stdatomic.h) - core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h) - CURRENT_TASK.md (Phase 12 completion status) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-14 14:18:56 +09:00
Moe Charm (CI)	1010a961fb	Tiny: fix header/stride mismatch and harden refill paths - Root cause: header-based class indexing (HEADER_CLASSIDX=1) wrote a 1-byte header during allocation, but linear carve/refill and initial slab capacity still used bare class block sizes. This mismatch could overrun slab usable space and corrupt freelists, causing reproducible SEGV at ~100k iters. Changes - Superslab: compute capacity with effective stride (block_size + header for classes 0..6; class7 remains headerless) in superslab_init_slab(). Add a debug-only bound check in superslab_alloc_from_slab() to fail fast if carve would exceed usable bytes. - Refill (non-P0 and P0): use header-aware stride for all linear carving and TLS window bump operations. Ensure alignment/validation in tiny_refill_opt.h also uses stride, not raw class size. - Drain: keep existing defense-in-depth for remote sentinel and sanitize nodes before splicing into freelist (already present). Notes - This unifies the memory layout across alloc/linear-carve/refill with a single stride definition and keeps class7 (1024B) headerless as designed. - Debug builds add fail-fast checks; release builds remain lean. Next - Re-run Tiny benches (256/1024B) in debug to confirm stability, then in release. If any remaining crash persists, bisect with HAKMEM_TINY_P0_BATCH_REFILL=0 to isolate P0 batch carve, and continue reducing branch-miss as planned.	2025-11-09 18:55:50 +09:00

Author

SHA1

Message

Date

Moe Charm (CI)

7311d32574

Phase 24 PageArena/HotSpanBox: Mid/VM page reuse cache (structural limit identified)

Summary:
- Implemented PageArena (Box PA1-PA3) for Mid-Large (8-52KB) / L25 (64KB-2MB)
- Integration: Pool TLS Arena + L25 alloc/refill paths
- Result: Minimal impact (+4.7% Mid, 0% VM page-fault reduction)
- Conclusion: Structural limit - existing Arena/Pool/L25 already optimized

Implementation:
1. Box PA1: Hot Page Cache (4KB pages, LIFO stack, 1024 slots)
   - core/page_arena.c: hot_page_alloc/free with mutex protection
   - TLS cache for 4KB pages

2. Box PA2: Warm Span Cache (64KB-2MB spans, size-bucketed)
   - 64KB/128KB/2MB span caches (256/128/64 slots)
   - Size-class based allocation

3. Box PA3: Cold Path (mmap fallback)
   - page_arena_alloc_pages/aligned with fallback to direct mmap

Integration Points:
4. Pool TLS Arena (core/pool_tls_arena.c)
   - chunk_ensure(): Lazy init + page_arena_alloc_pages() hook
   - arena_cleanup_thread(): Return chunks to PageArena if enabled
   - Exponential growth preserved (1MB → 8MB)

5. L25 Pool (core/hakmem_l25_pool.c)
   - l25_alloc_new_run(): Lazy init + page_arena_alloc_aligned() hook
   - refill_freelist(): PageArena allocation for bundles
   - 2MB run carving preserved

ENV Variables:
- HAKMEM_PAGE_ARENA_ENABLE=1 (default: 0, OFF)
- HAKMEM_PAGE_ARENA_HOT_SIZE=1024 (default: 1024)
- HAKMEM_PAGE_ARENA_WARM_64K=256 (default: 256)
- HAKMEM_PAGE_ARENA_WARM_128K=128 (default: 128)
- HAKMEM_PAGE_ARENA_WARM_2M=64 (default: 64)

Benchmark Results:
- Mid-Large MT (4T, 40K iter, 2KB):
  - OFF: 84,535 page-faults, 726K ops/s
  - ON:  84,534 page-faults, 760K ops/s (+4.7% ops, -0.001% faults)
- VM Mixed (200K iter):
  - OFF: 102,134 page-faults, 257K ops/s
  - ON:  102,134 page-faults, 255K ops/s (0% change)

Root Cause Analysis:
- Hypothesis: 50-66% page-fault reduction (80-100K → 30-40K)
- Actual: <1% page-fault reduction, minimal performance impact
- Reason: Structural limit - existing Arena/Pool/L25 already highly optimized
  - 1MB chunk sizes with high-density linear carving
  - TLS ring + exponential growth minimize mmap calls
  - PageArena becomes double-buffering layer with no benefit
  - Remaining page-faults from kernel zero-clear + app access patterns

Lessons Learned:
1. Mid/Large allocators already page-optimal via Arena/Pool design
2. Middle-layer caching ineffective when base layer already optimized
3. Page-fault reduction requires app-level access pattern changes
4. Tiny layer (Phase 23) remains best target for frontend optimization

Next Steps:
- Defer PageArena (low ROI, structural limit reached)
- Focus on upper layers (allocation pattern analysis, size distribution)
- Consider app-side access pattern optimization

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-17 03:22:27 +09:00

Moe Charm (CI)

40be86425b

Phase 12 SP-SLOT + Mid-Large P0 fix: Pool TLS debug logging & analysis

Phase 12 SP-SLOT Box (Complete):
- Per-slot state tracking (UNUSED/ACTIVE/EMPTY) for shared SuperSlabs
- 3-stage allocation: EMPTY reuse → UNUSED reuse → New SS
- Results: 877 → 72 SuperSlabs (-92%), 563K → 1.30M ops/s (+131%)
- Reports: PHASE12_SP_SLOT_BOX_IMPLEMENTATION_REPORT.md, CURRENT_TASK.md

Mid-Large P0 Analysis (2025-11-14):
- Root cause: Pool TLS disabled by default (build.sh:106 → POOL_TLS_PHASE1=0)
- Fix: POOL_TLS_PHASE1=1 build flag → 0.24M → 0.97M ops/s (+304%)
- Identified P0-2: futex bottleneck (67% syscall time) in pool_remote_push mutex
- Added debug logging: pool_tls.c (refill failures), pool_tls_arena.c (mmap/chunk failures)
- Reports: MID_LARGE_P0_FIX_REPORT_20251114.md, BOTTLENECK_ANALYSIS_REPORT_20251114.md

Next: Lock-free remote queue to reduce futex from 67% → <10%

Files modified:
- core/hakmem_shared_pool.c (SP-SLOT implementation)
- core/pool_tls.c (debug logging + stdatomic.h)
- core/pool_tls_arena.c (debug logging + stdio.h/errno.h/stdatomic.h)
- CURRENT_TASK.md (Phase 12 completion status)

🤖 Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-14 14:18:56 +09:00

Moe Charm (CI)

1010a961fb

Tiny: fix header/stride mismatch and harden refill paths

- Root cause: header-based class indexing (HEADER_CLASSIDX=1) wrote a 1-byte
  header during allocation, but linear carve/refill and initial slab capacity
  still used bare class block sizes. This mismatch could overrun slab usable
  space and corrupt freelists, causing reproducible SEGV at ~100k iters.

Changes
- Superslab: compute capacity with effective stride (block_size + header for
  classes 0..6; class7 remains headerless) in superslab_init_slab(). Add a
  debug-only bound check in superslab_alloc_from_slab() to fail fast if carve
  would exceed usable bytes.
- Refill (non-P0 and P0): use header-aware stride for all linear carving and
  TLS window bump operations. Ensure alignment/validation in tiny_refill_opt.h
  also uses stride, not raw class size.
- Drain: keep existing defense-in-depth for remote sentinel and sanitize nodes
  before splicing into freelist (already present).

Notes
- This unifies the memory layout across alloc/linear-carve/refill with a single
  stride definition and keeps class7 (1024B) headerless as designed.
- Debug builds add fail-fast checks; release builds remain lean.

Next
- Re-run Tiny benches (256/1024B) in debug to confirm stability, then in
  release. If any remaining crash persists, bisect with HAKMEM_TINY_P0_BATCH_REFILL=0
  to isolate P0 batch carve, and continue reducing branch-miss as planned.

2025-11-09 18:55:50 +09:00

3 Commits