Implements header write optimization for C6 v5 allocator by moving
header initialization from per-alloc time to carve time (during page
refill). This eliminates redundant header writes on the hot path.
Implementation:
- Added HAKMEM_SMALL_HEAP_V5_HEADER_MODE ENV (full|light, default: full)
- Added header_mode field to SmallHeapCtxV5 (cached per-thread)
- Modified alloc fast/slow paths to skip header write in light mode
- Modified refill to write headers during carve in light mode
- Free path unchanged (header validation still works)
Benchmark Results (2M iterations, ws=400):
C6-HEAVY (257-768B):
- Baseline (v5 OFF): 47.95 Mops/s
- v5 full mode: 38.97 Mops/s (-18.7% vs baseline)
- v5 light mode: 39.25 Mops/s (-18.1% vs baseline, +0.7% vs full)
MIXED 16-1024B:
- v5 OFF: 43.59 Mops/s
- v5 full mode: 36.53 Mops/s (-16.2% vs OFF)
- v5 light mode: 38.04 Mops/s (-12.7% vs OFF, +4.1% vs full)
Analysis:
- Light mode shows modest improvement over full (+0.7-4.1%)
- C6 v5 performance gap vs baseline (-18%) indicates need for
further optimization beyond header writes
- Mixed workload benefits more from light mode (+4.1% vs full)
- No regressions in safety/correctness observed
Research findings:
- Header write optimization alone insufficient to close v5 gap
- Need to investigate other hot path costs (freelist ops, metadata access)
- Light mode validates the carve-time header concept
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Single TLS segment (eliminates slot search loop)
- O(1) page_meta_of() (direct segment range check, no iteration)
- __builtin_ctz for O(1) free page finding in bitmap
- Simplified free path using page_meta_of() only (no find_page)
- Partial limit 1 (minimal list traversal)
Performance:
- Before (v5-2): 14.7M ops/s
- After (v5-3): 38.5M ops/s (+162%)
- vs baseline: 44.9M ops/s (-14%)
- SEGV: None, stable at ws=800
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Implemented TLS fastlist logic for C6 in smallobject_hotbox_v4.c (alloc/free).
- Added SmallC6FastState struct and g_small_c6_fast TLS variable.
- Gated the fastlist logic with HAKMEM_SMALL_HEAP_V4_FASTLIST (default OFF) due to observed instability in mixed workloads.
- Fixed a memory leak in small_heap_free_fast_v4 fallback path by calling hak_pool_free.
- Updated CURRENT_TASK.md with phase report.
Implementation:
- SmallObject HotBox v4 (core/smallobject_hotbox_v4.c) now fully implements C6-only allocations and frees, including current/partial management and freelist operations.
- Cold Iface (tiny_heap based) for page refill/retire is integrated.
- Stats instrumentation (v4-mid-5) added to small_heap_alloc_fast_v4 and small_heap_free_fast_v4, with a new header file core/box/smallobject_hotbox_v4_stats_box.h and atexit dump function.
Updates:
- CURRENT_TASK.md has been condensed and updated with summaries of Phase v4-mid-2 (C6-only v4), Phase v4-mid-3 (C5-only v4 pilot), and the stats implementation (v4-mid-5).
- docs/analysis/SMALLOBJECT_V4_BOX_DESIGN.md updated with A/B results and conclusions for C6-only and C5-only v4 implementations.
- The previous CURRENT_TASK.md content has been archived to CURRENT_TASK_ARCHIVE_20251210.md.
- Update environment profile presets and visibility analysis
- Enhance small object and tiny segment v4 box implementations
- Refine C7 ultra and C6 heavy allocation strategies
- Add comprehensive performance metrics and design documentation
🤖 Generated with Claude Code
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
- Redefine TinyHotHeap v2 as per-thread Hot Box with clear boundaries
- Add comprehensive OS statistics tracking for SS allocations
- Implement route-based free handling for TinyHeap v2
- Add C6/C7 debugging and statistics improvements
- Update documentation with implementation guidelines and analysis
- Add new box headers for stats, routing, and front-end management
- Added experimental path in unified_cache_refill to test ss_tls_bind_one for C7 class.
- Guarded by HAKMEM_WARM_TLS_BIND_C7 env var and debug build.
- Updated Page Box comments to clarify future TLS Bind Box integration.
- Created core/box/ss_tls_bind_box.h containing ss_tls_bind_one().
- Refactored superslab_refill() to use the new box.
- Updated signatures to avoid circular dependencies (tiny_self_u32).
- Added future integration points for Warm Pool and Page Box.
- Implement tiny_page_box.c/h: per-thread page cache between UC and Shared Pool
- Integrate Page Box into Unified Cache refill path
- Remove legacy SuperSlab implementation (merged into smallmid)
- Add HAKMEM_TINY_PAGE_BOX_CLASSES env var for selective class enabling
- Update bench_random_mixed.c with Page Box statistics
Current status: Implementation safe, no regressions.
Page Box ON/OFF shows minimal difference - pool strategy needs tuning.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Key findings from 2025-12-05 session:
1. HAKMEM vs mimalloc: 27x slower (4.5M vs 122M ops/s)
2. Root cause investigation: madvise 1081 calls vs mimalloc 0 calls
3. madvise disable test: -15% performance (worse, not better!)
4. Conclusion: MADV_POPULATE_WRITE is actually helping, not hurting
5. ChatGPT was right: time to move to user-space optimization phase
Reports added:
- WORKLOAD_COMPARISON_20251205.md
- PARTIAL_RELEASE_INVESTIGATION_REPORT_20251205.md
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>