d9991f39ff
Phase ALLOC-TINY-FAST-DUALHOT-1 & Optimization Roadmap Update
...
Add comprehensive design docs and research boxes:
- docs/analysis/ALLOC_TINY_FAST_DUALHOT_1_DESIGN.md: ALLOC DUALHOT investigation
- docs/analysis/FREE_TINY_FAST_DUALHOT_1_DESIGN.md: FREE DUALHOT final specs
- docs/analysis/FREE_TINY_FAST_HOTCOLD_OPT_1_DESIGN.md: Hot/Cold split research
- docs/analysis/POOL_MID_INUSE_DEFERRED_DN_BATCH_DESIGN.md: Deferred batching design
- docs/analysis/POOL_MID_INUSE_DEFERRED_REGRESSION_ANALYSIS.md: Stats overhead findings
- docs/analysis/MID_DESC_CACHE_BENCHMARK_2025-12-12.md: Cache measurement results
- docs/analysis/LAST_MATCH_CACHE_IMPLEMENTATION.md: TLS cache investigation
Research boxes (SS page table):
- core/box/ss_pt_env_box.h: HAKMEM_SS_LOOKUP_KIND gate
- core/box/ss_pt_types_box.h: 2-level page table structures
- core/box/ss_pt_lookup_box.h: ss_pt_lookup() implementation
- core/box/ss_pt_register_box.h: Page table registration
- core/box/ss_pt_impl.c: Global definitions
Updates:
- docs/specs/ENV_VARS_COMPLETE.md: HOTCOLD, DEFERRED, SS_LOOKUP env vars
- core/box/hak_free_api.inc.h: FREE-DISPATCH-SSOT integration
- core/box/pool_mid_inuse_deferred_box.h: Deferred API updates
- core/box/pool_mid_inuse_deferred_stats_box.h: Stats collection
- core/hakmem_super_registry: SS page table integration
Current Status:
- FREE-TINY-FAST-DUALHOT-1: +13% improvement, ready for adoption
- ALLOC-TINY-FAST-DUALHOT-1: -2% regression, frozen as research box
- Next: Optimization roadmap per ROI (mimalloc gap 2.5x)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-13 05:35:46 +09:00
fda6cd2e67
Boxify superslab registry, add bench profile, and document C7 hotpath experiments
2025-12-07 03:12:27 +09:00
0546454168
WIP: Add TLS SLL validation and SuperSlab registry fallback
...
ChatGPT's diagnostic changes to address TLS_SLL_HDR_RESET issue.
Current status: Partial mitigation, but root cause remains.
Changes Applied:
1. SuperSlab Registry Fallback (hakmem_super_registry.h)
- Added legacy table probe when hash map lookup misses
- Prevents NULL returns for valid SuperSlabs during initialization
- Status: ✅ Works but may hide underlying registration issues
2. TLS SLL Push Validation (tls_sll_box.h)
- Reject push if SuperSlab lookup returns NULL
- Reject push if class_idx mismatch detected
- Added [TLS_SLL_PUSH_NO_SS] diagnostic message
- Status: ✅ Prevents list corruption (defensive)
3. SuperSlab Allocation Class Fix (superslab_allocate.c)
- Pass actual class_idx to sp_internal_allocate_superslab
- Prevents dummy class=8 causing OOB access
- Status: ✅ Root cause fix for allocation path
4. Debug Output Additions
- First 256 push/pop operations traced
- First 4 mismatches logged with details
- SuperSlab registration state logged
- Status: ✅ Diagnostic tool (not a fix)
5. TLS Hint Box Removed
- Deleted ss_tls_hint_box.{c,h} (Phase 1 optimization)
- Simplified to focus on stability first
- Status: ⏳ Can be re-added after root cause fixed
Current Problem (REMAINS UNSOLVED):
- [TLS_SLL_HDR_RESET] still occurs after ~60 seconds of sh8bench
- Pointer is 16 bytes offset from expected (class 1 → class 2 boundary)
- hak_super_lookup returns NULL for that pointer
- Suggests: Use-After-Free, Double-Free, or pointer arithmetic error
Root Cause Analysis:
- Pattern: Pointer offset by +16 (one class 1 stride)
- Timing: Cumulative problem (appears after 60s, not immediately)
- Location: Header corruption detected during TLS SLL pop
Remaining Issues:
⚠️ Registry fallback is defensive (may hide registration bugs)
⚠️ Push validation prevents symptoms but not root cause
⚠️ 16-byte pointer offset source unidentified
Next Steps for Investigation:
1. Full pointer arithmetic audit (Magazine ⇔ TLS SLL paths)
2. Enhanced logging at HDR_RESET point:
- Expected vs actual pointer value
- Pointer provenance (where it came from)
- Allocation trace for that block
3. Verify Headerless flag is OFF throughout build
4. Check for double-offset application in conversions
Technical Assessment:
- 60% root cause fixes (allocation class, validation)
- 40% defensive mitigation (registry fallback, push rejection)
Performance Impact:
- Registry fallback: +10-30 cycles on cold path (negligible)
- Push validation: +5-10 cycles per push (acceptable)
- Overall: < 2% performance impact estimated
Related Issues:
- Phase 1 TLS Hint Box removed temporarily
- Phase 2 Headerless blocked until stability achieved
🤖 Generated with Claude Code (https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-12-03 20:42:28 +09:00
3a040a545a
Refactor: Split monolithic hakmem_shared_pool.c into acquire/release modules
...
- Split core/hakmem_shared_pool.c into acquire/release modules for maintainability.
- Introduced core/hakmem_shared_pool_internal.h for shared internal API.
- Fixed incorrect function name usage (superslab_alloc -> superslab_allocate).
- Increased SUPER_REG_SIZE to 1M to support large working sets (Phase 9-2 fix).
- Updated Makefile.
- Verified with benchmarks.
2025-11-30 18:11:08 +09:00
87b7d30998
Phase 9: SuperSlab optimization & EMPTY slab recycling (WIP)
...
Phase 9-1: O(1) SuperSlab lookup optimization
- Created ss_addr_map_box: Hash table (8192 buckets) for O(1) SuperSlab lookup
- Created ss_tls_hint_box: TLS caching layer for SuperSlab hints
- Integrated hash table into registry (init, insert, remove, lookup)
- Modified hak_super_lookup() to use new hash table
- Expected: 50-80 cycles → 10-20 cycles (not verified - SuperSlab disabled by default)
Phase 9-2: EMPTY slab recycling implementation
- Created slab_recycling_box: SLAB_TRY_RECYCLE() macro following Box pattern
- Integrated into remote drain (superslab_slab.c)
- Integrated into TLS SLL drain (tls_sll_drain_box.h) with touched slab tracking
- Observable: Debug tracing via HAKMEM_SLAB_RECYCLE_TRACE
- Updated Makefile: Added new box objects to 3 build targets
Known Issues:
- SuperSlab registry exhaustion still occurs (unregistration not working)
- shared_pool_release_slab() may not be removing from g_super_reg[]
- Needs investigation before Phase 9-2 can be completed
Expected Impact (when fixed):
- Stage 1 hit rate: 0% → 80%
- shared_fail events: 4 → 0
- Kernel overhead: 55% → 15%
- Throughput: 16.5M → 25-30M ops/s (+50-80%)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-30 07:16:50 +09:00
f8b0f38f78
ENV Cleanup Step 8: Gate HAKMEM_SUPER_LOOKUP_DEBUG in header
...
Gate HAKMEM_SUPER_LOOKUP_DEBUG environment variable behind
#if !HAKMEM_BUILD_RELEASE in hakmem_super_registry.h inline function.
Changes:
- Wrap s_dbg initialization in conditional compilation
- Release builds use constant s_dbg = 0 for complete elimination
- Debug logging in hak_super_lookup() now fully compiled out in release
Performance: 30.3M ops/s Larson (stable, no regression)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-28 01:45:45 +09:00
6199e9ba01
Phase 15 Box Separation: Fix wrapper domain check to prevent BenchMeta→CoreAlloc violation
...
Fix free() wrapper unconditionally routing ALL pointers to hak_free_at(),
causing Box boundary violations (BenchMeta slots[] entering CoreAlloc).
Solution: Add domain check in wrapper using 1-byte header inspection:
- Non-page-aligned: Check ptr-1 for HEADER_MAGIC (0xa0/0xb0)
- Hakmem Tiny → route to hak_free_at()
- External/BenchMeta → route to __libc_free()
- Page-aligned: Full classification (cannot safely check header)
Results:
- 99.29% BenchMeta properly freed via __libc_free() ✅
- 0.71% page-aligned fallthrough → ExternalGuard leak (acceptable)
- No crashes (100K/500K iterations stable)
- Performance: 15.3M ops/s (maintained)
Changes:
- core/box/hak_wrappers.inc.h: Domain check logic (lines 227-256)
- core/box/external_guard_box.h: Conservative leak prevention
- core/hakmem_super_registry.h: SUPER_MAX_PROBE 8→32
- PHASE15_WRAPPER_DOMAIN_CHECK_FIX.md: Comprehensive analysis
Root cause identified by user: LD_PRELOAD intercepts __libc_free(),
wrapper needs defense-in-depth to maintain Box boundaries.
2025-11-16 00:38:29 +09:00
2be754853f
Phase 11: SuperSlab Prewarm implementation (startup pre-allocation)
...
## Summary
Pre-allocate SuperSlabs at startup to eliminate runtime mmap overhead.
Result: +6.4% improvement (8.82M → 9.38M ops/s) but still 9x slower than System malloc.
## Key Findings (Lesson Learned)
- Syscall reduction strategy targeted WRONG bottleneck
- Real bottleneck: SuperSlab allocation churn (877 SuperSlabs needed)
- Prewarm reduces mmap frequency but doesn't solve fundamental architecture issue
## Implementation
- Two-phase allocation with atomic bypass flag
- Environment variable: HAKMEM_PREWARM_SUPERSLABS (default: 0)
- Best result: Prewarm=8 → 9.38M ops/s (+6.4%)
## Next Step
Pivot to Phase 12: Shared SuperSlab Pool (mimalloc-style)
- Expected: 877 → 100-200 SuperSlabs (-70-80%)
- This addresses ROOT CAUSE (allocation churn) not symptoms
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-13 14:45:43 +09:00
fb10d1710b
Phase 9: SuperSlab Lazy Deallocation + mincore removal
...
Goal: Eliminate syscall overhead (99.2% CPU) to approach System malloc performance
Implementation:
1. mincore removal (100% elimination)
- Deleted: hakmem_internal.h hak_is_memory_readable() syscall
- Deleted: tiny_free_fast_v2.inc.h safety checks
- Alternative: Internal metadata (Registry + Header magic validation)
- Result: 841 mincore calls → 0 calls ✅
2. SuperSlab Lazy Deallocation
- Added LRU Cache Manager (470 lines in hakmem_super_registry.c)
- Extended SuperSlab: last_used_ns, generation, lru_prev/next
- Deallocation policy: Count/Memory/TTL based eviction
- Environment variables:
* HAKMEM_SUPERSLAB_MAX_CACHED=256 (default)
* HAKMEM_SUPERSLAB_MAX_MEMORY_MB=512 (default)
* HAKMEM_SUPERSLAB_TTL_SEC=60 (default)
3. Integration
- superslab_allocate: Try LRU cache first before mmap
- superslab_free: Push to LRU cache instead of immediate munmap
- Lazy deallocation: Defer munmap until cache limits exceeded
Performance Results (100K iterations, 256B allocations):
Before (Phase 7-8):
- Performance: 2.76M ops/s
- Syscalls: 3,412 (mmap:1,250, munmap:1,321, mincore:841)
After (Phase 9):
- Performance: 9.71M ops/s (+251%) 🏆
- Syscalls: 1,729 (mmap:877, munmap:852, mincore:0) (-49%)
Key Achievements:
- ✅ mincore: 100% elimination (841 → 0)
- ✅ mmap: -30% reduction (1,250 → 877)
- ✅ munmap: -35% reduction (1,321 → 852)
- ✅ Total syscalls: -49% reduction (3,412 → 1,729)
- ✅ Performance: +251% improvement (2.76M → 9.71M ops/s)
System malloc comparison:
- HAKMEM: 9.71M ops/s
- System malloc: 90.04M ops/s
- Achievement: 10.8% (target: 93%)
Next optimization:
- Further mmap/munmap reduction (1,729 vs System's 13 = 133x gap)
- Pre-warm LRU cache
- Adaptive LRU sizing
- Per-class LRU cache
Production ready with recommended settings:
export HAKMEM_SUPERSLAB_MAX_CACHED=256
export HAKMEM_SUPERSLAB_MAX_MEMORY_MB=512
./bench_random_mixed_hakmem
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-13 14:05:39 +09:00
382980d450
Phase 6-2.4: Fix SuperSlab free SEGV: remove guess loop and add memory readability check; add registry atomic consistency (base as _Atomic uintptr_t with acq/rel); add debug toggles (SUPER_REG_DEBUG/REQTRACE); update CURRENT_TASK with results and next steps; capture suite results.
2025-11-07 18:07:48 +09:00
4978340c02
Tiny/SuperSlab: implement per-class registry optimization for fast refill scan
...
Replace 262K linear registry scan with per-class indexed registry:
- Add g_super_reg_by_class[TINY_NUM_CLASSES][16384] for O(class_size) scan
- Update hak_super_register/unregister to maintain both hash table + per-class index
- Optimize refill scan in hakmem_tiny_free.inc (262K → ~10-100 entries per class)
- Optimize mmap gate scan in tiny_mmap_gate.h (same optimization)
Performance impact (Larson benchmark):
- threads=1: 2.59M → 2.61M ops/s (+0.8%)
- threads=4: 3.62M → 4.19M ops/s (+15.7%) 🎉
Root cause analysis via perf:
- superslab_refill consumed 28.51% CPU time (97.65% in loop instructions)
- 262,144-entry linear scan with 2 atomic loads per iteration
- Per-class registry reduces scan target by 98.4% (262K → 16K per class)
Registry capacity:
- SUPER_REG_PER_CLASS = 16384 (increased from 4096 to avoid exhaustion)
- Total: 8 classes × 16384 = 128K entries (vs 262K unified registry)
Design:
- Dual registry: Hash table (address lookup) + Per-class index (refill scan)
- O(1) registration/unregistration with swap-with-last removal
- Lock-free reads, mutex-protected writes (same as before)
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-05 17:02:31 +09:00
52386401b3
Debug Counters Implementation - Clean History
...
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation
Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files
Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)
This is a clean repository without large log files.
🤖 Generated with [Claude Code](https://claude.com/claude-code )
Co-Authored-By: Claude <noreply@anthropic.com >
2025-11-05 12:31:14 +09:00