|
|
5cc1f93622
|
Phase 13-A Step 1: TinyHeapV2 NO-REFILL L0 cache implementation
Implement TinyHeapV2 as a minimal "lucky hit" L0 cache that avoids
circular dependency with FastCache by eliminating self-refill.
Key Changes:
- New: core/front/tiny_heap_v2.h - NO-REFILL L0 cache implementation
- tiny_heap_v2_alloc(): Pop from magazine if available, else return NULL
- tiny_heap_v2_refill_mag(): No-op stub (no backend refill)
- ENV: HAKMEM_TINY_HEAP_V2=1 to enable
- ENV: HAKMEM_TINY_HEAP_V2_CLASS_MASK=bitmask (C0-C3 control)
- ENV: HAKMEM_TINY_HEAP_V2_STATS=1 to print statistics
- Modified: core/hakmem_tiny_alloc_new.inc - Add TinyHeapV2 hook
- Hook at entry point (after class_idx calculation)
- Fallback to existing front if TinyHeapV2 returns NULL
- Modified: core/hakmem_tiny_alloc.inc - Add hook for legacy path
- Modified: core/hakmem_tiny.c - Add TLS variables and stats wrapper
- TinyHeapV2Mag: Per-class magazine (capacity=16)
- TinyHeapV2Stats: Per-class counters (alloc_calls, mag_hits, etc.)
- tiny_heap_v2_print_stats(): Statistics output at exit
- New: TINY_HEAP_V2_TASK_SPEC.md - Phase 13 specification
Root Cause Fixed:
- BEFORE: TinyHeapV2 refilled from FastCache → circular dependency
- TinyHeapV2 intercepted all allocs → FastCache never populated
- Result: 100% backend OOM, 0% hit rate, 99% slowdown
- AFTER: TinyHeapV2 is passive L0 cache (no refill)
- Magazine empty → return NULL → existing front handles it
- Result: 0% overhead, stable baseline performance
A/B Test Results (100K iterations, fixed-size bench):
- C1 (8B): Baseline 9,688 ops/s → HeapV2 ON 9,762 ops/s (+0.76%)
- C2 (16B): Baseline 9,804 ops/s → HeapV2 ON 9,845 ops/s (+0.42%)
- C3 (32B): Baseline 9,840 ops/s → HeapV2 ON 9,814 ops/s (-0.26%)
- All within noise range: NO PERFORMANCE REGRESSION ✅
Statistics (HeapV2 ON, C1-C3):
- alloc_calls: 200K (hook works correctly)
- mag_hits: 0 (0%) - Magazine empty as expected
- refill_calls: 0 - No refill executed (circular dependency avoided)
- backend_oom: 0 - No backend access
Next Steps (Phase 13-A Step 2):
- Implement magazine supply strategy (from existing front or free path)
- Goal: Populate magazine with "leftover" blocks from existing pipeline
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
2025-11-15 01:42:57 +09:00 |
|
|
|
897ce8873f
|
Phase B: Set refill=64 as default (A/B optimized)
A/B testing showed refill=64 provides best balanced performance:
- 128B: +15.5% improvement (8.27M → 9.55M ops/s)
- 256B: +7.2% improvement (7.90M → 8.47M ops/s)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
2025-11-14 19:35:56 +09:00 |
|
|
|
3f738c0d6e
|
Phase B: TinyFrontC23Box - Ultra-simple front path for C2/C3
Implemented dedicated fast path for C2/C3 (128B/256B) to bypass
SFC/SLL/Magazine complexity and directly access FastCache + SuperSlab.
Changes:
- core/front/tiny_front_c23.h: New ultra-simple front path (NEW)
- Direct FC → SS refill (2 layers vs 5+ in generic path)
- ENV-gated: HAKMEM_TINY_FRONT_C23_SIMPLE=1
- Refill target: 64 blocks (optimized via A/B testing)
- core/tiny_alloc_fast.inc.h: Hook at entry point (+11 lines)
- Early return for C2/C3 when C23 path enabled
- Safe fallback to generic path on failure
Results (100K iterations, A/B tested refill=16/32/64/128):
- 128B: 8.27M → 9.55M ops/s (+15.5% with refill=64) ✅
- 256B: 7.90M → 8.61M ops/s (+9.0% with refill=32) ✅
- 256B: 7.90M → 8.47M ops/s (+7.2% with refill=64) ✅
Optimal Refill: 64 blocks
- Balanced performance across C2/C3
- 128B best case: +15.5%
- 256B good performance: +7.2%
- Simple single-value default
Architecture:
- Flow: FC pop → (miss) → ss_refill_fc_fill(64) → FC pop retry
- Bypassed layers: SLL, Magazine, SFC, MidTC
- Preserved: Box boundaries, safety checks, fallback paths
- Free path: Unchanged (TLS SLL + drain)
Box Theory Compliance:
- Clear Front ← Backend boundary (ss_refill_fc_fill)
- ENV-gated A/B testing (default OFF, opt-in)
- Safe fallback: NULL → generic path handles slow case
- Zero impact when disabled
Performance Gap Analysis:
- Current: 8-9M ops/s
- After Phase B: 9-10M ops/s (+10-15%)
- Target: 15-20M ops/s
- Remaining gap: ~2x (suggests deeper bottlenecks remain)
Next Steps:
- Perf profiling to identify next bottleneck
- Current hypotheses: classify_ptr, drain overhead, refill path
- Phase C candidates: FC-direct free, inline optimizations
ENV Usage:
# Enable C23 fast path (default: OFF)
export HAKMEM_TINY_FRONT_C23_SIMPLE=1
# Optional: Override refill target (default: 64)
export HAKMEM_TINY_FRONT_C23_REFILL=32
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
2025-11-14 19:27:45 +09:00 |
|
|
|
ccf604778c
|
Front-Direct implementation: SS→FC direct refill + SLL complete bypass
## Summary
Implemented Front-Direct architecture with complete SLL bypass:
- Direct SuperSlab → FastCache refill (1-hop, bypasses SLL)
- SLL-free allocation/free paths when Front-Direct enabled
- Legacy path sealing (SLL inline opt-in, SFC cascade ENV-only)
## New Modules
- core/refill/ss_refill_fc.h (236 lines): Standard SS→FC refill entry point
- Remote drain → Freelist → Carve priority
- Header restoration for C1-C6 (NOT C0/C7)
- ENV: HAKMEM_TINY_P0_DRAIN_THRESH, HAKMEM_TINY_P0_NO_DRAIN
- core/front/fast_cache.h: FastCache (L1) type definition
- core/front/quick_slot.h: QuickSlot (L0) type definition
## Allocation Path (core/tiny_alloc_fast.inc.h)
- Added s_front_direct_alloc TLS flag (lazy ENV check)
- SLL pop guarded by: g_tls_sll_enable && !s_front_direct_alloc
- Refill dispatch:
- Front-Direct: ss_refill_fc_fill() → fastcache_pop() (1-hop)
- Legacy: sll_refill_batch_from_ss() → SLL → FC (2-hop, A/B only)
- SLL inline pop sealed (requires HAKMEM_TINY_INLINE_SLL=1 opt-in)
## Free Path (core/hakmem_tiny_free.inc, core/hakmem_tiny_fastcache.inc.h)
- FC priority: Try fastcache_push() first (same-thread free)
- tiny_fast_push() bypass: Returns 0 when s_front_direct_free || !g_tls_sll_enable
- Fallback: Magazine/slow path (safe, bypasses SLL)
## Legacy Sealing
- SFC cascade: Default OFF (ENV-only via HAKMEM_TINY_SFC_CASCADE=1)
- Deleted: core/hakmem_tiny_free.inc.bak, core/pool_refill_legacy.c.bak
- Documentation: ss_refill_fc_fill() promoted as CANONICAL refill entry
## ENV Controls
- HAKMEM_TINY_FRONT_DIRECT=1: Enable Front-Direct (SS→FC direct)
- HAKMEM_TINY_P0_DIRECT_FC_ALL=1: Same as above (alt name)
- HAKMEM_TINY_REFILL_BATCH=1: Enable batch refill (also enables Front-Direct)
- HAKMEM_TINY_SFC_CASCADE=1: Enable SFC cascade (default OFF)
- HAKMEM_TINY_INLINE_SLL=1: Enable inline SLL pop (default OFF, requires AGGRESSIVE_INLINE)
## Benchmarks (Front-Direct Enabled)
```bash
ENV: HAKMEM_BENCH_FAST_FRONT=1 HAKMEM_TINY_FRONT_DIRECT=1
HAKMEM_TINY_REFILL_BATCH=1 HAKMEM_TINY_P0_DIRECT_FC_ALL=1
HAKMEM_TINY_REFILL_COUNT_HOT=256 HAKMEM_TINY_REFILL_COUNT_MID=96
HAKMEM_TINY_BUMP_CHUNK=256
bench_random_mixed (16-1040B random, 200K iter):
256 slots: 1.44M ops/s (STABLE, 0 SEGV)
128 slots: 1.44M ops/s (STABLE, 0 SEGV)
bench_fixed_size (fixed size, 200K iter):
256B: 4.06M ops/s (has debug logs, expected >10M without logs)
128B: Similar (debug logs affect)
```
## Verification
- TRACE_RING test (10K iter): **0 SLL events** detected ✅
- Complete SLL bypass confirmed when Front-Direct=1
- Stable execution: 200K iterations × multiple sizes, 0 SEGV
## Next Steps
- Disable debug logs in hak_alloc_api.inc.h (call_num 14250-14280 range)
- Re-benchmark with clean Release build (target: 10-15M ops/s)
- 128/256B shortcut path optimization (FC hit rate improvement)
Co-Authored-By: ChatGPT <chatgpt@openai.com>
Suggested-By: ultrathink
|
2025-11-14 05:41:49 +09:00 |
|