Commit Graph

6 Commits

Author SHA1 Message Date
1a8652a91a Phase TLS-UNIFY-3: C6 intrusive freelist implementation (完成)
Implement C6 ULTRA intrusive LIFO freelist with ENV gating:
- Single-linked LIFO using next pointer at USER+1 offset
- tiny_next_store/tiny_next_load for pointer access (single source of truth)
- Segment learning via ss_fast_lookup (per-class seg_base/seg_end)
- ENV gate: HAKMEM_TINY_C6_ULTRA_INTRUSIVE_FL (default OFF)
- Counters: c6_ifl_push/pop/fallback in FREE_PATH_STATS

Files:
- core/box/tiny_ultra_tls_box.h: Added c6_head field for intrusive LIFO
- core/box/tiny_ultra_tls_box.c: Pop/push with intrusive branching (case 6)
- core/box/tiny_c6_ultra_intrusive_env_box.h: ENV gate (new)
- core/box/tiny_c6_intrusive_freelist_box.h: L1 pure LIFO (new)
- core/tiny_debug_ring.h: C6_IFL events
- core/box/free_path_stats_box.h/c: c6_ifl_* counters

A/B Test Results (1M iterations, ws=200, 257-512B):
- ENV_OFF (array): 56.6 Mop/s avg
- ENV_ON (intrusive): 57.6 Mop/s avg (+1.8%, within noise)
- Counters verified: c6_ifl_push=265890, c6_ifl_pop=265815, fallback=0

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-12 16:26:42 +09:00
fb88725a43 Phase FREE-LEGACY-OPT-6: C4 ULTRA Implementation
Implement C4 ULTRA free TLS cache with parasitic free+alloc pattern,
achieving 99.7-99.9% elimination of C4 legacy fallback calls.

Key Features:
- TLS cache cap=64 (tuned for L1 cache fit, smaller than C5/C6's 128)
- Segment learning via ss_fast_lookup() on first free
- Free-side cache push + alloc-side TLS pop pattern
- ENV gate: HAKMEM_TINY_C4_ULTRA_FREE_ENABLED (default OFF)
- Full FREE_PATH_STATS instrumentation

Benchmark Results:
C4-heavy (65-128B range):
  - C4 legacy: 591,583 → 1,711 (-99.7%)
  - c4_ultra cache hits: ~599k (free) + ~599k (alloc)
  - Mixed load: 340,732 → 284 C4 legacy (-99.9%)

Legacy fallback reduction:
  - C4-heavy: 589,872 fewer legacy calls (-10.9% total)
  - Mixed: 340,448 fewer C4 legacy calls (-12.8% in mixed)

Performance note: ~2% throughput cost in isolated C4-heavy case,
acceptable tradeoff for 99%+ legacy elimination per class.

Files:
  NEW: core/box/tiny_c4_ultra_free_box.h/c
  NEW: core/box/tiny_c4_ultra_free_env_box.h
  MOD: core/box/tiny_ultra_classes_box.h (added C4 macros)
  MOD: core/box/free_path_stats_box.h/c (C4 ULTRA counters)
  MOD: core/front/malloc_tiny_fast.h (C4 alloc+free integration)
  MOD: Makefile (added C4 ULTRA object)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-11 19:38:27 +09:00
ea6ed1a6e4 Phase FREE-LEGACY-OPT-5-1/5-2: C5 ULTRA free+alloc integration
Summary:
========
Implemented C5 ULTRA TLS cache pattern following the successful C6 ULTRA design:
- Phase 5-1: Free-side TLS cache + segment learning
- Phase 5-2: Alloc-side TLS pop for complete free+alloc cycle integration

Targets C5 class (129-256B) as next legacy reduction after C6 completion.

Key Changes:
============

1. NEW FILES:
   - core/box/tiny_c5_ultra_free_box.h: C5 ULTRA TLS cache structure
   - core/box/tiny_c5_ultra_free_box.c: C5 free path implementation (same pattern as C6)
   - core/box/tiny_c5_ultra_free_env_box.h: ENV gating (HAKMEM_TINY_C5_ULTRA_FREE_ENABLED)

2. MODIFIED FILES:
   - core/front/malloc_tiny_fast.h:
     * Added C5 ULTRA includes
     * Added C5 alloc-side TLS pop at lines 186-194 (integrated with C6)
     * Added C5 free path at lines 333-337 (integrated with C6)

   - core/box/tiny_ultra_classes_box.h:
     * Added TINY_CLASS_C5 constant
     * Added tiny_class_is_c5() macro
     * Extended tiny_class_is_ultra() to include C5

   - core/box/free_path_stats_box.h:
     * Added c5_ultra_free_fast counter
     * Added c5_ultra_alloc_hit counter

   - core/box/free_path_stats_box.c:
     * Updated stats dump to output C5 counters

   - Makefile:
     * Added core/box/tiny_c5_ultra_free_box.o to all object lists

3. Design Rationale:
   - Exact copy of C6 ULTRA pattern (proven effective)
   - TLS cache capacity: 128 blocks (same as C6 for consistency)
   - Segment learning on first C5 free via ss_fast_lookup()
   - Alloc-side pop integrated directly in malloc_tiny_fast.h hotpath
   - Legacy fallback unification via tiny_legacy_fallback_free_base()

4. Expected Impact:
   - C5 legacy calls: 68,871 → 0 (100% elimination)
   - Total legacy reduction: ~53% of remaining 129,623
   - Mixed workload: Minimal regression (C5 is smaller class, fewer allocations)

5. Stats Collection:
   Run with: HAKMEM_TINY_C5_ULTRA_FREE_ENABLED=1 HAKMEM_FREE_PATH_STATS=1 ./bench_allocators_hakmem

   Expected output:
   [FREE_PATH_STATS] ... c5_ultra_free=68871 c5_ultra_alloc=68871 ... legacy_fb=60752 ...
   [FREE_PATH_STATS_LEGACY_BY_CLASS] ... c5=0 ...

Status:
=======
- Code:  COMPLETE (3 new files + 5 modified files)
- Compilation:  Verified (no errors, only unused variable warnings unrelated to C5)
- Functionality: Ready to benchmark (ENV gating: default OFF, opt-in via ENV)

Phase Progression:
==================
 Phase 4-4: C6 ULTRA free+alloc (legacy C6: 137,319 → 0)
 Phase 5-1/5-2: C5 ULTRA free+alloc (legacy C5: 68,871 → 0 expected)
 Phase 4.5: C4 ULTRA (34,727 remaining)
📋 Future: C3/C2 ULTRA if beneficial

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-11 19:26:51 +09:00
9830eff6cc Phase FREE-LEGACY-OPT-4-4: C6 ULTRA free+alloc integration
Parasitic TLS cache: alloc now pops from the TLS freelist filled by free.

Implementation:
- malloc_tiny_fast(): C6 class-specific TLS pop check before route switch
  - if (class_idx == 6 && tiny_c6_ultra_free_enabled())
  - pop from TinyC6UltraFreeTLS.freelist[--count]
  - return USER pointer (BASE + 1)

- FreePathStats: Added c6_ultra_alloc_hit counter for observability

Results (Mixed 16-1024B):
- OFF: 40.2M ops/s baseline
- ON:  42.2M ops/s (+4.9%) stable

Per-profile:
- Mixed:     +4.9% (40.2M → 42.2M)
- C6-heavy:  +7.6% (40.7M → 43.8M)

Free-alloc loop:
- free: TLS push (all C6 frees)
- alloc: TLS pop (all C6 allocs in steady state)
- Cache never fills, no legacy overflow
- C6 legacy_by_class reduced from 137K to 0 (100% elimination)

Key insight:
- Free-only TLS cache fails without alloc integration
- Once integrated, creates perfect load-balancing loop
- Alloc drains exactly what free fills

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 18:47:21 +09:00
1b196b3ac0 Phase FREE-LEGACY-OPT-4-2/4-3: C6 ULTRA-free TLS cache + segment learning
Phase 4-2:
- Add TinyC6UltraFreeTLS structure with 128-slot TLS freelist
- Implement tiny_c6_ultra_free_fast/slow for C6 free hot path
- Add c6_ultra_free_fast counter to FreePathStats
- ENV gate: HAKMEM_TINY_C6_ULTRA_FREE_ENABLED (default: OFF)

Phase 4-3:
- Add segment learning on first C6 free via ss_fast_lookup()
- Learn seg_base/seg_end from SuperSlab for range check
- Increase cache capacity from 32 to 128 blocks

Results:
- Segment learning works: fast path captures blocks in segment
- However, without alloc integration, cache fills up and overflows to legacy
- Net effect: +1-3% (within noise range)
- Drain strategy also tested: no benefit (equal overhead)

Conclusion:
- Free-only TLS cache is limited without alloc-side integration
- Core v6 already has alloc/free integrated TLS (but -12% slower)
- Keep as research box (ENV default OFF)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2025-12-11 18:34:27 +09:00
210633117a Phase FREE-LEGACY-OPT-4-1: Legacy per-class breakdown analysis
## 目的
Legacy fallback 49.2% の内訳を per-class で分析し、最も Legacy を使用しているクラスを特定。

## 実装内容

1. FreePathStats 構造体の拡張
   - legacy_by_class[8] フィールドを追加(C0-C7 の Legacy fallback 内訳)

2. デストラクタ出力の更新
   - [FREE_PATH_STATS_LEGACY_BY_CLASS] 行を追加し、C0-C7 の内訳を出力

3. カウンタの散布
   - free_tiny_fast() の Legacy fallback 経路で legacy_by_class[class_idx] をインクリメント
   - class_idx の範囲チェック(0-7)を実施

## 測定結果(Mixed 16-1024B)

**測定安定性**: 完全に安定(3 回とも同一の値、決定的測定)

Legacy per-class 内訳:
- C0: 0 (0.0%)
- C1: 0 (0.0%)
- C2: 8,746 (3.3% of legacy)
- C3: 17,279 (6.5% of legacy)
- C4: 34,727 (13.0% of legacy)
- C5: 68,871 (25.8% of legacy)
- C6: 137,319 (51.4% of legacy) ← 最大シェア
- C7: 0 (0.0%)

合計: 266,942 (49.2% of total free calls)

## 分析結果

**最大シェアクラス**: C6 (513-1024B) が Legacy の 51.4% を占める

**理由**:
- Mixed 16-1024B では C6 サイズのアロケーションが多い
- C7 ULTRA は C7 専用で C6 は未対応
- v3/v4 も C6 をカバーしていない
- Route 設定で C6 は Legacy に直接落ちている

## 次のアクション

Phase FREE-LEGACY-OPT-4-2 で C6 クラスに ULTRA-Free lane を実装:
- Legacy fallback を 51% 削減(C6 分)
- Legacy: 49.2% → 24-27% に改善(半減)
- Mixed 16-1024B: 44.8M → 47-48M ops/s 程度(+5-8% 改善)

## 変更ファイル

- core/box/free_path_stats_box.h: FreePathStats 構造体に legacy_by_class[8] 追加
- core/box/free_path_stats_box.c: デストラクタに per-class 出力追加
- core/front/malloc_tiny_fast.h: Legacy fallback 経路に per-class カウンタ追加
- docs/analysis/FREE_LEGACY_PATH_ANALYSIS.md: Phase 4-1 分析結果を記録

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-11 18:04:14 +09:00