Commit Graph

5 Commits

Author SHA1 Message Date
030132f911 Phase 10: TLS/SFC aggressive cache tuning (syscall reduction failed)
Goal: Reduce backend transitions by increasing frontend hit rate
Result: +2% best case, syscalls unchanged (root cause: SuperSlab churn)

Implementation:

1. Cache capacity expansion (2-8x per-class)
   - Hot classes (C0-C3): 4x increase (512 slots)
   - Medium classes (C4-C6): 2-3x increase
   - Class 7 (1KB): 2x increase (128 slots)
   - Fast cache: 2x default capacity

2. Refill batch size increase (4-8x)
   - Global default: 16 → 64 (4x)
   - Hot classes: 128 (8x) via HAKMEM_TINY_REFILL_COUNT_HOT
   - Mid classes: 96 (6x) via HAKMEM_TINY_REFILL_COUNT_MID
   - Class 7: 64 → 128 (2x)
   - SFC refill: 64 → 128 (2x)

3. Adaptive sizing aggressive parameters
   - Grow threshold: 80% → 70% (expand earlier)
   - Shrink threshold: 20% → 10% (shrink less)
   - Growth rate: 2x → 1.5x (smoother growth)
   - Max capacity: 2048 → 4096 (2x ceiling)
   - Adapt frequency: Every 10 → 5 refills (more responsive)

Performance Results (100K iterations):

Before (Phase 9):
- Performance: 9.71M ops/s
- Syscalls: 1,729 (mmap:877, munmap:852)

After (Phase 10):
- Default settings: 8.77M ops/s (-9.7%) ⚠️
- Optimal ENV: 9.89M ops/s (+2%) 
- Syscalls: 1,729 (unchanged) 

Optimal ENV configuration:
export HAKMEM_TINY_REFILL_COUNT_HOT=256
export HAKMEM_TINY_REFILL_COUNT_MID=192

Root Cause Analysis:

Bottleneck is NOT TLS/SFC hit rate, but SuperSlab allocation churn:
- 877 SuperSlabs allocated (877MB via mmap)
- Phase 9 LRU cache not utilized (no frees during benchmark)
- All SuperSlabs retained until program exit
- System malloc: 9 syscalls vs HAKMEM: 1,729 syscalls (192x gap)

Conclusion:

TLS/SFC tuning cannot solve SuperSlab allocation policy problem.
Next step: Phase 11 SuperSlab Prewarm strategy to eliminate
mmap/munmap during benchmark execution.

ChatGPT review: Strategy validated, Option A (Prewarm) recommended.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-13 14:25:54 +09:00
862e8ea7db Infrastructure and build updates
- Update build configuration and flags
- Add missing header files and dependencies
- Update TLS list implementation with proper scoping
- Fix various compilation warnings and issues
- Update debug ring and tiny allocation infrastructure
- Update benchmark results documentation

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-11-11 21:49:05 +09:00
94e7d54a17 Tiny P0/FC tuning: per-class FastCache caps honored; defaults C5=96, C7=48. Raise direct-FC drain threshold default to 64. Default class7 direct-FC OFF for stability. 256B fixed-size shows branch-miss drop (~11%→~8.9%) and ~4.5M ops/s on Ryzen 7 5825U. Note: 1KB fixed-size currently SEGVs even with direct-FC OFF, pointing to non-direct P0 path; propose gating P0 for C7 and triage next (adopt-before-map recheck, bounds asserts). Update CURRENT_TASK.md with changes and results path. 2025-11-10 00:25:02 +09:00
47797a3ba0 Tiny: enable class7 (1024B) fast_cap by default (64); add 1T A/B switch for Remote Side (HAKMEM_TINY_ASSUME_1T)
Changes
- core/hakmem_tiny_config.c: set g_fast_cap_defaults[7]=64 (was 0) to reduce SuperSlab path frequency for 1024B.
- core/tiny_remote.c: add env HAKMEM_TINY_ASSUME_1T=1 to disable Remote Side table in single‑thread runs (A/B friendly).

A/B (1T, cpu2 pinned, 500k iters)
- 256B: cycles ↓ ~119.7M → ~60.0M, time 95.4ms → 83.2ms (~12% faster), IPC ~0.92→0.88, branch‑miss ~11%.
- 1024B: cycles ↓ ~74.4M → ~27.3M, time 83.3ms → 73.5ms (~12% faster), IPC ~0.82→0.75, branch‑miss ~11%.

Notes
- Branch‑miss率は依然高め。今後: adopt境界の分岐整理、超シンプルrefill(class7特例)/fast path優先度の再調整で詰める。
- A/B: export HAKMEM_TINY_ASSUME_1T=1 で1T時にRemote SideをOFF。HAKMEM_TINY_REMOTE_SIDEで明示的制御も可。
2025-11-09 17:00:37 +09:00
52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00