Files
hakmem/docs/PHASE1_TLS_HINT_BENCHMARK.md
Moe Charm (CI) 94f9ea5104 Implement Phase 1: TLS SuperSlab Hint Box for Headerless performance
Design: Cache recently-used SuperSlab references in TLS to accelerate
ptr→SuperSlab resolution in Headerless mode free() path.

## Implementation

### New Box: core/box/tls_ss_hint_box.h
- Header-only Box (4-slot FIFO cache per thread)
- Functions: tls_ss_hint_init(), tls_ss_hint_update(), tls_ss_hint_lookup(), tls_ss_hint_clear()
- Memory overhead: 112 bytes per thread (negligible)
- Statistics API for debug builds (hit/miss counters)

### Integration Points

1. **Free path** (core/hakmem_tiny_free.inc):
   - Lines 477-481: Fast path hint lookup before hak_super_lookup()
   - Lines 550-555: Second lookup location (fallback path)
   - Expected savings: 10-50 cycles → 2-5 cycles on cache hit

2. **Allocation path** (core/tiny_superslab_alloc.inc.h):
   - Lines 115-122: Linear allocation return path
   - Lines 179-186: Freelist allocation return path
   - Cache update on successful allocation

3. **TLS variable** (core/hakmem_tiny_tls_state_box.inc):
   - `__thread TlsSsHintCache g_tls_ss_hint = {0};`

### Build System

- **Build flag** (core/hakmem_build_flags.h):
  - HAKMEM_TINY_SS_TLS_HINT (default: 0, disabled)
  - Validation: requires HAKMEM_TINY_HEADERLESS=1

- **Makefile**:
  - Removed old ss_tls_hint_box.o (conflicting implementation)
  - Header-only design eliminates compiled object files

### Testing

- **Unit tests** (tests/test_tls_ss_hint.c):
  - 6 test functions covering init, lookup, FIFO rotation, duplicates, clear, stats
  - All tests PASSING

- **Build validation**:
  -  Compiles with hint disabled (default)
  -  Compiles with hint enabled (HAKMEM_TINY_SS_TLS_HINT=1)

### Documentation

- **Benchmark report** (docs/PHASE1_TLS_HINT_BENCHMARK.md):
  - Implementation summary
  - Build validation results
  - Benchmark methodology (to be executed)
  - Performance analysis framework

## Expected Performance

- **Hit rate**: 85-95% (single-threaded), 70-85% (multi-threaded)
- **Cycle savings**: 80-95% on cache hit (10-50 cycles → 2-5 cycles)
- **Target improvement**: 15-20% throughput increase vs Headerless baseline
- **Memory overhead**: 112 bytes per thread

## Box Theory

**Mission**: Cache hot SuperSlabs to avoid global registry lookup

**Boundary**: ptr → SuperSlab* or NULL (miss)

**Invariant**: hint.base ≤ ptr < hint.end → hit is valid

**Fallback**: Always safe to miss (triggers hak_super_lookup)

**Thread Safety**: TLS storage, no synchronization required

**Risk**: Low (read-only cache, fail-safe fallback, magic validation)

## Next Steps

1. Run full benchmark suite (sh8bench, cfrac, larson)
2. Measure actual hit rate with stats enabled
3. If performance target met (15-20% improvement), enable by default
4. Consider increasing cache slots if hit rate < 80%

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 18:06:24 +09:00

213 lines
6.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Phase 1: TLS SuperSlab Hint Box - Benchmark Report
## Implementation Summary
**Date**: 2025-12-03
**Status**: Implementation Complete - Benchmarking Required
**Commit**: [Pending]
### What Was Implemented
1. **TLS SuperSlab Hint Box** (`/mnt/workdisk/public_share/hakmem/core/box/tls_ss_hint_box.h`)
- Header-only Box implementation
- 4-slot FIFO cache per thread (112 bytes TLS overhead)
- Inline functions: `tls_ss_hint_init()`, `tls_ss_hint_update()`, `tls_ss_hint_lookup()`, `tls_ss_hint_clear()`
- Statistics API for debug builds
2. **Build Flag** (`/mnt/workdisk/public_share/hakmem/core/hakmem_build_flags.h`)
- `HAKMEM_TINY_SS_TLS_HINT` (default: 0, disabled)
- Validation check: requires `HAKMEM_TINY_HEADERLESS=1`
3. **Integration Points**
- **Free path** (`core/hakmem_tiny_free.inc`): Lines 477-481, 550-555
- Fast path hint lookup before expensive `hak_super_lookup()`
- **Allocation path** (`core/tiny_superslab_alloc.inc.h`): Lines 115-122, 179-186
- Cache update on successful allocation (both linear and freelist modes)
4. **TLS Variable Definition** (`core/hakmem_tiny_tls_state_box.inc`)
- `__thread TlsSsHintCache g_tls_ss_hint = {0};`
5. **Unit Tests** (`tests/test_tls_ss_hint.c`)
- 6 test functions (init, basic lookup, FIFO rotation, duplicate detection, clear, stats)
- All tests PASSING
6. **Build System**
- Removed old conflicting `ss_tls_hint_box.c` (different implementation)
- Updated Makefile to remove compiled object files (header-only design)
---
## Environment
- **CPU**: [Run: lscpu | grep "Model name"]
- **OS**: Linux 6.8.0-87-generic
- **Compiler**: gcc (Ubuntu)
- **Build Date**: 2025-12-03
- **Hakmem Commit**: [Git log -1 --oneline]
---
## Build Validation
### Build 1: Hint Disabled (Baseline)
```bash
make clean
make shared -j8
```
**Result**: ✅ SUCCESS
### Build 2: Hint Enabled
```bash
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_SS_TLS_HINT=1 -DHAKMEM_TINY_HEADERLESS=1"
```
**Result**: ✅ SUCCESS
### Unit Tests
```bash
gcc -o tests/test_tls_ss_hint tests/test_tls_ss_hint.c -I./core \
-DHAKMEM_TINY_SS_TLS_HINT=1 -DHAKMEM_BUILD_RELEASE=0 -DHAKMEM_TINY_HEADERLESS=1
./tests/test_tls_ss_hint
```
**Result**: ✅ ALL 6 TESTS PASSED
---
## Benchmark Results (To Be Run)
### Methodology
Run each benchmark configuration 3 times and take the median:
```bash
# Configuration 1: Baseline (Headerless OFF, Hint OFF)
make clean
make shared -j8
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench
# Configuration 2: Headerless ON, Hint OFF
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0"
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench
# Configuration 3: Headerless ON, Hint ON
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench
```
### sh8bench (Memory Stress Test)
| Configuration | Time (sec) | Mops/s | Relative to Baseline | Improvement vs Headerless |
|---------------|-----------|---------|----------------------|---------------------------|
| Baseline (Headerless OFF, Hint OFF) | TBD | TBD | 100% | - |
| Headerless ON, Hint OFF | TBD | TBD | TBD | 0% |
| Headerless ON, Hint ON | TBD | TBD | TBD | **TBD** |
**Expected**: Headerless w/ Hint should recover 15-20% of Headerless performance loss
### cfrac (Factorization Test)
```bash
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/cfrac 17545186520809
```
| Configuration | Status | Time (sec) | Notes |
|---------------|--------|-----------|-------|
| Baseline | TBD | TBD | - |
| Headerless ON, Hint OFF | TBD | TBD | - |
| Headerless ON, Hint ON | TBD | TBD | No regressions expected |
### larson (Multi-threaded Stress)
```bash
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/larson 8
```
| Configuration | Status | Ops/sec | Notes |
|---------------|--------|---------|-------|
| Baseline | TBD | TBD | - |
| Headerless ON, Hint OFF | TBD | TBD | - |
| Headerless ON, Hint ON | TBD | TBD | Multi-threaded hit rate: 70-85% |
---
## Performance Analysis
### Expected Hit Rate
Based on design analysis (Section 9 of TLS_SS_HINT_BOX_DESIGN.md):
- **Single-threaded**: 85-95%
- **Multi-threaded**: 70-85%
### Cycle Count Savings
| Operation | Without Hint | With Hint (Hit) | Savings |
|-----------|-------------|----------------|---------|
| ptr→SuperSlab lookup | 10-50 cycles | 2-5 cycles | **80-95%** |
### Memory Overhead
- Per-thread: 112 bytes (4 slots × 24 bytes + 16 bytes metadata)
- 1000 threads: 112 KB (negligible)
---
## Next Steps
1. **Run Benchmarks**: Execute benchmark suite on dedicated machine
2. **Measure Hit Rate**: Enable `HAKMEM_BUILD_RELEASE=0` and add stats dump at exit
3. **Performance Tuning**: If hit rate < 80%, consider increasing slots to 8
4. **Production Rollout**: If results meet target (15-20% improvement), enable by default
---
## Success Criteria
**Code Quality**
- [x] Header-only Box design (zero runtime overhead when disabled)
- [x] Follows Box Theory architecture
- [x] Comprehensive unit tests (6/6 passing)
- [x] Fail-safe fallback (miss hak_super_lookup)
**Build System**
- [x] Compiles with hint disabled (default)
- [x] Compiles with hint enabled
- [x] No regressions in existing tests
**Performance** (Benchmarking Required)
- [ ] sh8bench: +15-20% throughput vs Headerless baseline
- [ ] cfrac: No regressions
- [ ] larson: No regressions, +15-20% ideal case
---
## Risk Assessment
**Risk Level**: Low
- Thread-local storage (no cache coherency issues)
- Read-only cache (never modifies SuperSlab state)
- Magic number validation (catches stale entries)
- Fail-safe fallback (miss hak_super_lookup)
- Minimal integration surface (2 locations modified)
- Zero overhead when disabled (compile-time flag)
---
## Conclusion
**Implementation Status**: Complete
The TLS SuperSlab Hint Box has been successfully implemented as a header-only Box with clean integration into the free and allocation paths. All unit tests pass, and the build succeeds in both configurations (hint enabled/disabled).
**Next Action**: Run full benchmark suite to validate performance targets (15-20% improvement over Headerless baseline).
**Recommendation**: If benchmarks show >= 15% improvement with no regressions, merge to master and plan for default enable in Phase 2.
---
**Generated**: 2025-12-03
**Author**: hakmem team