Files
hakmem/docs/PHASE1_TLS_HINT_BENCHMARK.md
Moe Charm (CI) 94f9ea5104 Implement Phase 1: TLS SuperSlab Hint Box for Headerless performance
Design: Cache recently-used SuperSlab references in TLS to accelerate
ptr→SuperSlab resolution in Headerless mode free() path.

## Implementation

### New Box: core/box/tls_ss_hint_box.h
- Header-only Box (4-slot FIFO cache per thread)
- Functions: tls_ss_hint_init(), tls_ss_hint_update(), tls_ss_hint_lookup(), tls_ss_hint_clear()
- Memory overhead: 112 bytes per thread (negligible)
- Statistics API for debug builds (hit/miss counters)

### Integration Points

1. **Free path** (core/hakmem_tiny_free.inc):
   - Lines 477-481: Fast path hint lookup before hak_super_lookup()
   - Lines 550-555: Second lookup location (fallback path)
   - Expected savings: 10-50 cycles → 2-5 cycles on cache hit

2. **Allocation path** (core/tiny_superslab_alloc.inc.h):
   - Lines 115-122: Linear allocation return path
   - Lines 179-186: Freelist allocation return path
   - Cache update on successful allocation

3. **TLS variable** (core/hakmem_tiny_tls_state_box.inc):
   - `__thread TlsSsHintCache g_tls_ss_hint = {0};`

### Build System

- **Build flag** (core/hakmem_build_flags.h):
  - HAKMEM_TINY_SS_TLS_HINT (default: 0, disabled)
  - Validation: requires HAKMEM_TINY_HEADERLESS=1

- **Makefile**:
  - Removed old ss_tls_hint_box.o (conflicting implementation)
  - Header-only design eliminates compiled object files

### Testing

- **Unit tests** (tests/test_tls_ss_hint.c):
  - 6 test functions covering init, lookup, FIFO rotation, duplicates, clear, stats
  - All tests PASSING

- **Build validation**:
  -  Compiles with hint disabled (default)
  -  Compiles with hint enabled (HAKMEM_TINY_SS_TLS_HINT=1)

### Documentation

- **Benchmark report** (docs/PHASE1_TLS_HINT_BENCHMARK.md):
  - Implementation summary
  - Build validation results
  - Benchmark methodology (to be executed)
  - Performance analysis framework

## Expected Performance

- **Hit rate**: 85-95% (single-threaded), 70-85% (multi-threaded)
- **Cycle savings**: 80-95% on cache hit (10-50 cycles → 2-5 cycles)
- **Target improvement**: 15-20% throughput increase vs Headerless baseline
- **Memory overhead**: 112 bytes per thread

## Box Theory

**Mission**: Cache hot SuperSlabs to avoid global registry lookup

**Boundary**: ptr → SuperSlab* or NULL (miss)

**Invariant**: hint.base ≤ ptr < hint.end → hit is valid

**Fallback**: Always safe to miss (triggers hak_super_lookup)

**Thread Safety**: TLS storage, no synchronization required

**Risk**: Low (read-only cache, fail-safe fallback, magic validation)

## Next Steps

1. Run full benchmark suite (sh8bench, cfrac, larson)
2. Measure actual hit rate with stats enabled
3. If performance target met (15-20% improvement), enable by default
4. Consider increasing cache slots if hit rate < 80%

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-03 18:06:24 +09:00

6.4 KiB
Raw Blame History

Phase 1: TLS SuperSlab Hint Box - Benchmark Report

Implementation Summary

Date: 2025-12-03 Status: Implementation Complete - Benchmarking Required Commit: [Pending]

What Was Implemented

  1. TLS SuperSlab Hint Box (/mnt/workdisk/public_share/hakmem/core/box/tls_ss_hint_box.h)

    • Header-only Box implementation
    • 4-slot FIFO cache per thread (112 bytes TLS overhead)
    • Inline functions: tls_ss_hint_init(), tls_ss_hint_update(), tls_ss_hint_lookup(), tls_ss_hint_clear()
    • Statistics API for debug builds
  2. Build Flag (/mnt/workdisk/public_share/hakmem/core/hakmem_build_flags.h)

    • HAKMEM_TINY_SS_TLS_HINT (default: 0, disabled)
    • Validation check: requires HAKMEM_TINY_HEADERLESS=1
  3. Integration Points

    • Free path (core/hakmem_tiny_free.inc): Lines 477-481, 550-555
      • Fast path hint lookup before expensive hak_super_lookup()
    • Allocation path (core/tiny_superslab_alloc.inc.h): Lines 115-122, 179-186
      • Cache update on successful allocation (both linear and freelist modes)
  4. TLS Variable Definition (core/hakmem_tiny_tls_state_box.inc)

    • __thread TlsSsHintCache g_tls_ss_hint = {0};
  5. Unit Tests (tests/test_tls_ss_hint.c)

    • 6 test functions (init, basic lookup, FIFO rotation, duplicate detection, clear, stats)
    • All tests PASSING
  6. Build System

    • Removed old conflicting ss_tls_hint_box.c (different implementation)
    • Updated Makefile to remove compiled object files (header-only design)

Environment

  • CPU: [Run: lscpu | grep "Model name"]
  • OS: Linux 6.8.0-87-generic
  • Compiler: gcc (Ubuntu)
  • Build Date: 2025-12-03
  • Hakmem Commit: [Git log -1 --oneline]

Build Validation

Build 1: Hint Disabled (Baseline)

make clean
make shared -j8

Result: SUCCESS

Build 2: Hint Enabled

make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_SS_TLS_HINT=1 -DHAKMEM_TINY_HEADERLESS=1"

Result: SUCCESS

Unit Tests

gcc -o tests/test_tls_ss_hint tests/test_tls_ss_hint.c -I./core \
    -DHAKMEM_TINY_SS_TLS_HINT=1 -DHAKMEM_BUILD_RELEASE=0 -DHAKMEM_TINY_HEADERLESS=1
./tests/test_tls_ss_hint

Result: ALL 6 TESTS PASSED


Benchmark Results (To Be Run)

Methodology

Run each benchmark configuration 3 times and take the median:

# Configuration 1: Baseline (Headerless OFF, Hint OFF)
make clean
make shared -j8
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench

# Configuration 2: Headerless ON, Hint OFF
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=0"
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench

# Configuration 3: Headerless ON, Hint ON
make clean
make shared -j8 EXTRA_CFLAGS="-DHAKMEM_TINY_HEADERLESS=1 -DHAKMEM_TINY_SS_TLS_HINT=1"
LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/sh8bench

sh8bench (Memory Stress Test)

Configuration Time (sec) Mops/s Relative to Baseline Improvement vs Headerless
Baseline (Headerless OFF, Hint OFF) TBD TBD 100% -
Headerless ON, Hint OFF TBD TBD TBD 0%
Headerless ON, Hint ON TBD TBD TBD TBD

Expected: Headerless w/ Hint should recover 15-20% of Headerless performance loss

cfrac (Factorization Test)

LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/cfrac 17545186520809
Configuration Status Time (sec) Notes
Baseline TBD TBD -
Headerless ON, Hint OFF TBD TBD -
Headerless ON, Hint ON TBD TBD No regressions expected

larson (Multi-threaded Stress)

LD_PRELOAD=./libhakmem.so ./mimalloc-bench/out/bench/larson 8
Configuration Status Ops/sec Notes
Baseline TBD TBD -
Headerless ON, Hint OFF TBD TBD -
Headerless ON, Hint ON TBD TBD Multi-threaded hit rate: 70-85%

Performance Analysis

Expected Hit Rate

Based on design analysis (Section 9 of TLS_SS_HINT_BOX_DESIGN.md):

  • Single-threaded: 85-95%
  • Multi-threaded: 70-85%

Cycle Count Savings

Operation Without Hint With Hint (Hit) Savings
ptr→SuperSlab lookup 10-50 cycles 2-5 cycles 80-95%

Memory Overhead

  • Per-thread: 112 bytes (4 slots × 24 bytes + 16 bytes metadata)
  • 1000 threads: 112 KB (negligible)

Next Steps

  1. Run Benchmarks: Execute benchmark suite on dedicated machine
  2. Measure Hit Rate: Enable HAKMEM_BUILD_RELEASE=0 and add stats dump at exit
  3. Performance Tuning: If hit rate < 80%, consider increasing slots to 8
  4. Production Rollout: If results meet target (15-20% improvement), enable by default

Success Criteria

Code Quality

  • Header-only Box design (zero runtime overhead when disabled)
  • Follows Box Theory architecture
  • Comprehensive unit tests (6/6 passing)
  • Fail-safe fallback (miss → hak_super_lookup)

Build System

  • Compiles with hint disabled (default)
  • Compiles with hint enabled
  • No regressions in existing tests

Performance (Benchmarking Required)

  • sh8bench: +15-20% throughput vs Headerless baseline
  • cfrac: No regressions
  • larson: No regressions, +15-20% ideal case

Risk Assessment

Risk Level: Low

  • Thread-local storage (no cache coherency issues)
  • Read-only cache (never modifies SuperSlab state)
  • Magic number validation (catches stale entries)
  • Fail-safe fallback (miss → hak_super_lookup)
  • Minimal integration surface (2 locations modified)
  • Zero overhead when disabled (compile-time flag)

Conclusion

Implementation Status: Complete

The TLS SuperSlab Hint Box has been successfully implemented as a header-only Box with clean integration into the free and allocation paths. All unit tests pass, and the build succeeds in both configurations (hint enabled/disabled).

Next Action: Run full benchmark suite to validate performance targets (15-20% improvement over Headerless baseline).

Recommendation: If benchmarks show >= 15% improvement with no regressions, merge to master and plan for default enable in Phase 2.


Generated: 2025-12-03 Author: hakmem team