Files
hakmem/docs/archive/PROFILING_RESULTS_2025_10_22.md
Moe Charm (CI) 52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00

1.4 KiB
Raw Blame History

Profiling Results (20251022)

This document stores the current observed overheads from lightweight sampling runs. Throughput is ignored; we focus on avg ns per path.

Environment: larson, LD_PRELOAD hakmem, sampling profiler ON (HAKMEM_PROF=1), sample rates as indicated.

10s reference (subset)

  • BURST 10s, 1T
    • tiny_alloc: avg ~28.8 ns (samples ~65k)
  • BURST 10s, 4T
    • malloc_alloc: avg ~98.7 ns (samples ~16k)
  • LOOP 10s, 4T
    • malloc_alloc: avg ~40.9 ns (samples ~35k)

2s sweep (1/256), 1T/4T (mid/large ranges)

  • Mid 232KiB, 1T

    • ace_alloc: ~2631 ns
    • malloc_alloc: ~150220 ns
  • Mid 232KiB, 4T

    • ace_alloc: ~3135 ns
    • malloc_alloc: ~250315 ns
  • Large 64KiB1MiB, 1T

    • ace_alloc: ~3258 ns → 3150 ns (after W_MAX tuning)
    • malloc_alloc: ~9601,690 ns → 8401,330 ns (slight drop)
  • Large 64KiB1MiB, 4T

    • ace_alloc: ~5272 ns
    • malloc_alloc: ~2.54.1 µs

Notes:

  • Tiny path healthy: tiny_alloc ~1829 ns, tiny_reg_lookup ~1723 ns.
  • Registry registration ~0.61.3 µs (rare: slab creation only).
  • Pool/L25 lock/refill present (instrumented) but low sample count in 2s runs; use focused ranges and higher sampling for deeper analysis.

1s sweep (1/256), 1T (W_MAX_mid=1.40, W_MAX_large=1.30)

  • Mid 232KiB, 1T
    • ace_alloc: ~26.6 ns
    • malloc_alloc: ~153.2 ns (slight drop)
  • Large 64KiB1MiB, 1T
    • ace_alloc: ~31.950.6 ns
    • malloc_alloc: ~927.91,334.1 ns (modest drop)