Commit Graph

4 Commits

Author SHA1 Message Date
602edab87f Phase 1: Box Theory refactoring + include reduction
Phase 1-1: Split hakmem_tiny_free.inc (1,711 → 452 lines, -73%)
- Created tiny_free_magazine.inc.h (413 lines) - Magazine layer
- Created tiny_superslab_alloc.inc.h (394 lines) - SuperSlab alloc
- Created tiny_superslab_free.inc.h (305 lines) - SuperSlab free

Phase 1-2++: Refactor hakmem_pool.c (1,481 → 907 lines, -38.8%)
- Created pool_tls_types.inc.h (32 lines) - TLS structures
- Created pool_mf2_types.inc.h (266 lines) - MF2 data structures
- Created pool_mf2_helpers.inc.h (158 lines) - Helper functions
- Created pool_mf2_adoption.inc.h (129 lines) - Adoption logic

Phase 1-3: Reduce hakmem_tiny.c includes (60 → 46, -23.3%)
- Created tiny_system.h - System headers umbrella (stdio, stdlib, etc.)
- Created tiny_api.h - API headers umbrella (stats, query, rss, registry)

Performance: 4.19M ops/s maintained (±0% regression)
Verified: Larson benchmark 2×8×128×1024 = 4,192,128 ops/s

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-06 21:54:12 +09:00
4978340c02 Tiny/SuperSlab: implement per-class registry optimization for fast refill scan
Replace 262K linear registry scan with per-class indexed registry:
- Add g_super_reg_by_class[TINY_NUM_CLASSES][16384] for O(class_size) scan
- Update hak_super_register/unregister to maintain both hash table + per-class index
- Optimize refill scan in hakmem_tiny_free.inc (262K → ~10-100 entries per class)
- Optimize mmap gate scan in tiny_mmap_gate.h (same optimization)

Performance impact (Larson benchmark):
- threads=1: 2.59M → 2.61M ops/s (+0.8%)
- threads=4: 3.62M → 4.19M ops/s (+15.7%) 🎉

Root cause analysis via perf:
- superslab_refill consumed 28.51% CPU time (97.65% in loop instructions)
- 262,144-entry linear scan with 2 atomic loads per iteration
- Per-class registry reduces scan target by 98.4% (262K → 16K per class)

Registry capacity:
- SUPER_REG_PER_CLASS = 16384 (increased from 4096 to avoid exhaustion)
- Total: 8 classes × 16384 = 128K entries (vs 262K unified registry)

Design:
- Dual registry: Hash table (address lookup) + Per-class index (refill scan)
- O(1) registration/unregistration with swap-with-last removal
- Lock-free reads, mutex-protected writes (same as before)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 17:02:31 +09:00
d099719141 Fix #2: First-Fit Adopt Loop optimization
- Changed adopt loop from best-fit (scoring all 32 slabs) to first-fit
- Stop at first slab with freelist instead of scanning all 32
- Expected: -3,000 cycles per refill (eliminate 64 atomic loads + 32 scores)

Result: No measurable improvement (1.23M → 1.25M ops/s, ±0%)

Analysis:
- Adopt loop may not be executed frequently enough
- Larson benchmark hit rate might bypass adopt path
- Best-fit scoring overhead was smaller than estimated

Note: Fix #1 (getenv caching) was attempted but reverted due to -22% regression.
Global variable access overhead exceeded saved getenv() cost.
2025-11-05 06:59:28 +00:00
52386401b3 Debug Counters Implementation - Clean History
Major Features:
- Debug counter infrastructure for Refill Stage tracking
- Free Pipeline counters (ss_local, ss_remote, tls_sll)
- Diagnostic counters for early return analysis
- Unified larson.sh benchmark runner with profiles
- Phase 6-3 regression analysis documentation

Bug Fixes:
- Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB)
- Fix profile variable naming consistency
- Add .gitignore patterns for large files

Performance:
- Phase 6-3: 4.79 M ops/s (has OOM risk)
- With SuperSlab: 3.13 M ops/s (+19% improvement)

This is a clean repository without large log files.

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-05 12:31:14 +09:00