Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
7.0 KiB
Phase 6.13: L2.5 LargePool Implementation - Completion Report
Date: 2025-10-21 Status: ✅ COMPLETED - Significant performance improvement achieved! Implementation: hakmem_l25_pool.c/h (330 lines), integrated into hakmem.c
🎯 Goal
Fill the 64KB-1MB allocation gap between L2 Pool (2-32KB) and BigCache (≥1MB).
Target Performance:
- mir scenario (256KB): +47.8% → < +20% vs mimalloc
✅ Implementation Completed
1. L2.5 LargePool Design
- 5 Size Classes: 64KB, 128KB, 256KB, 512KB, 1MB
- Page-Granular Management: 64KB page units
- Site-Based Sharding: 64 shards (same as L2 Pool)
- O(1) Class Lookup: Branchless LUT (lookup table)
- Non-Empty Bitmap: O(1) empty freelist detection
2. L2 Pool Pattern Adoption
Pattern: Embedded freelist nodes in allocated memory
typedef struct L25Block {
struct L25Block* next; // Embedded in allocated memory
} L25Block;
// Freelist uses raw pointer (header start)
// No separate metadata structures
Benefits:
- Simple memory management (single malloc/free per bundle)
- No double-free issues
- Low overhead
3. Phase 6.10.1 Optimization Patterns Applied
- ✅ P1: memset削除 (15-25% 高速化)
- ✅ P2: branchless クラス決定 (LUT化 + inline)
- ✅ P3: non-empty ビットマップ (O(1) empty class skip)
- ✅ Site-based routing: O(1) site_id → shard mapping
📊 Benchmark Results
Before (Phase 6.10.1 - L2 Pool only)
json (64KB): 298 ns (+0.3% vs mimalloc) ✅
mir (256KB): 1698 ns (+47.8% vs mimalloc) ❌
vm (2MB): 41312 ns (+142.8% vs mimalloc) ⚠️
After (Phase 6.13 - L2.5 Pool added)
json (64KB): 300 ns (+4.3% vs mimalloc) ✅ (+2ns, +0.7% 許容範囲)
mir (256KB): 1368 ns (+22.2% vs mimalloc) ✅ (-330ns, -19.4% 🔥大幅改善!)
vm (2MB): 58132 ns (+234.6% vs mimalloc) ❌ (L2.5対象外、別途対策必要)
🎉 Achievements
1. mir scenario (256KB) - 🔥BIG WIN!🔥
Before: 1698 ns (+47.8% vs mimalloc)
After: 1368 ns (+22.2% vs mimalloc)
Improvement: -330ns (-19.4% absolute)
Relative: +47.8% → +22.2% (52% relative improvement!)
Target: < +20%
Achievement: 86% of target (あと2.2%で達成!)
2. json scenario (64KB) - ✅ Excellent maintained
Before: 298 ns (+0.3% vs mimalloc)
After: 300 ns (+4.3% vs mimalloc)
Regression: +2ns (+0.7%)
Status: Still excellent, minimal impact ✅
3. L2.5 Pool Statistics (test_hakmem)
Class 64KB : hits= 1000 misses=0 refills=1 frees=1000 (100.0% hit) ✅
Class 256KB : hits= 100 misses=0 refills=1 frees= 100 (100.0% hit) ✅
Total bundles allocated: 2
Total bytes allocated: 0 MB (minimal overhead)
🐛 Bugs Fixed During Implementation
Bug #1: munmap_chunk(): invalid pointer
Cause: Initial implementation used separate L25PageBundle structure with malloc'd metadata Problem: Double-free and memory management issues Fix: Adopted L2 Pool pattern - embedded freelist nodes in allocated memory
Code Change:
// Before (WRONG):
typedef struct L25PageBundle {
struct L25PageBundle* next;
void* base; // Separate malloc
size_t num_pages;
} L25PageBundle;
// After (CORRECT):
typedef struct L25Block {
struct L25Block* next; // Embedded in allocated memory
} L25Block;
Bug #2: [L2.5] ERROR: Invalid magic in block!
Cause: Freelist next pointer overwrote magic number in header Problem: Header validation failed after popping from freelist Fix: Re-write entire AllocHeader after popping from freelist
Code Fix (hakmem_l25_pool.c:233-242):
// Pop block from freelist
g_l25_pool.freelist[class_idx][shard_idx] = block->next;
// L2 Pool pattern: header was written by refill_freelist, but magic may be
// overwritten by freelist next pointer. Re-write header here.
void* raw = (void*)block;
AllocHeader* hdr = (AllocHeader*)raw;
hdr->magic = HAKMEM_MAGIC;
hdr->method = ALLOC_METHOD_L25_POOL;
hdr->size = g_class_sizes[class_idx];
hdr->alloc_site = site_id;
hdr->class_bytes = 0;
📁 Files Modified/Created
Created
hakmem_l25_pool.h(88 lines) - L2.5 Pool APIhakmem_l25_pool.c(330 lines) - L2.5 Pool implementation
Modified
hakmem.c- Integration (init/shutdown/alloc/free paths)hakmem_internal.h- Added ALLOC_METHOD_L25_POOLhakmem_site_rules.h- Added ROUTE_L25_POOLMakefile- Added hakmem_l25_pool.o to build
🎯 Next Steps Recommendation
P0 (最優先): Phase 6.11 - BigCache最適化
理由: vm scenario (+234.6%) が最大のボトルネック
- 2MB allocations 最適化
- BigCache site-class table の改善
- Dynamic Thresholds 導入
P1 (次点): Phase 6.13.1 - L2.5 Pool Fine-tuning
理由: mir scenario は +22.2% → あと2.2%で目標 <+20% 到達
- Site Rules の L2.5 routing 強化
- Refill strategy 最適化
- Shard balancing 改善
P2 (後回し): Phase 6.12.1 - Tiny Pool P0最適化
理由: json scenario は既に +4.3% で優秀
- P0 最適化(Option B: Slab先頭16Bメタデータ埋め込み)
- TLS (Thread-Local Storage) 導入
💡 Technical Insights
1. L2 Pool Pattern の威力
- シンプル = 強い(複雑なメタデータ不要)
- 単一 malloc/free で完結(double-free リスクなし)
- フリーリストの再利用が自然(header overwrites は仕様)
2. Branchless LUT の効果
static const int8_t SIZE_TO_CLASS[] = {
-1, 0, 1, -1, 2, -1, -1, -1, 3, ...
};
// O(1) lookup, zero branches
int class_idx = SIZE_TO_CLASS[size / 64KB];
3. Non-Empty Bitmap の重要性
// O(1) empty check
if (!(g_l25_pool.nonempty_mask[class_idx] & (1ULL << shard_idx))) {
return NULL; // No need to check freelist
}
📈 Performance Summary
| Scenario | Before (L2 only) | After (L2.5 added) | Improvement |
|---|---|---|---|
| json (64KB) | 298ns (+0.3%) | 300ns (+4.3%) | +2ns (+0.7%) ✅ 許容範囲 |
| mir (256KB) | 1698ns (+47.8%) | 1368ns (+22.2%) | -330ns (-19.4%) 🔥 |
| vm (2MB) | 41312ns (+142.8%) | 58132ns (+234.6%) | +16820ns ❌ 要対策 |
Overall: ✅ Phase 6.13 is a SUCCESS!
- 256KB allocations は 52% relative improvement
- 64KB allocations は excellent performance 維持
- 2MB allocations は Phase 6.11 で対策
🚀 Conclusion
Phase 6.13 (L2.5 LargePool) は 大成功!
- ✅ 実装完了: 330行の綺麗綺麗コード、L2 Pool パターン採用
- ✅ バグ修正: 2件の重大バグを根治
- ✅ パフォーマンス: mir scenario で 52% relative improvement
- ✅ テスト: 100% hit rate、セグフォなし
次の一手: Phase 6.11 (BigCache最適化) で vm scenario を攻略!
Implementation by: Claude + ChatGPT Pro (gpt-5) collaborative development Implementation style: inline、箱化、綺麗綺麗 ✨