ENV cleanup: Remove BG/HotMag vars & guard fprintf (Larson 52.3M ops/s)
Phase 1 完了:環境変数整理 + fprintf デバッグガード ENV変数削除(BG/HotMag系): - core/hakmem_tiny_init.inc: HotMag ENV 削除 (~131 lines) - core/hakmem_tiny_bg_spill.c: BG spill ENV 削除 - core/tiny_refill.h: BG remote 固定値化 - core/hakmem_tiny_slow.inc: BG refs 削除 fprintf Debug Guards (#if !HAKMEM_BUILD_RELEASE): - core/hakmem_shared_pool.c: Lock stats (~18 fprintf) - core/page_arena.c: Init/Shutdown/Stats (~27 fprintf) - core/hakmem.c: SIGSEGV init message ドキュメント整理: - 328 markdown files 削除(旧レポート・重複docs) 性能確認: - Larson: 52.35M ops/s (前回52.8M、安定動作✅) - ENV整理による機能影響なし - Debug出力は一部残存(次phase で対応) 🤖 Generated with Claude Code Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
474
docs/design/ACE_PHASE1_IMPLEMENTATION_TODO.md
Normal file
474
docs/design/ACE_PHASE1_IMPLEMENTATION_TODO.md
Normal file
@ -0,0 +1,474 @@
|
||||
# ACE Phase 1 Implementation TODO
|
||||
|
||||
**Status**: Ready to implement (documentation complete)
|
||||
**Target**: Fragmentation stress 3.87 → 8-12 M ops/s (2-3x improvement)
|
||||
**Timeline**: 1 day (7-9 hours total)
|
||||
**Date**: 2025-11-01
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Phase 1 implements the **minimal ACE (Adaptive Control Engine)** with maximum impact:
|
||||
- Metrics collection (throughput, LLC miss, mutex wait, backlog)
|
||||
- Fast loop control (0.5-1s adjustment cycle)
|
||||
- Dynamic TLS capacity tuning
|
||||
- UCB1 learning for knob selection
|
||||
- ON/OFF toggle via environment variable
|
||||
|
||||
**Expected Impact**: Fragmentation stress workload improves from 3.87 → 8-12 M ops/s
|
||||
|
||||
---
|
||||
|
||||
## Task Breakdown
|
||||
|
||||
### 1. Metrics Collection Infrastructure (2-3 hours)
|
||||
|
||||
#### 1.1 Create `core/hakmem_ace_metrics.h` (30 min)
|
||||
- [ ] Define `struct hkm_ace_metrics` with:
|
||||
```c
|
||||
struct hkm_ace_metrics {
|
||||
uint64_t throughput_ops; // Operations per second
|
||||
double llc_miss_rate; // LLC miss rate (0.0-1.0)
|
||||
uint64_t mutex_wait_ns; // Mutex contention time
|
||||
uint32_t remote_free_backlog[8]; // Per-class backlog
|
||||
double fragmentation_ratio; // Slow metric (60s)
|
||||
uint64_t rss_mb; // Slow metric (60s)
|
||||
uint64_t timestamp_ms; // Collection timestamp
|
||||
};
|
||||
```
|
||||
- [ ] Define collection API:
|
||||
```c
|
||||
void hkm_ace_metrics_init(void);
|
||||
void hkm_ace_metrics_collect(struct hkm_ace_metrics *out);
|
||||
void hkm_ace_metrics_destroy(void);
|
||||
```
|
||||
|
||||
#### 1.2 Create `core/hakmem_ace_metrics.c` (1.5-2 hours)
|
||||
- [ ] **Throughput tracking** (30 min)
|
||||
- Global atomic counter `g_ace_alloc_count`
|
||||
- Increment in `hakmem_alloc()` / `hakmem_free()`
|
||||
- Calculate ops/sec from delta between collections
|
||||
|
||||
- [ ] **LLC miss monitoring** (45 min)
|
||||
- Use `rdpmc` for lightweight performance counter access
|
||||
- Read LLC_MISSES and LLC_REFERENCES counters
|
||||
- Calculate miss_rate = misses / references
|
||||
- Fallback to 0.0 if RDPMC unavailable
|
||||
|
||||
- [ ] **Mutex contention tracking** (30 min)
|
||||
- Wrap `pthread_mutex_lock()` with timing
|
||||
- Track cumulative wait time per class
|
||||
- Reset counters after each collection
|
||||
|
||||
- [ ] **Remote free backlog** (15 min)
|
||||
- Read `g_tiny_classes[c].remote_backlog_count` for each class
|
||||
- Already tracked by tiny pool implementation
|
||||
|
||||
- [ ] **Fragmentation ratio (slow, 60s)** (15 min)
|
||||
- Calculate: `allocated_bytes / reserved_bytes`
|
||||
- Parse `/proc/self/status` for VmRSS and VmSize
|
||||
- Only update every 60 seconds (skip on fast collections)
|
||||
|
||||
- [ ] **RSS monitoring (slow, 60s)** (15 min)
|
||||
- Read `/proc/self/status` VmRSS field
|
||||
- Convert to MB
|
||||
- Only update every 60 seconds
|
||||
|
||||
#### 1.3 Integration with existing code (30 min)
|
||||
- [ ] Add `#include "hakmem_ace_metrics.h"` to `core/hakmem.c`
|
||||
- [ ] Call `hkm_ace_metrics_init()` in `hakmem_init()`
|
||||
- [ ] Call `hkm_ace_metrics_destroy()` in cleanup
|
||||
|
||||
---
|
||||
|
||||
### 2. Fast Loop Controller (2-3 hours)
|
||||
|
||||
#### 2.1 Create `core/hakmem_ace_controller.h` (30 min)
|
||||
- [ ] Define `struct hkm_ace_controller`:
|
||||
```c
|
||||
struct hkm_ace_controller {
|
||||
struct hkm_ace_metrics current;
|
||||
struct hkm_ace_metrics prev;
|
||||
|
||||
// Current knob values
|
||||
uint32_t tls_capacity[8]; // Per-class TLS magazine capacity
|
||||
uint32_t drain_threshold[8]; // Remote free drain threshold
|
||||
|
||||
// Fast loop state
|
||||
uint64_t fast_interval_ms; // Default 500ms
|
||||
uint64_t last_fast_tick_ms;
|
||||
|
||||
// Slow loop state
|
||||
uint64_t slow_interval_ms; // Default 30000ms (30s)
|
||||
uint64_t last_slow_tick_ms;
|
||||
|
||||
// Enabled flag
|
||||
bool enabled;
|
||||
};
|
||||
```
|
||||
- [ ] Define controller API:
|
||||
```c
|
||||
void hkm_ace_controller_init(struct hkm_ace_controller *ctrl);
|
||||
void hkm_ace_controller_tick(struct hkm_ace_controller *ctrl);
|
||||
void hkm_ace_controller_destroy(struct hkm_ace_controller *ctrl);
|
||||
```
|
||||
|
||||
#### 2.2 Create `core/hakmem_ace_controller.c` (1.5-2 hours)
|
||||
- [ ] **Initialization** (30 min)
|
||||
- Read environment variables:
|
||||
- `HAKMEM_ACE_ENABLED` (default 0)
|
||||
- `HAKMEM_ACE_FAST_INTERVAL_MS` (default 500)
|
||||
- `HAKMEM_ACE_SLOW_INTERVAL_MS` (default 30000)
|
||||
- Initialize knob values to current defaults:
|
||||
- `tls_capacity[c] = TINY_TLS_MAG_CAP` (currently 128)
|
||||
- `drain_threshold[c] = TINY_REMOTE_DRAIN_THRESHOLD` (currently high)
|
||||
|
||||
- [ ] **Fast loop tick** (45 min)
|
||||
- Check if `elapsed >= fast_interval_ms`
|
||||
- Collect current metrics
|
||||
- Calculate reward: `reward = throughput - (llc_miss_penalty + mutex_wait_penalty + backlog_penalty)`
|
||||
- Adjust knobs based on metrics:
|
||||
```c
|
||||
// LLC miss high → reduce TLS capacity (diet)
|
||||
if (llc_miss_rate > 0.15) {
|
||||
tls_capacity[c] *= 0.75; // Diet factor
|
||||
}
|
||||
|
||||
// Remote backlog high → lower drain threshold
|
||||
if (remote_backlog[c] > drain_threshold[c]) {
|
||||
drain_threshold[c] /= 2;
|
||||
}
|
||||
|
||||
// Mutex wait high → increase bundle width
|
||||
// (Phase 1: skip, implement in Phase 2)
|
||||
```
|
||||
- Apply knob changes to runtime (see section 4)
|
||||
- Update `prev` metrics for next iteration
|
||||
|
||||
- [ ] **Slow loop tick** (30 min)
|
||||
- Check if `elapsed >= slow_interval_ms`
|
||||
- Collect slow metrics (fragmentation, RSS)
|
||||
- If fragmentation high: trigger partial release (Phase 2 feature, skip for now)
|
||||
- If RSS high: trigger budgeted scavenge (Phase 2 feature, skip for now)
|
||||
|
||||
- [ ] **Tick dispatcher** (15 min)
|
||||
- Combined `hkm_ace_controller_tick()` that calls both fast and slow loops
|
||||
- Use monotonic clock (`clock_gettime(CLOCK_MONOTONIC)`) for timing
|
||||
|
||||
#### 2.3 Integration with main loop (30 min)
|
||||
- [ ] Add background thread in `core/hakmem.c`:
|
||||
```c
|
||||
static void* hkm_ace_thread_main(void *arg) {
|
||||
struct hkm_ace_controller *ctrl = arg;
|
||||
while (ctrl->enabled) {
|
||||
hkm_ace_controller_tick(ctrl);
|
||||
usleep(100000); // 100ms sleep, check every 0.1s
|
||||
}
|
||||
return NULL;
|
||||
}
|
||||
```
|
||||
- [ ] Start ACE thread in `hakmem_init()` if `HAKMEM_ACE_ENABLED=1`
|
||||
- [ ] Join ACE thread in cleanup
|
||||
|
||||
---
|
||||
|
||||
### 3. UCB1 Learning Algorithm (1-2 hours)
|
||||
|
||||
#### 3.1 Create `core/hakmem_ace_ucb1.h` (30 min)
|
||||
- [ ] Define discrete knob candidates:
|
||||
```c
|
||||
// TLS capacity candidates
|
||||
static const uint32_t TLS_CAP_CANDIDATES[] = {4, 8, 16, 32, 64, 128, 256, 512};
|
||||
#define TLS_CAP_N_ARMS 8
|
||||
|
||||
// Drain threshold candidates
|
||||
static const uint32_t DRAIN_THRESH_CANDIDATES[] = {32, 64, 128, 256, 512, 1024};
|
||||
#define DRAIN_THRESH_N_ARMS 6
|
||||
```
|
||||
- [ ] Define `struct hkm_ace_ucb1_arm`:
|
||||
```c
|
||||
struct hkm_ace_ucb1_arm {
|
||||
uint32_t value; // Knob value (e.g., 32, 64, 128)
|
||||
double avg_reward; // Average reward
|
||||
uint32_t n_pulls; // Number of times selected
|
||||
};
|
||||
```
|
||||
- [ ] Define `struct hkm_ace_ucb1_bandit`:
|
||||
```c
|
||||
struct hkm_ace_ucb1_bandit {
|
||||
struct hkm_ace_ucb1_arm arms[TLS_CAP_N_ARMS];
|
||||
uint32_t total_pulls;
|
||||
double exploration_bonus; // Default sqrt(2)
|
||||
};
|
||||
```
|
||||
- [ ] Define UCB1 API:
|
||||
```c
|
||||
void hkm_ace_ucb1_init(struct hkm_ace_ucb1_bandit *bandit, const uint32_t *candidates, int n_arms);
|
||||
int hkm_ace_ucb1_select(struct hkm_ace_ucb1_bandit *bandit);
|
||||
void hkm_ace_ucb1_update(struct hkm_ace_ucb1_bandit *bandit, int arm_idx, double reward);
|
||||
```
|
||||
|
||||
#### 3.2 Create `core/hakmem_ace_ucb1.c` (45 min)
|
||||
- [ ] **Initialization** (15 min)
|
||||
- Initialize each arm with candidate value
|
||||
- Set `avg_reward = 0.0`, `n_pulls = 0`
|
||||
|
||||
- [ ] **Selection** (15 min)
|
||||
- Implement UCB1 formula:
|
||||
```c
|
||||
ucb_value = avg_reward + exploration_bonus * sqrt(log(total_pulls) / n_pulls)
|
||||
```
|
||||
- Return arm index with highest UCB value
|
||||
- Handle initial exploration (n_pulls == 0 → infinity UCB)
|
||||
|
||||
- [ ] **Update** (15 min)
|
||||
- Update running average:
|
||||
```c
|
||||
avg_reward = (avg_reward * n_pulls + reward) / (n_pulls + 1)
|
||||
```
|
||||
- Increment `n_pulls` and `total_pulls`
|
||||
|
||||
#### 3.3 Integration with controller (30 min)
|
||||
- [ ] Add UCB1 bandits to `struct hkm_ace_controller`:
|
||||
```c
|
||||
struct hkm_ace_ucb1_bandit tls_cap_bandit[8]; // Per-class TLS capacity
|
||||
struct hkm_ace_ucb1_bandit drain_bandit[8]; // Per-class drain threshold
|
||||
```
|
||||
- [ ] In fast loop tick:
|
||||
- Select knob values using UCB1: `arm_idx = hkm_ace_ucb1_select(&ctrl->tls_cap_bandit[c])`
|
||||
- Apply selected values: `ctrl->tls_capacity[c] = TLS_CAP_CANDIDATES[arm_idx]`
|
||||
- After observing reward: `hkm_ace_ucb1_update(&ctrl->tls_cap_bandit[c], arm_idx, reward)`
|
||||
|
||||
---
|
||||
|
||||
### 4. Dynamic TLS Capacity Adjustment (1-2 hours)
|
||||
|
||||
#### 4.1 Modify `core/hakmem_tiny_magazine.h` (30 min)
|
||||
- [ ] Change `TINY_TLS_MAG_CAP` from compile-time constant to runtime variable:
|
||||
```c
|
||||
// OLD:
|
||||
#define TINY_TLS_MAG_CAP 128
|
||||
|
||||
// NEW:
|
||||
extern uint32_t g_tiny_tls_mag_cap[8]; // Per-class capacity
|
||||
```
|
||||
- [ ] Update all references to `TINY_TLS_MAG_CAP` to use `g_tiny_tls_mag_cap[class_idx]`
|
||||
|
||||
#### 4.2 Modify `core/hakmem_tiny_magazine.c` (30 min)
|
||||
- [ ] Define global capacity array:
|
||||
```c
|
||||
uint32_t g_tiny_tls_mag_cap[8] = {
|
||||
128, 128, 128, 128, 128, 128, 128, 128 // Default values
|
||||
};
|
||||
```
|
||||
- [ ] Add setter function:
|
||||
```c
|
||||
void hkm_tiny_set_tls_capacity(uint8_t class_idx, uint32_t new_cap) {
|
||||
if (class_idx >= 8) return;
|
||||
g_tiny_tls_mag_cap[class_idx] = new_cap;
|
||||
}
|
||||
```
|
||||
- [ ] Update magazine refill logic to respect dynamic capacity:
|
||||
```c
|
||||
// In tiny_magazine_refill():
|
||||
uint32_t cap = g_tiny_tls_mag_cap[class_idx];
|
||||
if (mag->count >= cap) return; // Already at capacity
|
||||
```
|
||||
|
||||
#### 4.3 Integration with ACE controller (30 min)
|
||||
- [ ] In `hkm_ace_controller_tick()`, apply TLS capacity changes:
|
||||
```c
|
||||
for (int c = 0; c < 8; c++) {
|
||||
uint32_t new_cap = ctrl->tls_capacity[c];
|
||||
hkm_tiny_set_tls_capacity(c, new_cap);
|
||||
}
|
||||
```
|
||||
- [ ] Similarly for drain threshold (if implemented in tiny pool):
|
||||
```c
|
||||
for (int c = 0; c < 8; c++) {
|
||||
uint32_t new_thresh = ctrl->drain_threshold[c];
|
||||
hkm_tiny_set_drain_threshold(c, new_thresh);
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5. ON/OFF Toggle and Configuration (1 hour)
|
||||
|
||||
#### 5.1 Environment variables (30 min)
|
||||
- [ ] Add to `core/hakmem_config.h`:
|
||||
```c
|
||||
// ACE Learning Layer
|
||||
#define HAKMEM_ACE_ENABLED "HAKMEM_ACE_ENABLED" // 0/1
|
||||
#define HAKMEM_ACE_FAST_INTERVAL_MS "HAKMEM_ACE_FAST_INTERVAL_MS" // Default 500
|
||||
#define HAKMEM_ACE_SLOW_INTERVAL_MS "HAKMEM_ACE_SLOW_INTERVAL_MS" // Default 30000
|
||||
#define HAKMEM_ACE_LOG_LEVEL "HAKMEM_ACE_LOG_LEVEL" // 0=off, 1=info, 2=debug
|
||||
|
||||
// Safety guards
|
||||
#define HAKMEM_ACE_MAX_P99_LAT_NS "HAKMEM_ACE_MAX_P99_LAT_NS" // Default 10000000 (10ms)
|
||||
#define HAKMEM_ACE_MAX_RSS_MB "HAKMEM_ACE_MAX_RSS_MB" // Default 16384 (16GB)
|
||||
#define HAKMEM_ACE_MAX_CPU_PERCENT "HAKMEM_ACE_MAX_CPU_PERCENT" // Default 5
|
||||
```
|
||||
- [ ] Parse environment variables in `hkm_ace_controller_init()`
|
||||
|
||||
#### 5.2 Logging infrastructure (30 min)
|
||||
- [ ] Add logging macros in `core/hakmem_ace_controller.c`:
|
||||
```c
|
||||
#define ACE_LOG_INFO(fmt, ...) \
|
||||
if (g_ace_log_level >= 1) fprintf(stderr, "[ACE] " fmt "\n", ##__VA_ARGS__)
|
||||
|
||||
#define ACE_LOG_DEBUG(fmt, ...) \
|
||||
if (g_ace_log_level >= 2) fprintf(stderr, "[ACE DEBUG] " fmt "\n", ##__VA_ARGS__)
|
||||
```
|
||||
- [ ] Add debug output in fast loop:
|
||||
```c
|
||||
ACE_LOG_DEBUG("Fast loop: reward=%.2f, llc_miss=%.2f, backlog=%u",
|
||||
reward, llc_miss_rate, remote_backlog[0]);
|
||||
ACE_LOG_INFO("Adjusting TLS cap[%d]: %u → %u (diet factor=%.2f)",
|
||||
c, old_cap, new_cap, diet_factor);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
### Unit Tests
|
||||
- [ ] Test metrics collection:
|
||||
```bash
|
||||
# Verify throughput tracking
|
||||
HAKMEM_ACE_ENABLED=1 ./test_ace_metrics
|
||||
```
|
||||
- [ ] Test UCB1 selection:
|
||||
```bash
|
||||
# Verify arm selection and update
|
||||
./test_ace_ucb1
|
||||
```
|
||||
|
||||
### Integration Tests
|
||||
- [ ] Test ACE on fragmentation stress benchmark:
|
||||
```bash
|
||||
# Baseline (ACE OFF)
|
||||
HAKMEM_ACE_ENABLED=0 ./bench_fragment_stress_hakx > baseline.txt
|
||||
|
||||
# ACE ON
|
||||
HAKMEM_ACE_ENABLED=1 ./bench_fragment_stress_hakx > ace_on.txt
|
||||
|
||||
# Compare
|
||||
diff baseline.txt ace_on.txt
|
||||
```
|
||||
- [ ] Verify dynamic TLS capacity adjustment:
|
||||
```bash
|
||||
# Enable debug logging
|
||||
export HAKMEM_ACE_ENABLED=1
|
||||
export HAKMEM_ACE_LOG_LEVEL=2
|
||||
./bench_fragment_stress_hakx
|
||||
# Should see log output: "Adjusting TLS cap[2]: 128 → 96"
|
||||
```
|
||||
|
||||
### Benchmark Validation
|
||||
- [ ] Run A/B comparison on all weak workloads:
|
||||
```bash
|
||||
bash scripts/ace_ab_test.sh
|
||||
```
|
||||
- [ ] Expected results:
|
||||
- Fragmentation stress: 3.87 → 8-12 M ops/s (2-3x)
|
||||
- Mid MT: 111.6 M ops/s → 110-115 M ops/s (maintain ±5%)
|
||||
- Large WS: 22.15 M ops/s → 25-30 M ops/s (1.1-1.4x, partial improvement)
|
||||
|
||||
---
|
||||
|
||||
## Implementation Order
|
||||
|
||||
**Day 1 (7-9 hours)**:
|
||||
|
||||
1. **Morning (3-4 hours)**:
|
||||
- [ ] 1.1 Create hakmem_ace_metrics.h (30 min)
|
||||
- [ ] 1.2 Create hakmem_ace_metrics.c (2 hours)
|
||||
- [ ] 1.3 Integration (30 min)
|
||||
- [ ] Test: Verify metrics collection works
|
||||
|
||||
2. **Midday (2-3 hours)**:
|
||||
- [ ] 2.1 Create hakmem_ace_controller.h (30 min)
|
||||
- [ ] 2.2 Create hakmem_ace_controller.c (1.5 hours)
|
||||
- [ ] 2.3 Integration (30 min)
|
||||
- [ ] Test: Verify fast/slow loops run
|
||||
|
||||
3. **Afternoon (2-3 hours)**:
|
||||
- [ ] 3.1 Create hakmem_ace_ucb1.h (30 min)
|
||||
- [ ] 3.2 Create hakmem_ace_ucb1.c (45 min)
|
||||
- [ ] 3.3 Integration (30 min)
|
||||
- [ ] 4.1-4.3 Dynamic TLS capacity (1.5 hours)
|
||||
- [ ] 5.1-5.2 ON/OFF toggle (1 hour)
|
||||
|
||||
4. **Evening (1-2 hours)**:
|
||||
- [ ] Build and test complete system
|
||||
- [ ] Run fragmentation stress A/B test
|
||||
- [ ] Verify 2-3x improvement
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
Phase 1 is complete when:
|
||||
- ✅ Metrics collection works (throughput, LLC miss, mutex wait, backlog)
|
||||
- ✅ Fast loop adjusts TLS capacity based on LLC miss rate
|
||||
- ✅ UCB1 learning selects optimal knob values
|
||||
- ✅ Dynamic TLS capacity affects runtime behavior
|
||||
- ✅ ON/OFF toggle via `HAKMEM_ACE_ENABLED=1` works
|
||||
- ✅ **Benchmark improvement**: Fragmentation stress 3.87 → 8-12 M ops/s (2-3x)
|
||||
- ✅ **No regression**: Mid MT maintains 110-115 M ops/s (±5%)
|
||||
|
||||
---
|
||||
|
||||
## Files to Create
|
||||
|
||||
New files (Phase 1):
|
||||
```
|
||||
core/hakmem_ace_metrics.h (80 lines)
|
||||
core/hakmem_ace_metrics.c (300 lines)
|
||||
core/hakmem_ace_controller.h (100 lines)
|
||||
core/hakmem_ace_controller.c (400 lines)
|
||||
core/hakmem_ace_ucb1.h (80 lines)
|
||||
core/hakmem_ace_ucb1.c (150 lines)
|
||||
```
|
||||
|
||||
Modified files:
|
||||
```
|
||||
core/hakmem_tiny_magazine.h (change TINY_TLS_MAG_CAP to array)
|
||||
core/hakmem_tiny_magazine.c (add setter, use dynamic capacity)
|
||||
core/hakmem.c (start ACE thread)
|
||||
core/hakmem_config.h (add ACE env vars)
|
||||
```
|
||||
|
||||
Test files:
|
||||
```
|
||||
tests/unit/test_ace_metrics.c (150 lines)
|
||||
tests/unit/test_ace_ucb1.c (120 lines)
|
||||
tests/integration/test_ace_e2e.c (200 lines)
|
||||
```
|
||||
|
||||
Scripts:
|
||||
```
|
||||
benchmarks/scripts/utils/ace_ab_test.sh (100 lines)
|
||||
```
|
||||
|
||||
**Total new code**: ~1,680 lines (Phase 1 only)
|
||||
|
||||
---
|
||||
|
||||
## Next Steps After Phase 1
|
||||
|
||||
Once Phase 1 is complete and validated:
|
||||
- **Phase 2**: Fragmentation countermeasures (budgeted scavenge, partial release)
|
||||
- **Phase 3**: Large WS countermeasures (auto diet, LLC miss optimization)
|
||||
- **Phase 4**: realloc optimization (in-place expansion, NT store)
|
||||
|
||||
---
|
||||
|
||||
**Status**: READY TO IMPLEMENT
|
||||
**Priority**: HIGH 🔥
|
||||
**Expected Impact**: 2-3x improvement on fragmentation stress
|
||||
**Risk**: LOW (isolated, ON/OFF toggle, no impact when disabled)
|
||||
|
||||
Let's build it! 💪
|
||||
539
docs/design/ATOMIC_FREELIST_IMPLEMENTATION_STRATEGY.md
Normal file
539
docs/design/ATOMIC_FREELIST_IMPLEMENTATION_STRATEGY.md
Normal file
@ -0,0 +1,539 @@
|
||||
# Atomic Freelist Implementation Strategy
|
||||
|
||||
## Executive Summary
|
||||
|
||||
**Good News**: Only **90 freelist access sites** (not 589), making full conversion feasible in 4-6 hours.
|
||||
|
||||
**Recommendation**: **Hybrid Approach** - Convert hot paths to lock-free atomic operations, use relaxed ordering for cold paths, skip debug/stats sites entirely.
|
||||
|
||||
**Expected Performance Impact**: <3% regression for atomic operations in hot paths.
|
||||
|
||||
---
|
||||
|
||||
## 1. Accessor Function Design
|
||||
|
||||
### Core API (in `core/box/slab_freelist_atomic.h`)
|
||||
|
||||
```c
|
||||
#ifndef SLAB_FREELIST_ATOMIC_H
|
||||
#define SLAB_FREELIST_ATOMIC_H
|
||||
|
||||
#include <stdatomic.h>
|
||||
#include "../superslab/superslab_types.h"
|
||||
|
||||
// ============================================================================
|
||||
// HOT PATH: Lock-Free Operations (use CAS for push/pop)
|
||||
// ============================================================================
|
||||
|
||||
// Atomic POP (lock-free, for refill hot path)
|
||||
// Returns NULL if freelist empty
|
||||
static inline void* slab_freelist_pop_lockfree(TinySlabMeta* meta, int class_idx) {
|
||||
void* head = atomic_load_explicit(&meta->freelist, memory_order_acquire);
|
||||
if (!head) return NULL;
|
||||
|
||||
void* next = tiny_next_read(class_idx, head);
|
||||
while (!atomic_compare_exchange_weak_explicit(
|
||||
&meta->freelist,
|
||||
&head, // Expected value (updated on failure)
|
||||
next, // Desired value
|
||||
memory_order_release, // Success ordering
|
||||
memory_order_acquire // Failure ordering (reload head)
|
||||
)) {
|
||||
// CAS failed (another thread modified freelist)
|
||||
if (!head) return NULL; // List became empty
|
||||
next = tiny_next_read(class_idx, head); // Reload next pointer
|
||||
}
|
||||
return head;
|
||||
}
|
||||
|
||||
// Atomic PUSH (lock-free, for free hot path)
|
||||
static inline void slab_freelist_push_lockfree(TinySlabMeta* meta, int class_idx, void* node) {
|
||||
void* head = atomic_load_explicit(&meta->freelist, memory_order_relaxed);
|
||||
do {
|
||||
tiny_next_write(class_idx, node, head); // Link node->next = head
|
||||
} while (!atomic_compare_exchange_weak_explicit(
|
||||
&meta->freelist,
|
||||
&head, // Expected value (updated on failure)
|
||||
node, // Desired value
|
||||
memory_order_release, // Success ordering
|
||||
memory_order_relaxed // Failure ordering
|
||||
));
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// WARM PATH: Relaxed Load/Store (single-threaded or low contention)
|
||||
// ============================================================================
|
||||
|
||||
// Simple load (relaxed ordering for checks/prefetch)
|
||||
static inline void* slab_freelist_load_relaxed(TinySlabMeta* meta) {
|
||||
return atomic_load_explicit(&meta->freelist, memory_order_relaxed);
|
||||
}
|
||||
|
||||
// Simple store (relaxed ordering for init/cleanup)
|
||||
static inline void slab_freelist_store_relaxed(TinySlabMeta* meta, void* value) {
|
||||
atomic_store_explicit(&meta->freelist, value, memory_order_relaxed);
|
||||
}
|
||||
|
||||
// NULL check (relaxed ordering)
|
||||
static inline bool slab_freelist_is_empty(TinySlabMeta* meta) {
|
||||
return atomic_load_explicit(&meta->freelist, memory_order_relaxed) == NULL;
|
||||
}
|
||||
|
||||
static inline bool slab_freelist_is_nonempty(TinySlabMeta* meta) {
|
||||
return atomic_load_explicit(&meta->freelist, memory_order_relaxed) != NULL;
|
||||
}
|
||||
|
||||
// ============================================================================
|
||||
// COLD PATH: Direct Access (for debug/stats - already atomic type)
|
||||
// ============================================================================
|
||||
|
||||
// For printf/debugging: cast to void* for printing
|
||||
#define SLAB_FREELIST_DEBUG_PTR(meta) \
|
||||
((void*)atomic_load_explicit(&(meta)->freelist, memory_order_relaxed))
|
||||
|
||||
#endif // SLAB_FREELIST_ATOMIC_H
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Critical Site List (Top 20 - MUST Convert)
|
||||
|
||||
### Tier 1: Ultra-Hot Paths (5-10 ops/allocation)
|
||||
|
||||
1. **`core/tiny_superslab_alloc.inc.h:118-145`** - Fast alloc freelist pop
|
||||
2. **`core/hakmem_tiny_refill_p0.inc.h:252-253`** - P0 batch refill check
|
||||
3. **`core/box/carve_push_box.c:33-34, 120-121, 128-129`** - Carve rollback push
|
||||
4. **`core/hakmem_tiny_tls_ops.h:77-85`** - TLS freelist drain
|
||||
|
||||
### Tier 2: Hot Paths (1-2 ops/allocation)
|
||||
|
||||
5. **`core/tiny_refill_opt.h:199-230`** - Refill chain pop
|
||||
6. **`core/tiny_free_magazine.inc.h:135-136`** - Magazine free push
|
||||
7. **`core/box/carve_push_box.c:172-180`** - Freelist carve with push
|
||||
|
||||
### Tier 3: Warm Paths (0.1-1 ops/allocation)
|
||||
|
||||
8. **`core/refill/ss_refill_fc.h:151-153`** - FC refill pop
|
||||
9. **`core/hakmem_tiny_tls_ops.h:203`** - TLS freelist init
|
||||
10. **`core/slab_handle.h:211, 259, 308`** - Slab handle ops
|
||||
|
||||
**Total Critical Sites**: ~40-50 (out of 90 total)
|
||||
|
||||
---
|
||||
|
||||
## 3. Non-Critical Site Strategy
|
||||
|
||||
### Skip Entirely (10-15 sites)
|
||||
|
||||
- **Debug/Stats**: `core/box/ss_stats_box.c:79`, `core/tiny_debug.h:48`
|
||||
- **Reason**: Already atomic type, simple load for printing is fine
|
||||
- **Action**: Change `meta->freelist` → `SLAB_FREELIST_DEBUG_PTR(meta)`
|
||||
|
||||
- **Initialization** (already protected by single-threaded setup):
|
||||
- `core/box/ss_allocation_box.c:66` - Initial freelist setup
|
||||
- `core/hakmem_tiny_superslab.c` - SuperSlab init
|
||||
|
||||
### Use Relaxed Load/Store (20-30 sites)
|
||||
|
||||
- **Condition checks**: `if (meta->freelist)` → `if (slab_freelist_is_nonempty(meta))`
|
||||
- **Prefetch**: `__builtin_prefetch(&meta->freelist, 0, 3)` → keep as-is (atomic type is fine)
|
||||
- **Init/cleanup**: `meta->freelist = NULL` → `slab_freelist_store_relaxed(meta, NULL)`
|
||||
|
||||
### Convert to Lock-Free (10-20 sites)
|
||||
|
||||
- **All POP operations** in hot paths
|
||||
- **All PUSH operations** in free paths
|
||||
- **Carve rollback** operations
|
||||
|
||||
---
|
||||
|
||||
## 4. Phased Implementation Plan
|
||||
|
||||
### Phase 1: Hot Paths Only (2-3 hours) 🔥
|
||||
|
||||
**Goal**: Fix Larson 8T crash with minimal changes
|
||||
|
||||
**Files to modify** (5 files, ~25 sites):
|
||||
1. `core/tiny_superslab_alloc.inc.h` (fast alloc pop)
|
||||
2. `core/hakmem_tiny_refill_p0.inc.h` (P0 batch refill)
|
||||
3. `core/box/carve_push_box.c` (carve/rollback push)
|
||||
4. `core/hakmem_tiny_tls_ops.h` (TLS drain)
|
||||
5. Create `core/box/slab_freelist_atomic.h` (accessor API)
|
||||
|
||||
**Testing**:
|
||||
```bash
|
||||
./build.sh bench_random_mixed_hakmem
|
||||
./out/release/bench_random_mixed_hakmem 10000000 256 42 # Single-threaded baseline
|
||||
./build.sh larson_hakmem
|
||||
./out/release/larson_hakmem 8 100000 256 # 8 threads (expect no crash)
|
||||
```
|
||||
|
||||
**Expected Result**: Larson 8T stable, <5% regression on single-threaded
|
||||
|
||||
---
|
||||
|
||||
### Phase 2: All TLS Paths (2-3 hours) ⚡
|
||||
|
||||
**Goal**: Full MT safety for all allocation paths
|
||||
|
||||
**Files to modify** (10 files, ~40 sites):
|
||||
- All files from Phase 1 (complete conversion)
|
||||
- `core/tiny_refill_opt.h` (refill chain ops)
|
||||
- `core/tiny_free_magazine.inc.h` (magazine push)
|
||||
- `core/refill/ss_refill_fc.h` (FC refill)
|
||||
- `core/slab_handle.h` (slab handle ops)
|
||||
|
||||
**Testing**:
|
||||
```bash
|
||||
./build.sh bench_random_mixed_hakmem
|
||||
./out/release/bench_random_mixed_hakmem 10000000 256 42 # Baseline check
|
||||
./build.sh stress_test_mt_hakmem
|
||||
./out/release/stress_test_mt_hakmem 16 100000 # 16 threads stress test
|
||||
```
|
||||
|
||||
**Expected Result**: All MT tests pass, <3% regression
|
||||
|
||||
---
|
||||
|
||||
### Phase 3: Cleanup (1-2 hours) 🧹
|
||||
|
||||
**Goal**: Convert/document remaining sites
|
||||
|
||||
**Files to modify** (5 files, ~25 sites):
|
||||
- Debug/stats sites: Add `SLAB_FREELIST_DEBUG_PTR()` macro
|
||||
- Init/cleanup sites: Use `slab_freelist_store_relaxed()`
|
||||
- Add comments explaining MT safety assumptions
|
||||
|
||||
**Testing**:
|
||||
```bash
|
||||
make clean && make all # Full rebuild
|
||||
./run_all_tests.sh # Comprehensive test suite
|
||||
```
|
||||
|
||||
**Expected Result**: Clean build, all tests pass
|
||||
|
||||
---
|
||||
|
||||
## 5. Automated Conversion Script
|
||||
|
||||
### Semi-Automated Sed Script
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# atomic_freelist_convert.sh - Phase 1 conversion helper
|
||||
|
||||
set -e
|
||||
|
||||
# Backup
|
||||
git stash
|
||||
git checkout -b atomic-freelist-phase1
|
||||
|
||||
# Step 1: Convert NULL checks (read-only, safe)
|
||||
find core -name "*.c" -o -name "*.h" | xargs sed -i \
|
||||
's/if (\([^)]*\)meta->freelist)/if (slab_freelist_is_nonempty(\1meta))/g'
|
||||
|
||||
# Step 2: Convert condition checks in while loops
|
||||
find core -name "*.c" -o -name "*.h" | xargs sed -i \
|
||||
's/while (\([^)]*\)meta->freelist)/while (slab_freelist_is_nonempty(\1meta))/g'
|
||||
|
||||
# Step 3: Show remaining manual conversions needed
|
||||
echo "=== REMAINING MANUAL CONVERSIONS ==="
|
||||
grep -rn "meta->freelist" core/ --include="*.c" --include="*.h" | \
|
||||
grep -v "slab_freelist_" | wc -l
|
||||
|
||||
echo "Review changes:"
|
||||
git diff --stat
|
||||
echo ""
|
||||
echo "If good: git commit -am 'Phase 1: Convert freelist NULL checks'"
|
||||
echo "If bad: git checkout . && git checkout master"
|
||||
```
|
||||
|
||||
**Limitations**:
|
||||
- Cannot auto-convert POP operations (need CAS loop)
|
||||
- Cannot auto-convert PUSH operations (need tiny_next_write + CAS)
|
||||
- Manual review required for all changes
|
||||
|
||||
---
|
||||
|
||||
## 6. Performance Projection
|
||||
|
||||
### Single-Threaded Impact
|
||||
|
||||
| Operation | Before | After (Relaxed) | After (CAS) | Overhead |
|
||||
|-----------|--------|-----------------|-------------|----------|
|
||||
| Load | 1 cycle | 1 cycle | 1 cycle | 0% |
|
||||
| Store | 1 cycle | 1 cycle | - | 0% |
|
||||
| POP (freelist) | 3-5 cycles | - | 8-12 cycles | +60-140% |
|
||||
| PUSH (freelist) | 3-5 cycles | - | 8-12 cycles | +60-140% |
|
||||
|
||||
**Expected Regression**:
|
||||
- Best case: 0-1% (mostly relaxed loads)
|
||||
- Worst case: 3-5% (CAS overhead in hot paths)
|
||||
- Realistic: 2-3% (good branch prediction, low contention)
|
||||
|
||||
**Mitigation**: Lock-free CAS is still faster than mutex (20-30 cycles)
|
||||
|
||||
### Multi-Threaded Impact
|
||||
|
||||
| Metric | Before (Non-Atomic) | After (Atomic) | Change |
|
||||
|--------|---------------------|----------------|--------|
|
||||
| Larson 8T | CRASH | Stable | ✅ FIXED |
|
||||
| Throughput (1T) | 25.1M ops/s | 24.4-24.8M ops/s | -1.2-2.8% |
|
||||
| Throughput (8T) | CRASH | ~18-20M ops/s | ✅ NEW |
|
||||
| Scalability | 0% (crashes) | 70-80% | ✅ GAIN |
|
||||
|
||||
**Expected Benefit**: Stability + MT scalability >> 2-3% single-threaded cost
|
||||
|
||||
---
|
||||
|
||||
## 7. Implementation Example (Phase 1)
|
||||
|
||||
### Before: `core/tiny_superslab_alloc.inc.h:117-145`
|
||||
|
||||
```c
|
||||
if (__builtin_expect(meta->freelist != NULL, 0)) {
|
||||
void* block = meta->freelist;
|
||||
if (meta->class_idx != class_idx) {
|
||||
meta->freelist = NULL;
|
||||
goto bump_path;
|
||||
}
|
||||
// ... pop logic ...
|
||||
meta->freelist = tiny_next_read(meta->class_idx, block);
|
||||
return (void*)((uint8_t*)block + 1);
|
||||
}
|
||||
```
|
||||
|
||||
### After: `core/tiny_superslab_alloc.inc.h:117-145`
|
||||
|
||||
```c
|
||||
if (__builtin_expect(slab_freelist_is_nonempty(meta), 0)) {
|
||||
void* block = slab_freelist_pop_lockfree(meta, class_idx);
|
||||
if (!block) {
|
||||
// Another thread won the race, fall through to bump path
|
||||
goto bump_path;
|
||||
}
|
||||
if (meta->class_idx != class_idx) {
|
||||
// Wrong class, return to freelist and go to bump path
|
||||
slab_freelist_push_lockfree(meta, class_idx, block);
|
||||
goto bump_path;
|
||||
}
|
||||
return (void*)((uint8_t*)block + 1);
|
||||
}
|
||||
```
|
||||
|
||||
**Changes**:
|
||||
- NULL check → `slab_freelist_is_nonempty()`
|
||||
- Manual pop → `slab_freelist_pop_lockfree()`
|
||||
- Handle CAS race (block == NULL case)
|
||||
- Simpler logic (CAS handles next pointer atomically)
|
||||
|
||||
---
|
||||
|
||||
## 8. Risk Assessment
|
||||
|
||||
### Low Risk ✅
|
||||
|
||||
- **Phase 1**: Only 5 files, ~25 sites, well-tested patterns
|
||||
- **Rollback**: Easy (`git checkout master`)
|
||||
- **Testing**: Can A/B test with env variable
|
||||
|
||||
### Medium Risk ⚠️
|
||||
|
||||
- **Performance**: 2-3% regression possible
|
||||
- **Subtle bugs**: CAS retry loops need careful review
|
||||
- **ABA problem**: mitigated by pointer tagging (already in codebase)
|
||||
|
||||
### High Risk ❌
|
||||
|
||||
- **None**: Atomic type already declared, no ABI changes
|
||||
|
||||
---
|
||||
|
||||
## 9. Alternative Approaches (Considered)
|
||||
|
||||
### Option A: Mutex per Slab (rejected)
|
||||
|
||||
**Pros**: Simple, guaranteed correctness
|
||||
**Cons**: 40-byte overhead per slab, 10-20x performance hit
|
||||
|
||||
### Option B: Global Lock (rejected)
|
||||
|
||||
**Pros**: Zero code changes, 1-line fix
|
||||
**Cons**: Serializes all allocation, kills MT performance
|
||||
|
||||
### Option C: TLS-Only (rejected)
|
||||
|
||||
**Pros**: No atomics needed
|
||||
**Cons**: Cannot handle remote free (required for MT)
|
||||
|
||||
### Option D: Hybrid (SELECTED) ✅
|
||||
|
||||
**Pros**: Best performance, incremental implementation
|
||||
**Cons**: More complex, requires careful memory ordering
|
||||
|
||||
---
|
||||
|
||||
## 10. Memory Ordering Rationale
|
||||
|
||||
### Relaxed (`memory_order_relaxed`)
|
||||
|
||||
**Use case**: Single-threaded or benign races (e.g., stats)
|
||||
**Cost**: 0 cycles (no fence)
|
||||
**Example**: `if (meta->freelist)` - checking emptiness
|
||||
|
||||
### Acquire (`memory_order_acquire`)
|
||||
|
||||
**Use case**: Loading pointer before dereferencing
|
||||
**Cost**: 1-2 cycles (read fence on some architectures)
|
||||
**Example**: POP freelist head before reading `next` pointer
|
||||
|
||||
### Release (`memory_order_release`)
|
||||
|
||||
**Use case**: Publishing pointer after setup
|
||||
**Cost**: 1-2 cycles (write fence on some architectures)
|
||||
**Example**: PUSH node to freelist after writing `next` pointer
|
||||
|
||||
### AcqRel (`memory_order_acq_rel`)
|
||||
|
||||
**Use case**: CAS success path (acquire+release)
|
||||
**Cost**: 2-4 cycles (full fence on some architectures)
|
||||
**Example**: Not used (separate acquire/release in CAS)
|
||||
|
||||
### SeqCst (`memory_order_seq_cst`)
|
||||
|
||||
**Use case**: Total ordering required
|
||||
**Cost**: 5-10 cycles (expensive fence)
|
||||
**Example**: Not needed for freelist (per-slab ordering sufficient)
|
||||
|
||||
**Chosen**: Acquire/Release for CAS, Relaxed for checks (optimal trade-off)
|
||||
|
||||
---
|
||||
|
||||
## 11. Testing Strategy
|
||||
|
||||
### Phase 1 Tests
|
||||
|
||||
```bash
|
||||
# Baseline (before conversion)
|
||||
./out/release/bench_random_mixed_hakmem 10000000 256 42
|
||||
# Record: 25.1M ops/s
|
||||
|
||||
# After conversion (expect: 24.4-24.8M ops/s)
|
||||
./out/release/bench_random_mixed_hakmem 10000000 256 42
|
||||
|
||||
# MT stability (expect: no crash)
|
||||
./out/release/larson_hakmem 8 100000 256
|
||||
|
||||
# Correctness (expect: 0 errors)
|
||||
./out/release/bench_fixed_size_hakmem 100000 256 128
|
||||
./out/release/bench_fixed_size_hakmem 100000 1024 128
|
||||
```
|
||||
|
||||
### Phase 2 Tests
|
||||
|
||||
```bash
|
||||
# Stress test all sizes
|
||||
for size in 128 256 512 1024; do
|
||||
./out/release/bench_random_mixed_hakmem 1000000 $size 42
|
||||
done
|
||||
|
||||
# MT scaling test
|
||||
for threads in 1 2 4 8 16; do
|
||||
./out/release/larson_hakmem $threads 100000 256
|
||||
done
|
||||
```
|
||||
|
||||
### Phase 3 Tests
|
||||
|
||||
```bash
|
||||
# Full test suite
|
||||
./run_all_tests.sh
|
||||
|
||||
# ASan build (detect races)
|
||||
./build.sh asan bench_random_mixed_hakmem
|
||||
./out/asan/bench_random_mixed_hakmem 100000 256 42
|
||||
|
||||
# TSan build (detect data races)
|
||||
./build.sh tsan larson_hakmem
|
||||
./out/tsan/larson_hakmem 8 10000 256
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 12. Success Criteria
|
||||
|
||||
### Phase 1 (Hot Paths)
|
||||
|
||||
- ✅ Larson 8T runs without crash (100K iterations)
|
||||
- ✅ Single-threaded regression <5% (24.0M+ ops/s)
|
||||
- ✅ No ASan/TSan warnings
|
||||
- ✅ Clean build with no warnings
|
||||
|
||||
### Phase 2 (All Paths)
|
||||
|
||||
- ✅ All MT tests pass (1T, 2T, 4T, 8T, 16T)
|
||||
- ✅ Single-threaded regression <3% (24.4M+ ops/s)
|
||||
- ✅ MT scaling 70%+ (8T = 5.6x+ speedup)
|
||||
- ✅ No memory leaks (Valgrind clean)
|
||||
|
||||
### Phase 3 (Complete)
|
||||
|
||||
- ✅ All 90 sites converted or documented
|
||||
- ✅ Full test suite passes (100% pass rate)
|
||||
- ✅ Code review approved
|
||||
- ✅ Documentation updated
|
||||
|
||||
---
|
||||
|
||||
## 13. Rollback Plan
|
||||
|
||||
If Phase 1 fails (>5% regression or instability):
|
||||
|
||||
```bash
|
||||
# Revert to master
|
||||
git checkout master
|
||||
git branch -D atomic-freelist-phase1
|
||||
|
||||
# Try alternative: Per-slab spinlock (medium overhead)
|
||||
# Add uint8_t lock field to TinySlabMeta
|
||||
# Use __sync_lock_test_and_set() for 1-byte spinlock
|
||||
# Expected: 5-10% overhead, but guaranteed correctness
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 14. Next Steps
|
||||
|
||||
1. **Create accessor header** (`core/box/slab_freelist_atomic.h`) - 30 min
|
||||
2. **Phase 1 conversion** (5 files, ~25 sites) - 2-3 hours
|
||||
3. **Test Phase 1** (single + MT tests) - 1 hour
|
||||
4. **If pass**: Continue to Phase 2
|
||||
5. **If fail**: Review, fix, or rollback
|
||||
|
||||
**Estimated Total Time**: 4-6 hours for full implementation (all 3 phases)
|
||||
|
||||
---
|
||||
|
||||
## 15. Code Review Checklist
|
||||
|
||||
Before merging:
|
||||
|
||||
- [ ] All CAS loops handle retry correctly
|
||||
- [ ] Memory ordering documented for each site
|
||||
- [ ] No direct `meta->freelist` access remains (except debug)
|
||||
- [ ] All tests pass (single + MT)
|
||||
- [ ] ASan/TSan clean
|
||||
- [ ] Performance regression <3%
|
||||
- [ ] Documentation updated (CLAUDE.md)
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
**Approach**: Hybrid - Lock-free CAS for hot paths, relaxed atomics for cold paths
|
||||
**Effort**: 4-6 hours (3 phases)
|
||||
**Risk**: Low (incremental, easy rollback)
|
||||
**Performance**: -2-3% single-threaded, +MT stability and scalability
|
||||
**Benefit**: Unlocks MT performance without sacrificing single-threaded speed
|
||||
|
||||
**Recommendation**: Proceed with Phase 1 (2-3 hours) and evaluate results before committing to full implementation.
|
||||
423
docs/design/PHASE12_SHARED_SUPERSLAB_POOL_DESIGN.md
Normal file
423
docs/design/PHASE12_SHARED_SUPERSLAB_POOL_DESIGN.md
Normal file
@ -0,0 +1,423 @@
|
||||
# Phase 12: Shared SuperSlab Pool - Design Document
|
||||
|
||||
**Date**: 2025-11-13
|
||||
**Goal**: System malloc parity (90M ops/s) via mimalloc-style shared SuperSlab architecture
|
||||
**Expected Impact**: SuperSlab count 877 → 100-200 (-70-80%), +650-860% performance
|
||||
|
||||
---
|
||||
|
||||
## 🎯 Problem Statement
|
||||
|
||||
### Root Cause: Fixed Size Class Architecture
|
||||
|
||||
**Current Design** (Phase 11):
|
||||
```c
|
||||
// SuperSlab is bound to ONE size class
|
||||
struct SuperSlab {
|
||||
uint8_t size_class; // FIXED at allocation time (0-7)
|
||||
// ... 32 slabs, all for the SAME class
|
||||
};
|
||||
|
||||
// 8 independent SuperSlabHead structures (one per class)
|
||||
SuperSlabHead g_superslab_heads[8]; // Each class manages its own pool
|
||||
```
|
||||
|
||||
**Problem**:
|
||||
- Benchmark (100K iterations, 256B): **877 SuperSlabs allocated**
|
||||
- Memory usage: 877MB (877 × 1MB SuperSlabs)
|
||||
- Metadata overhead: 877 × ~2KB headers = ~1.8MB
|
||||
- **Each size class independently allocates SuperSlabs** → massive churn
|
||||
|
||||
**Why 877?**:
|
||||
```
|
||||
Class 0 (8B): ~100 SuperSlabs
|
||||
Class 1 (16B): ~120 SuperSlabs
|
||||
Class 2 (32B): ~150 SuperSlabs
|
||||
Class 3 (64B): ~180 SuperSlabs
|
||||
Class 4 (128B): ~140 SuperSlabs
|
||||
Class 5 (256B): ~187 SuperSlabs ← Target class for benchmark
|
||||
Class 6 (512B): ~80 SuperSlabs
|
||||
Class 7 (1KB): ~20 SuperSlabs
|
||||
Total: 877 SuperSlabs
|
||||
```
|
||||
|
||||
**Performance Impact**:
|
||||
- Massive metadata traversal overhead
|
||||
- Poor cache locality (877 scattered 1MB regions)
|
||||
- Excessive TLB pressure
|
||||
- SuperSlab allocation churn dominates runtime
|
||||
|
||||
---
|
||||
|
||||
## 🚀 Solution: Shared SuperSlab Pool (mimalloc-style)
|
||||
|
||||
### Core Concept
|
||||
|
||||
**New Design** (Phase 12):
|
||||
```c
|
||||
// SuperSlab is NOT bound to any class - slabs are dynamically assigned
|
||||
struct SuperSlab {
|
||||
// NO size_class field! Each slab has its own class_idx
|
||||
uint8_t active_slabs; // Number of active slabs (any class)
|
||||
uint32_t slab_bitmap; // 32-bit bitmap (1=active, 0=free)
|
||||
// ... 32 slabs, EACH can be a different size class
|
||||
};
|
||||
|
||||
// Single global pool (shared by all classes)
|
||||
typedef struct SharedSuperSlabPool {
|
||||
SuperSlab** slabs; // Array of all SuperSlabs
|
||||
uint32_t total_count; // Total SuperSlabs allocated
|
||||
uint32_t active_count; // SuperSlabs with active slabs
|
||||
pthread_mutex_t lock; // Allocation lock
|
||||
|
||||
// Per-class hints (fast path optimization)
|
||||
SuperSlab* class_hints[8]; // Last known SuperSlab with free space per class
|
||||
} SharedSuperSlabPool;
|
||||
```
|
||||
|
||||
### Per-Slab Dynamic Class Assignment
|
||||
|
||||
**Old** (TinySlabMeta):
|
||||
```c
|
||||
// Slab metadata (16 bytes) - class_idx inherited from SuperSlab
|
||||
typedef struct TinySlabMeta {
|
||||
void* freelist;
|
||||
uint16_t used;
|
||||
uint16_t capacity;
|
||||
uint16_t carved;
|
||||
uint16_t owner_tid;
|
||||
} TinySlabMeta;
|
||||
```
|
||||
|
||||
**New** (Phase 12):
|
||||
```c
|
||||
// Slab metadata (16 bytes) - class_idx is PER-SLAB
|
||||
typedef struct TinySlabMeta {
|
||||
void* freelist;
|
||||
uint16_t used;
|
||||
uint16_t capacity;
|
||||
uint16_t carved;
|
||||
uint8_t class_idx; // NEW: Dynamic class assignment (0-7, 255=unassigned)
|
||||
uint8_t owner_tid_low; // Truncated to 8-bit (from 16-bit)
|
||||
} TinySlabMeta;
|
||||
```
|
||||
|
||||
**Size preserved**: Still 16 bytes (no growth!)
|
||||
|
||||
---
|
||||
|
||||
## 📐 Architecture Changes
|
||||
|
||||
### 1. SuperSlab Structure (superslab_types.h)
|
||||
|
||||
**Remove**:
|
||||
```c
|
||||
uint8_t size_class; // DELETE - no longer per-SuperSlab
|
||||
```
|
||||
|
||||
**Add** (optional, for debugging):
|
||||
```c
|
||||
uint8_t mixed_slab_count; // Number of slabs with different class_idx (stats)
|
||||
```
|
||||
|
||||
### 2. TinySlabMeta Structure (superslab_types.h)
|
||||
|
||||
**Modify**:
|
||||
```c
|
||||
typedef struct TinySlabMeta {
|
||||
void* freelist;
|
||||
uint16_t used;
|
||||
uint16_t capacity;
|
||||
uint16_t carved;
|
||||
uint8_t class_idx; // NEW: 0-7 for active, 255=unassigned
|
||||
uint8_t owner_tid_low; // Changed from uint16_t owner_tid
|
||||
} TinySlabMeta;
|
||||
```
|
||||
|
||||
### 3. Shared Pool Structure (NEW: hakmem_shared_pool.h)
|
||||
|
||||
```c
|
||||
// Global shared pool (singleton)
|
||||
typedef struct SharedSuperSlabPool {
|
||||
SuperSlab** slabs; // Dynamic array of SuperSlab pointers
|
||||
uint32_t capacity; // Array capacity (grows as needed)
|
||||
uint32_t total_count; // Total SuperSlabs allocated
|
||||
uint32_t active_count; // SuperSlabs with >0 active slabs
|
||||
|
||||
pthread_mutex_t alloc_lock; // Lock for slab allocation
|
||||
|
||||
// Per-class hints (lock-free read, updated under lock)
|
||||
SuperSlab* class_hints[TINY_NUM_CLASSES];
|
||||
|
||||
// LRU cache integration (Phase 9)
|
||||
SuperSlab* lru_head;
|
||||
SuperSlab* lru_tail;
|
||||
uint32_t lru_count;
|
||||
} SharedSuperSlabPool;
|
||||
|
||||
// Global singleton
|
||||
extern SharedSuperSlabPool g_shared_pool;
|
||||
|
||||
// API
|
||||
void shared_pool_init(void);
|
||||
SuperSlab* shared_pool_acquire_superslab(void); // Get/allocate SuperSlab
|
||||
int shared_pool_acquire_slab(int class_idx, SuperSlab** ss_out, int* slab_idx_out);
|
||||
void shared_pool_release_slab(SuperSlab* ss, int slab_idx);
|
||||
```
|
||||
|
||||
### 4. Allocation Flow (NEW)
|
||||
|
||||
**Old Flow** (Phase 11):
|
||||
```
|
||||
1. TLS cache miss for class C
|
||||
2. Check g_superslab_heads[C].current_chunk
|
||||
3. If no space → allocate NEW SuperSlab for class C
|
||||
4. All 32 slabs in new SuperSlab belong to class C
|
||||
```
|
||||
|
||||
**New Flow** (Phase 12):
|
||||
```
|
||||
1. TLS cache miss for class C
|
||||
2. Check g_shared_pool.class_hints[C]
|
||||
3. If hint has free slab → assign that slab to class C (set class_idx=C)
|
||||
4. If no hint:
|
||||
a. Scan g_shared_pool.slabs[] for any SuperSlab with free slab
|
||||
b. If found → assign slab to class C
|
||||
c. If not found → allocate NEW SuperSlab (add to pool)
|
||||
5. Update class_hints[C] for fast path
|
||||
```
|
||||
|
||||
**Key Benefit**: NEW SuperSlab only allocated when ALL existing SuperSlabs are full!
|
||||
|
||||
---
|
||||
|
||||
## 🔧 Implementation Plan
|
||||
|
||||
### Phase 12-1: Dynamic Slab Metadata ✅ (Current Task)
|
||||
|
||||
**Files to modify**:
|
||||
- `core/superslab/superslab_types.h` - Add `class_idx` to TinySlabMeta
|
||||
- `core/superslab/superslab_types.h` - Remove `size_class` from SuperSlab
|
||||
|
||||
**Changes**:
|
||||
```c
|
||||
// TinySlabMeta: Add class_idx field
|
||||
typedef struct TinySlabMeta {
|
||||
void* freelist;
|
||||
uint16_t used;
|
||||
uint16_t capacity;
|
||||
uint16_t carved;
|
||||
uint8_t class_idx; // NEW: 0-7 for active, 255=UNASSIGNED
|
||||
uint8_t owner_tid_low; // Changed from uint16_t
|
||||
} TinySlabMeta;
|
||||
|
||||
// SuperSlab: Remove size_class
|
||||
typedef struct SuperSlab {
|
||||
uint64_t magic;
|
||||
// uint8_t size_class; // REMOVED!
|
||||
uint8_t active_slabs;
|
||||
uint8_t lg_size;
|
||||
uint8_t _pad0;
|
||||
// ... rest unchanged
|
||||
} SuperSlab;
|
||||
```
|
||||
|
||||
**Compatibility shim** (temporary, for gradual migration):
|
||||
```c
|
||||
// Provide backward-compatible size_class accessor
|
||||
static inline int superslab_get_class(SuperSlab* ss, int slab_idx) {
|
||||
return ss->slabs[slab_idx].class_idx;
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 12-2: Shared Pool Infrastructure
|
||||
|
||||
**New file**: `core/hakmem_shared_pool.h`, `core/hakmem_shared_pool.c`
|
||||
|
||||
**Functionality**:
|
||||
- `shared_pool_init()` - Initialize global pool
|
||||
- `shared_pool_acquire_slab()` - Get free slab for class_idx
|
||||
- `shared_pool_release_slab()` - Mark slab as free (class_idx=255)
|
||||
- `shared_pool_gc()` - Garbage collect empty SuperSlabs
|
||||
|
||||
**Data structure**:
|
||||
```c
|
||||
// Global pool (singleton)
|
||||
SharedSuperSlabPool g_shared_pool = {
|
||||
.slabs = NULL,
|
||||
.capacity = 0,
|
||||
.total_count = 0,
|
||||
.active_count = 0,
|
||||
.alloc_lock = PTHREAD_MUTEX_INITIALIZER,
|
||||
.class_hints = {NULL},
|
||||
.lru_head = NULL,
|
||||
.lru_tail = NULL,
|
||||
.lru_count = 0
|
||||
};
|
||||
```
|
||||
|
||||
### Phase 12-3: Refill Path Integration
|
||||
|
||||
**Files to modify**:
|
||||
- `core/hakmem_tiny_refill_p0.inc.h` - Update to use shared pool
|
||||
- `core/tiny_superslab_alloc.inc.h` - Replace per-class allocation with shared pool
|
||||
|
||||
**Key changes**:
|
||||
```c
|
||||
// OLD: superslab_refill(int class_idx)
|
||||
static SuperSlab* superslab_refill_old(int class_idx) {
|
||||
SuperSlabHead* head = &g_superslab_heads[class_idx];
|
||||
// ... allocate SuperSlab for class_idx only
|
||||
}
|
||||
|
||||
// NEW: superslab_refill(int class_idx) - use shared pool
|
||||
static SuperSlab* superslab_refill_new(int class_idx) {
|
||||
SuperSlab* ss = NULL;
|
||||
int slab_idx = -1;
|
||||
|
||||
// Try to acquire a free slab from shared pool
|
||||
if (shared_pool_acquire_slab(class_idx, &ss, &slab_idx) == 0) {
|
||||
// SUCCESS: Got a slab assigned to class_idx
|
||||
return ss;
|
||||
}
|
||||
|
||||
// FAILURE: All SuperSlabs full, need to allocate new one
|
||||
// (This should be RARE after pool grows to steady-state)
|
||||
return NULL;
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 12-4: Free Path Integration
|
||||
|
||||
**Files to modify**:
|
||||
- `core/tiny_free_fast.inc.h` - Update to handle dynamic class_idx
|
||||
- `core/tiny_superslab_free.inc.h` - Update to release slabs back to pool
|
||||
|
||||
**Key changes**:
|
||||
```c
|
||||
// OLD: Free assumes slab belongs to ss->size_class
|
||||
static inline void hak_tiny_free_superslab_old(void* ptr, SuperSlab* ss) {
|
||||
int class_idx = ss->size_class; // FIXED class
|
||||
// ... free logic
|
||||
}
|
||||
|
||||
// NEW: Free reads class_idx from slab metadata
|
||||
static inline void hak_tiny_free_superslab_new(void* ptr, SuperSlab* ss, int slab_idx) {
|
||||
int class_idx = ss->slabs[slab_idx].class_idx; // DYNAMIC class
|
||||
|
||||
// ... free logic
|
||||
|
||||
// If slab becomes empty, release back to pool
|
||||
if (ss->slabs[slab_idx].used == 0) {
|
||||
shared_pool_release_slab(ss, slab_idx);
|
||||
ss->slabs[slab_idx].class_idx = 255; // Mark as unassigned
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Phase 12-5: Testing & Benchmarking
|
||||
|
||||
**Validation**:
|
||||
1. **Correctness**: Run bench_fixed_size_hakmem 100K iterations (all classes)
|
||||
2. **SuperSlab count**: Monitor g_shared_pool.total_count (expect 100-200)
|
||||
3. **Performance**: bench_random_mixed_hakmem (expect 70-90M ops/s)
|
||||
|
||||
**Expected results**:
|
||||
| Metric | Phase 11 (Before) | Phase 12 (After) | Improvement |
|
||||
|--------|-------------------|------------------|-------------|
|
||||
| SuperSlab count | 877 | 100-200 | -70-80% |
|
||||
| Memory usage | 877MB | 100-200MB | -70-80% |
|
||||
| Metadata overhead | ~1.8MB | ~0.2-0.4MB | -78-89% |
|
||||
| Performance | 9.38M ops/s | 70-90M ops/s | +650-860% |
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ Risk Analysis
|
||||
|
||||
### Complexity Risks
|
||||
|
||||
1. **Concurrency**: Shared pool requires careful locking
|
||||
- **Mitigation**: Per-class hints reduce contention (lock-free fast path)
|
||||
|
||||
2. **Fragmentation**: Mixed classes in same SuperSlab may increase fragmentation
|
||||
- **Mitigation**: Smart slab assignment (prefer same-class SuperSlabs)
|
||||
|
||||
3. **Debugging**: Dynamic class_idx makes debugging harder
|
||||
- **Mitigation**: Add runtime validation (class_idx sanity checks)
|
||||
|
||||
### Performance Risks
|
||||
|
||||
1. **Lock contention**: Shared pool lock may become bottleneck
|
||||
- **Mitigation**: Per-class hints + fast path bypass lock 90%+ of time
|
||||
|
||||
2. **Cache misses**: Accessing distant SuperSlabs may reduce locality
|
||||
- **Mitigation**: LRU cache keeps hot SuperSlabs resident
|
||||
|
||||
---
|
||||
|
||||
## 📊 Success Metrics
|
||||
|
||||
### Primary Goals
|
||||
|
||||
1. **SuperSlab count**: 877 → 100-200 (-70-80%) ✅
|
||||
2. **Performance**: 9.38M → 70-90M ops/s (+650-860%) ✅
|
||||
3. **Memory usage**: 877MB → 100-200MB (-70-80%) ✅
|
||||
|
||||
### Stretch Goals
|
||||
|
||||
1. **System malloc parity**: 90M ops/s (100% of target) 🎯
|
||||
2. **Scalability**: Maintain performance with 4T+ threads
|
||||
3. **Fragmentation**: <10% internal fragmentation
|
||||
|
||||
---
|
||||
|
||||
## 🔄 Migration Strategy
|
||||
|
||||
### Phase 12-1: Metadata (Low Risk)
|
||||
- Add `class_idx` to TinySlabMeta (16B preserved)
|
||||
- Remove `size_class` from SuperSlab
|
||||
- Add backward-compatible shim
|
||||
|
||||
### Phase 12-2: Infrastructure (Medium Risk)
|
||||
- Implement shared pool (NEW code, isolated)
|
||||
- No changes to existing paths yet
|
||||
|
||||
### Phase 12-3: Integration (High Risk)
|
||||
- Update refill path to use shared pool
|
||||
- Update free path to handle dynamic class_idx
|
||||
- **Critical**: Extensive testing required
|
||||
|
||||
### Phase 12-4: Cleanup (Low Risk)
|
||||
- Remove per-class SuperSlabHead structures
|
||||
- Remove backward-compatible shims
|
||||
- Final optimization pass
|
||||
|
||||
---
|
||||
|
||||
## 📝 Next Steps
|
||||
|
||||
### Immediate (Phase 12-1)
|
||||
|
||||
1. ✅ Update `superslab_types.h` - Add `class_idx` to TinySlabMeta
|
||||
2. ✅ Update `superslab_types.h` - Remove `size_class` from SuperSlab
|
||||
3. Add backward-compatible shim `superslab_get_class()`
|
||||
4. Fix compilation errors (grep for `ss->size_class`)
|
||||
|
||||
### Next (Phase 12-2)
|
||||
|
||||
1. Implement `hakmem_shared_pool.h/c`
|
||||
2. Write unit tests for shared pool
|
||||
3. Integrate with LRU cache (Phase 9)
|
||||
|
||||
### Then (Phase 12-3+)
|
||||
|
||||
1. Update refill path
|
||||
2. Update free path
|
||||
3. Benchmark & validate
|
||||
4. Cleanup & optimize
|
||||
|
||||
---
|
||||
|
||||
**Status**: 🚧 Phase 12-1 (Metadata) - IN PROGRESS
|
||||
**Expected completion**: Phase 12-1 today, Phase 12-2 tomorrow, Phase 12-3 day after
|
||||
**Total estimated time**: 3-4 days for full implementation
|
||||
235
docs/design/PHASE7_ACTION_PLAN.md
Normal file
235
docs/design/PHASE7_ACTION_PLAN.md
Normal file
@ -0,0 +1,235 @@
|
||||
# Phase 7: Immediate Action Plan
|
||||
|
||||
**Date:** 2025-11-08
|
||||
**Status:** 🔥 CRITICAL OPTIMIZATION REQUIRED
|
||||
|
||||
---
|
||||
|
||||
## TL;DR
|
||||
|
||||
Phase 7 works but is **40x slower** than System malloc due to `mincore()` overhead.
|
||||
|
||||
**Fix:** Replace `mincore()` with alignment check (99.9% cases) + `mincore()` fallback (0.1% cases)
|
||||
|
||||
**Impact:** 634 cycles → 1-2 cycles (**317x faster!**)
|
||||
|
||||
**Time:** 1-2 hours
|
||||
|
||||
---
|
||||
|
||||
## Critical Finding
|
||||
|
||||
```
|
||||
Current: mincore() on EVERY free = 634 cycles
|
||||
Target: System malloc tcache = 10-15 cycles
|
||||
Result: Phase 7 is 40x SLOWER!
|
||||
```
|
||||
|
||||
**Micro-Benchmark Proof:**
|
||||
```
|
||||
[MINCORE] Mapped memory: 634 cycles/call
|
||||
[ALIGN] Alignment check: 0 cycles/call
|
||||
[HYBRID] Align + mincore: 1 cycles/call ← SOLUTION!
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## The Fix (1-2 Hours)
|
||||
|
||||
### Step 1: Add Helper (core/hakmem_internal.h)
|
||||
|
||||
Add after line 294:
|
||||
|
||||
```c
|
||||
// Fast path: Check if ptr-1 is likely accessible (99.9% cases)
|
||||
// Returns: 1 if ptr-1 is NOT near page boundary (safe to read)
|
||||
static inline int is_likely_valid_header(void* ptr) {
|
||||
uintptr_t p = (uintptr_t)ptr;
|
||||
// Check: ptr-1 is NOT within first 16 bytes of a page
|
||||
// Most allocations are NOT at page boundaries
|
||||
return (p & 0xFFF) >= 16; // 1 cycle
|
||||
}
|
||||
```
|
||||
|
||||
### Step 2: Optimize Fast Free (core/tiny_free_fast_v2.inc.h)
|
||||
|
||||
Replace lines 53-60 with:
|
||||
|
||||
```c
|
||||
// OPTIMIZED: Hybrid check (1-2 cycles effective)
|
||||
void* header_addr = (char*)ptr - 1;
|
||||
|
||||
// Fast path: Alignment check (99.9% cases, 1 cycle)
|
||||
if (__builtin_expect(!is_likely_valid_header(ptr), 0)) {
|
||||
// Slow path: Page boundary case (0.1% cases, 634 cycles)
|
||||
extern int hak_is_memory_readable(void* addr);
|
||||
if (!hak_is_memory_readable(header_addr)) {
|
||||
return 0; // Header not accessible
|
||||
}
|
||||
}
|
||||
|
||||
// Header is accessible (either by alignment or mincore check)
|
||||
int class_idx = tiny_region_id_read_header(ptr);
|
||||
```
|
||||
|
||||
### Step 3: Optimize Dual-Header Dispatch (core/box/hak_free_api.inc.h)
|
||||
|
||||
Replace lines 94-96 with:
|
||||
|
||||
```c
|
||||
// SAFETY: Check if raw header is accessible before dereferencing
|
||||
if (!is_likely_valid_header((char*)ptr + HEADER_SIZE)) {
|
||||
// Page boundary: use mincore fallback
|
||||
if (!hak_is_memory_readable(raw)) {
|
||||
// Header not accessible, continue to slow path
|
||||
goto mid_l25_lookup;
|
||||
}
|
||||
}
|
||||
|
||||
AllocHeader* hdr = (AllocHeader*)raw;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing (30 Minutes)
|
||||
|
||||
### Test 1: Verify Optimization
|
||||
```bash
|
||||
./micro_mincore_bench
|
||||
# Expected: [HYBRID] 1 cycles/call (vs 634 before)
|
||||
```
|
||||
|
||||
### Test 2: Larson Smoke Test
|
||||
```bash
|
||||
make clean && make larson_hakmem
|
||||
./larson_hakmem 1 8 128 1024 1 12345 1
|
||||
# Expected: 40-60M ops/s (vs 0.8M before = 50x improvement!)
|
||||
```
|
||||
|
||||
### Test 3: Stability Check
|
||||
```bash
|
||||
# 10-minute continuous test
|
||||
timeout 600 bash -c 'while true; do ./larson_hakmem 10 8 128 1024 1 $RANDOM 4 || break; done'
|
||||
# Expected: No crashes
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Why This Works
|
||||
|
||||
**Problem:**
|
||||
- Page boundary allocations: <0.1% frequency
|
||||
- But we pay `mincore()` cost (634 cycles) on 100% of frees
|
||||
|
||||
**Solution:**
|
||||
- Alignment check: 1 cycle, 99.9% cases
|
||||
- mincore fallback: 634 cycles, 0.1% cases
|
||||
- **Effective cost:** 0.999 * 1 + 0.001 * 634 = **1.6 cycles**
|
||||
|
||||
**Result:** 634 → 1.6 cycles = **396x faster!**
|
||||
|
||||
---
|
||||
|
||||
## Expected Results
|
||||
|
||||
### Performance (After Fix)
|
||||
|
||||
| Benchmark | Before (ops/s) | After (ops/s) | Improvement |
|
||||
|-----------|----------------|---------------|-------------|
|
||||
| Larson 1T | 0.8M | 40-60M | **50-75x** 🚀 |
|
||||
| Larson 4T | 0.8M | 120-180M | **150-225x** 🚀 |
|
||||
| vs System malloc | -95% | **+20-50%** | **Competitive!** ✅ |
|
||||
|
||||
### Memory Overhead
|
||||
|
||||
| Size | Header | Overhead |
|
||||
|------|--------|----------|
|
||||
| 8B | 1B | 12.5% (but 0% in Slab[0]) |
|
||||
| 128B | 1B | 0.78% |
|
||||
| 512B | 1B | 0.20% |
|
||||
| **Average** | 1B | **<3%** (vs System's 10-15%) |
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
**Minimum (GO/NO-GO):**
|
||||
- ✅ Micro-benchmark: 1-2 cycles (hybrid)
|
||||
- ✅ Larson: ≥20M ops/s (minimum viable)
|
||||
- ✅ No crashes (10-minute stress test)
|
||||
|
||||
**Target:**
|
||||
- ✅ Larson: ≥40M ops/s (2x System)
|
||||
- ✅ Memory: ≤System * 1.05 (RSS)
|
||||
- ✅ Stability: 100% (no crashes)
|
||||
|
||||
**Stretch:**
|
||||
- ✅ Beat mimalloc (if possible)
|
||||
- ✅ 50M+ ops/s (Larson 1T)
|
||||
|
||||
---
|
||||
|
||||
## Risks
|
||||
|
||||
| Risk | Probability | Mitigation |
|
||||
|------|-------------|------------|
|
||||
| False positives (alignment check) | Very Low | Magic validation catches them |
|
||||
| Still slower than System | Low | Micro-benchmark proves 1-2 cycles |
|
||||
| 1024B fallback impacts score | Medium | Measure frequency, optimize if >10% |
|
||||
|
||||
**Overall Risk:** LOW (proven by micro-benchmark)
|
||||
|
||||
---
|
||||
|
||||
## Timeline
|
||||
|
||||
| Phase | Duration | Deliverable |
|
||||
|-------|----------|-------------|
|
||||
| **1. Implement** | 1-2 hours | Code changes (3 files) |
|
||||
| **2. Test** | 30 min | Micro + Larson smoke |
|
||||
| **3. Validate** | 2-3 hours | Full benchmark suite |
|
||||
| **4. Deploy** | 1 day | Production-ready |
|
||||
|
||||
**Total:** 1-2 days to production
|
||||
|
||||
---
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. ✅ Read this document
|
||||
2. ⏳ Implement optimization (Step 1-3 above)
|
||||
3. ⏳ Run tests (micro + Larson)
|
||||
4. ⏳ Full benchmark suite
|
||||
5. ⏳ Compare with mimalloc
|
||||
6. ⏳ Deploy!
|
||||
|
||||
---
|
||||
|
||||
## References
|
||||
|
||||
- **Full Report:** `PHASE7_DESIGN_REVIEW.md` (758 lines)
|
||||
- **Micro-Benchmark:** `tests/micro_mincore_bench.c`
|
||||
- **Code Locations:**
|
||||
- `core/hakmem_internal.h:294` (add helper)
|
||||
- `core/tiny_free_fast_v2.inc.h:53-60` (optimize)
|
||||
- `core/box/hak_free_api.inc.h:94-96` (optimize)
|
||||
|
||||
---
|
||||
|
||||
## Questions?
|
||||
|
||||
**Q: Why not remove mincore entirely?**
|
||||
A: Need it for page boundary cases (0.1%), otherwise SEGV.
|
||||
|
||||
**Q: What about false positives?**
|
||||
A: Magic byte validation catches them (line 75 in tiny_region_id.h).
|
||||
|
||||
**Q: Will this work on ARM/other platforms?**
|
||||
A: Yes, alignment check is portable (bitwise AND).
|
||||
|
||||
**Q: What if it's still slow?**
|
||||
A: Micro-benchmark proves 1-2 cycles. If slow, something else is wrong.
|
||||
|
||||
---
|
||||
|
||||
**GO BUILD IT!** 🚀
|
||||
758
docs/design/PHASE7_DESIGN_REVIEW.md
Normal file
758
docs/design/PHASE7_DESIGN_REVIEW.md
Normal file
@ -0,0 +1,758 @@
|
||||
# Phase 7 Region-ID Direct Lookup: Complete Design Review
|
||||
|
||||
**Date:** 2025-11-08
|
||||
**Reviewer:** Claude (Task Agent Ultrathink)
|
||||
**Status:** CRITICAL BOTTLENECK IDENTIFIED - OPTIMIZATION REQUIRED BEFORE BENCHMARKING
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Phase 7 successfully eliminated the SuperSlab lookup bottleneck and achieved crash-free operation, but introduces a **CRITICAL performance bottleneck** that will prevent it from beating System malloc:
|
||||
|
||||
- **mincore() overhead:** 634 cycles/call (measured)
|
||||
- **System malloc tcache:** 10-15 cycles (target)
|
||||
- **Phase 7 current:** 634 + 5-10 = 639-644 cycles (**40x slower than System!**)
|
||||
|
||||
**Verdict:** **NO-GO for benchmarking without optimization**
|
||||
|
||||
**Recommended fix:** Hybrid approach (alignment check + mincore fallback) → 1-2 cycles effective overhead
|
||||
|
||||
---
|
||||
|
||||
## 1. Critical Bottlenecks (Immediate Action Required)
|
||||
|
||||
### 1.1 mincore() Syscall Overhead 🔥🔥🔥
|
||||
|
||||
**Location:** `core/tiny_free_fast_v2.inc.h:53-60`
|
||||
**Severity:** CRITICAL (blocks deployment)
|
||||
**Performance Impact:** 634 cycles (measured) = **6340% overhead vs target (10 cycles)**
|
||||
|
||||
**Current Implementation:**
|
||||
```c
|
||||
// Line 53-60
|
||||
void* header_addr = (char*)ptr - 1;
|
||||
extern int hak_is_memory_readable(void* addr);
|
||||
if (__builtin_expect(!hak_is_memory_readable(header_addr), 0)) {
|
||||
return 0; // Non-accessible, route to slow path
|
||||
}
|
||||
```
|
||||
|
||||
**Problem:**
|
||||
- `hak_is_memory_readable()` calls `mincore()` syscall (634 cycles measured)
|
||||
- Called on **EVERY free()** (not just edge cases!)
|
||||
- System malloc tcache = 10-15 cycles total
|
||||
- Phase 7 with mincore = 639-644 cycles total (**40x slower!**)
|
||||
|
||||
**Micro-Benchmark Results:**
|
||||
```
|
||||
[MINCORE] Mapped memory: 634 cycles/call (overhead: 6340%)
|
||||
[ALIGN] Alignment check: 0 cycles/call (overhead: 0%)
|
||||
[HYBRID] Align + mincore: 1 cycles/call (overhead: 10%)
|
||||
[BOUNDARY] Page boundary: 2155 cycles/call (but <0.1% frequency)
|
||||
```
|
||||
|
||||
**Root Cause:**
|
||||
The check is overly conservative. Page boundary allocations are **extremely rare** (<0.1%), but we pay the cost for 100% of frees.
|
||||
|
||||
**Solution: Hybrid Approach (1-2 cycles effective)**
|
||||
|
||||
```c
|
||||
// Fast path: Alignment-based heuristic (1 cycle, 99.9% cases)
|
||||
static inline int is_likely_valid_header(void* ptr) {
|
||||
uintptr_t p = (uintptr_t)ptr;
|
||||
// Most allocations are NOT at page boundaries
|
||||
// Check: ptr-1 is NOT within first 16 bytes of a page
|
||||
return (p & 0xFFF) >= 16; // 1 cycle
|
||||
}
|
||||
|
||||
// Phase 7 Fast Free (optimized)
|
||||
static inline int hak_tiny_free_fast_v2(void* ptr) {
|
||||
if (__builtin_expect(!ptr, 0)) return 0;
|
||||
|
||||
// OPTIMIZED: Hybrid check (1-2 cycles effective)
|
||||
void* header_addr = (char*)ptr - 1;
|
||||
|
||||
// Fast path: Alignment check (99.9% cases)
|
||||
if (__builtin_expect(is_likely_valid_header(ptr), 1)) {
|
||||
// Header is almost certainly accessible
|
||||
// (False positive rate: <0.01%, handled by magic validation)
|
||||
goto read_header;
|
||||
}
|
||||
|
||||
// Slow path: Page boundary case (0.1% cases)
|
||||
extern int hak_is_memory_readable(void* addr);
|
||||
if (!hak_is_memory_readable(header_addr)) {
|
||||
return 0; // Actually unmapped
|
||||
}
|
||||
|
||||
read_header:
|
||||
int class_idx = tiny_region_id_read_header(ptr);
|
||||
// ... rest of fast path (5-10 cycles)
|
||||
}
|
||||
```
|
||||
|
||||
**Performance Comparison:**
|
||||
|
||||
| Approach | Cycles/call | Overhead vs System (10-15 cycles) |
|
||||
|----------|-------------|-----------------------------------|
|
||||
| Current (mincore always) | 639-644 | **40x slower** ❌ |
|
||||
| Alignment only | 5-10 | 0.33-1.0x (target) ✅ |
|
||||
| Hybrid (align + mincore fallback) | 6-12 | 0.4-1.2x (acceptable) ✅ |
|
||||
|
||||
**Implementation Cost:** 1-2 hours (add helper, modify line 53-60)
|
||||
|
||||
**Expected Improvement:**
|
||||
- Free path: 639-644 → 6-12 cycles (**53x faster!**)
|
||||
- Larson score: 0.8M → **40-60M ops/s** (predicted)
|
||||
|
||||
---
|
||||
|
||||
### 1.2 1024B Allocation Strategy 🔥
|
||||
|
||||
**Location:** `core/hakmem_tiny.h:247-249`, `core/box/hak_alloc_api.inc.h:35-49`
|
||||
**Severity:** HIGH (performance loss for common size)
|
||||
**Performance Impact:** -50% for 1024B allocations (frequent in benchmarks)
|
||||
|
||||
**Current Behavior:**
|
||||
```c
|
||||
// core/hakmem_tiny.h:247-249
|
||||
#if HAKMEM_TINY_HEADER_CLASSIDX
|
||||
// Phase 7: 1024B requires header (1B) + user data (1024B) = 1025B
|
||||
// Class 7 blocks are only 1024B, so 1024B requests must use Mid allocator
|
||||
if (size >= 1024) return -1; // Reject 1024B!
|
||||
#endif
|
||||
```
|
||||
|
||||
**Result:** 1024B allocations fall through to malloc fallback (16-byte header, no fast path)
|
||||
|
||||
**Problem:**
|
||||
- 1024B is the **most frequent power-of-2 size** in many workloads
|
||||
- Larson uses 128B (good) but bench_random_mixed uses up to 4096B (includes 1024B)
|
||||
- Fallback path: malloc → 16-byte header → slow free → **misses all Phase 7 benefits**
|
||||
|
||||
**Why 1024B is Rejected:**
|
||||
- Class 7 block size: 1024B (fixed by SuperSlab design)
|
||||
- User request: 1024B
|
||||
- Phase 7 header: 1B
|
||||
- Total needed: 1024 + 1 = 1025B > 1024B → **doesn't fit!**
|
||||
|
||||
**Options Analysis:**
|
||||
|
||||
| Option | Pros | Cons | Implementation Cost |
|
||||
|--------|------|------|---------------------|
|
||||
| **A: 1024B class with 2-byte header** | Clean, supports 1024B | Wastes 1B/block (1022B usable) | 2-3 days (header redesign) |
|
||||
| **B: Mid-pool optimization** | Reuses existing infrastructure | Still slower than Tiny | 1 week (Mid fast path) |
|
||||
| **C: Keep malloc fallback** | Simple, no code change | Loses performance on 1024B | 0 (current) |
|
||||
| **D: Reduce max to 512B** | Simplifies Phase 7 | Loses 1024B entirely | 1 hour (config change) |
|
||||
|
||||
**Frequency Analysis (Needed):**
|
||||
```bash
|
||||
# Run benchmarks with size histogram
|
||||
HAKMEM_SIZE_HIST=1 ./larson_hakmem 10 8 128 1024 1 12345 4
|
||||
HAKMEM_SIZE_HIST=1 ./bench_random_mixed_hakmem 10000 4096 1234567
|
||||
|
||||
# Check: How often is 1024B requested?
|
||||
# If <5%: Option C (keep fallback) is fine
|
||||
# If >10%: Option A or B required
|
||||
```
|
||||
|
||||
**Recommendation:** **Measure first, optimize if needed**
|
||||
- Priority: LOW (after mincore fix)
|
||||
- Action: Add size histogram, check 1024B frequency
|
||||
- If <5%: Accept current behavior (Option C)
|
||||
- If >10%: Implement Option A (2-byte header for class 7)
|
||||
|
||||
---
|
||||
|
||||
## 2. Design Concerns (Non-Critical)
|
||||
|
||||
### 2.1 Header Validation in Release Builds
|
||||
|
||||
**Location:** `core/tiny_region_id.h:75-85`
|
||||
**Issue:** Magic byte validation enabled even in release builds
|
||||
|
||||
**Current:**
|
||||
```c
|
||||
// CRITICAL: Always validate magic byte (even in release builds)
|
||||
uint8_t magic = header & 0xF0;
|
||||
if (magic != HEADER_MAGIC) {
|
||||
return -1; // Invalid header
|
||||
}
|
||||
```
|
||||
|
||||
**Concern:** Validation adds 1-2 cycles (compare + branch)
|
||||
|
||||
**Counter-Argument:**
|
||||
- **CORRECT DESIGN** - Must validate to distinguish Tiny from Mid/Large allocations
|
||||
- Without validation: Mid/Large free → reads garbage header → crashes
|
||||
- Cost: 1-2 cycles (acceptable for safety)
|
||||
|
||||
**Verdict:** Keep as-is (validation is essential)
|
||||
|
||||
---
|
||||
|
||||
### 2.2 Dual-Header Dispatch Completeness
|
||||
|
||||
**Location:** `core/box/hak_free_api.inc.h:77-119`
|
||||
**Issue:** Are all allocation methods covered?
|
||||
|
||||
**Current Flow:**
|
||||
```
|
||||
Step 1: Try 1-byte Tiny header (Phase 7)
|
||||
↓ Miss
|
||||
Step 2: Try 16-byte AllocHeader (malloc/mmap)
|
||||
↓ Miss (or unmapped)
|
||||
Step 3: SuperSlab lookup (legacy Tiny)
|
||||
↓ Miss
|
||||
Step 4: Mid/L25 registry lookup
|
||||
↓ Miss
|
||||
Step 5: Error handling (libc fallback or leak warning)
|
||||
```
|
||||
|
||||
**Coverage Analysis:**
|
||||
|
||||
| Allocation Method | Header Type | Dispatch Step | Coverage |
|
||||
|-------------------|-------------|---------------|----------|
|
||||
| Tiny (Phase 7) | 1-byte | Step 1 | ✅ Covered |
|
||||
| Malloc fallback | 16-byte | Step 2 | ✅ Covered |
|
||||
| Mmap | 16-byte | Step 2 | ✅ Covered |
|
||||
| Mid pool | None | Step 4 | ✅ Covered |
|
||||
| L25 pool | None | Step 4 | ✅ Covered |
|
||||
| Tiny (legacy, no header) | None | Step 3 | ✅ Covered |
|
||||
| Libc (LD_PRELOAD) | None | Step 5 | ✅ Covered |
|
||||
|
||||
**Step 2 Coverage Check (Lines 89-113):**
|
||||
```c
|
||||
// SAFETY: Check if raw header is accessible before dereferencing
|
||||
if (hak_is_memory_readable(raw)) { // ← Same mincore issue!
|
||||
AllocHeader* hdr = (AllocHeader*)raw;
|
||||
if (hdr->magic == HAKMEM_MAGIC) {
|
||||
if (hdr->method == ALLOC_METHOD_MALLOC) {
|
||||
extern void __libc_free(void*);
|
||||
__libc_free(raw); // ✅ Correct
|
||||
goto done;
|
||||
}
|
||||
// Other methods handled below
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Issue:** Step 2 also uses `hak_is_memory_readable()` → same 634-cycle overhead!
|
||||
|
||||
**Impact:**
|
||||
- Step 2 frequency: ~1-5% (malloc fallback for 1024B, large allocs)
|
||||
- Hybrid optimization will fix this too (same code path)
|
||||
|
||||
**Verdict:** Complete coverage, but Step 2 needs hybrid optimization too
|
||||
|
||||
---
|
||||
|
||||
### 2.3 Fast Path Hit Rate Estimation
|
||||
|
||||
**Expected Hit Rates (by step):**
|
||||
|
||||
| Step | Path | Expected Frequency | Cycles (current) | Cycles (optimized) |
|
||||
|------|------|-------------------|------------------|-------------------|
|
||||
| 1 | Phase 7 Tiny header | 80-90% | 639-644 | 6-12 ✅ |
|
||||
| 2 | 16-byte header (malloc/mmap) | 5-10% | 639-644 | 6-12 ✅ |
|
||||
| 3 | SuperSlab lookup (legacy) | 0-5% | 500+ | 500+ (rare) |
|
||||
| 4 | Mid/L25 lookup | 3-5% | 200-300 | 200-300 (acceptable) |
|
||||
| 5 | Error handling | <0.1% | Varies | Varies (negligible) |
|
||||
|
||||
**Weighted Average (current):**
|
||||
```
|
||||
0.85 * 639 + 0.08 * 639 + 0.05 * 500 + 0.02 * 250 = 643 cycles
|
||||
```
|
||||
|
||||
**Weighted Average (optimized):**
|
||||
```
|
||||
0.85 * 8 + 0.08 * 8 + 0.05 * 500 + 0.02 * 250 = 37 cycles
|
||||
```
|
||||
|
||||
**Improvement:** 643 → 37 cycles (**17x faster!**)
|
||||
|
||||
**Verdict:** Optimization is MANDATORY for competitive performance
|
||||
|
||||
---
|
||||
|
||||
## 3. Memory Overhead Analysis
|
||||
|
||||
### 3.1 Theoretical Overhead (from `tiny_region_id.h:140-151`)
|
||||
|
||||
| Block Size | Header | Total | Overhead % |
|
||||
|------------|--------|-------|------------|
|
||||
| 8B (class 0) | 1B | 9B | 12.5% |
|
||||
| 16B (class 1) | 1B | 17B | 6.25% |
|
||||
| 32B (class 2) | 1B | 33B | 3.12% |
|
||||
| 64B (class 3) | 1B | 65B | 1.56% |
|
||||
| 128B (class 4) | 1B | 129B | 0.78% |
|
||||
| 256B (class 5) | 1B | 257B | 0.39% |
|
||||
| 512B (class 6) | 1B | 513B | 0.20% |
|
||||
|
||||
**Note:** Class 0 (8B) has special handling: reuses 960B padding in Slab[0] → 0% overhead
|
||||
|
||||
### 3.2 Workload-Weighted Overhead
|
||||
|
||||
**Typical workload distribution** (based on Larson, bench_random_mixed):
|
||||
- Small (8-64B): 60% → avg 5% overhead
|
||||
- Medium (128-512B): 35% → avg 0.5% overhead
|
||||
- Large (1024B): 5% → malloc fallback (16-byte header)
|
||||
|
||||
**Weighted average:** `0.60 * 5% + 0.35 * 0.5% + 0.05 * N/A = 3.2%`
|
||||
|
||||
**vs System malloc:**
|
||||
- System: 8-16 bytes/allocation (depends on size)
|
||||
- 128B alloc: System = 16B/128B = 12.5%, HAKMEM = 1B/128B = 0.78% (**16x better!**)
|
||||
|
||||
**Verdict:** Memory overhead is excellent (<3.2% avg vs System's 10-15%)
|
||||
|
||||
### 3.3 Actual Memory Usage (TODO: Measure)
|
||||
|
||||
**Measurement Plan:**
|
||||
```bash
|
||||
# RSS comparison (Larson)
|
||||
ps aux | grep larson_hakmem # HAKMEM
|
||||
ps aux | grep larson_system # System
|
||||
|
||||
# Detailed memory tracking
|
||||
HAKMEM_MEM_TRACE=1 ./larson_hakmem 10 8 128 1024 1 12345 4
|
||||
```
|
||||
|
||||
**Success Criteria:**
|
||||
- HAKMEM RSS ≤ System RSS * 1.05 (5% margin)
|
||||
- No memory leaks (Valgrind clean)
|
||||
|
||||
---
|
||||
|
||||
## 4. Optimization Opportunities
|
||||
|
||||
### 4.1 URGENT: Hybrid mincore Optimization 🚀
|
||||
|
||||
**Impact:** 17x performance improvement (643 → 37 cycles)
|
||||
**Effort:** 1-2 hours
|
||||
**Priority:** CRITICAL (blocks deployment)
|
||||
|
||||
**Implementation:**
|
||||
```c
|
||||
// core/hakmem_internal.h (add helper)
|
||||
static inline int is_likely_valid_header(void* ptr) {
|
||||
uintptr_t p = (uintptr_t)ptr;
|
||||
return (p & 0xFFF) >= 16; // Not near page boundary
|
||||
}
|
||||
|
||||
// core/tiny_free_fast_v2.inc.h (modify line 53-60)
|
||||
static inline int hak_tiny_free_fast_v2(void* ptr) {
|
||||
if (__builtin_expect(!ptr, 0)) return 0;
|
||||
|
||||
void* header_addr = (char*)ptr - 1;
|
||||
|
||||
// Hybrid check: alignment (99.9%) + mincore fallback (0.1%)
|
||||
if (__builtin_expect(!is_likely_valid_header(ptr), 0)) {
|
||||
extern int hak_is_memory_readable(void* addr);
|
||||
if (!hak_is_memory_readable(header_addr)) {
|
||||
return 0;
|
||||
}
|
||||
}
|
||||
|
||||
// Header is accessible (either by alignment or mincore check)
|
||||
int class_idx = tiny_region_id_read_header(ptr);
|
||||
// ... rest of fast path
|
||||
}
|
||||
```
|
||||
|
||||
**Testing:**
|
||||
```bash
|
||||
make clean && make larson_hakmem
|
||||
./larson_hakmem 10 8 128 1024 1 12345 4
|
||||
|
||||
# Should see: 40-60M ops/s (vs current 0.8M)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 4.2 OPTIONAL: 1024B Class Optimization
|
||||
|
||||
**Impact:** +50% for 1024B allocations (if frequent)
|
||||
**Effort:** 2-3 days (header redesign)
|
||||
**Priority:** LOW (measure first)
|
||||
|
||||
**Approach:** 2-byte header for class 7 only
|
||||
- Classes 0-6: 1-byte header (current)
|
||||
- Class 7 (1024B): 2-byte header (allows 1022B user data)
|
||||
- Header format: `[magic:8][class:8]` (2 bytes)
|
||||
|
||||
**Trade-offs:**
|
||||
- Pro: Supports 1024B in fast path
|
||||
- Con: 2B overhead for 1024B (0.2% vs malloc's 1.6%)
|
||||
- Con: Dual header format (complexity)
|
||||
|
||||
**Decision:** Implement ONLY if 1024B >10% of allocations
|
||||
|
||||
---
|
||||
|
||||
### 4.3 FUTURE: TLS Cache Prefetching
|
||||
|
||||
**Impact:** +5-10% (speculative)
|
||||
**Effort:** 1 week
|
||||
**Priority:** LOW (after above optimizations)
|
||||
|
||||
**Concept:** Prefetch next TLS freelist entry
|
||||
```c
|
||||
void* ptr = g_tls_sll_head[class_idx];
|
||||
if (ptr) {
|
||||
void* next = *(void**)ptr;
|
||||
__builtin_prefetch(next, 0, 3); // Prefetch next
|
||||
g_tls_sll_head[class_idx] = next;
|
||||
return ptr;
|
||||
}
|
||||
```
|
||||
|
||||
**Benefit:** Hides L1 miss latency (~4 cycles)
|
||||
|
||||
---
|
||||
|
||||
## 5. Benchmark Strategy
|
||||
|
||||
### 5.1 DO NOT RUN BENCHMARKS YET! ⚠️
|
||||
|
||||
**Reason:** Current implementation will show **40x slower** than System due to mincore overhead
|
||||
|
||||
**Required:** Hybrid mincore optimization (Section 4.1) MUST be implemented first
|
||||
|
||||
---
|
||||
|
||||
### 5.2 Benchmark Plan (After Optimization)
|
||||
|
||||
**Phase 1: Micro-Benchmarks (Validate Fix)**
|
||||
```bash
|
||||
# 1. Verify mincore optimization
|
||||
./micro_mincore_bench
|
||||
# Expected: 1-2 cycles (hybrid) vs 634 cycles (current)
|
||||
|
||||
# 2. Fast path latency (new micro-benchmark)
|
||||
# Create: tests/micro_fastpath_bench.c
|
||||
# Measure: alloc/free cycles for Phase 7 vs System
|
||||
# Expected: 6-12 cycles vs System's 10-15 cycles
|
||||
```
|
||||
|
||||
**Phase 2: Larson Benchmark (Single/Multi-threaded)**
|
||||
```bash
|
||||
# Single-threaded
|
||||
./larson_hakmem 1 8 128 1024 1 12345 1
|
||||
./larson_system 1 8 128 1024 1 12345 1
|
||||
# Expected: HAKMEM 40-60M ops/s vs System 30-50M ops/s (+20-33%)
|
||||
|
||||
# 4-thread
|
||||
./larson_hakmem 10 8 128 1024 1 12345 4
|
||||
./larson_system 10 8 128 1024 1 12345 4
|
||||
# Expected: HAKMEM 120-180M ops/s vs System 100-150M ops/s (+20-33%)
|
||||
```
|
||||
|
||||
**Phase 3: Mixed Workloads**
|
||||
```bash
|
||||
# Random mixed sizes (16B-4096B)
|
||||
./bench_random_mixed_hakmem 100000 4096 1234567
|
||||
./bench_random_mixed_system 100000 4096 1234567
|
||||
# Expected: HAKMEM +10-20% (some large allocs use malloc fallback)
|
||||
|
||||
# Producer-consumer (cross-thread free)
|
||||
# TODO: Create tests/bench_producer_consumer.c
|
||||
# Expected: HAKMEM +30-50% (TLS cache absorbs cross-thread frees)
|
||||
```
|
||||
|
||||
**Phase 4: Mimalloc Comparison (Ultimate Test)**
|
||||
```bash
|
||||
# Build mimalloc Larson
|
||||
cd mimalloc-bench/bench/larson
|
||||
make
|
||||
|
||||
# Compare
|
||||
LD_PRELOAD=../../../libhakmem.so ./larson 10 8 128 1024 1 12345 4 # HAKMEM
|
||||
LD_PRELOAD=mimalloc.so ./larson 10 8 128 1024 1 12345 4 # mimalloc
|
||||
./larson 10 8 128 1024 1 12345 4 # System
|
||||
|
||||
# Success Criteria:
|
||||
# - HAKMEM ≥ System * 1.1 (10% faster minimum)
|
||||
# - HAKMEM ≥ mimalloc * 0.9 (within 10% of mimalloc acceptable)
|
||||
# - Stretch goal: HAKMEM > mimalloc (beat the best!)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5.3 What to Measure
|
||||
|
||||
**Performance Metrics:**
|
||||
1. **Throughput (ops/s):** Primary metric
|
||||
2. **Latency (cycles/op):** Alloc + Free average
|
||||
3. **Fast path hit rate (%):** Step 1 hits (should be 80-90%)
|
||||
4. **Cache efficiency:** L1/L2 miss rates (perf stat)
|
||||
|
||||
**Memory Metrics:**
|
||||
1. **RSS (KB):** Resident set size
|
||||
2. **Overhead (%):** (Total - User) / User
|
||||
3. **Fragmentation (%):** (Allocated - Used) / Allocated
|
||||
4. **Leak check:** Valgrind --leak-check=full
|
||||
|
||||
**Stability Metrics:**
|
||||
1. **Crash rate (%):** 0% required
|
||||
2. **Score variance (%):** <5% across 10 runs
|
||||
3. **Thread scaling:** Linear 1→4 threads
|
||||
|
||||
---
|
||||
|
||||
### 5.4 Success Criteria
|
||||
|
||||
**Minimum Viable (Go/No-Go Decision):**
|
||||
- [ ] No crashes (100% stability)
|
||||
- [ ] ≥ System * 1.0 (at least equal performance)
|
||||
- [ ] ≤ System * 1.1 RSS (memory overhead acceptable)
|
||||
|
||||
**Target Performance:**
|
||||
- [ ] ≥ System * 1.2 (20% faster)
|
||||
- [ ] Fast path hit rate ≥ 85%
|
||||
- [ ] Memory overhead ≤ 5%
|
||||
|
||||
**Stretch Goals:**
|
||||
- [ ] ≥ mimalloc * 1.0 (beat the best!)
|
||||
- [ ] ≥ System * 1.5 (50% faster)
|
||||
- [ ] Memory overhead ≤ 2%
|
||||
|
||||
---
|
||||
|
||||
## 6. Go/No-Go Decision
|
||||
|
||||
### 6.1 Current Status: NO-GO ⛔
|
||||
|
||||
**Critical Blocker:** mincore() overhead (634 cycles = 40x slower than System)
|
||||
|
||||
**Required Before Benchmarking:**
|
||||
1. ✅ Implement hybrid mincore optimization (Section 4.1)
|
||||
2. ✅ Validate with micro-benchmark (1-2 cycles expected)
|
||||
3. ✅ Run Larson smoke test (40-60M ops/s expected)
|
||||
|
||||
**Estimated Time:** 1-2 hours implementation + 30 minutes testing
|
||||
|
||||
---
|
||||
|
||||
### 6.2 Post-Optimization Status: CONDITIONAL GO 🟡
|
||||
|
||||
**After hybrid optimization:**
|
||||
|
||||
**Proceed to benchmarking IF:**
|
||||
- ✅ Micro-benchmark shows 1-2 cycles (vs 634 current)
|
||||
- ✅ Larson smoke test ≥ 20M ops/s (minimum viable)
|
||||
- ✅ No crashes in 10-minute stress test
|
||||
|
||||
**DO NOT proceed IF:**
|
||||
- ❌ Still >50 cycles effective overhead
|
||||
- ❌ Larson <10M ops/s
|
||||
- ❌ Crashes or memory corruption
|
||||
|
||||
---
|
||||
|
||||
### 6.3 Risk Assessment
|
||||
|
||||
**Technical Risks:**
|
||||
|
||||
| Risk | Probability | Impact | Mitigation |
|
||||
|------|-------------|--------|------------|
|
||||
| Hybrid optimization insufficient | LOW | HIGH | Fallback: Page-aligned allocator |
|
||||
| 1024B frequency high (>10%) | MEDIUM | MEDIUM | Implement 2-byte header (3 days) |
|
||||
| Mid/Large lookups slow down average | LOW | LOW | Already measured at 200-300 cycles (acceptable) |
|
||||
| False positives in alignment check | VERY LOW | LOW | Magic validation catches them |
|
||||
|
||||
**Non-Technical Risks:**
|
||||
|
||||
| Risk | Probability | Impact | Mitigation |
|
||||
|------|-------------|--------|------------|
|
||||
| Mimalloc still faster | MEDIUM | LOW | "Within 10%" is acceptable for Phase 7 |
|
||||
| System malloc improves in newer glibc | LOW | MEDIUM | Target current stable glibc |
|
||||
| Workload doesn't match benchmarks | MEDIUM | MEDIUM | Test diverse workloads |
|
||||
|
||||
**Overall Risk:** LOW (after optimization)
|
||||
|
||||
---
|
||||
|
||||
## 7. Recommendations
|
||||
|
||||
### 7.1 Immediate Actions (Next 2 Hours)
|
||||
|
||||
1. **CRITICAL: Implement hybrid mincore optimization**
|
||||
- File: `core/hakmem_internal.h` (add `is_likely_valid_header()`)
|
||||
- File: `core/tiny_free_fast_v2.inc.h` (modify line 53-60)
|
||||
- File: `core/box/hak_free_api.inc.h` (modify line 94-96 for Step 2)
|
||||
- Test: `./micro_mincore_bench` (should show 1-2 cycles)
|
||||
|
||||
2. **Validate optimization with Larson smoke test**
|
||||
```bash
|
||||
make clean && make larson_hakmem
|
||||
./larson_hakmem 1 8 128 1024 1 12345 1 # Should see 40-60M ops/s
|
||||
```
|
||||
|
||||
3. **Run 10-minute stress test**
|
||||
```bash
|
||||
# Continuous Larson (detect crashes/leaks)
|
||||
while true; do ./larson_hakmem 10 8 128 1024 1 $RANDOM 4 || break; done
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 7.2 Short-Term Actions (Next 1-2 Days)
|
||||
|
||||
1. **Create fast path micro-benchmark**
|
||||
- File: `tests/micro_fastpath_bench.c`
|
||||
- Measure: Alloc/free cycles for Phase 7 vs System
|
||||
- Target: 6-12 cycles (competitive with System's 10-15)
|
||||
|
||||
2. **Implement size histogram tracking**
|
||||
```bash
|
||||
HAKMEM_SIZE_HIST=1 ./larson_hakmem ...
|
||||
# Output: Frequency distribution of allocation sizes
|
||||
# Decision: Is 1024B >10%? → Implement 2-byte header
|
||||
```
|
||||
|
||||
3. **Run full benchmark suite**
|
||||
- Larson (1T, 4T)
|
||||
- bench_random_mixed (sizes 16B-4096B)
|
||||
- Stress tests (stability)
|
||||
|
||||
---
|
||||
|
||||
### 7.3 Medium-Term Actions (Next 1-2 Weeks)
|
||||
|
||||
1. **If 1024B >10%: Implement 2-byte header**
|
||||
- Design: `[magic:8][class:8]` for class 7
|
||||
- Modify: `tiny_region_id.h` (dual format support)
|
||||
- Test: Dedicated 1024B benchmark
|
||||
|
||||
2. **Mimalloc comparison**
|
||||
- Setup: Build mimalloc-bench Larson
|
||||
- Run: Side-by-side comparison
|
||||
- Target: HAKMEM ≥ mimalloc * 0.9
|
||||
|
||||
3. **Production readiness**
|
||||
- Valgrind clean (no leaks)
|
||||
- ASan/TSan clean
|
||||
- Documentation update
|
||||
|
||||
---
|
||||
|
||||
### 7.4 What NOT to Do
|
||||
|
||||
**DO NOT:**
|
||||
- ❌ Run benchmarks without hybrid optimization (will show 40x slower!)
|
||||
- ❌ Optimize 1024B before measuring frequency (premature optimization)
|
||||
- ❌ Remove magic validation (essential for safety)
|
||||
- ❌ Disable mincore entirely (needed for edge cases)
|
||||
|
||||
---
|
||||
|
||||
## 8. Conclusion
|
||||
|
||||
**Phase 7 Design Quality:** EXCELLENT ⭐⭐⭐⭐⭐
|
||||
- Clean architecture (1-byte header, O(1) lookup)
|
||||
- Minimal memory overhead (0.8-3.2% vs System's 10-15%)
|
||||
- Comprehensive dispatch (handles all allocation methods)
|
||||
- Excellent crash-free stability (Phase 7-1.2)
|
||||
|
||||
**Current Implementation:** NEEDS OPTIMIZATION 🟡
|
||||
- CRITICAL: mincore overhead (634 cycles → must fix!)
|
||||
- Minor: 1024B fallback (measure before optimizing)
|
||||
|
||||
**Path Forward:** CLEAR ✅
|
||||
1. Implement hybrid optimization (1-2 hours)
|
||||
2. Validate with micro-benchmarks (30 min)
|
||||
3. Run full benchmark suite (2-3 hours)
|
||||
4. Decision: Deploy if ≥ System * 1.2
|
||||
|
||||
**Confidence Level:** HIGH (85%)
|
||||
- After optimization: Expected 20-50% faster than System
|
||||
- Risk: LOW (hybrid approach proven in micro-benchmark)
|
||||
- Timeline: 1-2 days to production-ready
|
||||
|
||||
**Final Verdict:** **IMPLEMENT OPTIMIZATION → BENCHMARK → DEPLOY** 🚀
|
||||
|
||||
---
|
||||
|
||||
## Appendix A: Micro-Benchmark Code
|
||||
|
||||
**File:** `tests/micro_mincore_bench.c` (already created)
|
||||
|
||||
**Results:**
|
||||
```
|
||||
[MINCORE] Mapped memory: 634 cycles/call (overhead: 6340%)
|
||||
[ALIGN] Alignment check: 0 cycles/call (overhead: 0%)
|
||||
[HYBRID] Align + mincore: 1 cycles/call (overhead: 10%)
|
||||
[BOUNDARY] Page boundary: 2155 cycles/call (frequency: <0.1%)
|
||||
```
|
||||
|
||||
**Conclusion:** Hybrid approach reduces overhead from 634 → 1 cycles (**634x improvement!**)
|
||||
|
||||
---
|
||||
|
||||
## Appendix B: Code Locations Reference
|
||||
|
||||
| Component | File | Lines |
|
||||
|-----------|------|-------|
|
||||
| Fast free (Phase 7) | `core/tiny_free_fast_v2.inc.h` | 50-92 |
|
||||
| Header helpers | `core/tiny_region_id.h` | 40-100 |
|
||||
| mincore check | `core/hakmem_internal.h` | 283-294 |
|
||||
| Free dispatch | `core/box/hak_free_api.inc.h` | 77-119 |
|
||||
| Alloc dispatch | `core/box/hak_alloc_api.inc.h` | 6-145 |
|
||||
| Size-to-class | `core/hakmem_tiny.h` | 244-252 |
|
||||
| Micro-benchmark | `tests/micro_mincore_bench.c` | 1-120 |
|
||||
|
||||
---
|
||||
|
||||
## Appendix C: Performance Prediction Model
|
||||
|
||||
**Assumptions:**
|
||||
- Step 1 (Tiny header): 85% frequency, 8 cycles (optimized)
|
||||
- Step 2 (malloc header): 8% frequency, 8 cycles (optimized)
|
||||
- Step 3 (SuperSlab): 2% frequency, 500 cycles
|
||||
- Step 4 (Mid/L25): 5% frequency, 250 cycles
|
||||
- System malloc: 12 cycles (tcache average)
|
||||
|
||||
**Calculation:**
|
||||
```
|
||||
HAKMEM_avg = 0.85 * 8 + 0.08 * 8 + 0.02 * 500 + 0.05 * 250
|
||||
= 6.8 + 0.64 + 10 + 12.5
|
||||
= 29.94 cycles
|
||||
|
||||
System_avg = 12 cycles
|
||||
|
||||
Speedup = 12 / 29.94 = 0.40x (40% of System)
|
||||
```
|
||||
|
||||
**Wait, that's SLOWER!** 🤔
|
||||
|
||||
**Problem:** Steps 3-4 are too expensive. But wait...
|
||||
|
||||
**Corrected Analysis:**
|
||||
- Step 3 (SuperSlab legacy): Should be 0% (Phase 7 replaces this!)
|
||||
- Step 4 (Mid/L25): Only 5% (not 7%)
|
||||
|
||||
**Recalculation:**
|
||||
```
|
||||
HAKMEM_avg = 0.85 * 8 + 0.08 * 8 + 0.00 * 500 + 0.05 * 250 + 0.02 * 12 (fallback)
|
||||
= 6.8 + 0.64 + 0 + 12.5 + 0.24
|
||||
= 20.18 cycles
|
||||
|
||||
Speedup = 12 / 20.18 = 0.59x (59% of System)
|
||||
```
|
||||
|
||||
**Still slower!** The Mid/L25 lookups are killing performance.
|
||||
|
||||
**But Larson uses 100% Tiny (128B), so:**
|
||||
```
|
||||
Larson_avg = 1.0 * 8 = 8 cycles
|
||||
System_avg = 12 cycles
|
||||
Speedup = 12 / 8 = 1.5x (150% of System!) ✅
|
||||
```
|
||||
|
||||
**Conclusion:** Phase 7 will beat System on Tiny-heavy workloads (Larson) but may tie/lose on mixed workloads. This is **acceptable** for Phase 7 goals.
|
||||
|
||||
---
|
||||
|
||||
**END OF REPORT**
|
||||
305
docs/design/PHASE9_LRU_ARCHITECTURE_ISSUE.md
Normal file
305
docs/design/PHASE9_LRU_ARCHITECTURE_ISSUE.md
Normal file
@ -0,0 +1,305 @@
|
||||
# Phase 9 LRU Architecture Issue - Root Cause Analysis
|
||||
|
||||
**Date**: 2025-11-14
|
||||
**Discovery**: Task B-1 Investigation
|
||||
**Impact**: ❌ **CRITICAL** - Phase 9 Lazy Deallocation completely non-functional
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
Phase 9 LRU cache for SuperSlab reuse is **architecturally unreachable** during normal operation due to TLS SLL fast path preventing `meta->used == 0` condition.
|
||||
|
||||
**Result**:
|
||||
- LRU cache never populated (0% utilization)
|
||||
- SuperSlabs never reused (100% mmap/munmap churn)
|
||||
- Syscall overhead: 6,455 calls per 200K iterations (74.8% of total time)
|
||||
- Performance impact: **-94% regression** (9.38M → 563K ops/s)
|
||||
|
||||
---
|
||||
|
||||
## Root Cause Chain
|
||||
|
||||
### 1. Free Path Architecture
|
||||
|
||||
**Fast Path (95-99% of frees):**
|
||||
```c
|
||||
// core/tiny_free_fast_v2.inc.h
|
||||
hak_tiny_free_fast_v2(ptr) {
|
||||
tls_sll_push(class_idx, base); // ← Does NOT decrement meta->used
|
||||
}
|
||||
```
|
||||
|
||||
**Slow Path (1-5% of frees):**
|
||||
```c
|
||||
// core/tiny_superslab_free.inc.h
|
||||
tiny_free_local_box() {
|
||||
meta->used--; // ← ONLY here is meta->used decremented
|
||||
}
|
||||
```
|
||||
|
||||
### 2. The Accounting Gap
|
||||
|
||||
**Physical Reality**: Blocks freed to TLS SLL (available for reuse)
|
||||
**Slab Accounting**: Blocks still counted as "used" (`meta->used` unchanged)
|
||||
|
||||
**Consequence**: Slabs never appear empty → SuperSlabs never freed → LRU never used
|
||||
|
||||
### 3. Empty Detection Code Path
|
||||
|
||||
```c
|
||||
// core/tiny_superslab_free.inc.h:211 (local free)
|
||||
if (meta->used == 0) {
|
||||
shared_pool_release_slab(ss, slab_idx); // ← NEVER REACHED
|
||||
}
|
||||
|
||||
// core/hakmem_shared_pool.c:298
|
||||
if (ss->active_slabs == 0) {
|
||||
superslab_free(ss); // ← NEVER REACHED
|
||||
}
|
||||
|
||||
// core/hakmem_tiny_superslab.c:1016
|
||||
void superslab_free(SuperSlab* ss) {
|
||||
int lru_cached = hak_ss_lru_push(ss); // ← NEVER CALLED
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Experimental Evidence
|
||||
|
||||
**Test**: `bench_random_mixed_hakmem 200000 4096 1234567`
|
||||
|
||||
**Observations**:
|
||||
```bash
|
||||
export HAKMEM_SS_LRU_DEBUG=1
|
||||
export HAKMEM_SS_FREE_DEBUG=1
|
||||
|
||||
# Results (200K iterations):
|
||||
[LRU_POP] class=X (miss): 877 times ← LRU lookup attempts
|
||||
[LRU_PUSH]: 0 times ← NEVER populated
|
||||
[SS_FREE]: 0 times ← NEVER called
|
||||
[SS_EMPTY]: 0 times ← meta->used never reached 0
|
||||
```
|
||||
|
||||
**Syscall Impact**:
|
||||
```
|
||||
mmap: 3,241 calls (27.4% time)
|
||||
munmap: 3,214 calls (47.4% time)
|
||||
Total: 6,455 syscalls (74.8% time) ← Should be ~100 with LRU working
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Why This Happens
|
||||
|
||||
### TLS SLL Design Rationale
|
||||
|
||||
**Purpose**: Ultra-fast free path (3-5 instructions)
|
||||
**Tradeoff**: No slab accounting updates
|
||||
|
||||
**Lifecycle**:
|
||||
1. Block allocated from slab: `meta->used++`
|
||||
2. Block freed to TLS SLL: `meta->used` UNCHANGED
|
||||
3. Block reallocated from TLS SLL: `meta->used` UNCHANGED
|
||||
4. Cycle repeats infinitely
|
||||
|
||||
**Drain Behavior**:
|
||||
- `bench_random_mixed` drain phase frees all blocks
|
||||
- But TLS SLL cleanup (`hakmem_tiny_lifecycle.inc:162-170`) drains to `tls_list`, NOT back to slabs
|
||||
- `meta->used` never decremented
|
||||
- Slabs never reported as empty
|
||||
|
||||
### Benchmark Characteristics
|
||||
|
||||
`bench_random_mixed.c`:
|
||||
- Working set: 4,096 slots (random alloc/free)
|
||||
- Size range: 16-1040 bytes
|
||||
- Pattern: Blocks cycle through TLS SLL
|
||||
- **Never reaches `meta->used == 0` during main loop**
|
||||
|
||||
---
|
||||
|
||||
## Impact Analysis
|
||||
|
||||
### Performance Regression
|
||||
|
||||
| Metric | Phase 11 (Before) | Current (After SEGV Fix) | Change |
|
||||
|--------|-------------------|--------------------------|--------|
|
||||
| Throughput | 9.38M ops/s | 563K ops/s | **-94%** |
|
||||
| mmap calls | ~800-900 | 3,241 | +260-305% |
|
||||
| munmap calls | ~800-900 | 3,214 | +257-302% |
|
||||
| LRU hits | Expected high | **0** | -100% |
|
||||
|
||||
**Root Causes**:
|
||||
1. **Primary (74.8% time)**: LRU not working → mmap/munmap churn
|
||||
2. **Secondary (11.0% time)**: mincore() SEGV fix overhead
|
||||
|
||||
### Design Validity
|
||||
|
||||
**Phase 9 LRU Implementation**: ✅ **Functionally Correct**
|
||||
- `hak_ss_lru_push()`: Works as designed
|
||||
- `hak_ss_lru_pop()`: Works as designed
|
||||
- Cache eviction: Works as designed
|
||||
|
||||
**Phase 9 Architecture**: ❌ **Fundamentally Incompatible** with TLS SLL fast path
|
||||
|
||||
---
|
||||
|
||||
## Solution Options
|
||||
|
||||
### Option A: Decrement `meta->used` in Fast Path ❌
|
||||
|
||||
**Approach**: Modify `tls_sll_push()` to decrement `meta->used`
|
||||
|
||||
**Problem**:
|
||||
- Requires SuperSlab lookup (expensive)
|
||||
- Defeats fast path purpose (3-5 instructions → 50+ instructions)
|
||||
- Cache misses, branch mispredicts
|
||||
|
||||
**Verdict**: Not viable
|
||||
|
||||
---
|
||||
|
||||
### Option B: Periodic TLS SLL Drain to Slabs ✅ **RECOMMENDED**
|
||||
|
||||
**Approach**:
|
||||
- Drain TLS SLL back to slab freelists periodically (e.g., every 1K frees)
|
||||
- Decrement `meta->used` via `tiny_free_local_box()`
|
||||
- Allow slab empty detection
|
||||
|
||||
**Implementation**:
|
||||
```c
|
||||
static __thread uint32_t g_tls_sll_drain_counter[TINY_NUM_CLASSES] = {0};
|
||||
|
||||
void tls_sll_push(int class_idx, void* base) {
|
||||
// Fast path: push to SLL
|
||||
// ... existing code ...
|
||||
|
||||
// Periodic drain
|
||||
if (++g_tls_sll_drain_counter[class_idx] >= 1024) {
|
||||
tls_sll_drain_to_slabs(class_idx);
|
||||
g_tls_sll_drain_counter[class_idx] = 0;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Fast path stays fast (99.9% of frees)
|
||||
- Slow path drain (0.1% of frees) updates `meta->used`
|
||||
- Enables slab empty detection
|
||||
- LRU cache becomes functional
|
||||
|
||||
**Expected Impact**:
|
||||
- mmap/munmap: 6,455 → ~100-200 calls (-96-97%)
|
||||
- Throughput: 563K → 8-10M ops/s (+1,300-1,700%)
|
||||
|
||||
---
|
||||
|
||||
### Option C: Separate Accounting ⚠️
|
||||
|
||||
**Approach**: Track "logical used" (includes TLS SLL) vs "physical used"
|
||||
|
||||
**Problem**:
|
||||
- Complex, error-prone
|
||||
- Atomic operations required (slow)
|
||||
- Hard to maintain consistency
|
||||
|
||||
**Verdict**: Not recommended
|
||||
|
||||
---
|
||||
|
||||
### Option D: Accept Current Behavior ❌
|
||||
|
||||
**Approach**: LRU cache only for shutdown/cleanup, not runtime
|
||||
|
||||
**Problem**:
|
||||
- Defeats Phase 9 purpose (lazy deallocation)
|
||||
- Leaves 74.8% syscall overhead unfixed
|
||||
- Performance remains -94% regressed
|
||||
|
||||
**Verdict**: Not acceptable
|
||||
|
||||
---
|
||||
|
||||
## Recommendation
|
||||
|
||||
**Implement Option B: Periodic TLS SLL Drain**
|
||||
|
||||
### Phase 12 Design
|
||||
|
||||
1. **Add drain trigger** in `tls_sll_push()`
|
||||
- Every 1,024 frees (tunable via ENV)
|
||||
- Drain TLS SLL → slab freelist
|
||||
- Decrement `meta->used` properly
|
||||
|
||||
2. **Enable slab empty detection**
|
||||
- `meta->used == 0` now reachable
|
||||
- `shared_pool_release_slab()` called
|
||||
- `superslab_free()` → `hak_ss_lru_push()` called
|
||||
|
||||
3. **LRU cache becomes functional**
|
||||
- SuperSlabs reused from cache
|
||||
- mmap/munmap reduced by 96-97%
|
||||
- Syscall overhead: 74.8% → ~5%
|
||||
|
||||
### Expected Performance
|
||||
|
||||
```
|
||||
Current: 563K ops/s (0.63% of System malloc)
|
||||
After: 8-10M ops/s (9-11% of System malloc)
|
||||
Gain: +1,300-1,700%
|
||||
```
|
||||
|
||||
**Remaining gap to System malloc (90M ops/s)**:
|
||||
- Still need +800-1,000% additional optimization
|
||||
- Focus areas: Front cache hit rate, branch prediction, cache locality
|
||||
|
||||
---
|
||||
|
||||
## Action Items
|
||||
|
||||
1. **[URGENT]** Implement TLS SLL periodic drain (Option B)
|
||||
2. **[HIGH]** Add ENV tuning: `HAKMEM_TLS_SLL_DRAIN_INTERVAL=1024`
|
||||
3. **[HIGH]** Re-measure with `strace -c` (expect -96% mmap/munmap)
|
||||
4. **[MEDIUM]** Fix prewarm crash (separate investigation)
|
||||
5. **[MEDIUM]** Document architectural tradeoff in design docs
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
1. **Fast path optimizations can disable architectural features**
|
||||
- TLS SLL fast path → LRU cache unreachable
|
||||
- Need periodic cleanup to restore functionality
|
||||
|
||||
2. **Accounting consistency is critical**
|
||||
- `meta->used` must reflect true state
|
||||
- Buffering (TLS SLL) creates accounting gap
|
||||
|
||||
3. **Integration testing needed**
|
||||
- Phase 9 LRU tested in isolation: ✅ Works
|
||||
- Phase 9 LRU + TLS SLL integration: ❌ Broken
|
||||
- Need end-to-end benchmarks
|
||||
|
||||
4. **Performance monitoring essential**
|
||||
- LRU hit rate = 0% should have triggered alert
|
||||
- Syscall count regression should have been caught earlier
|
||||
|
||||
---
|
||||
|
||||
## Files Involved
|
||||
|
||||
- `/mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h` - Fast path (no `meta->used` update)
|
||||
- `/mnt/workdisk/public_share/hakmem/core/tiny_superslab_free.inc.h` - Slow path (`meta->used--`)
|
||||
- `/mnt/workdisk/public_share/hakmem/core/hakmem_shared_pool.c` - Empty detection
|
||||
- `/mnt/workdisk/public_share/hakmem/core/hakmem_tiny_superslab.c` - `superslab_free()`
|
||||
- `/mnt/workdisk/public_share/hakmem/core/hakmem_super_registry.c` - LRU cache implementation
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
Phase 9 LRU cache is **functionally correct** but **architecturally unreachable** due to TLS SLL fast path not updating `meta->used`.
|
||||
|
||||
**Fix**: Implement periodic TLS SLL drain to restore slab accounting consistency and enable LRU cache utilization.
|
||||
|
||||
**Expected Impact**: +1,300-1,700% throughput improvement (563K → 8-10M ops/s)
|
||||
403
docs/design/PHASE_E3-2_IMPLEMENTATION.md
Normal file
403
docs/design/PHASE_E3-2_IMPLEMENTATION.md
Normal file
@ -0,0 +1,403 @@
|
||||
# Phase E3-2: Restore Direct TLS Push - Implementation Guide
|
||||
|
||||
**Date**: 2025-11-12
|
||||
**Goal**: Restore Phase 7 ultra-fast free (3 instructions, 5-10 cycles)
|
||||
**Expected**: 6-9M → 30-50M ops/s (+226-443%)
|
||||
|
||||
---
|
||||
|
||||
## Strategy
|
||||
|
||||
**Hybrid Approach**: Direct push in release, Box TLS-SLL in debug
|
||||
|
||||
**Rationale**:
|
||||
- Release: Maximum performance (Phase 7 speed)
|
||||
- Debug: Maximum safety (catch bugs before release)
|
||||
- Best of both worlds: Speed + Safety
|
||||
|
||||
---
|
||||
|
||||
## Implementation
|
||||
|
||||
### File to Modify
|
||||
|
||||
`/mnt/workdisk/public_share/hakmem/core/tiny_free_fast_v2.inc.h`
|
||||
|
||||
### Current Code (Lines 119-137)
|
||||
|
||||
```c
|
||||
// 3. Push base to TLS freelist (4 instructions, 5-7 cycles)
|
||||
// Must push base (block start) not user pointer!
|
||||
// Phase E1: ALL classes (C0-C7) have 1-byte header → base = ptr-1
|
||||
void* base = (char*)ptr - 1;
|
||||
|
||||
// Use Box TLS-SLL API (C7-safe)
|
||||
if (!tls_sll_push(class_idx, base, UINT32_MAX)) {
|
||||
// C7 rejected or capacity exceeded - route to slow path
|
||||
return 0;
|
||||
}
|
||||
|
||||
return 1; // Success - handled in fast path
|
||||
}
|
||||
```
|
||||
|
||||
### New Code (Phase E3-2)
|
||||
|
||||
```c
|
||||
// 3. Push base to TLS freelist (3 instructions, 5-7 cycles in release)
|
||||
// Must push base (block start) not user pointer!
|
||||
// Phase E1: ALL classes (C0-C7) have 1-byte header → base = ptr-1
|
||||
void* base = (char*)ptr - 1;
|
||||
|
||||
// Phase E3-2: Hybrid approach (Direct push in release, Box API in debug)
|
||||
// Reason: Release needs Phase 7 speed (5-10 cycles), Debug needs safety checks
|
||||
#if HAKMEM_BUILD_RELEASE
|
||||
// Release: Ultra-fast direct push (Phase 7 restoration)
|
||||
// CRITICAL: Restore header byte before push (defense in depth)
|
||||
// Cost: 1 byte write (~1-2 cycles), prevents header corruption bugs
|
||||
*(uint8_t*)base = HEADER_MAGIC | (class_idx & HEADER_CLASS_MASK);
|
||||
|
||||
// Direct TLS push (3 instructions, 5-7 cycles)
|
||||
// Store next pointer at base+1 (skip 1-byte header)
|
||||
*(void**)((uint8_t*)base + 1) = g_tls_sll_head[class_idx]; // 1 mov
|
||||
g_tls_sll_head[class_idx] = base; // 1 mov
|
||||
g_tls_sll_count[class_idx]++; // 1 inc
|
||||
|
||||
// Total: 8-12 cycles (vs 50-100 with Box TLS-SLL)
|
||||
#else
|
||||
// Debug: Full Box TLS-SLL validation (safety first)
|
||||
// This catches: double-free, header corruption, alignment issues, etc.
|
||||
// Cost: 50-100+ cycles (includes O(n) double-free scan)
|
||||
// Benefit: Catch ALL bugs before release
|
||||
if (!tls_sll_push(class_idx, base, UINT32_MAX)) {
|
||||
// C7 rejected or capacity exceeded - route to slow path
|
||||
return 0;
|
||||
}
|
||||
#endif
|
||||
|
||||
return 1; // Success - handled in fast path
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Verification Steps
|
||||
|
||||
### 1. Clean Build
|
||||
|
||||
```bash
|
||||
cd /mnt/workdisk/public_share/hakmem
|
||||
make clean
|
||||
make bench_random_mixed_hakmem
|
||||
```
|
||||
|
||||
**Expected**: Clean compilation, no warnings
|
||||
|
||||
### 2. Release Build Test (Performance)
|
||||
|
||||
```bash
|
||||
# Test E3-2 (current code with fix)
|
||||
./out/release/bench_random_mixed_hakmem 100000 256 42
|
||||
./out/release/bench_random_mixed_hakmem 100000 128 42
|
||||
./out/release/bench_random_mixed_hakmem 100000 512 42
|
||||
./out/release/bench_random_mixed_hakmem 100000 1024 42
|
||||
```
|
||||
|
||||
**Expected Results**:
|
||||
- 128B: 30-50M ops/s (+260-506% vs 8.25M baseline)
|
||||
- 256B: 30-50M ops/s (+391-718% vs 6.11M baseline)
|
||||
- 512B: 30-50M ops/s (+244-474% vs 8.71M baseline)
|
||||
- 1024B: 30-50M ops/s (+473-854% vs 5.24M baseline)
|
||||
|
||||
**Acceptable Range**:
|
||||
- Any improvement >100% is a win
|
||||
- Target: +226-443% (Phase 7 claimed levels)
|
||||
|
||||
### 3. Debug Build Test (Safety)
|
||||
|
||||
```bash
|
||||
make clean
|
||||
make debug bench_random_mixed_hakmem
|
||||
./out/debug/bench_random_mixed_hakmem 10000 256 42
|
||||
```
|
||||
|
||||
**Expected**:
|
||||
- No crashes, no assertions
|
||||
- Full Box TLS-SLL validation enabled
|
||||
- Performance will be slower (expected)
|
||||
|
||||
### 4. Stress Test (Stability)
|
||||
|
||||
```bash
|
||||
# Large workload
|
||||
./out/release/bench_random_mixed_hakmem 1000000 8192 42
|
||||
|
||||
# Multiple runs (check consistency)
|
||||
for i in {1..5}; do
|
||||
./out/release/bench_random_mixed_hakmem 100000 256 $i
|
||||
done
|
||||
```
|
||||
|
||||
**Expected**:
|
||||
- All runs complete successfully
|
||||
- Consistent performance (±5% variance)
|
||||
- No crashes, no memory leaks
|
||||
|
||||
### 5. Comparison Test
|
||||
|
||||
```bash
|
||||
# Create comparison script
|
||||
cat > /tmp/bench_comparison.sh << 'EOF'
|
||||
#!/bin/bash
|
||||
echo "=== Phase E3-2 Performance Comparison ==="
|
||||
echo ""
|
||||
|
||||
for size in 128 256 512 1024; do
|
||||
echo "Testing size=${size}B..."
|
||||
total=0
|
||||
runs=3
|
||||
|
||||
for i in $(seq 1 $runs); do
|
||||
result=$(./out/release/bench_random_mixed_hakmem 100000 $size 42 2>/dev/null | grep "Throughput" | awk '{print $3}')
|
||||
total=$(echo "$total + $result" | bc)
|
||||
done
|
||||
|
||||
avg=$(echo "scale=2; $total / $runs" | bc)
|
||||
echo " Average: ${avg} ops/s"
|
||||
echo ""
|
||||
done
|
||||
EOF
|
||||
|
||||
chmod +x /tmp/bench_comparison.sh
|
||||
/tmp/bench_comparison.sh
|
||||
```
|
||||
|
||||
**Expected Output**:
|
||||
```
|
||||
=== Phase E3-2 Performance Comparison ===
|
||||
|
||||
Testing size=128B...
|
||||
Average: 35000000.00 ops/s
|
||||
|
||||
Testing size=256B...
|
||||
Average: 40000000.00 ops/s
|
||||
|
||||
Testing size=512B...
|
||||
Average: 38000000.00 ops/s
|
||||
|
||||
Testing size=1024B...
|
||||
Average: 35000000.00 ops/s
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### Must Have (P0)
|
||||
|
||||
- ✅ **Performance**: >20M ops/s on all sizes (>2x current)
|
||||
- ✅ **Stability**: 5/5 runs succeed, no crashes
|
||||
- ✅ **Debug safety**: Box TLS-SLL validation works in debug
|
||||
|
||||
### Should Have (P1)
|
||||
|
||||
- ✅ **Performance**: >30M ops/s on most sizes (>3x current)
|
||||
- ✅ **Consistency**: <10% variance across runs
|
||||
|
||||
### Nice to Have (P2)
|
||||
|
||||
- ✅ **Performance**: >50M ops/s on some sizes (Phase 7 levels)
|
||||
- ✅ **All sizes**: Uniform improvement across 128-1024B
|
||||
|
||||
---
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
### If Performance Doesn't Improve
|
||||
|
||||
**Hypothesis Failed**: Direct push not the bottleneck
|
||||
|
||||
**Action**:
|
||||
1. Revert change: `git checkout HEAD -- core/tiny_free_fast_v2.inc.h`
|
||||
2. Profile with `perf`: Find actual hot path
|
||||
3. Investigate other bottlenecks (allocation, refill, etc.)
|
||||
|
||||
### If Crashes in Release
|
||||
|
||||
**Safety Issue**: Header corruption or double-free
|
||||
|
||||
**Action**:
|
||||
1. Run debug build: Catch specific failure
|
||||
2. Add release-mode checks: Minimal validation
|
||||
3. Revert if unfixable: Keep Box TLS-SLL
|
||||
|
||||
### If Debug Build Breaks
|
||||
|
||||
**Integration Issue**: Box TLS-SLL API changed
|
||||
|
||||
**Action**:
|
||||
1. Check `tls_sll_push()` signature
|
||||
2. Update call site: Match current API
|
||||
3. Test debug build: Verify safety checks work
|
||||
|
||||
---
|
||||
|
||||
## Performance Tracking
|
||||
|
||||
### Baseline (E3-1 Current)
|
||||
|
||||
| Size | Ops/s | Cycles/Op (5GHz) |
|
||||
|-------|-------|------------------|
|
||||
| 128B | 8.25M | ~606 |
|
||||
| 256B | 6.11M | ~818 |
|
||||
| 512B | 8.71M | ~574 |
|
||||
| 1024B | 5.24M | ~954 |
|
||||
|
||||
**Average**: 7.08M ops/s (~738 cycles/op)
|
||||
|
||||
### Target (E3-2 Phase 7 Recovery)
|
||||
|
||||
| Size | Ops/s | Cycles/Op (5GHz) | Improvement |
|
||||
|-------|-------|------------------|-------------|
|
||||
| 128B | 30-50M | 100-167 | +264-506% |
|
||||
| 256B | 30-50M | 100-167 | +391-718% |
|
||||
| 512B | 30-50M | 100-167 | +244-474% |
|
||||
| 1024B | 30-50M | 100-167 | +473-854% |
|
||||
|
||||
**Average**: 30-50M ops/s (~100-167 cycles/op) = **4-7x improvement**
|
||||
|
||||
### Theoretical Maximum
|
||||
|
||||
- CPU: 5 GHz = 5B cycles/sec
|
||||
- Direct push: 8-12 cycles/op
|
||||
- Max throughput: 417-625M ops/s
|
||||
|
||||
**Phase 7 efficiency**: 59-70M / 500M = **12-14%** (reasonable with cache misses)
|
||||
|
||||
---
|
||||
|
||||
## Debugging Guide
|
||||
|
||||
### If Performance is Slow (<20M ops/s)
|
||||
|
||||
**Check 1**: Is HAKMEM_BUILD_RELEASE=1?
|
||||
```bash
|
||||
make print-flags | grep BUILD_RELEASE
|
||||
# Should show: CFLAGS contains = -DHAKMEM_BUILD_RELEASE=1
|
||||
```
|
||||
|
||||
**Check 2**: Is direct push being used?
|
||||
```bash
|
||||
objdump -d out/release/bench_random_mixed_hakmem > /tmp/asm.txt
|
||||
grep -A 30 "hak_tiny_free_fast_v2" /tmp/asm.txt | grep -E "tls_sll_push|call"
|
||||
# Should NOT see: call to tls_sll_push (inlined direct push instead)
|
||||
```
|
||||
|
||||
**Check 3**: Is LTO enabled?
|
||||
```bash
|
||||
make print-flags | grep LTO
|
||||
# Should show: -flto
|
||||
```
|
||||
|
||||
### If Debug Build Crashes
|
||||
|
||||
**Check 1**: Is Box TLS-SLL path enabled?
|
||||
```bash
|
||||
./out/debug/bench_random_mixed_hakmem 100 256 42 2>&1 | grep "TLS_SLL"
|
||||
# Should see Box TLS-SLL validation logs
|
||||
```
|
||||
|
||||
**Check 2**: What's the error?
|
||||
```bash
|
||||
gdb ./out/debug/bench_random_mixed_hakmem
|
||||
(gdb) run 10000 256 42
|
||||
(gdb) bt # Backtrace on crash
|
||||
```
|
||||
|
||||
### If Results are Inconsistent
|
||||
|
||||
**Check 1**: CPU frequency scaling?
|
||||
```bash
|
||||
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
|
||||
# Should be: performance (not powersave)
|
||||
```
|
||||
|
||||
**Check 2**: Other processes running?
|
||||
```bash
|
||||
top -n 1 | head -20
|
||||
# Should show: Idle CPU
|
||||
```
|
||||
|
||||
**Check 3**: Thermal throttling?
|
||||
```bash
|
||||
sensors # Check CPU temperature
|
||||
# Should be: <80°C
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Expected Commit Message
|
||||
|
||||
```
|
||||
Phase E3-2: Restore Phase 7 ultra-fast free (direct TLS push)
|
||||
|
||||
Problem:
|
||||
- Phase E3-1 removed Registry lookup expecting +226-443% improvement
|
||||
- Performance decreased -10% to -38% instead
|
||||
- Root cause: Registry lookup was NOT in fast path (only 1-5% miss rate)
|
||||
- True bottleneck: Box TLS-SLL API overhead (150 lines vs 3 instructions)
|
||||
|
||||
Solution:
|
||||
- Restore Phase 7 direct TLS push in RELEASE builds (3 instructions, 8-12 cycles)
|
||||
- Keep Box TLS-SLL in DEBUG builds (full safety validation)
|
||||
- Hybrid approach: Speed in production, safety in development
|
||||
|
||||
Performance Results:
|
||||
- 128B: 8.25M → 35M ops/s (+324%)
|
||||
- 256B: 6.11M → 40M ops/s (+555%)
|
||||
- 512B: 8.71M → 38M ops/s (+336%)
|
||||
- 1024B: 5.24M → 35M ops/s (+568%)
|
||||
- Average: 7.08M → 37M ops/s (+423%)
|
||||
|
||||
Implementation:
|
||||
- File: core/tiny_free_fast_v2.inc.h line 119-137
|
||||
- Change: #if HAKMEM_BUILD_RELEASE → direct push, #else → Box TLS-SLL
|
||||
- Defense in depth: Header restoration (1 byte write, 1-2 cycles)
|
||||
- Safety: Debug catches all bugs before release
|
||||
|
||||
Verification:
|
||||
- Release: 5/5 stress test runs passed (1M ops each)
|
||||
- Debug: Box TLS-SLL validation enabled, no crashes
|
||||
- Stability: <5% variance across runs
|
||||
|
||||
Co-Authored-By: Claude <noreply@anthropic.com>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Post-Implementation
|
||||
|
||||
### Documentation
|
||||
|
||||
1. ✅ Update `CLAUDE.md`: Add Phase E3-2 results
|
||||
2. ✅ Update `HISTORY.md`: Document E3-1 failure + E3-2 success
|
||||
3. ✅ Create `PHASE_E3_COMPLETE.md`: Full E3 saga
|
||||
|
||||
### Next Steps
|
||||
|
||||
1. ✅ **Phase E4**: Optimize slow path (Registry → header probe)
|
||||
2. ✅ **Phase E5**: Profile allocation path (malloc vs refill)
|
||||
3. ✅ **Phase E6**: Investigate Phase 7 original test (verify 59-70M)
|
||||
|
||||
---
|
||||
|
||||
**Implementation Time**: 15 minutes
|
||||
**Testing Time**: 15 minutes
|
||||
**Total Time**: 30 minutes
|
||||
|
||||
**Status**: ✅ READY TO IMPLEMENT
|
||||
|
||||
---
|
||||
|
||||
**Generated**: 2025-11-12 18:15 JST
|
||||
**Guide Version**: 1.0
|
||||
216
docs/design/POOL_IMPLEMENTATION_CHECKLIST.md
Normal file
216
docs/design/POOL_IMPLEMENTATION_CHECKLIST.md
Normal file
@ -0,0 +1,216 @@
|
||||
# Pool TLS + Learning Implementation Checklist
|
||||
|
||||
## Pre-Implementation Review
|
||||
|
||||
### Contract Understanding
|
||||
- [ ] Read and understand all 4 contracts (A-D) in POOL_TLS_LEARNING_DESIGN.md
|
||||
- [ ] Identify which contract applies to each code section
|
||||
- [ ] Review enforcement strategies for each contract
|
||||
|
||||
## Phase 1: Ultra-Simple TLS Implementation
|
||||
|
||||
### Box 1: TLS Freelist (pool_tls.c)
|
||||
|
||||
#### Setup
|
||||
- [ ] Create `core/pool_tls.c` and `core/pool_tls.h`
|
||||
- [ ] Define TLS globals: `__thread void* g_tls_pool_head[POOL_SIZE_CLASSES]`
|
||||
- [ ] Define TLS counts: `__thread uint32_t g_tls_pool_count[POOL_SIZE_CLASSES]`
|
||||
- [ ] Define default refill counts array
|
||||
|
||||
#### Hot Path Implementation
|
||||
- [ ] Implement `pool_alloc_fast()` - must be 5-6 instructions max
|
||||
- [ ] Pop from TLS freelist
|
||||
- [ ] Conditional header write (if enabled)
|
||||
- [ ] Call refill only on miss
|
||||
- [ ] Implement `pool_free_fast()` - must be 5-6 instructions max
|
||||
- [ ] Header validation (if enabled)
|
||||
- [ ] Push to TLS freelist
|
||||
- [ ] Optional drain check
|
||||
|
||||
#### Contract D Validation
|
||||
- [ ] Verify Box1 has NO learning code
|
||||
- [ ] Verify Box1 has NO metrics collection
|
||||
- [ ] Verify Box1 only exposes public API and internal chain installer
|
||||
- [ ] No includes of ace_learning.h or pool_refill.h in pool_tls.c
|
||||
|
||||
#### Testing
|
||||
- [ ] Unit test: Allocation/free correctness
|
||||
- [ ] Performance test: Target 40-60M ops/s
|
||||
- [ ] Verify hot path is < 10 instructions with objdump
|
||||
|
||||
### Box 2: Refill Engine (pool_refill.c)
|
||||
|
||||
#### Setup
|
||||
- [ ] Create `core/pool_refill.c` and `core/pool_refill.h`
|
||||
- [ ] Import only pool_tls.h public API
|
||||
- [ ] Define refill statistics (miss streak, etc.)
|
||||
|
||||
#### Refill Implementation
|
||||
- [ ] Implement `pool_refill_and_alloc()`
|
||||
- [ ] Capture pre-refill state
|
||||
- [ ] Get refill count (default for Phase 1)
|
||||
- [ ] Batch allocate from backend
|
||||
- [ ] Install chain in TLS
|
||||
- [ ] Return first block
|
||||
|
||||
#### Contract B Validation
|
||||
- [ ] Verify refill NEVER blocks waiting for policy
|
||||
- [ ] Verify refill only reads atomic policy values
|
||||
- [ ] No immediate cache manipulation
|
||||
|
||||
#### Contract C Validation
|
||||
- [ ] Event created on stack
|
||||
- [ ] Event data copied, not referenced
|
||||
- [ ] No dynamic allocation for events
|
||||
|
||||
## Phase 2: Metrics Collection
|
||||
|
||||
### Metrics Addition
|
||||
- [ ] Add hit/miss counters to TLS state
|
||||
- [ ] Add miss streak tracking
|
||||
- [ ] Instrument hot path (with ifdef guard)
|
||||
- [ ] Implement `pool_print_stats()`
|
||||
|
||||
### Performance Validation
|
||||
- [ ] Measure regression with metrics enabled
|
||||
- [ ] Must be < 2% performance impact
|
||||
- [ ] Verify counters are accurate
|
||||
|
||||
## Phase 3: Learning Integration
|
||||
|
||||
### Box 3: ACE Learning (ace_learning.c)
|
||||
|
||||
#### Setup
|
||||
- [ ] Create `core/ace_learning.c` and `core/ace_learning.h`
|
||||
- [ ] Pre-allocate event ring buffer: `RefillEvent g_event_pool[QUEUE_SIZE]`
|
||||
- [ ] Initialize MPSC queue structure
|
||||
- [ ] Define policy table: `_Atomic uint32_t g_refill_policies[CLASSES]`
|
||||
|
||||
#### MPSC Queue Implementation
|
||||
- [ ] Implement `ace_push_event()`
|
||||
- [ ] Contract A: Check for full queue
|
||||
- [ ] Contract A: DROP if full (never block!)
|
||||
- [ ] Contract A: Track drops with counter
|
||||
- [ ] Contract C: COPY event to ring buffer
|
||||
- [ ] Use proper memory ordering
|
||||
- [ ] Implement `ace_consume_events()`
|
||||
- [ ] Read events with acquire semantics
|
||||
- [ ] Process and release slots
|
||||
- [ ] Sleep when queue empty
|
||||
|
||||
#### Contract A Validation
|
||||
- [ ] Push function NEVER blocks
|
||||
- [ ] Drops are tracked
|
||||
- [ ] Drop rate monitoring implemented
|
||||
- [ ] Warning issued if drop rate > 1%
|
||||
|
||||
#### Contract B Validation
|
||||
- [ ] ACE only writes to policy table
|
||||
- [ ] No immediate actions taken
|
||||
- [ ] No direct TLS manipulation
|
||||
- [ ] No blocking operations
|
||||
|
||||
#### Contract C Validation
|
||||
- [ ] Ring buffer pre-allocated
|
||||
- [ ] Events copied, not moved
|
||||
- [ ] No malloc/free in event path
|
||||
- [ ] Clear slot ownership model
|
||||
|
||||
#### Contract D Validation
|
||||
- [ ] ace_learning.c does NOT include pool_tls.h internals
|
||||
- [ ] No direct calls to Box1 functions
|
||||
- [ ] Only ace_push_event() exposed to Box2
|
||||
- [ ] Make notify_learning() static in pool_refill.c
|
||||
|
||||
#### Learning Algorithm
|
||||
- [ ] Implement UCB1 or similar
|
||||
- [ ] Track per-class statistics
|
||||
- [ ] Gradual policy adjustments
|
||||
- [ ] Oscillation detection
|
||||
|
||||
### Integration Points
|
||||
|
||||
#### Box2 → Box3 Connection
|
||||
- [ ] Add event creation in pool_refill_and_alloc()
|
||||
- [ ] Call ace_push_event() after successful refill
|
||||
- [ ] Make notify_learning() wrapper static
|
||||
|
||||
#### Box2 Policy Reading
|
||||
- [ ] Replace DEFAULT_REFILL_COUNT with ace_get_refill_count()
|
||||
- [ ] Atomic read of policy (no blocking)
|
||||
- [ ] Fallback to default if no policy
|
||||
|
||||
#### Startup
|
||||
- [ ] Launch learning thread in hakmem_init()
|
||||
- [ ] Initialize policy table with defaults
|
||||
- [ ] Verify thread starts successfully
|
||||
|
||||
## Diagnostics Implementation
|
||||
|
||||
### Queue Monitoring
|
||||
- [ ] Implement drop rate calculation
|
||||
- [ ] Add queue health metrics structure
|
||||
- [ ] Periodic health checks
|
||||
|
||||
### Debug Flags
|
||||
- [ ] POOL_DEBUG_CONTRACTS - contract validation
|
||||
- [ ] POOL_DEBUG_DROPS - log dropped events
|
||||
- [ ] Add contract violation counters
|
||||
|
||||
### Runtime Diagnostics
|
||||
- [ ] Implement pool_print_diagnostics()
|
||||
- [ ] Per-class statistics
|
||||
- [ ] Queue health report
|
||||
- [ ] Contract violation summary
|
||||
|
||||
## Final Validation
|
||||
|
||||
### Performance
|
||||
- [ ] Larson: 2.5M+ ops/s
|
||||
- [ ] bench_random_mixed: 40M+ ops/s
|
||||
- [ ] Background thread < 1% CPU
|
||||
- [ ] Drop rate < 0.1%
|
||||
|
||||
### Correctness
|
||||
- [ ] No memory leaks (Valgrind)
|
||||
- [ ] Thread safety verified
|
||||
- [ ] All contracts validated
|
||||
- [ ] Stress test passes
|
||||
|
||||
### Code Quality
|
||||
- [ ] Each box in separate .c file
|
||||
- [ ] Clear API boundaries
|
||||
- [ ] No cross-box includes
|
||||
- [ ] < 1000 LOC total
|
||||
|
||||
## Sign-off Checklist
|
||||
|
||||
### Contract A (Queue Never Blocks)
|
||||
- [ ] Verified ace_push_event() drops on full
|
||||
- [ ] Drop tracking implemented
|
||||
- [ ] No blocking operations in push path
|
||||
- [ ] Approved by: _____________
|
||||
|
||||
### Contract B (Policy Scope Limited)
|
||||
- [ ] ACE only adjusts next refill count
|
||||
- [ ] No immediate actions
|
||||
- [ ] Atomic reads only
|
||||
- [ ] Approved by: _____________
|
||||
|
||||
### Contract C (Memory Ownership Clear)
|
||||
- [ ] Ring buffer pre-allocated
|
||||
- [ ] Events copied not moved
|
||||
- [ ] No use-after-free possible
|
||||
- [ ] Approved by: _____________
|
||||
|
||||
### Contract D (API Boundaries Enforced)
|
||||
- [ ] Box files separate
|
||||
- [ ] No improper includes
|
||||
- [ ] Static functions where needed
|
||||
- [ ] Approved by: _____________
|
||||
|
||||
## Notes
|
||||
|
||||
**Remember**: The goal is an ultra-simple hot path (5-6 cycles) with smart learning that never interferes with performance. When in doubt, favor simplicity and speed over completeness of telemetry.
|
||||
|
||||
**Key Principle**: "キャッシュ増やす時だけ学習させる、push して他のスレッドに任せる" - Learning happens only during refill, pushed async to another thread.
|
||||
Reference in New Issue
Block a user