Files
hakmem/ACE_PHASE1_IMPLEMENTATION_TODO.md

475 lines
15 KiB
Markdown
Raw Normal View History

# ACE Phase 1 Implementation TODO
**Status**: Ready to implement (documentation complete)
**Target**: Fragmentation stress 3.87 → 8-12 M ops/s (2-3x improvement)
**Timeline**: 1 day (7-9 hours total)
**Date**: 2025-11-01
---
## Overview
Phase 1 implements the **minimal ACE (Adaptive Control Engine)** with maximum impact:
- Metrics collection (throughput, LLC miss, mutex wait, backlog)
- Fast loop control (0.5-1s adjustment cycle)
- Dynamic TLS capacity tuning
- UCB1 learning for knob selection
- ON/OFF toggle via environment variable
**Expected Impact**: Fragmentation stress workload improves from 3.87 → 8-12 M ops/s
---
## Task Breakdown
### 1. Metrics Collection Infrastructure (2-3 hours)
#### 1.1 Create `core/hakmem_ace_metrics.h` (30 min)
- [ ] Define `struct hkm_ace_metrics` with:
```c
struct hkm_ace_metrics {
uint64_t throughput_ops; // Operations per second
double llc_miss_rate; // LLC miss rate (0.0-1.0)
uint64_t mutex_wait_ns; // Mutex contention time
uint32_t remote_free_backlog[8]; // Per-class backlog
double fragmentation_ratio; // Slow metric (60s)
uint64_t rss_mb; // Slow metric (60s)
uint64_t timestamp_ms; // Collection timestamp
};
```
- [ ] Define collection API:
```c
void hkm_ace_metrics_init(void);
void hkm_ace_metrics_collect(struct hkm_ace_metrics *out);
void hkm_ace_metrics_destroy(void);
```
#### 1.2 Create `core/hakmem_ace_metrics.c` (1.5-2 hours)
- [ ] **Throughput tracking** (30 min)
- Global atomic counter `g_ace_alloc_count`
- Increment in `hakmem_alloc()` / `hakmem_free()`
- Calculate ops/sec from delta between collections
- [ ] **LLC miss monitoring** (45 min)
- Use `rdpmc` for lightweight performance counter access
- Read LLC_MISSES and LLC_REFERENCES counters
- Calculate miss_rate = misses / references
- Fallback to 0.0 if RDPMC unavailable
- [ ] **Mutex contention tracking** (30 min)
- Wrap `pthread_mutex_lock()` with timing
- Track cumulative wait time per class
- Reset counters after each collection
- [ ] **Remote free backlog** (15 min)
- Read `g_tiny_classes[c].remote_backlog_count` for each class
- Already tracked by tiny pool implementation
- [ ] **Fragmentation ratio (slow, 60s)** (15 min)
- Calculate: `allocated_bytes / reserved_bytes`
- Parse `/proc/self/status` for VmRSS and VmSize
- Only update every 60 seconds (skip on fast collections)
- [ ] **RSS monitoring (slow, 60s)** (15 min)
- Read `/proc/self/status` VmRSS field
- Convert to MB
- Only update every 60 seconds
#### 1.3 Integration with existing code (30 min)
- [ ] Add `#include "hakmem_ace_metrics.h"` to `core/hakmem.c`
- [ ] Call `hkm_ace_metrics_init()` in `hakmem_init()`
- [ ] Call `hkm_ace_metrics_destroy()` in cleanup
---
### 2. Fast Loop Controller (2-3 hours)
#### 2.1 Create `core/hakmem_ace_controller.h` (30 min)
- [ ] Define `struct hkm_ace_controller`:
```c
struct hkm_ace_controller {
struct hkm_ace_metrics current;
struct hkm_ace_metrics prev;
// Current knob values
uint32_t tls_capacity[8]; // Per-class TLS magazine capacity
uint32_t drain_threshold[8]; // Remote free drain threshold
// Fast loop state
uint64_t fast_interval_ms; // Default 500ms
uint64_t last_fast_tick_ms;
// Slow loop state
uint64_t slow_interval_ms; // Default 30000ms (30s)
uint64_t last_slow_tick_ms;
// Enabled flag
bool enabled;
};
```
- [ ] Define controller API:
```c
void hkm_ace_controller_init(struct hkm_ace_controller *ctrl);
void hkm_ace_controller_tick(struct hkm_ace_controller *ctrl);
void hkm_ace_controller_destroy(struct hkm_ace_controller *ctrl);
```
#### 2.2 Create `core/hakmem_ace_controller.c` (1.5-2 hours)
- [ ] **Initialization** (30 min)
- Read environment variables:
- `HAKMEM_ACE_ENABLED` (default 0)
- `HAKMEM_ACE_FAST_INTERVAL_MS` (default 500)
- `HAKMEM_ACE_SLOW_INTERVAL_MS` (default 30000)
- Initialize knob values to current defaults:
- `tls_capacity[c] = TINY_TLS_MAG_CAP` (currently 128)
- `drain_threshold[c] = TINY_REMOTE_DRAIN_THRESHOLD` (currently high)
- [ ] **Fast loop tick** (45 min)
- Check if `elapsed >= fast_interval_ms`
- Collect current metrics
- Calculate reward: `reward = throughput - (llc_miss_penalty + mutex_wait_penalty + backlog_penalty)`
- Adjust knobs based on metrics:
```c
// LLC miss high → reduce TLS capacity (diet)
if (llc_miss_rate > 0.15) {
tls_capacity[c] *= 0.75; // Diet factor
}
// Remote backlog high → lower drain threshold
if (remote_backlog[c] > drain_threshold[c]) {
drain_threshold[c] /= 2;
}
// Mutex wait high → increase bundle width
// (Phase 1: skip, implement in Phase 2)
```
- Apply knob changes to runtime (see section 4)
- Update `prev` metrics for next iteration
- [ ] **Slow loop tick** (30 min)
- Check if `elapsed >= slow_interval_ms`
- Collect slow metrics (fragmentation, RSS)
- If fragmentation high: trigger partial release (Phase 2 feature, skip for now)
- If RSS high: trigger budgeted scavenge (Phase 2 feature, skip for now)
- [ ] **Tick dispatcher** (15 min)
- Combined `hkm_ace_controller_tick()` that calls both fast and slow loops
- Use monotonic clock (`clock_gettime(CLOCK_MONOTONIC)`) for timing
#### 2.3 Integration with main loop (30 min)
- [ ] Add background thread in `core/hakmem.c`:
```c
static void* hkm_ace_thread_main(void *arg) {
struct hkm_ace_controller *ctrl = arg;
while (ctrl->enabled) {
hkm_ace_controller_tick(ctrl);
usleep(100000); // 100ms sleep, check every 0.1s
}
return NULL;
}
```
- [ ] Start ACE thread in `hakmem_init()` if `HAKMEM_ACE_ENABLED=1`
- [ ] Join ACE thread in cleanup
---
### 3. UCB1 Learning Algorithm (1-2 hours)
#### 3.1 Create `core/hakmem_ace_ucb1.h` (30 min)
- [ ] Define discrete knob candidates:
```c
// TLS capacity candidates
static const uint32_t TLS_CAP_CANDIDATES[] = {4, 8, 16, 32, 64, 128, 256, 512};
#define TLS_CAP_N_ARMS 8
// Drain threshold candidates
static const uint32_t DRAIN_THRESH_CANDIDATES[] = {32, 64, 128, 256, 512, 1024};
#define DRAIN_THRESH_N_ARMS 6
```
- [ ] Define `struct hkm_ace_ucb1_arm`:
```c
struct hkm_ace_ucb1_arm {
uint32_t value; // Knob value (e.g., 32, 64, 128)
double avg_reward; // Average reward
uint32_t n_pulls; // Number of times selected
};
```
- [ ] Define `struct hkm_ace_ucb1_bandit`:
```c
struct hkm_ace_ucb1_bandit {
struct hkm_ace_ucb1_arm arms[TLS_CAP_N_ARMS];
uint32_t total_pulls;
double exploration_bonus; // Default sqrt(2)
};
```
- [ ] Define UCB1 API:
```c
void hkm_ace_ucb1_init(struct hkm_ace_ucb1_bandit *bandit, const uint32_t *candidates, int n_arms);
int hkm_ace_ucb1_select(struct hkm_ace_ucb1_bandit *bandit);
void hkm_ace_ucb1_update(struct hkm_ace_ucb1_bandit *bandit, int arm_idx, double reward);
```
#### 3.2 Create `core/hakmem_ace_ucb1.c` (45 min)
- [ ] **Initialization** (15 min)
- Initialize each arm with candidate value
- Set `avg_reward = 0.0`, `n_pulls = 0`
- [ ] **Selection** (15 min)
- Implement UCB1 formula:
```c
ucb_value = avg_reward + exploration_bonus * sqrt(log(total_pulls) / n_pulls)
```
- Return arm index with highest UCB value
- Handle initial exploration (n_pulls == 0 → infinity UCB)
- [ ] **Update** (15 min)
- Update running average:
```c
avg_reward = (avg_reward * n_pulls + reward) / (n_pulls + 1)
```
- Increment `n_pulls` and `total_pulls`
#### 3.3 Integration with controller (30 min)
- [ ] Add UCB1 bandits to `struct hkm_ace_controller`:
```c
struct hkm_ace_ucb1_bandit tls_cap_bandit[8]; // Per-class TLS capacity
struct hkm_ace_ucb1_bandit drain_bandit[8]; // Per-class drain threshold
```
- [ ] In fast loop tick:
- Select knob values using UCB1: `arm_idx = hkm_ace_ucb1_select(&ctrl->tls_cap_bandit[c])`
- Apply selected values: `ctrl->tls_capacity[c] = TLS_CAP_CANDIDATES[arm_idx]`
- After observing reward: `hkm_ace_ucb1_update(&ctrl->tls_cap_bandit[c], arm_idx, reward)`
---
### 4. Dynamic TLS Capacity Adjustment (1-2 hours)
#### 4.1 Modify `core/hakmem_tiny_magazine.h` (30 min)
- [ ] Change `TINY_TLS_MAG_CAP` from compile-time constant to runtime variable:
```c
// OLD:
#define TINY_TLS_MAG_CAP 128
// NEW:
extern uint32_t g_tiny_tls_mag_cap[8]; // Per-class capacity
```
- [ ] Update all references to `TINY_TLS_MAG_CAP` to use `g_tiny_tls_mag_cap[class_idx]`
#### 4.2 Modify `core/hakmem_tiny_magazine.c` (30 min)
- [ ] Define global capacity array:
```c
uint32_t g_tiny_tls_mag_cap[8] = {
128, 128, 128, 128, 128, 128, 128, 128 // Default values
};
```
- [ ] Add setter function:
```c
void hkm_tiny_set_tls_capacity(uint8_t class_idx, uint32_t new_cap) {
if (class_idx >= 8) return;
g_tiny_tls_mag_cap[class_idx] = new_cap;
}
```
- [ ] Update magazine refill logic to respect dynamic capacity:
```c
// In tiny_magazine_refill():
uint32_t cap = g_tiny_tls_mag_cap[class_idx];
if (mag->count >= cap) return; // Already at capacity
```
#### 4.3 Integration with ACE controller (30 min)
- [ ] In `hkm_ace_controller_tick()`, apply TLS capacity changes:
```c
for (int c = 0; c < 8; c++) {
uint32_t new_cap = ctrl->tls_capacity[c];
hkm_tiny_set_tls_capacity(c, new_cap);
}
```
- [ ] Similarly for drain threshold (if implemented in tiny pool):
```c
for (int c = 0; c < 8; c++) {
uint32_t new_thresh = ctrl->drain_threshold[c];
hkm_tiny_set_drain_threshold(c, new_thresh);
}
```
---
### 5. ON/OFF Toggle and Configuration (1 hour)
#### 5.1 Environment variables (30 min)
- [ ] Add to `core/hakmem_config.h`:
```c
// ACE Learning Layer
#define HAKMEM_ACE_ENABLED "HAKMEM_ACE_ENABLED" // 0/1
#define HAKMEM_ACE_FAST_INTERVAL_MS "HAKMEM_ACE_FAST_INTERVAL_MS" // Default 500
#define HAKMEM_ACE_SLOW_INTERVAL_MS "HAKMEM_ACE_SLOW_INTERVAL_MS" // Default 30000
#define HAKMEM_ACE_LOG_LEVEL "HAKMEM_ACE_LOG_LEVEL" // 0=off, 1=info, 2=debug
// Safety guards
#define HAKMEM_ACE_MAX_P99_LAT_NS "HAKMEM_ACE_MAX_P99_LAT_NS" // Default 10000000 (10ms)
#define HAKMEM_ACE_MAX_RSS_MB "HAKMEM_ACE_MAX_RSS_MB" // Default 16384 (16GB)
#define HAKMEM_ACE_MAX_CPU_PERCENT "HAKMEM_ACE_MAX_CPU_PERCENT" // Default 5
```
- [ ] Parse environment variables in `hkm_ace_controller_init()`
#### 5.2 Logging infrastructure (30 min)
- [ ] Add logging macros in `core/hakmem_ace_controller.c`:
```c
#define ACE_LOG_INFO(fmt, ...) \
if (g_ace_log_level >= 1) fprintf(stderr, "[ACE] " fmt "\n", ##__VA_ARGS__)
#define ACE_LOG_DEBUG(fmt, ...) \
if (g_ace_log_level >= 2) fprintf(stderr, "[ACE DEBUG] " fmt "\n", ##__VA_ARGS__)
```
- [ ] Add debug output in fast loop:
```c
ACE_LOG_DEBUG("Fast loop: reward=%.2f, llc_miss=%.2f, backlog=%u",
reward, llc_miss_rate, remote_backlog[0]);
ACE_LOG_INFO("Adjusting TLS cap[%d]: %u → %u (diet factor=%.2f)",
c, old_cap, new_cap, diet_factor);
```
---
## Testing Strategy
### Unit Tests
- [ ] Test metrics collection:
```bash
# Verify throughput tracking
HAKMEM_ACE_ENABLED=1 ./test_ace_metrics
```
- [ ] Test UCB1 selection:
```bash
# Verify arm selection and update
./test_ace_ucb1
```
### Integration Tests
- [ ] Test ACE on fragmentation stress benchmark:
```bash
# Baseline (ACE OFF)
HAKMEM_ACE_ENABLED=0 ./bench_fragment_stress_hakx > baseline.txt
# ACE ON
HAKMEM_ACE_ENABLED=1 ./bench_fragment_stress_hakx > ace_on.txt
# Compare
diff baseline.txt ace_on.txt
```
- [ ] Verify dynamic TLS capacity adjustment:
```bash
# Enable debug logging
export HAKMEM_ACE_ENABLED=1
export HAKMEM_ACE_LOG_LEVEL=2
./bench_fragment_stress_hakx
# Should see log output: "Adjusting TLS cap[2]: 128 → 96"
```
### Benchmark Validation
- [ ] Run A/B comparison on all weak workloads:
```bash
bash scripts/ace_ab_test.sh
```
- [ ] Expected results:
- Fragmentation stress: 3.87 → 8-12 M ops/s (2-3x)
- Mid MT: 111.6 M ops/s → 110-115 M ops/s (maintain ±5%)
- Large WS: 22.15 M ops/s → 25-30 M ops/s (1.1-1.4x, partial improvement)
---
## Implementation Order
**Day 1 (7-9 hours)**:
1. **Morning (3-4 hours)**:
- [ ] 1.1 Create hakmem_ace_metrics.h (30 min)
- [ ] 1.2 Create hakmem_ace_metrics.c (2 hours)
- [ ] 1.3 Integration (30 min)
- [ ] Test: Verify metrics collection works
2. **Midday (2-3 hours)**:
- [ ] 2.1 Create hakmem_ace_controller.h (30 min)
- [ ] 2.2 Create hakmem_ace_controller.c (1.5 hours)
- [ ] 2.3 Integration (30 min)
- [ ] Test: Verify fast/slow loops run
3. **Afternoon (2-3 hours)**:
- [ ] 3.1 Create hakmem_ace_ucb1.h (30 min)
- [ ] 3.2 Create hakmem_ace_ucb1.c (45 min)
- [ ] 3.3 Integration (30 min)
- [ ] 4.1-4.3 Dynamic TLS capacity (1.5 hours)
- [ ] 5.1-5.2 ON/OFF toggle (1 hour)
4. **Evening (1-2 hours)**:
- [ ] Build and test complete system
- [ ] Run fragmentation stress A/B test
- [ ] Verify 2-3x improvement
---
## Success Criteria
Phase 1 is complete when:
- ✅ Metrics collection works (throughput, LLC miss, mutex wait, backlog)
- ✅ Fast loop adjusts TLS capacity based on LLC miss rate
- ✅ UCB1 learning selects optimal knob values
- ✅ Dynamic TLS capacity affects runtime behavior
- ✅ ON/OFF toggle via `HAKMEM_ACE_ENABLED=1` works
-**Benchmark improvement**: Fragmentation stress 3.87 → 8-12 M ops/s (2-3x)
-**No regression**: Mid MT maintains 110-115 M ops/s (±5%)
---
## Files to Create
New files (Phase 1):
```
core/hakmem_ace_metrics.h (80 lines)
core/hakmem_ace_metrics.c (300 lines)
core/hakmem_ace_controller.h (100 lines)
core/hakmem_ace_controller.c (400 lines)
core/hakmem_ace_ucb1.h (80 lines)
core/hakmem_ace_ucb1.c (150 lines)
```
Modified files:
```
core/hakmem_tiny_magazine.h (change TINY_TLS_MAG_CAP to array)
core/hakmem_tiny_magazine.c (add setter, use dynamic capacity)
core/hakmem.c (start ACE thread)
core/hakmem_config.h (add ACE env vars)
```
Test files:
```
tests/unit/test_ace_metrics.c (150 lines)
tests/unit/test_ace_ucb1.c (120 lines)
tests/integration/test_ace_e2e.c (200 lines)
```
Scripts:
```
benchmarks/scripts/utils/ace_ab_test.sh (100 lines)
```
**Total new code**: ~1,680 lines (Phase 1 only)
---
## Next Steps After Phase 1
Once Phase 1 is complete and validated:
- **Phase 2**: Fragmentation countermeasures (budgeted scavenge, partial release)
- **Phase 3**: Large WS countermeasures (auto diet, LLC miss optimization)
- **Phase 4**: realloc optimization (in-place expansion, NT store)
---
**Status**: READY TO IMPLEMENT
**Priority**: HIGH 🔥
**Expected Impact**: 2-3x improvement on fragmentation stress
**Risk**: LOW (isolated, ON/OFF toggle, no impact when disabled)
Let's build it! 💪