# ACE Phase 1 Implementation TODO **Status**: Ready to implement (documentation complete) **Target**: Fragmentation stress 3.87 → 8-12 M ops/s (2-3x improvement) **Timeline**: 1 day (7-9 hours total) **Date**: 2025-11-01 --- ## Overview Phase 1 implements the **minimal ACE (Adaptive Control Engine)** with maximum impact: - Metrics collection (throughput, LLC miss, mutex wait, backlog) - Fast loop control (0.5-1s adjustment cycle) - Dynamic TLS capacity tuning - UCB1 learning for knob selection - ON/OFF toggle via environment variable **Expected Impact**: Fragmentation stress workload improves from 3.87 → 8-12 M ops/s --- ## Task Breakdown ### 1. Metrics Collection Infrastructure (2-3 hours) #### 1.1 Create `core/hakmem_ace_metrics.h` (30 min) - [ ] Define `struct hkm_ace_metrics` with: ```c struct hkm_ace_metrics { uint64_t throughput_ops; // Operations per second double llc_miss_rate; // LLC miss rate (0.0-1.0) uint64_t mutex_wait_ns; // Mutex contention time uint32_t remote_free_backlog[8]; // Per-class backlog double fragmentation_ratio; // Slow metric (60s) uint64_t rss_mb; // Slow metric (60s) uint64_t timestamp_ms; // Collection timestamp }; ``` - [ ] Define collection API: ```c void hkm_ace_metrics_init(void); void hkm_ace_metrics_collect(struct hkm_ace_metrics *out); void hkm_ace_metrics_destroy(void); ``` #### 1.2 Create `core/hakmem_ace_metrics.c` (1.5-2 hours) - [ ] **Throughput tracking** (30 min) - Global atomic counter `g_ace_alloc_count` - Increment in `hakmem_alloc()` / `hakmem_free()` - Calculate ops/sec from delta between collections - [ ] **LLC miss monitoring** (45 min) - Use `rdpmc` for lightweight performance counter access - Read LLC_MISSES and LLC_REFERENCES counters - Calculate miss_rate = misses / references - Fallback to 0.0 if RDPMC unavailable - [ ] **Mutex contention tracking** (30 min) - Wrap `pthread_mutex_lock()` with timing - Track cumulative wait time per class - Reset counters after each collection - [ ] **Remote free backlog** (15 min) - Read `g_tiny_classes[c].remote_backlog_count` for each class - Already tracked by tiny pool implementation - [ ] **Fragmentation ratio (slow, 60s)** (15 min) - Calculate: `allocated_bytes / reserved_bytes` - Parse `/proc/self/status` for VmRSS and VmSize - Only update every 60 seconds (skip on fast collections) - [ ] **RSS monitoring (slow, 60s)** (15 min) - Read `/proc/self/status` VmRSS field - Convert to MB - Only update every 60 seconds #### 1.3 Integration with existing code (30 min) - [ ] Add `#include "hakmem_ace_metrics.h"` to `core/hakmem.c` - [ ] Call `hkm_ace_metrics_init()` in `hakmem_init()` - [ ] Call `hkm_ace_metrics_destroy()` in cleanup --- ### 2. Fast Loop Controller (2-3 hours) #### 2.1 Create `core/hakmem_ace_controller.h` (30 min) - [ ] Define `struct hkm_ace_controller`: ```c struct hkm_ace_controller { struct hkm_ace_metrics current; struct hkm_ace_metrics prev; // Current knob values uint32_t tls_capacity[8]; // Per-class TLS magazine capacity uint32_t drain_threshold[8]; // Remote free drain threshold // Fast loop state uint64_t fast_interval_ms; // Default 500ms uint64_t last_fast_tick_ms; // Slow loop state uint64_t slow_interval_ms; // Default 30000ms (30s) uint64_t last_slow_tick_ms; // Enabled flag bool enabled; }; ``` - [ ] Define controller API: ```c void hkm_ace_controller_init(struct hkm_ace_controller *ctrl); void hkm_ace_controller_tick(struct hkm_ace_controller *ctrl); void hkm_ace_controller_destroy(struct hkm_ace_controller *ctrl); ``` #### 2.2 Create `core/hakmem_ace_controller.c` (1.5-2 hours) - [ ] **Initialization** (30 min) - Read environment variables: - `HAKMEM_ACE_ENABLED` (default 0) - `HAKMEM_ACE_FAST_INTERVAL_MS` (default 500) - `HAKMEM_ACE_SLOW_INTERVAL_MS` (default 30000) - Initialize knob values to current defaults: - `tls_capacity[c] = TINY_TLS_MAG_CAP` (currently 128) - `drain_threshold[c] = TINY_REMOTE_DRAIN_THRESHOLD` (currently high) - [ ] **Fast loop tick** (45 min) - Check if `elapsed >= fast_interval_ms` - Collect current metrics - Calculate reward: `reward = throughput - (llc_miss_penalty + mutex_wait_penalty + backlog_penalty)` - Adjust knobs based on metrics: ```c // LLC miss high → reduce TLS capacity (diet) if (llc_miss_rate > 0.15) { tls_capacity[c] *= 0.75; // Diet factor } // Remote backlog high → lower drain threshold if (remote_backlog[c] > drain_threshold[c]) { drain_threshold[c] /= 2; } // Mutex wait high → increase bundle width // (Phase 1: skip, implement in Phase 2) ``` - Apply knob changes to runtime (see section 4) - Update `prev` metrics for next iteration - [ ] **Slow loop tick** (30 min) - Check if `elapsed >= slow_interval_ms` - Collect slow metrics (fragmentation, RSS) - If fragmentation high: trigger partial release (Phase 2 feature, skip for now) - If RSS high: trigger budgeted scavenge (Phase 2 feature, skip for now) - [ ] **Tick dispatcher** (15 min) - Combined `hkm_ace_controller_tick()` that calls both fast and slow loops - Use monotonic clock (`clock_gettime(CLOCK_MONOTONIC)`) for timing #### 2.3 Integration with main loop (30 min) - [ ] Add background thread in `core/hakmem.c`: ```c static void* hkm_ace_thread_main(void *arg) { struct hkm_ace_controller *ctrl = arg; while (ctrl->enabled) { hkm_ace_controller_tick(ctrl); usleep(100000); // 100ms sleep, check every 0.1s } return NULL; } ``` - [ ] Start ACE thread in `hakmem_init()` if `HAKMEM_ACE_ENABLED=1` - [ ] Join ACE thread in cleanup --- ### 3. UCB1 Learning Algorithm (1-2 hours) #### 3.1 Create `core/hakmem_ace_ucb1.h` (30 min) - [ ] Define discrete knob candidates: ```c // TLS capacity candidates static const uint32_t TLS_CAP_CANDIDATES[] = {4, 8, 16, 32, 64, 128, 256, 512}; #define TLS_CAP_N_ARMS 8 // Drain threshold candidates static const uint32_t DRAIN_THRESH_CANDIDATES[] = {32, 64, 128, 256, 512, 1024}; #define DRAIN_THRESH_N_ARMS 6 ``` - [ ] Define `struct hkm_ace_ucb1_arm`: ```c struct hkm_ace_ucb1_arm { uint32_t value; // Knob value (e.g., 32, 64, 128) double avg_reward; // Average reward uint32_t n_pulls; // Number of times selected }; ``` - [ ] Define `struct hkm_ace_ucb1_bandit`: ```c struct hkm_ace_ucb1_bandit { struct hkm_ace_ucb1_arm arms[TLS_CAP_N_ARMS]; uint32_t total_pulls; double exploration_bonus; // Default sqrt(2) }; ``` - [ ] Define UCB1 API: ```c void hkm_ace_ucb1_init(struct hkm_ace_ucb1_bandit *bandit, const uint32_t *candidates, int n_arms); int hkm_ace_ucb1_select(struct hkm_ace_ucb1_bandit *bandit); void hkm_ace_ucb1_update(struct hkm_ace_ucb1_bandit *bandit, int arm_idx, double reward); ``` #### 3.2 Create `core/hakmem_ace_ucb1.c` (45 min) - [ ] **Initialization** (15 min) - Initialize each arm with candidate value - Set `avg_reward = 0.0`, `n_pulls = 0` - [ ] **Selection** (15 min) - Implement UCB1 formula: ```c ucb_value = avg_reward + exploration_bonus * sqrt(log(total_pulls) / n_pulls) ``` - Return arm index with highest UCB value - Handle initial exploration (n_pulls == 0 → infinity UCB) - [ ] **Update** (15 min) - Update running average: ```c avg_reward = (avg_reward * n_pulls + reward) / (n_pulls + 1) ``` - Increment `n_pulls` and `total_pulls` #### 3.3 Integration with controller (30 min) - [ ] Add UCB1 bandits to `struct hkm_ace_controller`: ```c struct hkm_ace_ucb1_bandit tls_cap_bandit[8]; // Per-class TLS capacity struct hkm_ace_ucb1_bandit drain_bandit[8]; // Per-class drain threshold ``` - [ ] In fast loop tick: - Select knob values using UCB1: `arm_idx = hkm_ace_ucb1_select(&ctrl->tls_cap_bandit[c])` - Apply selected values: `ctrl->tls_capacity[c] = TLS_CAP_CANDIDATES[arm_idx]` - After observing reward: `hkm_ace_ucb1_update(&ctrl->tls_cap_bandit[c], arm_idx, reward)` --- ### 4. Dynamic TLS Capacity Adjustment (1-2 hours) #### 4.1 Modify `core/hakmem_tiny_magazine.h` (30 min) - [ ] Change `TINY_TLS_MAG_CAP` from compile-time constant to runtime variable: ```c // OLD: #define TINY_TLS_MAG_CAP 128 // NEW: extern uint32_t g_tiny_tls_mag_cap[8]; // Per-class capacity ``` - [ ] Update all references to `TINY_TLS_MAG_CAP` to use `g_tiny_tls_mag_cap[class_idx]` #### 4.2 Modify `core/hakmem_tiny_magazine.c` (30 min) - [ ] Define global capacity array: ```c uint32_t g_tiny_tls_mag_cap[8] = { 128, 128, 128, 128, 128, 128, 128, 128 // Default values }; ``` - [ ] Add setter function: ```c void hkm_tiny_set_tls_capacity(uint8_t class_idx, uint32_t new_cap) { if (class_idx >= 8) return; g_tiny_tls_mag_cap[class_idx] = new_cap; } ``` - [ ] Update magazine refill logic to respect dynamic capacity: ```c // In tiny_magazine_refill(): uint32_t cap = g_tiny_tls_mag_cap[class_idx]; if (mag->count >= cap) return; // Already at capacity ``` #### 4.3 Integration with ACE controller (30 min) - [ ] In `hkm_ace_controller_tick()`, apply TLS capacity changes: ```c for (int c = 0; c < 8; c++) { uint32_t new_cap = ctrl->tls_capacity[c]; hkm_tiny_set_tls_capacity(c, new_cap); } ``` - [ ] Similarly for drain threshold (if implemented in tiny pool): ```c for (int c = 0; c < 8; c++) { uint32_t new_thresh = ctrl->drain_threshold[c]; hkm_tiny_set_drain_threshold(c, new_thresh); } ``` --- ### 5. ON/OFF Toggle and Configuration (1 hour) #### 5.1 Environment variables (30 min) - [ ] Add to `core/hakmem_config.h`: ```c // ACE Learning Layer #define HAKMEM_ACE_ENABLED "HAKMEM_ACE_ENABLED" // 0/1 #define HAKMEM_ACE_FAST_INTERVAL_MS "HAKMEM_ACE_FAST_INTERVAL_MS" // Default 500 #define HAKMEM_ACE_SLOW_INTERVAL_MS "HAKMEM_ACE_SLOW_INTERVAL_MS" // Default 30000 #define HAKMEM_ACE_LOG_LEVEL "HAKMEM_ACE_LOG_LEVEL" // 0=off, 1=info, 2=debug // Safety guards #define HAKMEM_ACE_MAX_P99_LAT_NS "HAKMEM_ACE_MAX_P99_LAT_NS" // Default 10000000 (10ms) #define HAKMEM_ACE_MAX_RSS_MB "HAKMEM_ACE_MAX_RSS_MB" // Default 16384 (16GB) #define HAKMEM_ACE_MAX_CPU_PERCENT "HAKMEM_ACE_MAX_CPU_PERCENT" // Default 5 ``` - [ ] Parse environment variables in `hkm_ace_controller_init()` #### 5.2 Logging infrastructure (30 min) - [ ] Add logging macros in `core/hakmem_ace_controller.c`: ```c #define ACE_LOG_INFO(fmt, ...) \ if (g_ace_log_level >= 1) fprintf(stderr, "[ACE] " fmt "\n", ##__VA_ARGS__) #define ACE_LOG_DEBUG(fmt, ...) \ if (g_ace_log_level >= 2) fprintf(stderr, "[ACE DEBUG] " fmt "\n", ##__VA_ARGS__) ``` - [ ] Add debug output in fast loop: ```c ACE_LOG_DEBUG("Fast loop: reward=%.2f, llc_miss=%.2f, backlog=%u", reward, llc_miss_rate, remote_backlog[0]); ACE_LOG_INFO("Adjusting TLS cap[%d]: %u → %u (diet factor=%.2f)", c, old_cap, new_cap, diet_factor); ``` --- ## Testing Strategy ### Unit Tests - [ ] Test metrics collection: ```bash # Verify throughput tracking HAKMEM_ACE_ENABLED=1 ./test_ace_metrics ``` - [ ] Test UCB1 selection: ```bash # Verify arm selection and update ./test_ace_ucb1 ``` ### Integration Tests - [ ] Test ACE on fragmentation stress benchmark: ```bash # Baseline (ACE OFF) HAKMEM_ACE_ENABLED=0 ./bench_fragment_stress_hakx > baseline.txt # ACE ON HAKMEM_ACE_ENABLED=1 ./bench_fragment_stress_hakx > ace_on.txt # Compare diff baseline.txt ace_on.txt ``` - [ ] Verify dynamic TLS capacity adjustment: ```bash # Enable debug logging export HAKMEM_ACE_ENABLED=1 export HAKMEM_ACE_LOG_LEVEL=2 ./bench_fragment_stress_hakx # Should see log output: "Adjusting TLS cap[2]: 128 → 96" ``` ### Benchmark Validation - [ ] Run A/B comparison on all weak workloads: ```bash bash scripts/ace_ab_test.sh ``` - [ ] Expected results: - Fragmentation stress: 3.87 → 8-12 M ops/s (2-3x) - Mid MT: 111.6 M ops/s → 110-115 M ops/s (maintain ±5%) - Large WS: 22.15 M ops/s → 25-30 M ops/s (1.1-1.4x, partial improvement) --- ## Implementation Order **Day 1 (7-9 hours)**: 1. **Morning (3-4 hours)**: - [ ] 1.1 Create hakmem_ace_metrics.h (30 min) - [ ] 1.2 Create hakmem_ace_metrics.c (2 hours) - [ ] 1.3 Integration (30 min) - [ ] Test: Verify metrics collection works 2. **Midday (2-3 hours)**: - [ ] 2.1 Create hakmem_ace_controller.h (30 min) - [ ] 2.2 Create hakmem_ace_controller.c (1.5 hours) - [ ] 2.3 Integration (30 min) - [ ] Test: Verify fast/slow loops run 3. **Afternoon (2-3 hours)**: - [ ] 3.1 Create hakmem_ace_ucb1.h (30 min) - [ ] 3.2 Create hakmem_ace_ucb1.c (45 min) - [ ] 3.3 Integration (30 min) - [ ] 4.1-4.3 Dynamic TLS capacity (1.5 hours) - [ ] 5.1-5.2 ON/OFF toggle (1 hour) 4. **Evening (1-2 hours)**: - [ ] Build and test complete system - [ ] Run fragmentation stress A/B test - [ ] Verify 2-3x improvement --- ## Success Criteria Phase 1 is complete when: - ✅ Metrics collection works (throughput, LLC miss, mutex wait, backlog) - ✅ Fast loop adjusts TLS capacity based on LLC miss rate - ✅ UCB1 learning selects optimal knob values - ✅ Dynamic TLS capacity affects runtime behavior - ✅ ON/OFF toggle via `HAKMEM_ACE_ENABLED=1` works - ✅ **Benchmark improvement**: Fragmentation stress 3.87 → 8-12 M ops/s (2-3x) - ✅ **No regression**: Mid MT maintains 110-115 M ops/s (±5%) --- ## Files to Create New files (Phase 1): ``` core/hakmem_ace_metrics.h (80 lines) core/hakmem_ace_metrics.c (300 lines) core/hakmem_ace_controller.h (100 lines) core/hakmem_ace_controller.c (400 lines) core/hakmem_ace_ucb1.h (80 lines) core/hakmem_ace_ucb1.c (150 lines) ``` Modified files: ``` core/hakmem_tiny_magazine.h (change TINY_TLS_MAG_CAP to array) core/hakmem_tiny_magazine.c (add setter, use dynamic capacity) core/hakmem.c (start ACE thread) core/hakmem_config.h (add ACE env vars) ``` Test files: ``` tests/unit/test_ace_metrics.c (150 lines) tests/unit/test_ace_ucb1.c (120 lines) tests/integration/test_ace_e2e.c (200 lines) ``` Scripts: ``` benchmarks/scripts/utils/ace_ab_test.sh (100 lines) ``` **Total new code**: ~1,680 lines (Phase 1 only) --- ## Next Steps After Phase 1 Once Phase 1 is complete and validated: - **Phase 2**: Fragmentation countermeasures (budgeted scavenge, partial release) - **Phase 3**: Large WS countermeasures (auto diet, LLC miss optimization) - **Phase 4**: realloc optimization (in-place expansion, NT store) --- **Status**: READY TO IMPLEMENT **Priority**: HIGH 🔥 **Expected Impact**: 2-3x improvement on fragmentation stress **Risk**: LOW (isolated, ON/OFF toggle, no impact when disabled) Let's build it! 💪