# ACE Phase 1 Implementation TODO

**Status**: Ready to implement (documentation complete)
**Target**: Fragmentation stress 3.87 → 8-12 M ops/s (2-3x improvement)
**Timeline**: 1 day (7-9 hours total)
**Date**: 2025-11-01

---

## Overview

Phase 1 implements the **minimal ACE (Adaptive Control Engine)** with maximum impact:
- Metrics collection (throughput, LLC miss, mutex wait, backlog)
- Fast loop control (0.5-1s adjustment cycle)
- Dynamic TLS capacity tuning
- UCB1 learning for knob selection
- ON/OFF toggle via environment variable

**Expected Impact**: Fragmentation stress workload improves from 3.87 → 8-12 M ops/s

---

## Task Breakdown

### 1. Metrics Collection Infrastructure (2-3 hours)

#### 1.1 Create `core/hakmem_ace_metrics.h` (30 min)
- [ ] Define `struct hkm_ace_metrics` with:
  ```c
  struct hkm_ace_metrics {
      uint64_t throughput_ops;        // Operations per second
      double llc_miss_rate;           // LLC miss rate (0.0-1.0)
      uint64_t mutex_wait_ns;         // Mutex contention time
      uint32_t remote_free_backlog[8]; // Per-class backlog
      double fragmentation_ratio;     // Slow metric (60s)
      uint64_t rss_mb;                // Slow metric (60s)
      uint64_t timestamp_ms;          // Collection timestamp
  };
  ```
- [ ] Define collection API:
  ```c
  void hkm_ace_metrics_init(void);
  void hkm_ace_metrics_collect(struct hkm_ace_metrics *out);
  void hkm_ace_metrics_destroy(void);
  ```

#### 1.2 Create `core/hakmem_ace_metrics.c` (1.5-2 hours)
- [ ] **Throughput tracking** (30 min)
  - Global atomic counter `g_ace_alloc_count`
  - Increment in `hakmem_alloc()` / `hakmem_free()`
  - Calculate ops/sec from delta between collections

- [ ] **LLC miss monitoring** (45 min)
  - Use `rdpmc` for lightweight performance counter access
  - Read LLC_MISSES and LLC_REFERENCES counters
  - Calculate miss_rate = misses / references
  - Fallback to 0.0 if RDPMC unavailable

- [ ] **Mutex contention tracking** (30 min)
  - Wrap `pthread_mutex_lock()` with timing
  - Track cumulative wait time per class
  - Reset counters after each collection

- [ ] **Remote free backlog** (15 min)
  - Read `g_tiny_classes[c].remote_backlog_count` for each class
  - Already tracked by tiny pool implementation

- [ ] **Fragmentation ratio (slow, 60s)** (15 min)
  - Calculate: `allocated_bytes / reserved_bytes`
  - Parse `/proc/self/status` for VmRSS and VmSize
  - Only update every 60 seconds (skip on fast collections)

- [ ] **RSS monitoring (slow, 60s)** (15 min)
  - Read `/proc/self/status` VmRSS field
  - Convert to MB
  - Only update every 60 seconds

#### 1.3 Integration with existing code (30 min)
- [ ] Add `#include "hakmem_ace_metrics.h"` to `core/hakmem.c`
- [ ] Call `hkm_ace_metrics_init()` in `hakmem_init()`
- [ ] Call `hkm_ace_metrics_destroy()` in cleanup

---

### 2. Fast Loop Controller (2-3 hours)

#### 2.1 Create `core/hakmem_ace_controller.h` (30 min)
- [ ] Define `struct hkm_ace_controller`:
  ```c
  struct hkm_ace_controller {
      struct hkm_ace_metrics current;
      struct hkm_ace_metrics prev;

      // Current knob values
      uint32_t tls_capacity[8];       // Per-class TLS magazine capacity
      uint32_t drain_threshold[8];    // Remote free drain threshold

      // Fast loop state
      uint64_t fast_interval_ms;      // Default 500ms
      uint64_t last_fast_tick_ms;

      // Slow loop state
      uint64_t slow_interval_ms;      // Default 30000ms (30s)
      uint64_t last_slow_tick_ms;

      // Enabled flag
      bool enabled;
  };
  ```
- [ ] Define controller API:
  ```c
  void hkm_ace_controller_init(struct hkm_ace_controller *ctrl);
  void hkm_ace_controller_tick(struct hkm_ace_controller *ctrl);
  void hkm_ace_controller_destroy(struct hkm_ace_controller *ctrl);
  ```

#### 2.2 Create `core/hakmem_ace_controller.c` (1.5-2 hours)
- [ ] **Initialization** (30 min)
  - Read environment variables:
    - `HAKMEM_ACE_ENABLED` (default 0)
    - `HAKMEM_ACE_FAST_INTERVAL_MS` (default 500)
    - `HAKMEM_ACE_SLOW_INTERVAL_MS` (default 30000)
  - Initialize knob values to current defaults:
    - `tls_capacity[c] = TINY_TLS_MAG_CAP` (currently 128)
    - `drain_threshold[c] = TINY_REMOTE_DRAIN_THRESHOLD` (currently high)

- [ ] **Fast loop tick** (45 min)
  - Check if `elapsed >= fast_interval_ms`
  - Collect current metrics
  - Calculate reward: `reward = throughput - (llc_miss_penalty + mutex_wait_penalty + backlog_penalty)`
  - Adjust knobs based on metrics:
    ```c
    // LLC miss high → reduce TLS capacity (diet)
    if (llc_miss_rate > 0.15) {
        tls_capacity[c] *= 0.75;  // Diet factor
    }

    // Remote backlog high → lower drain threshold
    if (remote_backlog[c] > drain_threshold[c]) {
        drain_threshold[c] /= 2;
    }

    // Mutex wait high → increase bundle width
    // (Phase 1: skip, implement in Phase 2)
    ```
  - Apply knob changes to runtime (see section 4)
  - Update `prev` metrics for next iteration

- [ ] **Slow loop tick** (30 min)
  - Check if `elapsed >= slow_interval_ms`
  - Collect slow metrics (fragmentation, RSS)
  - If fragmentation high: trigger partial release (Phase 2 feature, skip for now)
  - If RSS high: trigger budgeted scavenge (Phase 2 feature, skip for now)

- [ ] **Tick dispatcher** (15 min)
  - Combined `hkm_ace_controller_tick()` that calls both fast and slow loops
  - Use monotonic clock (`clock_gettime(CLOCK_MONOTONIC)`) for timing

#### 2.3 Integration with main loop (30 min)
- [ ] Add background thread in `core/hakmem.c`:
  ```c
  static void* hkm_ace_thread_main(void *arg) {
      struct hkm_ace_controller *ctrl = arg;
      while (ctrl->enabled) {
          hkm_ace_controller_tick(ctrl);
          usleep(100000);  // 100ms sleep, check every 0.1s
      }
      return NULL;
  }
  ```
- [ ] Start ACE thread in `hakmem_init()` if `HAKMEM_ACE_ENABLED=1`
- [ ] Join ACE thread in cleanup

---

### 3. UCB1 Learning Algorithm (1-2 hours)

#### 3.1 Create `core/hakmem_ace_ucb1.h` (30 min)
- [ ] Define discrete knob candidates:
  ```c
  // TLS capacity candidates
  static const uint32_t TLS_CAP_CANDIDATES[] = {4, 8, 16, 32, 64, 128, 256, 512};
  #define TLS_CAP_N_ARMS 8

  // Drain threshold candidates
  static const uint32_t DRAIN_THRESH_CANDIDATES[] = {32, 64, 128, 256, 512, 1024};
  #define DRAIN_THRESH_N_ARMS 6
  ```
- [ ] Define `struct hkm_ace_ucb1_arm`:
  ```c
  struct hkm_ace_ucb1_arm {
      uint32_t value;           // Knob value (e.g., 32, 64, 128)
      double avg_reward;        // Average reward
      uint32_t n_pulls;         // Number of times selected
  };
  ```
- [ ] Define `struct hkm_ace_ucb1_bandit`:
  ```c
  struct hkm_ace_ucb1_bandit {
      struct hkm_ace_ucb1_arm arms[TLS_CAP_N_ARMS];
      uint32_t total_pulls;
      double exploration_bonus;  // Default sqrt(2)
  };
  ```
- [ ] Define UCB1 API:
  ```c
  void hkm_ace_ucb1_init(struct hkm_ace_ucb1_bandit *bandit, const uint32_t *candidates, int n_arms);
  int hkm_ace_ucb1_select(struct hkm_ace_ucb1_bandit *bandit);
  void hkm_ace_ucb1_update(struct hkm_ace_ucb1_bandit *bandit, int arm_idx, double reward);
  ```

#### 3.2 Create `core/hakmem_ace_ucb1.c` (45 min)
- [ ] **Initialization** (15 min)
  - Initialize each arm with candidate value
  - Set `avg_reward = 0.0`, `n_pulls = 0`

- [ ] **Selection** (15 min)
  - Implement UCB1 formula:
    ```c
    ucb_value = avg_reward + exploration_bonus * sqrt(log(total_pulls) / n_pulls)
    ```
  - Return arm index with highest UCB value
  - Handle initial exploration (n_pulls == 0 → infinity UCB)

- [ ] **Update** (15 min)
  - Update running average:
    ```c
    avg_reward = (avg_reward * n_pulls + reward) / (n_pulls + 1)
    ```
  - Increment `n_pulls` and `total_pulls`

#### 3.3 Integration with controller (30 min)
- [ ] Add UCB1 bandits to `struct hkm_ace_controller`:
  ```c
  struct hkm_ace_ucb1_bandit tls_cap_bandit[8];   // Per-class TLS capacity
  struct hkm_ace_ucb1_bandit drain_bandit[8];     // Per-class drain threshold
  ```
- [ ] In fast loop tick:
  - Select knob values using UCB1: `arm_idx = hkm_ace_ucb1_select(&ctrl->tls_cap_bandit[c])`
  - Apply selected values: `ctrl->tls_capacity[c] = TLS_CAP_CANDIDATES[arm_idx]`
  - After observing reward: `hkm_ace_ucb1_update(&ctrl->tls_cap_bandit[c], arm_idx, reward)`

---

### 4. Dynamic TLS Capacity Adjustment (1-2 hours)

#### 4.1 Modify `core/hakmem_tiny_magazine.h` (30 min)
- [ ] Change `TINY_TLS_MAG_CAP` from compile-time constant to runtime variable:
  ```c
  // OLD:
  #define TINY_TLS_MAG_CAP 128

  // NEW:
  extern uint32_t g_tiny_tls_mag_cap[8];  // Per-class capacity
  ```
- [ ] Update all references to `TINY_TLS_MAG_CAP` to use `g_tiny_tls_mag_cap[class_idx]`

#### 4.2 Modify `core/hakmem_tiny_magazine.c` (30 min)
- [ ] Define global capacity array:
  ```c
  uint32_t g_tiny_tls_mag_cap[8] = {
      128, 128, 128, 128, 128, 128, 128, 128  // Default values
  };
  ```
- [ ] Add setter function:
  ```c
  void hkm_tiny_set_tls_capacity(uint8_t class_idx, uint32_t new_cap) {
      if (class_idx >= 8) return;
      g_tiny_tls_mag_cap[class_idx] = new_cap;
  }
  ```
- [ ] Update magazine refill logic to respect dynamic capacity:
  ```c
  // In tiny_magazine_refill():
  uint32_t cap = g_tiny_tls_mag_cap[class_idx];
  if (mag->count >= cap) return;  // Already at capacity
  ```

#### 4.3 Integration with ACE controller (30 min)
- [ ] In `hkm_ace_controller_tick()`, apply TLS capacity changes:
  ```c
  for (int c = 0; c < 8; c++) {
      uint32_t new_cap = ctrl->tls_capacity[c];
      hkm_tiny_set_tls_capacity(c, new_cap);
  }
  ```
- [ ] Similarly for drain threshold (if implemented in tiny pool):
  ```c
  for (int c = 0; c < 8; c++) {
      uint32_t new_thresh = ctrl->drain_threshold[c];
      hkm_tiny_set_drain_threshold(c, new_thresh);
  }
  ```

---

### 5. ON/OFF Toggle and Configuration (1 hour)

#### 5.1 Environment variables (30 min)
- [ ] Add to `core/hakmem_config.h`:
  ```c
  // ACE Learning Layer
  #define HAKMEM_ACE_ENABLED              "HAKMEM_ACE_ENABLED"              // 0/1
  #define HAKMEM_ACE_FAST_INTERVAL_MS     "HAKMEM_ACE_FAST_INTERVAL_MS"     // Default 500
  #define HAKMEM_ACE_SLOW_INTERVAL_MS     "HAKMEM_ACE_SLOW_INTERVAL_MS"     // Default 30000
  #define HAKMEM_ACE_LOG_LEVEL            "HAKMEM_ACE_LOG_LEVEL"            // 0=off, 1=info, 2=debug

  // Safety guards
  #define HAKMEM_ACE_MAX_P99_LAT_NS       "HAKMEM_ACE_MAX_P99_LAT_NS"       // Default 10000000 (10ms)
  #define HAKMEM_ACE_MAX_RSS_MB           "HAKMEM_ACE_MAX_RSS_MB"           // Default 16384 (16GB)
  #define HAKMEM_ACE_MAX_CPU_PERCENT      "HAKMEM_ACE_MAX_CPU_PERCENT"      // Default 5
  ```
- [ ] Parse environment variables in `hkm_ace_controller_init()`

#### 5.2 Logging infrastructure (30 min)
- [ ] Add logging macros in `core/hakmem_ace_controller.c`:
  ```c
  #define ACE_LOG_INFO(fmt, ...) \
      if (g_ace_log_level >= 1) fprintf(stderr, "[ACE] " fmt "\n", ##__VA_ARGS__)

  #define ACE_LOG_DEBUG(fmt, ...) \
      if (g_ace_log_level >= 2) fprintf(stderr, "[ACE DEBUG] " fmt "\n", ##__VA_ARGS__)
  ```
- [ ] Add debug output in fast loop:
  ```c
  ACE_LOG_DEBUG("Fast loop: reward=%.2f, llc_miss=%.2f, backlog=%u",
                reward, llc_miss_rate, remote_backlog[0]);
  ACE_LOG_INFO("Adjusting TLS cap[%d]: %u → %u (diet factor=%.2f)",
               c, old_cap, new_cap, diet_factor);
  ```

---

## Testing Strategy

### Unit Tests
- [ ] Test metrics collection:
  ```bash
  # Verify throughput tracking
  HAKMEM_ACE_ENABLED=1 ./test_ace_metrics
  ```
- [ ] Test UCB1 selection:
  ```bash
  # Verify arm selection and update
  ./test_ace_ucb1
  ```

### Integration Tests
- [ ] Test ACE on fragmentation stress benchmark:
  ```bash
  # Baseline (ACE OFF)
  HAKMEM_ACE_ENABLED=0 ./bench_fragment_stress_hakx > baseline.txt

  # ACE ON
  HAKMEM_ACE_ENABLED=1 ./bench_fragment_stress_hakx > ace_on.txt

  # Compare
  diff baseline.txt ace_on.txt
  ```
- [ ] Verify dynamic TLS capacity adjustment:
  ```bash
  # Enable debug logging
  export HAKMEM_ACE_ENABLED=1
  export HAKMEM_ACE_LOG_LEVEL=2
  ./bench_fragment_stress_hakx
  # Should see log output: "Adjusting TLS cap[2]: 128 → 96"
  ```

### Benchmark Validation
- [ ] Run A/B comparison on all weak workloads:
  ```bash
  bash scripts/ace_ab_test.sh
  ```
- [ ] Expected results:
  - Fragmentation stress: 3.87 → 8-12 M ops/s (2-3x)
  - Mid MT: 111.6 M ops/s → 110-115 M ops/s (maintain ±5%)
  - Large WS: 22.15 M ops/s → 25-30 M ops/s (1.1-1.4x, partial improvement)

---

## Implementation Order

**Day 1 (7-9 hours)**:

1. **Morning (3-4 hours)**:
   - [ ] 1.1 Create hakmem_ace_metrics.h (30 min)
   - [ ] 1.2 Create hakmem_ace_metrics.c (2 hours)
   - [ ] 1.3 Integration (30 min)
   - [ ] Test: Verify metrics collection works

2. **Midday (2-3 hours)**:
   - [ ] 2.1 Create hakmem_ace_controller.h (30 min)
   - [ ] 2.2 Create hakmem_ace_controller.c (1.5 hours)
   - [ ] 2.3 Integration (30 min)
   - [ ] Test: Verify fast/slow loops run

3. **Afternoon (2-3 hours)**:
   - [ ] 3.1 Create hakmem_ace_ucb1.h (30 min)
   - [ ] 3.2 Create hakmem_ace_ucb1.c (45 min)
   - [ ] 3.3 Integration (30 min)
   - [ ] 4.1-4.3 Dynamic TLS capacity (1.5 hours)
   - [ ] 5.1-5.2 ON/OFF toggle (1 hour)

4. **Evening (1-2 hours)**:
   - [ ] Build and test complete system
   - [ ] Run fragmentation stress A/B test
   - [ ] Verify 2-3x improvement

---

## Success Criteria

Phase 1 is complete when:
- ✅ Metrics collection works (throughput, LLC miss, mutex wait, backlog)
- ✅ Fast loop adjusts TLS capacity based on LLC miss rate
- ✅ UCB1 learning selects optimal knob values
- ✅ Dynamic TLS capacity affects runtime behavior
- ✅ ON/OFF toggle via `HAKMEM_ACE_ENABLED=1` works
- ✅ **Benchmark improvement**: Fragmentation stress 3.87 → 8-12 M ops/s (2-3x)
- ✅ **No regression**: Mid MT maintains 110-115 M ops/s (±5%)

---

## Files to Create

New files (Phase 1):
```
core/hakmem_ace_metrics.h         (80 lines)
core/hakmem_ace_metrics.c         (300 lines)
core/hakmem_ace_controller.h      (100 lines)
core/hakmem_ace_controller.c      (400 lines)
core/hakmem_ace_ucb1.h            (80 lines)
core/hakmem_ace_ucb1.c            (150 lines)
```

Modified files:
```
core/hakmem_tiny_magazine.h       (change TINY_TLS_MAG_CAP to array)
core/hakmem_tiny_magazine.c       (add setter, use dynamic capacity)
core/hakmem.c                     (start ACE thread)
core/hakmem_config.h              (add ACE env vars)
```

Test files:
```
tests/unit/test_ace_metrics.c     (150 lines)
tests/unit/test_ace_ucb1.c        (120 lines)
tests/integration/test_ace_e2e.c  (200 lines)
```

Scripts:
```
benchmarks/scripts/utils/ace_ab_test.sh  (100 lines)
```

**Total new code**: ~1,680 lines (Phase 1 only)

---

## Next Steps After Phase 1

Once Phase 1 is complete and validated:
- **Phase 2**: Fragmentation countermeasures (budgeted scavenge, partial release)
- **Phase 3**: Large WS countermeasures (auto diet, LLC miss optimization)
- **Phase 4**: realloc optimization (in-place expansion, NT store)

---

**Status**: READY TO IMPLEMENT
**Priority**: HIGH 🔥
**Expected Impact**: 2-3x improvement on fragmentation stress
**Risk**: LOW (isolated, ON/OFF toggle, no impact when disabled)

Let's build it! 💪