hakmem/ACE_PHASE1_IMPLEMENTATION_TODO.md

# ACE Phase 1 Implementation TODO

**Status**: Ready to implement (documentation complete)
**Target**: Fragmentation stress 3.87 → 8-12 M ops/s (2-3x improvement)
**Timeline**: 1 day (7-9 hours total)
**Date**: 2025-11-01

---

## Overview

Phase 1 implements the **minimal ACE (Adaptive Control Engine)** with maximum impact:
- Metrics collection (throughput, LLC miss, mutex wait, backlog)
- Fast loop control (0.5-1s adjustment cycle)
- Dynamic TLS capacity tuning
- UCB1 learning for knob selection
- ON/OFF toggle via environment variable

**Expected Impact**: Fragmentation stress workload improves from 3.87 → 8-12 M ops/s

---

## Task Breakdown

### 1. Metrics Collection Infrastructure (2-3 hours)

#### 1.1 Create `core/hakmem_ace_metrics.h` (30 min)
- [ ] Define `struct hkm_ace_metrics` with:
  ```c
  struct hkm_ace_metrics {
      uint64_t throughput_ops;        // Operations per second
      double llc_miss_rate;           // LLC miss rate (0.0-1.0)
      uint64_t mutex_wait_ns;         // Mutex contention time
      uint32_t remote_free_backlog[8]; // Per-class backlog
      double fragmentation_ratio;     // Slow metric (60s)
      uint64_t rss_mb;                // Slow metric (60s)
      uint64_t timestamp_ms;          // Collection timestamp
  };
  ```
- [ ] Define collection API:
  ```c
  void hkm_ace_metrics_init(void);
  void hkm_ace_metrics_collect(struct hkm_ace_metrics *out);
  void hkm_ace_metrics_destroy(void);
  ```

#### 1.2 Create `core/hakmem_ace_metrics.c` (1.5-2 hours)
- [ ] **Throughput tracking** (30 min)
  - Global atomic counter `g_ace_alloc_count`
  - Increment in `hakmem_alloc()` / `hakmem_free()`
  - Calculate ops/sec from delta between collections

- [ ] **LLC miss monitoring** (45 min)
  - Use `rdpmc` for lightweight performance counter access
  - Read LLC_MISSES and LLC_REFERENCES counters
  - Calculate miss_rate = misses / references
  - Fallback to 0.0 if RDPMC unavailable

- [ ] **Mutex contention tracking** (30 min)
  - Wrap `pthread_mutex_lock()` with timing
  - Track cumulative wait time per class
  - Reset counters after each collection

- [ ] **Remote free backlog** (15 min)
  - Read `g_tiny_classes[c].remote_backlog_count` for each class
  - Already tracked by tiny pool implementation

- [ ] **Fragmentation ratio (slow, 60s)** (15 min)
  - Calculate: `allocated_bytes / reserved_bytes`
  - Parse `/proc/self/status` for VmRSS and VmSize
  - Only update every 60 seconds (skip on fast collections)

- [ ] **RSS monitoring (slow, 60s)** (15 min)
  - Read `/proc/self/status` VmRSS field
  - Convert to MB
  - Only update every 60 seconds

#### 1.3 Integration with existing code (30 min)
- [ ] Add `#include "hakmem_ace_metrics.h"` to `core/hakmem.c`
- [ ] Call `hkm_ace_metrics_init()` in `hakmem_init()`
- [ ] Call `hkm_ace_metrics_destroy()` in cleanup

---

### 2. Fast Loop Controller (2-3 hours)

#### 2.1 Create `core/hakmem_ace_controller.h` (30 min)
- [ ] Define `struct hkm_ace_controller`:
  ```c
  struct hkm_ace_controller {
      struct hkm_ace_metrics current;
      struct hkm_ace_metrics prev;

      // Current knob values
      uint32_t tls_capacity[8];       // Per-class TLS magazine capacity
      uint32_t drain_threshold[8];    // Remote free drain threshold

      // Fast loop state
      uint64_t fast_interval_ms;      // Default 500ms
      uint64_t last_fast_tick_ms;

      // Slow loop state
      uint64_t slow_interval_ms;      // Default 30000ms (30s)
      uint64_t last_slow_tick_ms;

      // Enabled flag
      bool enabled;
  };
  ```
- [ ] Define controller API:
  ```c
  void hkm_ace_controller_init(struct hkm_ace_controller *ctrl);
  void hkm_ace_controller_tick(struct hkm_ace_controller *ctrl);
  void hkm_ace_controller_destroy(struct hkm_ace_controller *ctrl);
  ```

#### 2.2 Create `core/hakmem_ace_controller.c` (1.5-2 hours)
- [ ] **Initialization** (30 min)
  - Read environment variables:
    - `HAKMEM_ACE_ENABLED` (default 0)
    - `HAKMEM_ACE_FAST_INTERVAL_MS` (default 500)
    - `HAKMEM_ACE_SLOW_INTERVAL_MS` (default 30000)
  - Initialize knob values to current defaults:
    - `tls_capacity[c] = TINY_TLS_MAG_CAP` (currently 128)
    - `drain_threshold[c] = TINY_REMOTE_DRAIN_THRESHOLD` (currently high)

- [ ] **Fast loop tick** (45 min)
  - Check if `elapsed >= fast_interval_ms`
  - Collect current metrics
  - Calculate reward: `reward = throughput - (llc_miss_penalty + mutex_wait_penalty + backlog_penalty)`
  - Adjust knobs based on metrics:
    ```c
    // LLC miss high → reduce TLS capacity (diet)
    if (llc_miss_rate > 0.15) {
        tls_capacity[c] *= 0.75;  // Diet factor
    }

    // Remote backlog high → lower drain threshold
    if (remote_backlog[c] > drain_threshold[c]) {
        drain_threshold[c] /= 2;
    }

    // Mutex wait high → increase bundle width
    // (Phase 1: skip, implement in Phase 2)
    ```
  - Apply knob changes to runtime (see section 4)
  - Update `prev` metrics for next iteration

- [ ] **Slow loop tick** (30 min)
  - Check if `elapsed >= slow_interval_ms`
  - Collect slow metrics (fragmentation, RSS)
  - If fragmentation high: trigger partial release (Phase 2 feature, skip for now)
  - If RSS high: trigger budgeted scavenge (Phase 2 feature, skip for now)

- [ ] **Tick dispatcher** (15 min)
  - Combined `hkm_ace_controller_tick()` that calls both fast and slow loops
  - Use monotonic clock (`clock_gettime(CLOCK_MONOTONIC)`) for timing

#### 2.3 Integration with main loop (30 min)
- [ ] Add background thread in `core/hakmem.c`:
  ```c
  static void* hkm_ace_thread_main(void *arg) {
      struct hkm_ace_controller *ctrl = arg;
      while (ctrl->enabled) {
          hkm_ace_controller_tick(ctrl);
          usleep(100000);  // 100ms sleep, check every 0.1s
      }
      return NULL;
  }
  ```
- [ ] Start ACE thread in `hakmem_init()` if `HAKMEM_ACE_ENABLED=1`
- [ ] Join ACE thread in cleanup

---

### 3. UCB1 Learning Algorithm (1-2 hours)

#### 3.1 Create `core/hakmem_ace_ucb1.h` (30 min)
- [ ] Define discrete knob candidates:
  ```c
  // TLS capacity candidates
  static const uint32_t TLS_CAP_CANDIDATES[] = {4, 8, 16, 32, 64, 128, 256, 512};
  #define TLS_CAP_N_ARMS 8

  // Drain threshold candidates
  static const uint32_t DRAIN_THRESH_CANDIDATES[] = {32, 64, 128, 256, 512, 1024};
  #define DRAIN_THRESH_N_ARMS 6
  ```
- [ ] Define `struct hkm_ace_ucb1_arm`:
  ```c
  struct hkm_ace_ucb1_arm {
      uint32_t value;           // Knob value (e.g., 32, 64, 128)
      double avg_reward;        // Average reward
      uint32_t n_pulls;         // Number of times selected
  };
  ```
- [ ] Define `struct hkm_ace_ucb1_bandit`:
  ```c
  struct hkm_ace_ucb1_bandit {
      struct hkm_ace_ucb1_arm arms[TLS_CAP_N_ARMS];
      uint32_t total_pulls;
      double exploration_bonus;  // Default sqrt(2)
  };
  ```
- [ ] Define UCB1 API:
  ```c
  void hkm_ace_ucb1_init(struct hkm_ace_ucb1_bandit *bandit, const uint32_t *candidates, int n_arms);
  int hkm_ace_ucb1_select(struct hkm_ace_ucb1_bandit *bandit);
  void hkm_ace_ucb1_update(struct hkm_ace_ucb1_bandit *bandit, int arm_idx, double reward);
  ```

#### 3.2 Create `core/hakmem_ace_ucb1.c` (45 min)
- [ ] **Initialization** (15 min)
  - Initialize each arm with candidate value
  - Set `avg_reward = 0.0`, `n_pulls = 0`

- [ ] **Selection** (15 min)
  - Implement UCB1 formula:
    ```c
    ucb_value = avg_reward + exploration_bonus * sqrt(log(total_pulls) / n_pulls)
    ```
  - Return arm index with highest UCB value
  - Handle initial exploration (n_pulls == 0 → infinity UCB)

- [ ] **Update** (15 min)
  - Update running average:
    ```c
    avg_reward = (avg_reward * n_pulls + reward) / (n_pulls + 1)
    ```
  - Increment `n_pulls` and `total_pulls`

#### 3.3 Integration with controller (30 min)
- [ ] Add UCB1 bandits to `struct hkm_ace_controller`:
  ```c
  struct hkm_ace_ucb1_bandit tls_cap_bandit[8];   // Per-class TLS capacity
  struct hkm_ace_ucb1_bandit drain_bandit[8];     // Per-class drain threshold
  ```
- [ ] In fast loop tick:
  - Select knob values using UCB1: `arm_idx = hkm_ace_ucb1_select(&ctrl->tls_cap_bandit[c])`
  - Apply selected values: `ctrl->tls_capacity[c] = TLS_CAP_CANDIDATES[arm_idx]`
  - After observing reward: `hkm_ace_ucb1_update(&ctrl->tls_cap_bandit[c], arm_idx, reward)`

---

### 4. Dynamic TLS Capacity Adjustment (1-2 hours)

#### 4.1 Modify `core/hakmem_tiny_magazine.h` (30 min)
- [ ] Change `TINY_TLS_MAG_CAP` from compile-time constant to runtime variable:
  ```c
  // OLD:
  #define TINY_TLS_MAG_CAP 128

  // NEW:
  extern uint32_t g_tiny_tls_mag_cap[8];  // Per-class capacity
  ```
- [ ] Update all references to `TINY_TLS_MAG_CAP` to use `g_tiny_tls_mag_cap[class_idx]`

#### 4.2 Modify `core/hakmem_tiny_magazine.c` (30 min)
- [ ] Define global capacity array:
  ```c
  uint32_t g_tiny_tls_mag_cap[8] = {
      128, 128, 128, 128, 128, 128, 128, 128  // Default values
  };
  ```
- [ ] Add setter function:
  ```c
  void hkm_tiny_set_tls_capacity(uint8_t class_idx, uint32_t new_cap) {
      if (class_idx >= 8) return;
      g_tiny_tls_mag_cap[class_idx] = new_cap;
  }
  ```
- [ ] Update magazine refill logic to respect dynamic capacity:
  ```c
  // In tiny_magazine_refill():
  uint32_t cap = g_tiny_tls_mag_cap[class_idx];
  if (mag->count >= cap) return;  // Already at capacity
  ```

#### 4.3 Integration with ACE controller (30 min)
- [ ] In `hkm_ace_controller_tick()`, apply TLS capacity changes:
  ```c
  for (int c = 0; c < 8; c++) {
      uint32_t new_cap = ctrl->tls_capacity[c];
      hkm_tiny_set_tls_capacity(c, new_cap);
  }
  ```
- [ ] Similarly for drain threshold (if implemented in tiny pool):
  ```c
  for (int c = 0; c < 8; c++) {
      uint32_t new_thresh = ctrl->drain_threshold[c];
      hkm_tiny_set_drain_threshold(c, new_thresh);
  }
  ```

---

### 5. ON/OFF Toggle and Configuration (1 hour)

#### 5.1 Environment variables (30 min)
- [ ] Add to `core/hakmem_config.h`:
  ```c
  // ACE Learning Layer
  #define HAKMEM_ACE_ENABLED              "HAKMEM_ACE_ENABLED"              // 0/1
  #define HAKMEM_ACE_FAST_INTERVAL_MS     "HAKMEM_ACE_FAST_INTERVAL_MS"     // Default 500
  #define HAKMEM_ACE_SLOW_INTERVAL_MS     "HAKMEM_ACE_SLOW_INTERVAL_MS"     // Default 30000
  #define HAKMEM_ACE_LOG_LEVEL            "HAKMEM_ACE_LOG_LEVEL"            // 0=off, 1=info, 2=debug

  // Safety guards
  #define HAKMEM_ACE_MAX_P99_LAT_NS       "HAKMEM_ACE_MAX_P99_LAT_NS"       // Default 10000000 (10ms)
  #define HAKMEM_ACE_MAX_RSS_MB           "HAKMEM_ACE_MAX_RSS_MB"           // Default 16384 (16GB)
  #define HAKMEM_ACE_MAX_CPU_PERCENT      "HAKMEM_ACE_MAX_CPU_PERCENT"      // Default 5
  ```
- [ ] Parse environment variables in `hkm_ace_controller_init()`

#### 5.2 Logging infrastructure (30 min)
- [ ] Add logging macros in `core/hakmem_ace_controller.c`:
  ```c
  #define ACE_LOG_INFO(fmt, ...) \
      if (g_ace_log_level >= 1) fprintf(stderr, "[ACE] " fmt "\n", ##__VA_ARGS__)

  #define ACE_LOG_DEBUG(fmt, ...) \
      if (g_ace_log_level >= 2) fprintf(stderr, "[ACE DEBUG] " fmt "\n", ##__VA_ARGS__)
  ```
- [ ] Add debug output in fast loop:
  ```c
  ACE_LOG_DEBUG("Fast loop: reward=%.2f, llc_miss=%.2f, backlog=%u",
                reward, llc_miss_rate, remote_backlog[0]);
  ACE_LOG_INFO("Adjusting TLS cap[%d]: %u → %u (diet factor=%.2f)",
               c, old_cap, new_cap, diet_factor);
  ```

---

## Testing Strategy

### Unit Tests
- [ ] Test metrics collection:
  ```bash
  # Verify throughput tracking
  HAKMEM_ACE_ENABLED=1 ./test_ace_metrics
  ```
- [ ] Test UCB1 selection:
  ```bash
  # Verify arm selection and update
  ./test_ace_ucb1
  ```

### Integration Tests
- [ ] Test ACE on fragmentation stress benchmark:
  ```bash
  # Baseline (ACE OFF)
  HAKMEM_ACE_ENABLED=0 ./bench_fragment_stress_hakx > baseline.txt

  # ACE ON
  HAKMEM_ACE_ENABLED=1 ./bench_fragment_stress_hakx > ace_on.txt

  # Compare
  diff baseline.txt ace_on.txt
  ```
- [ ] Verify dynamic TLS capacity adjustment:
  ```bash
  # Enable debug logging
  export HAKMEM_ACE_ENABLED=1
  export HAKMEM_ACE_LOG_LEVEL=2
  ./bench_fragment_stress_hakx
  # Should see log output: "Adjusting TLS cap[2]: 128 → 96"
  ```

### Benchmark Validation
- [ ] Run A/B comparison on all weak workloads:
  ```bash
  bash scripts/ace_ab_test.sh
  ```
- [ ] Expected results:
  - Fragmentation stress: 3.87 → 8-12 M ops/s (2-3x)
  - Mid MT: 111.6 M ops/s → 110-115 M ops/s (maintain ±5%)
  - Large WS: 22.15 M ops/s → 25-30 M ops/s (1.1-1.4x, partial improvement)

---

## Implementation Order

**Day 1 (7-9 hours)**:

1. **Morning (3-4 hours)**:
   - [ ] 1.1 Create hakmem_ace_metrics.h (30 min)
   - [ ] 1.2 Create hakmem_ace_metrics.c (2 hours)
   - [ ] 1.3 Integration (30 min)
   - [ ] Test: Verify metrics collection works

2. **Midday (2-3 hours)**:
   - [ ] 2.1 Create hakmem_ace_controller.h (30 min)
   - [ ] 2.2 Create hakmem_ace_controller.c (1.5 hours)
   - [ ] 2.3 Integration (30 min)
   - [ ] Test: Verify fast/slow loops run

3. **Afternoon (2-3 hours)**:
   - [ ] 3.1 Create hakmem_ace_ucb1.h (30 min)
   - [ ] 3.2 Create hakmem_ace_ucb1.c (45 min)
   - [ ] 3.3 Integration (30 min)
   - [ ] 4.1-4.3 Dynamic TLS capacity (1.5 hours)
   - [ ] 5.1-5.2 ON/OFF toggle (1 hour)

4. **Evening (1-2 hours)**:
   - [ ] Build and test complete system
   - [ ] Run fragmentation stress A/B test
   - [ ] Verify 2-3x improvement

---

## Success Criteria

Phase 1 is complete when:
- ✅ Metrics collection works (throughput, LLC miss, mutex wait, backlog)
- ✅ Fast loop adjusts TLS capacity based on LLC miss rate
- ✅ UCB1 learning selects optimal knob values
- ✅ Dynamic TLS capacity affects runtime behavior
- ✅ ON/OFF toggle via `HAKMEM_ACE_ENABLED=1` works
- ✅ **Benchmark improvement**: Fragmentation stress 3.87 → 8-12 M ops/s (2-3x)
- ✅ **No regression**: Mid MT maintains 110-115 M ops/s (±5%)

---

## Files to Create

New files (Phase 1):
```
core/hakmem_ace_metrics.h         (80 lines)
core/hakmem_ace_metrics.c         (300 lines)
core/hakmem_ace_controller.h      (100 lines)
core/hakmem_ace_controller.c      (400 lines)
core/hakmem_ace_ucb1.h            (80 lines)
core/hakmem_ace_ucb1.c            (150 lines)
```

Modified files:
```
core/hakmem_tiny_magazine.h       (change TINY_TLS_MAG_CAP to array)
core/hakmem_tiny_magazine.c       (add setter, use dynamic capacity)
core/hakmem.c                     (start ACE thread)
core/hakmem_config.h              (add ACE env vars)
```

Test files:
```
tests/unit/test_ace_metrics.c     (150 lines)
tests/unit/test_ace_ucb1.c        (120 lines)
tests/integration/test_ace_e2e.c  (200 lines)
```

Scripts:
```
benchmarks/scripts/utils/ace_ab_test.sh  (100 lines)
```

**Total new code**: ~1,680 lines (Phase 1 only)

---

## Next Steps After Phase 1

Once Phase 1 is complete and validated:
- **Phase 2**: Fragmentation countermeasures (budgeted scavenge, partial release)
- **Phase 3**: Large WS countermeasures (auto diet, LLC miss optimization)
- **Phase 4**: realloc optimization (in-place expansion, NT store)

---

**Status**: READY TO IMPLEMENT
**Priority**: HIGH 🔥
**Expected Impact**: 2-3x improvement on fragmentation stress
**Risk**: LOW (isolated, ON/OFF toggle, no impact when disabled)

Let's build it! 💪
Debug Counters Implementation - Clean History Major Features: - Debug counter infrastructure for Refill Stage tracking - Free Pipeline counters (ss_local, ss_remote, tls_sll) - Diagnostic counters for early return analysis - Unified larson.sh benchmark runner with profiles - Phase 6-3 regression analysis documentation Bug Fixes: - Fix SuperSlab disabled by default (HAKMEM_TINY_USE_SUPERSLAB) - Fix profile variable naming consistency - Add .gitignore patterns for large files Performance: - Phase 6-3: 4.79 M ops/s (has OOM risk) - With SuperSlab: 3.13 M ops/s (+19% improvement) This is a clean repository without large log files. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> 2025-11-05 12:31:14 +09:00			`# ACE Phase 1 Implementation TODO`

			`Status: Ready to implement (documentation complete)`
			`Target: Fragmentation stress 3.87 → 8-12 M ops/s (2-3x improvement)`
			`Timeline: 1 day (7-9 hours total)`
			`Date: 2025-11-01`

			`---`

			`## Overview`

			`Phase 1 implements the minimal ACE (Adaptive Control Engine) with maximum impact:`
			`- Metrics collection (throughput, LLC miss, mutex wait, backlog)`
			`- Fast loop control (0.5-1s adjustment cycle)`
			`- Dynamic TLS capacity tuning`
			`- UCB1 learning for knob selection`
			`- ON/OFF toggle via environment variable`

			`Expected Impact: Fragmentation stress workload improves from 3.87 → 8-12 M ops/s`

			`---`

			`## Task Breakdown`

			`### 1. Metrics Collection Infrastructure (2-3 hours)`

			#### 1.1 Create `core/hakmem_ace_metrics.h` (30 min)
			- [ ] Define `struct hkm_ace_metrics` with:
			```c
			`struct hkm_ace_metrics {`
			`uint64_t throughput_ops; // Operations per second`
			`double llc_miss_rate; // LLC miss rate (0.0-1.0)`
			`uint64_t mutex_wait_ns; // Mutex contention time`
			`uint32_t remote_free_backlog[8]; // Per-class backlog`
			`double fragmentation_ratio; // Slow metric (60s)`
			`uint64_t rss_mb; // Slow metric (60s)`
			`uint64_t timestamp_ms; // Collection timestamp`
			`};`
			```
			`- [ ] Define collection API:`
			```c
			`void hkm_ace_metrics_init(void);`
			`void hkm_ace_metrics_collect(struct hkm_ace_metrics *out);`
			`void hkm_ace_metrics_destroy(void);`
			```

			#### 1.2 Create `core/hakmem_ace_metrics.c` (1.5-2 hours)
			`- [ ] Throughput tracking (30 min)`
			- Global atomic counter `g_ace_alloc_count`
			- Increment in `hakmem_alloc()` / `hakmem_free()`
			`- Calculate ops/sec from delta between collections`

			`- [ ] LLC miss monitoring (45 min)`
			- Use `rdpmc` for lightweight performance counter access
			`- Read LLC_MISSES and LLC_REFERENCES counters`
			`- Calculate miss_rate = misses / references`
			`- Fallback to 0.0 if RDPMC unavailable`

			`- [ ] Mutex contention tracking (30 min)`
			- Wrap `pthread_mutex_lock()` with timing
			`- Track cumulative wait time per class`
			`- Reset counters after each collection`

			`- [ ] Remote free backlog (15 min)`
			- Read `g_tiny_classes[c].remote_backlog_count` for each class
			`- Already tracked by tiny pool implementation`

			`- [ ] Fragmentation ratio (slow, 60s) (15 min)`
			- Calculate: `allocated_bytes / reserved_bytes`
			- Parse `/proc/self/status` for VmRSS and VmSize
			`- Only update every 60 seconds (skip on fast collections)`

			`- [ ] RSS monitoring (slow, 60s) (15 min)`
			- Read `/proc/self/status` VmRSS field
			`- Convert to MB`
			`- Only update every 60 seconds`

			`#### 1.3 Integration with existing code (30 min)`
			- [ ] Add `#include "hakmem_ace_metrics.h"` to `core/hakmem.c`
			- [ ] Call `hkm_ace_metrics_init()` in `hakmem_init()`
			- [ ] Call `hkm_ace_metrics_destroy()` in cleanup

			`---`

			`### 2. Fast Loop Controller (2-3 hours)`

			#### 2.1 Create `core/hakmem_ace_controller.h` (30 min)
			- [ ] Define `struct hkm_ace_controller`:
			```c
			`struct hkm_ace_controller {`
			`struct hkm_ace_metrics current;`
			`struct hkm_ace_metrics prev;`

			`// Current knob values`
			`uint32_t tls_capacity[8]; // Per-class TLS magazine capacity`
			`uint32_t drain_threshold[8]; // Remote free drain threshold`

			`// Fast loop state`
			`uint64_t fast_interval_ms; // Default 500ms`
			`uint64_t last_fast_tick_ms;`

			`// Slow loop state`
			`uint64_t slow_interval_ms; // Default 30000ms (30s)`
			`uint64_t last_slow_tick_ms;`

			`// Enabled flag`
			`bool enabled;`
			`};`
			```
			`- [ ] Define controller API:`
			```c
			`void hkm_ace_controller_init(struct hkm_ace_controller *ctrl);`
			`void hkm_ace_controller_tick(struct hkm_ace_controller *ctrl);`
			`void hkm_ace_controller_destroy(struct hkm_ace_controller *ctrl);`
			```

			#### 2.2 Create `core/hakmem_ace_controller.c` (1.5-2 hours)
			`- [ ] Initialization (30 min)`
			`- Read environment variables:`
			- `HAKMEM_ACE_ENABLED` (default 0)
			- `HAKMEM_ACE_FAST_INTERVAL_MS` (default 500)
			- `HAKMEM_ACE_SLOW_INTERVAL_MS` (default 30000)
			`- Initialize knob values to current defaults:`
			- `tls_capacity[c] = TINY_TLS_MAG_CAP` (currently 128)
			- `drain_threshold[c] = TINY_REMOTE_DRAIN_THRESHOLD` (currently high)

			`- [ ] Fast loop tick (45 min)`
			- Check if `elapsed >= fast_interval_ms`
			`- Collect current metrics`
			- Calculate reward: `reward = throughput - (llc_miss_penalty + mutex_wait_penalty + backlog_penalty)`
			`- Adjust knobs based on metrics:`
			```c
			`// LLC miss high → reduce TLS capacity (diet)`
			`if (llc_miss_rate > 0.15) {`
			`tls_capacity[c] *= 0.75; // Diet factor`
			`}`

			`// Remote backlog high → lower drain threshold`
			`if (remote_backlog[c] > drain_threshold[c]) {`
			`drain_threshold[c] /= 2;`
			`}`

			`// Mutex wait high → increase bundle width`
			`// (Phase 1: skip, implement in Phase 2)`
			```
			`- Apply knob changes to runtime (see section 4)`
			- Update `prev` metrics for next iteration

			`- [ ] Slow loop tick (30 min)`
			- Check if `elapsed >= slow_interval_ms`
			`- Collect slow metrics (fragmentation, RSS)`
			`- If fragmentation high: trigger partial release (Phase 2 feature, skip for now)`
			`- If RSS high: trigger budgeted scavenge (Phase 2 feature, skip for now)`

			`- [ ] Tick dispatcher (15 min)`
			- Combined `hkm_ace_controller_tick()` that calls both fast and slow loops
			- Use monotonic clock (`clock_gettime(CLOCK_MONOTONIC)`) for timing

			`#### 2.3 Integration with main loop (30 min)`
			- [ ] Add background thread in `core/hakmem.c`:
			```c
			`static void* hkm_ace_thread_main(void *arg) {`
			`struct hkm_ace_controller *ctrl = arg;`
			`while (ctrl->enabled) {`
			`hkm_ace_controller_tick(ctrl);`
			`usleep(100000); // 100ms sleep, check every 0.1s`
			`}`
			`return NULL;`
			`}`
			```
			- [ ] Start ACE thread in `hakmem_init()` if `HAKMEM_ACE_ENABLED=1`
			`- [ ] Join ACE thread in cleanup`

			`---`

			`### 3. UCB1 Learning Algorithm (1-2 hours)`

			#### 3.1 Create `core/hakmem_ace_ucb1.h` (30 min)
			`- [ ] Define discrete knob candidates:`
			```c
			`// TLS capacity candidates`
			`static const uint32_t TLS_CAP_CANDIDATES[] = {4, 8, 16, 32, 64, 128, 256, 512};`
			`#define TLS_CAP_N_ARMS 8`

			`// Drain threshold candidates`
			`static const uint32_t DRAIN_THRESH_CANDIDATES[] = {32, 64, 128, 256, 512, 1024};`
			`#define DRAIN_THRESH_N_ARMS 6`
			```
			- [ ] Define `struct hkm_ace_ucb1_arm`:
			```c
			`struct hkm_ace_ucb1_arm {`
			`uint32_t value; // Knob value (e.g., 32, 64, 128)`
			`double avg_reward; // Average reward`
			`uint32_t n_pulls; // Number of times selected`
			`};`
			```
			- [ ] Define `struct hkm_ace_ucb1_bandit`:
			```c
			`struct hkm_ace_ucb1_bandit {`
			`struct hkm_ace_ucb1_arm arms[TLS_CAP_N_ARMS];`
			`uint32_t total_pulls;`
			`double exploration_bonus; // Default sqrt(2)`
			`};`
			```
			`- [ ] Define UCB1 API:`
			```c
			`void hkm_ace_ucb1_init(struct hkm_ace_ucb1_bandit bandit, const uint32_t candidates, int n_arms);`
			`int hkm_ace_ucb1_select(struct hkm_ace_ucb1_bandit *bandit);`
			`void hkm_ace_ucb1_update(struct hkm_ace_ucb1_bandit *bandit, int arm_idx, double reward);`
			```

			#### 3.2 Create `core/hakmem_ace_ucb1.c` (45 min)
			`- [ ] Initialization (15 min)`
			`- Initialize each arm with candidate value`
			- Set `avg_reward = 0.0`, `n_pulls = 0`

			`- [ ] Selection (15 min)`
			`- Implement UCB1 formula:`
			```c
			`ucb_value = avg_reward + exploration_bonus * sqrt(log(total_pulls) / n_pulls)`
			```
			`- Return arm index with highest UCB value`
			`- Handle initial exploration (n_pulls == 0 → infinity UCB)`

			`- [ ] Update (15 min)`
			`- Update running average:`
			```c
			`avg_reward = (avg_reward * n_pulls + reward) / (n_pulls + 1)`
			```
			- Increment `n_pulls` and `total_pulls`

			`#### 3.3 Integration with controller (30 min)`
			- [ ] Add UCB1 bandits to `struct hkm_ace_controller`:
			```c
			`struct hkm_ace_ucb1_bandit tls_cap_bandit[8]; // Per-class TLS capacity`
			`struct hkm_ace_ucb1_bandit drain_bandit[8]; // Per-class drain threshold`
			```
			`- [ ] In fast loop tick:`
			- Select knob values using UCB1: `arm_idx = hkm_ace_ucb1_select(&ctrl->tls_cap_bandit[c])`
			- Apply selected values: `ctrl->tls_capacity[c] = TLS_CAP_CANDIDATES[arm_idx]`
			- After observing reward: `hkm_ace_ucb1_update(&ctrl->tls_cap_bandit[c], arm_idx, reward)`

			`---`

			`### 4. Dynamic TLS Capacity Adjustment (1-2 hours)`

			#### 4.1 Modify `core/hakmem_tiny_magazine.h` (30 min)
			- [ ] Change `TINY_TLS_MAG_CAP` from compile-time constant to runtime variable:
			```c
			`// OLD:`
			`#define TINY_TLS_MAG_CAP 128`

			`// NEW:`
			`extern uint32_t g_tiny_tls_mag_cap[8]; // Per-class capacity`
			```
			- [ ] Update all references to `TINY_TLS_MAG_CAP` to use `g_tiny_tls_mag_cap[class_idx]`

			#### 4.2 Modify `core/hakmem_tiny_magazine.c` (30 min)
			`- [ ] Define global capacity array:`
			```c
			`uint32_t g_tiny_tls_mag_cap[8] = {`
			`128, 128, 128, 128, 128, 128, 128, 128 // Default values`
			`};`
			```
			`- [ ] Add setter function:`
			```c
			`void hkm_tiny_set_tls_capacity(uint8_t class_idx, uint32_t new_cap) {`
			`if (class_idx >= 8) return;`
			`g_tiny_tls_mag_cap[class_idx] = new_cap;`
			`}`
			```
			`- [ ] Update magazine refill logic to respect dynamic capacity:`
			```c
			`// In tiny_magazine_refill():`
			`uint32_t cap = g_tiny_tls_mag_cap[class_idx];`
			`if (mag->count >= cap) return; // Already at capacity`
			```

			`#### 4.3 Integration with ACE controller (30 min)`
			- [ ] In `hkm_ace_controller_tick()`, apply TLS capacity changes:
			```c
			`for (int c = 0; c < 8; c++) {`
			`uint32_t new_cap = ctrl->tls_capacity[c];`
			`hkm_tiny_set_tls_capacity(c, new_cap);`
			`}`
			```
			`- [ ] Similarly for drain threshold (if implemented in tiny pool):`
			```c
			`for (int c = 0; c < 8; c++) {`
			`uint32_t new_thresh = ctrl->drain_threshold[c];`
			`hkm_tiny_set_drain_threshold(c, new_thresh);`
			`}`
			```

			`---`

			`### 5. ON/OFF Toggle and Configuration (1 hour)`

			`#### 5.1 Environment variables (30 min)`
			- [ ] Add to `core/hakmem_config.h`:
			```c
			`// ACE Learning Layer`
			`#define HAKMEM_ACE_ENABLED "HAKMEM_ACE_ENABLED" // 0/1`
			`#define HAKMEM_ACE_FAST_INTERVAL_MS "HAKMEM_ACE_FAST_INTERVAL_MS" // Default 500`
			`#define HAKMEM_ACE_SLOW_INTERVAL_MS "HAKMEM_ACE_SLOW_INTERVAL_MS" // Default 30000`
			`#define HAKMEM_ACE_LOG_LEVEL "HAKMEM_ACE_LOG_LEVEL" // 0=off, 1=info, 2=debug`

			`// Safety guards`
			`#define HAKMEM_ACE_MAX_P99_LAT_NS "HAKMEM_ACE_MAX_P99_LAT_NS" // Default 10000000 (10ms)`
			`#define HAKMEM_ACE_MAX_RSS_MB "HAKMEM_ACE_MAX_RSS_MB" // Default 16384 (16GB)`
			`#define HAKMEM_ACE_MAX_CPU_PERCENT "HAKMEM_ACE_MAX_CPU_PERCENT" // Default 5`
			```
			- [ ] Parse environment variables in `hkm_ace_controller_init()`

			`#### 5.2 Logging infrastructure (30 min)`
			- [ ] Add logging macros in `core/hakmem_ace_controller.c`:
			```c
			`#define ACE_LOG_INFO(fmt, ...) \`
			`if (g_ace_log_level >= 1) fprintf(stderr, "[ACE] " fmt "\n", ##__VA_ARGS__)`

			`#define ACE_LOG_DEBUG(fmt, ...) \`
			`if (g_ace_log_level >= 2) fprintf(stderr, "[ACE DEBUG] " fmt "\n", ##__VA_ARGS__)`
			```
			`- [ ] Add debug output in fast loop:`
			```c
			`ACE_LOG_DEBUG("Fast loop: reward=%.2f, llc_miss=%.2f, backlog=%u",`
			`reward, llc_miss_rate, remote_backlog[0]);`
			`ACE_LOG_INFO("Adjusting TLS cap[%d]: %u → %u (diet factor=%.2f)",`
			`c, old_cap, new_cap, diet_factor);`
			```

			`---`

			`## Testing Strategy`

			`### Unit Tests`
			`- [ ] Test metrics collection:`
			```bash
			`# Verify throughput tracking`
			`HAKMEM_ACE_ENABLED=1 ./test_ace_metrics`
			```
			`- [ ] Test UCB1 selection:`
			```bash
			`# Verify arm selection and update`
			`./test_ace_ucb1`
			```

			`### Integration Tests`
			`- [ ] Test ACE on fragmentation stress benchmark:`
			```bash
			`# Baseline (ACE OFF)`
			`HAKMEM_ACE_ENABLED=0 ./bench_fragment_stress_hakx > baseline.txt`

			`# ACE ON`
			`HAKMEM_ACE_ENABLED=1 ./bench_fragment_stress_hakx > ace_on.txt`

			`# Compare`
			`diff baseline.txt ace_on.txt`
			```
			`- [ ] Verify dynamic TLS capacity adjustment:`
			```bash
			`# Enable debug logging`
			`export HAKMEM_ACE_ENABLED=1`
			`export HAKMEM_ACE_LOG_LEVEL=2`
			`./bench_fragment_stress_hakx`
			`# Should see log output: "Adjusting TLS cap[2]: 128 → 96"`
			```

			`### Benchmark Validation`
			`- [ ] Run A/B comparison on all weak workloads:`
			```bash
			`bash scripts/ace_ab_test.sh`
			```
			`- [ ] Expected results:`
			`- Fragmentation stress: 3.87 → 8-12 M ops/s (2-3x)`
			`- Mid MT: 111.6 M ops/s → 110-115 M ops/s (maintain ±5%)`
			`- Large WS: 22.15 M ops/s → 25-30 M ops/s (1.1-1.4x, partial improvement)`

			`---`

			`## Implementation Order`

			`Day 1 (7-9 hours):`

			`1. Morning (3-4 hours):`
			`- [ ] 1.1 Create hakmem_ace_metrics.h (30 min)`
			`- [ ] 1.2 Create hakmem_ace_metrics.c (2 hours)`
			`- [ ] 1.3 Integration (30 min)`
			`- [ ] Test: Verify metrics collection works`

			`2. Midday (2-3 hours):`
			`- [ ] 2.1 Create hakmem_ace_controller.h (30 min)`
			`- [ ] 2.2 Create hakmem_ace_controller.c (1.5 hours)`
			`- [ ] 2.3 Integration (30 min)`
			`- [ ] Test: Verify fast/slow loops run`

			`3. Afternoon (2-3 hours):`
			`- [ ] 3.1 Create hakmem_ace_ucb1.h (30 min)`
			`- [ ] 3.2 Create hakmem_ace_ucb1.c (45 min)`
			`- [ ] 3.3 Integration (30 min)`
			`- [ ] 4.1-4.3 Dynamic TLS capacity (1.5 hours)`
			`- [ ] 5.1-5.2 ON/OFF toggle (1 hour)`

			`4. Evening (1-2 hours):`
			`- [ ] Build and test complete system`
			`- [ ] Run fragmentation stress A/B test`
			`- [ ] Verify 2-3x improvement`

			`---`

			`## Success Criteria`

			`Phase 1 is complete when:`
			`- ✅ Metrics collection works (throughput, LLC miss, mutex wait, backlog)`
			`- ✅ Fast loop adjusts TLS capacity based on LLC miss rate`
			`- ✅ UCB1 learning selects optimal knob values`
			`- ✅ Dynamic TLS capacity affects runtime behavior`
			- ✅ ON/OFF toggle via `HAKMEM_ACE_ENABLED=1` works
			`- ✅ Benchmark improvement: Fragmentation stress 3.87 → 8-12 M ops/s (2-3x)`
			`- ✅ No regression: Mid MT maintains 110-115 M ops/s (±5%)`

			`---`

			`## Files to Create`

			`New files (Phase 1):`
			```
			`core/hakmem_ace_metrics.h (80 lines)`
			`core/hakmem_ace_metrics.c (300 lines)`
			`core/hakmem_ace_controller.h (100 lines)`
			`core/hakmem_ace_controller.c (400 lines)`
			`core/hakmem_ace_ucb1.h (80 lines)`
			`core/hakmem_ace_ucb1.c (150 lines)`
			```

			`Modified files:`
			```
			`core/hakmem_tiny_magazine.h (change TINY_TLS_MAG_CAP to array)`
			`core/hakmem_tiny_magazine.c (add setter, use dynamic capacity)`
			`core/hakmem.c (start ACE thread)`
			`core/hakmem_config.h (add ACE env vars)`
			```

			`Test files:`
			```
			`tests/unit/test_ace_metrics.c (150 lines)`
			`tests/unit/test_ace_ucb1.c (120 lines)`
			`tests/integration/test_ace_e2e.c (200 lines)`
			```

			`Scripts:`
			```
			`benchmarks/scripts/utils/ace_ab_test.sh (100 lines)`
			```

			`Total new code: ~1,680 lines (Phase 1 only)`

			`---`

			`## Next Steps After Phase 1`

			`Once Phase 1 is complete and validated:`
			`- Phase 2: Fragmentation countermeasures (budgeted scavenge, partial release)`
			`- Phase 3: Large WS countermeasures (auto diet, LLC miss optimization)`
			`- Phase 4: realloc optimization (in-place expansion, NT store)`

			`---`

			`Status: READY TO IMPLEMENT`
			`Priority: HIGH 🔥`
			`Expected Impact: 2-3x improvement on fragmentation stress`
			`Risk: LOW (isolated, ON/OFF toggle, no impact when disabled)`

			`Let's build it! 💪`