Files
hakmem/docs/archive/ACE_PHASE1_IMPLEMENTATION_TODO.md
Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)
## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-26 13:14:18 +09:00

15 KiB

ACE Phase 1 Implementation TODO

Status: Ready to implement (documentation complete) Target: Fragmentation stress 3.87 → 8-12 M ops/s (2-3x improvement) Timeline: 1 day (7-9 hours total) Date: 2025-11-01


Overview

Phase 1 implements the minimal ACE (Adaptive Control Engine) with maximum impact:

  • Metrics collection (throughput, LLC miss, mutex wait, backlog)
  • Fast loop control (0.5-1s adjustment cycle)
  • Dynamic TLS capacity tuning
  • UCB1 learning for knob selection
  • ON/OFF toggle via environment variable

Expected Impact: Fragmentation stress workload improves from 3.87 → 8-12 M ops/s


Task Breakdown

1. Metrics Collection Infrastructure (2-3 hours)

1.1 Create core/hakmem_ace_metrics.h (30 min)

  • Define struct hkm_ace_metrics with:
    struct hkm_ace_metrics {
        uint64_t throughput_ops;        // Operations per second
        double llc_miss_rate;           // LLC miss rate (0.0-1.0)
        uint64_t mutex_wait_ns;         // Mutex contention time
        uint32_t remote_free_backlog[8]; // Per-class backlog
        double fragmentation_ratio;     // Slow metric (60s)
        uint64_t rss_mb;                // Slow metric (60s)
        uint64_t timestamp_ms;          // Collection timestamp
    };
    
  • Define collection API:
    void hkm_ace_metrics_init(void);
    void hkm_ace_metrics_collect(struct hkm_ace_metrics *out);
    void hkm_ace_metrics_destroy(void);
    

1.2 Create core/hakmem_ace_metrics.c (1.5-2 hours)

  • Throughput tracking (30 min)

    • Global atomic counter g_ace_alloc_count
    • Increment in hakmem_alloc() / hakmem_free()
    • Calculate ops/sec from delta between collections
  • LLC miss monitoring (45 min)

    • Use rdpmc for lightweight performance counter access
    • Read LLC_MISSES and LLC_REFERENCES counters
    • Calculate miss_rate = misses / references
    • Fallback to 0.0 if RDPMC unavailable
  • Mutex contention tracking (30 min)

    • Wrap pthread_mutex_lock() with timing
    • Track cumulative wait time per class
    • Reset counters after each collection
  • Remote free backlog (15 min)

    • Read g_tiny_classes[c].remote_backlog_count for each class
    • Already tracked by tiny pool implementation
  • Fragmentation ratio (slow, 60s) (15 min)

    • Calculate: allocated_bytes / reserved_bytes
    • Parse /proc/self/status for VmRSS and VmSize
    • Only update every 60 seconds (skip on fast collections)
  • RSS monitoring (slow, 60s) (15 min)

    • Read /proc/self/status VmRSS field
    • Convert to MB
    • Only update every 60 seconds

1.3 Integration with existing code (30 min)

  • Add #include "hakmem_ace_metrics.h" to core/hakmem.c
  • Call hkm_ace_metrics_init() in hakmem_init()
  • Call hkm_ace_metrics_destroy() in cleanup

2. Fast Loop Controller (2-3 hours)

2.1 Create core/hakmem_ace_controller.h (30 min)

  • Define struct hkm_ace_controller:
    struct hkm_ace_controller {
        struct hkm_ace_metrics current;
        struct hkm_ace_metrics prev;
    
        // Current knob values
        uint32_t tls_capacity[8];       // Per-class TLS magazine capacity
        uint32_t drain_threshold[8];    // Remote free drain threshold
    
        // Fast loop state
        uint64_t fast_interval_ms;      // Default 500ms
        uint64_t last_fast_tick_ms;
    
        // Slow loop state
        uint64_t slow_interval_ms;      // Default 30000ms (30s)
        uint64_t last_slow_tick_ms;
    
        // Enabled flag
        bool enabled;
    };
    
  • Define controller API:
    void hkm_ace_controller_init(struct hkm_ace_controller *ctrl);
    void hkm_ace_controller_tick(struct hkm_ace_controller *ctrl);
    void hkm_ace_controller_destroy(struct hkm_ace_controller *ctrl);
    

2.2 Create core/hakmem_ace_controller.c (1.5-2 hours)

  • Initialization (30 min)

    • Read environment variables:
      • HAKMEM_ACE_ENABLED (default 0)
      • HAKMEM_ACE_FAST_INTERVAL_MS (default 500)
      • HAKMEM_ACE_SLOW_INTERVAL_MS (default 30000)
    • Initialize knob values to current defaults:
      • tls_capacity[c] = TINY_TLS_MAG_CAP (currently 128)
      • drain_threshold[c] = TINY_REMOTE_DRAIN_THRESHOLD (currently high)
  • Fast loop tick (45 min)

    • Check if elapsed >= fast_interval_ms
    • Collect current metrics
    • Calculate reward: reward = throughput - (llc_miss_penalty + mutex_wait_penalty + backlog_penalty)
    • Adjust knobs based on metrics:
      // LLC miss high → reduce TLS capacity (diet)
      if (llc_miss_rate > 0.15) {
          tls_capacity[c] *= 0.75;  // Diet factor
      }
      
      // Remote backlog high → lower drain threshold
      if (remote_backlog[c] > drain_threshold[c]) {
          drain_threshold[c] /= 2;
      }
      
      // Mutex wait high → increase bundle width
      // (Phase 1: skip, implement in Phase 2)
      
    • Apply knob changes to runtime (see section 4)
    • Update prev metrics for next iteration
  • Slow loop tick (30 min)

    • Check if elapsed >= slow_interval_ms
    • Collect slow metrics (fragmentation, RSS)
    • If fragmentation high: trigger partial release (Phase 2 feature, skip for now)
    • If RSS high: trigger budgeted scavenge (Phase 2 feature, skip for now)
  • Tick dispatcher (15 min)

    • Combined hkm_ace_controller_tick() that calls both fast and slow loops
    • Use monotonic clock (clock_gettime(CLOCK_MONOTONIC)) for timing

2.3 Integration with main loop (30 min)

  • Add background thread in core/hakmem.c:
    static void* hkm_ace_thread_main(void *arg) {
        struct hkm_ace_controller *ctrl = arg;
        while (ctrl->enabled) {
            hkm_ace_controller_tick(ctrl);
            usleep(100000);  // 100ms sleep, check every 0.1s
        }
        return NULL;
    }
    
  • Start ACE thread in hakmem_init() if HAKMEM_ACE_ENABLED=1
  • Join ACE thread in cleanup

3. UCB1 Learning Algorithm (1-2 hours)

3.1 Create core/hakmem_ace_ucb1.h (30 min)

  • Define discrete knob candidates:
    // TLS capacity candidates
    static const uint32_t TLS_CAP_CANDIDATES[] = {4, 8, 16, 32, 64, 128, 256, 512};
    #define TLS_CAP_N_ARMS 8
    
    // Drain threshold candidates
    static const uint32_t DRAIN_THRESH_CANDIDATES[] = {32, 64, 128, 256, 512, 1024};
    #define DRAIN_THRESH_N_ARMS 6
    
  • Define struct hkm_ace_ucb1_arm:
    struct hkm_ace_ucb1_arm {
        uint32_t value;           // Knob value (e.g., 32, 64, 128)
        double avg_reward;        // Average reward
        uint32_t n_pulls;         // Number of times selected
    };
    
  • Define struct hkm_ace_ucb1_bandit:
    struct hkm_ace_ucb1_bandit {
        struct hkm_ace_ucb1_arm arms[TLS_CAP_N_ARMS];
        uint32_t total_pulls;
        double exploration_bonus;  // Default sqrt(2)
    };
    
  • Define UCB1 API:
    void hkm_ace_ucb1_init(struct hkm_ace_ucb1_bandit *bandit, const uint32_t *candidates, int n_arms);
    int hkm_ace_ucb1_select(struct hkm_ace_ucb1_bandit *bandit);
    void hkm_ace_ucb1_update(struct hkm_ace_ucb1_bandit *bandit, int arm_idx, double reward);
    

3.2 Create core/hakmem_ace_ucb1.c (45 min)

  • Initialization (15 min)

    • Initialize each arm with candidate value
    • Set avg_reward = 0.0, n_pulls = 0
  • Selection (15 min)

    • Implement UCB1 formula:
      ucb_value = avg_reward + exploration_bonus * sqrt(log(total_pulls) / n_pulls)
      
    • Return arm index with highest UCB value
    • Handle initial exploration (n_pulls == 0 → infinity UCB)
  • Update (15 min)

    • Update running average:
      avg_reward = (avg_reward * n_pulls + reward) / (n_pulls + 1)
      
    • Increment n_pulls and total_pulls

3.3 Integration with controller (30 min)

  • Add UCB1 bandits to struct hkm_ace_controller:
    struct hkm_ace_ucb1_bandit tls_cap_bandit[8];   // Per-class TLS capacity
    struct hkm_ace_ucb1_bandit drain_bandit[8];     // Per-class drain threshold
    
  • In fast loop tick:
    • Select knob values using UCB1: arm_idx = hkm_ace_ucb1_select(&ctrl->tls_cap_bandit[c])
    • Apply selected values: ctrl->tls_capacity[c] = TLS_CAP_CANDIDATES[arm_idx]
    • After observing reward: hkm_ace_ucb1_update(&ctrl->tls_cap_bandit[c], arm_idx, reward)

4. Dynamic TLS Capacity Adjustment (1-2 hours)

4.1 Modify core/hakmem_tiny_magazine.h (30 min)

  • Change TINY_TLS_MAG_CAP from compile-time constant to runtime variable:
    // OLD:
    #define TINY_TLS_MAG_CAP 128
    
    // NEW:
    extern uint32_t g_tiny_tls_mag_cap[8];  // Per-class capacity
    
  • Update all references to TINY_TLS_MAG_CAP to use g_tiny_tls_mag_cap[class_idx]

4.2 Modify core/hakmem_tiny_magazine.c (30 min)

  • Define global capacity array:
    uint32_t g_tiny_tls_mag_cap[8] = {
        128, 128, 128, 128, 128, 128, 128, 128  // Default values
    };
    
  • Add setter function:
    void hkm_tiny_set_tls_capacity(uint8_t class_idx, uint32_t new_cap) {
        if (class_idx >= 8) return;
        g_tiny_tls_mag_cap[class_idx] = new_cap;
    }
    
  • Update magazine refill logic to respect dynamic capacity:
    // In tiny_magazine_refill():
    uint32_t cap = g_tiny_tls_mag_cap[class_idx];
    if (mag->count >= cap) return;  // Already at capacity
    

4.3 Integration with ACE controller (30 min)

  • In hkm_ace_controller_tick(), apply TLS capacity changes:
    for (int c = 0; c < 8; c++) {
        uint32_t new_cap = ctrl->tls_capacity[c];
        hkm_tiny_set_tls_capacity(c, new_cap);
    }
    
  • Similarly for drain threshold (if implemented in tiny pool):
    for (int c = 0; c < 8; c++) {
        uint32_t new_thresh = ctrl->drain_threshold[c];
        hkm_tiny_set_drain_threshold(c, new_thresh);
    }
    

5. ON/OFF Toggle and Configuration (1 hour)

5.1 Environment variables (30 min)

  • Add to core/hakmem_config.h:
    // ACE Learning Layer
    #define HAKMEM_ACE_ENABLED              "HAKMEM_ACE_ENABLED"              // 0/1
    #define HAKMEM_ACE_FAST_INTERVAL_MS     "HAKMEM_ACE_FAST_INTERVAL_MS"     // Default 500
    #define HAKMEM_ACE_SLOW_INTERVAL_MS     "HAKMEM_ACE_SLOW_INTERVAL_MS"     // Default 30000
    #define HAKMEM_ACE_LOG_LEVEL            "HAKMEM_ACE_LOG_LEVEL"            // 0=off, 1=info, 2=debug
    
    // Safety guards
    #define HAKMEM_ACE_MAX_P99_LAT_NS       "HAKMEM_ACE_MAX_P99_LAT_NS"       // Default 10000000 (10ms)
    #define HAKMEM_ACE_MAX_RSS_MB           "HAKMEM_ACE_MAX_RSS_MB"           // Default 16384 (16GB)
    #define HAKMEM_ACE_MAX_CPU_PERCENT      "HAKMEM_ACE_MAX_CPU_PERCENT"      // Default 5
    
  • Parse environment variables in hkm_ace_controller_init()

5.2 Logging infrastructure (30 min)

  • Add logging macros in core/hakmem_ace_controller.c:
    #define ACE_LOG_INFO(fmt, ...) \
        if (g_ace_log_level >= 1) fprintf(stderr, "[ACE] " fmt "\n", ##__VA_ARGS__)
    
    #define ACE_LOG_DEBUG(fmt, ...) \
        if (g_ace_log_level >= 2) fprintf(stderr, "[ACE DEBUG] " fmt "\n", ##__VA_ARGS__)
    
  • Add debug output in fast loop:
    ACE_LOG_DEBUG("Fast loop: reward=%.2f, llc_miss=%.2f, backlog=%u",
                  reward, llc_miss_rate, remote_backlog[0]);
    ACE_LOG_INFO("Adjusting TLS cap[%d]: %u → %u (diet factor=%.2f)",
                 c, old_cap, new_cap, diet_factor);
    

Testing Strategy

Unit Tests

  • Test metrics collection:
    # Verify throughput tracking
    HAKMEM_ACE_ENABLED=1 ./test_ace_metrics
    
  • Test UCB1 selection:
    # Verify arm selection and update
    ./test_ace_ucb1
    

Integration Tests

  • Test ACE on fragmentation stress benchmark:
    # Baseline (ACE OFF)
    HAKMEM_ACE_ENABLED=0 ./bench_fragment_stress_hakx > baseline.txt
    
    # ACE ON
    HAKMEM_ACE_ENABLED=1 ./bench_fragment_stress_hakx > ace_on.txt
    
    # Compare
    diff baseline.txt ace_on.txt
    
  • Verify dynamic TLS capacity adjustment:
    # Enable debug logging
    export HAKMEM_ACE_ENABLED=1
    export HAKMEM_ACE_LOG_LEVEL=2
    ./bench_fragment_stress_hakx
    # Should see log output: "Adjusting TLS cap[2]: 128 → 96"
    

Benchmark Validation

  • Run A/B comparison on all weak workloads:
    bash scripts/ace_ab_test.sh
    
  • Expected results:
    • Fragmentation stress: 3.87 → 8-12 M ops/s (2-3x)
    • Mid MT: 111.6 M ops/s → 110-115 M ops/s (maintain ±5%)
    • Large WS: 22.15 M ops/s → 25-30 M ops/s (1.1-1.4x, partial improvement)

Implementation Order

Day 1 (7-9 hours):

  1. Morning (3-4 hours):

    • 1.1 Create hakmem_ace_metrics.h (30 min)
    • 1.2 Create hakmem_ace_metrics.c (2 hours)
    • 1.3 Integration (30 min)
    • Test: Verify metrics collection works
  2. Midday (2-3 hours):

    • 2.1 Create hakmem_ace_controller.h (30 min)
    • 2.2 Create hakmem_ace_controller.c (1.5 hours)
    • 2.3 Integration (30 min)
    • Test: Verify fast/slow loops run
  3. Afternoon (2-3 hours):

    • 3.1 Create hakmem_ace_ucb1.h (30 min)
    • 3.2 Create hakmem_ace_ucb1.c (45 min)
    • 3.3 Integration (30 min)
    • 4.1-4.3 Dynamic TLS capacity (1.5 hours)
    • 5.1-5.2 ON/OFF toggle (1 hour)
  4. Evening (1-2 hours):

    • Build and test complete system
    • Run fragmentation stress A/B test
    • Verify 2-3x improvement

Success Criteria

Phase 1 is complete when:

  • Metrics collection works (throughput, LLC miss, mutex wait, backlog)
  • Fast loop adjusts TLS capacity based on LLC miss rate
  • UCB1 learning selects optimal knob values
  • Dynamic TLS capacity affects runtime behavior
  • ON/OFF toggle via HAKMEM_ACE_ENABLED=1 works
  • Benchmark improvement: Fragmentation stress 3.87 → 8-12 M ops/s (2-3x)
  • No regression: Mid MT maintains 110-115 M ops/s (±5%)

Files to Create

New files (Phase 1):

core/hakmem_ace_metrics.h         (80 lines)
core/hakmem_ace_metrics.c         (300 lines)
core/hakmem_ace_controller.h      (100 lines)
core/hakmem_ace_controller.c      (400 lines)
core/hakmem_ace_ucb1.h            (80 lines)
core/hakmem_ace_ucb1.c            (150 lines)

Modified files:

core/hakmem_tiny_magazine.h       (change TINY_TLS_MAG_CAP to array)
core/hakmem_tiny_magazine.c       (add setter, use dynamic capacity)
core/hakmem.c                     (start ACE thread)
core/hakmem_config.h              (add ACE env vars)

Test files:

tests/unit/test_ace_metrics.c     (150 lines)
tests/unit/test_ace_ucb1.c        (120 lines)
tests/integration/test_ace_e2e.c  (200 lines)

Scripts:

benchmarks/scripts/utils/ace_ab_test.sh  (100 lines)

Total new code: ~1,680 lines (Phase 1 only)


Next Steps After Phase 1

Once Phase 1 is complete and validated:

  • Phase 2: Fragmentation countermeasures (budgeted scavenge, partial release)
  • Phase 3: Large WS countermeasures (auto diet, LLC miss optimization)
  • Phase 4: realloc optimization (in-place expansion, NT store)

Status: READY TO IMPLEMENT Priority: HIGH 🔥 Expected Impact: 2-3x improvement on fragmentation stress Risk: LOW (isolated, ON/OFF toggle, no impact when disabled)

Let's build it! 💪