Files

Moe Charm (CI) 67fb15f35f Wrap debug fprintf in !HAKMEM_BUILD_RELEASE guards (Release build optimization)

## Changes

### 1. core/page_arena.c
- Removed init failure message (lines 25-27) - error is handled by returning early
- All other fprintf statements already wrapped in existing #if !HAKMEM_BUILD_RELEASE blocks

### 2. core/hakmem.c
- Wrapped SIGSEGV handler init message (line 72)
- CRITICAL: Kept SIGSEGV/SIGBUS/SIGABRT error messages (lines 62-64) - production needs crash logs

### 3. core/hakmem_shared_pool.c
- Wrapped all debug fprintf statements in #if !HAKMEM_BUILD_RELEASE:
  - Node pool exhaustion warning (line 252)
  - SP_META_CAPACITY_ERROR warning (line 421)
  - SP_FIX_GEOMETRY debug logging (line 745)
  - SP_ACQUIRE_STAGE0.5_EMPTY debug logging (line 865)
  - SP_ACQUIRE_STAGE0_L0 debug logging (line 803)
  - SP_ACQUIRE_STAGE1_LOCKFREE debug logging (line 922)
  - SP_ACQUIRE_STAGE2_LOCKFREE debug logging (line 996)
  - SP_ACQUIRE_STAGE3 debug logging (line 1116)
  - SP_SLOT_RELEASE debug logging (line 1245)
  - SP_SLOT_FREELIST_LOCKFREE debug logging (line 1305)
  - SP_SLOT_COMPLETELY_EMPTY debug logging (line 1316)
- Fixed lock_stats_init() for release builds (lines 60-65) - ensure g_lock_stats_enabled is initialized

## Performance Validation

Before: 51M ops/s (with debug fprintf overhead)
After:  49.1M ops/s (consistent performance, fprintf removed from hot paths)

## Build & Test

```bash
./build.sh larson_hakmem
./out/release/larson_hakmem 1 5 1 1000 100 10000 42
# Result: 49.1M ops/s
```

Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-11-26 13:14:18 +09:00

15 KiB

Raw Blame History

ACE Phase 1 Implementation TODO

Status: Ready to implement (documentation complete) Target: Fragmentation stress 3.87 → 8-12 M ops/s (2-3x improvement) Timeline: 1 day (7-9 hours total) Date: 2025-11-01

Overview

Phase 1 implements the minimal ACE (Adaptive Control Engine) with maximum impact:

Metrics collection (throughput, LLC miss, mutex wait, backlog)
Fast loop control (0.5-1s adjustment cycle)
Dynamic TLS capacity tuning
UCB1 learning for knob selection
ON/OFF toggle via environment variable

Expected Impact: Fragmentation stress workload improves from 3.87 → 8-12 M ops/s

Task Breakdown

1. Metrics Collection Infrastructure (2-3 hours)

1.1 Create `core/hakmem_ace_metrics.h` (30 min)

Define struct hkm_ace_metrics with:

struct hkm_ace_metrics {
    uint64_t throughput_ops;        // Operations per second
    double llc_miss_rate;           // LLC miss rate (0.0-1.0)
    uint64_t mutex_wait_ns;         // Mutex contention time
    uint32_t remote_free_backlog[8]; // Per-class backlog
    double fragmentation_ratio;     // Slow metric (60s)
    uint64_t rss_mb;                // Slow metric (60s)
    uint64_t timestamp_ms;          // Collection timestamp
};

Define collection API:

void hkm_ace_metrics_init(void);
void hkm_ace_metrics_collect(struct hkm_ace_metrics *out);
void hkm_ace_metrics_destroy(void);

1.2 Create `core/hakmem_ace_metrics.c` (1.5-2 hours)

Throughput tracking (30 min)
- Global atomic counter g_ace_alloc_count
- Increment in hakmem_alloc() / hakmem_free()
- Calculate ops/sec from delta between collections
LLC miss monitoring (45 min)
- Use rdpmc for lightweight performance counter access
- Read LLC_MISSES and LLC_REFERENCES counters
- Calculate miss_rate = misses / references
- Fallback to 0.0 if RDPMC unavailable
Mutex contention tracking (30 min)
- Wrap pthread_mutex_lock() with timing
- Track cumulative wait time per class
- Reset counters after each collection
Remote free backlog (15 min)
- Read g_tiny_classes[c].remote_backlog_count for each class
- Already tracked by tiny pool implementation
Fragmentation ratio (slow, 60s) (15 min)
- Calculate: allocated_bytes / reserved_bytes
- Parse /proc/self/status for VmRSS and VmSize
- Only update every 60 seconds (skip on fast collections)
RSS monitoring (slow, 60s) (15 min)
- Read /proc/self/status VmRSS field
- Convert to MB
- Only update every 60 seconds

1.3 Integration with existing code (30 min)

Add #include "hakmem_ace_metrics.h" to core/hakmem.c
Call hkm_ace_metrics_init() in hakmem_init()
Call hkm_ace_metrics_destroy() in cleanup

2. Fast Loop Controller (2-3 hours)

2.1 Create `core/hakmem_ace_controller.h` (30 min)

Define struct hkm_ace_controller:

struct hkm_ace_controller {
    struct hkm_ace_metrics current;
    struct hkm_ace_metrics prev;

    // Current knob values
    uint32_t tls_capacity[8];       // Per-class TLS magazine capacity
    uint32_t drain_threshold[8];    // Remote free drain threshold

    // Fast loop state
    uint64_t fast_interval_ms;      // Default 500ms
    uint64_t last_fast_tick_ms;

    // Slow loop state
    uint64_t slow_interval_ms;      // Default 30000ms (30s)
    uint64_t last_slow_tick_ms;

    // Enabled flag
    bool enabled;
};

Define controller API:

void hkm_ace_controller_init(struct hkm_ace_controller *ctrl);
void hkm_ace_controller_tick(struct hkm_ace_controller *ctrl);
void hkm_ace_controller_destroy(struct hkm_ace_controller *ctrl);

2.2 Create `core/hakmem_ace_controller.c` (1.5-2 hours)

Initialization (30 min)
- Read environment variables:
  - HAKMEM_ACE_ENABLED (default 0)
  - HAKMEM_ACE_FAST_INTERVAL_MS (default 500)
  - HAKMEM_ACE_SLOW_INTERVAL_MS (default 30000)
- Initialize knob values to current defaults:
  - tls_capacity[c] = TINY_TLS_MAG_CAP (currently 128)
  - drain_threshold[c] = TINY_REMOTE_DRAIN_THRESHOLD (currently high)

Fast loop tick (45 min)

Check if elapsed >= fast_interval_ms
Collect current metrics
Calculate reward: reward = throughput - (llc_miss_penalty + mutex_wait_penalty + backlog_penalty)

Adjust knobs based on metrics:

// LLC miss high → reduce TLS capacity (diet)
if (llc_miss_rate > 0.15) {
    tls_capacity[c] *= 0.75;  // Diet factor
}

// Remote backlog high → lower drain threshold
if (remote_backlog[c] > drain_threshold[c]) {
    drain_threshold[c] /= 2;
}

// Mutex wait high → increase bundle width
// (Phase 1: skip, implement in Phase 2)

Apply knob changes to runtime (see section 4)
Update prev metrics for next iteration

Slow loop tick (30 min)
- Check if elapsed >= slow_interval_ms
- Collect slow metrics (fragmentation, RSS)
- If fragmentation high: trigger partial release (Phase 2 feature, skip for now)
- If RSS high: trigger budgeted scavenge (Phase 2 feature, skip for now)
Tick dispatcher (15 min)
- Combined hkm_ace_controller_tick() that calls both fast and slow loops
- Use monotonic clock (clock_gettime(CLOCK_MONOTONIC)) for timing

2.3 Integration with main loop (30 min)

Add background thread in core/hakmem.c:

static void* hkm_ace_thread_main(void *arg) {
    struct hkm_ace_controller *ctrl = arg;
    while (ctrl->enabled) {
        hkm_ace_controller_tick(ctrl);
        usleep(100000);  // 100ms sleep, check every 0.1s
    }
    return NULL;
}

Start ACE thread in hakmem_init() if HAKMEM_ACE_ENABLED=1
Join ACE thread in cleanup

3. UCB1 Learning Algorithm (1-2 hours)

3.1 Create `core/hakmem_ace_ucb1.h` (30 min)

Define discrete knob candidates:

// TLS capacity candidates
static const uint32_t TLS_CAP_CANDIDATES[] = {4, 8, 16, 32, 64, 128, 256, 512};
#define TLS_CAP_N_ARMS 8

// Drain threshold candidates
static const uint32_t DRAIN_THRESH_CANDIDATES[] = {32, 64, 128, 256, 512, 1024};
#define DRAIN_THRESH_N_ARMS 6

Define struct hkm_ace_ucb1_arm:

struct hkm_ace_ucb1_arm {
    uint32_t value;           // Knob value (e.g., 32, 64, 128)
    double avg_reward;        // Average reward
    uint32_t n_pulls;         // Number of times selected
};

Define struct hkm_ace_ucb1_bandit:

struct hkm_ace_ucb1_bandit {
    struct hkm_ace_ucb1_arm arms[TLS_CAP_N_ARMS];
    uint32_t total_pulls;
    double exploration_bonus;  // Default sqrt(2)
};

Define UCB1 API:

void hkm_ace_ucb1_init(struct hkm_ace_ucb1_bandit *bandit, const uint32_t *candidates, int n_arms);
int hkm_ace_ucb1_select(struct hkm_ace_ucb1_bandit *bandit);
void hkm_ace_ucb1_update(struct hkm_ace_ucb1_bandit *bandit, int arm_idx, double reward);

3.2 Create `core/hakmem_ace_ucb1.c` (45 min)

Initialization (15 min)
- Initialize each arm with candidate value
- Set avg_reward = 0.0, n_pulls = 0
Selection (15 min)
- Implement UCB1 formula:
```
ucb_value = avg_reward + exploration_bonus * sqrt(log(total_pulls) / n_pulls)
```
- Return arm index with highest UCB value
- Handle initial exploration (n_pulls == 0 → infinity UCB)

Update (15 min)

Update running average:

avg_reward = (avg_reward * n_pulls + reward) / (n_pulls + 1)

Increment n_pulls and total_pulls

3.3 Integration with controller (30 min)

Add UCB1 bandits to struct hkm_ace_controller:

struct hkm_ace_ucb1_bandit tls_cap_bandit[8];   // Per-class TLS capacity
struct hkm_ace_ucb1_bandit drain_bandit[8];     // Per-class drain threshold

In fast loop tick:
- Select knob values using UCB1: arm_idx = hkm_ace_ucb1_select(&ctrl->tls_cap_bandit[c])
- Apply selected values: ctrl->tls_capacity[c] = TLS_CAP_CANDIDATES[arm_idx]
- After observing reward: hkm_ace_ucb1_update(&ctrl->tls_cap_bandit[c], arm_idx, reward)

4. Dynamic TLS Capacity Adjustment (1-2 hours)

4.1 Modify `core/hakmem_tiny_magazine.h` (30 min)

Change TINY_TLS_MAG_CAP from compile-time constant to runtime variable:

// OLD:
#define TINY_TLS_MAG_CAP 128

// NEW:
extern uint32_t g_tiny_tls_mag_cap[8];  // Per-class capacity

Update all references to TINY_TLS_MAG_CAP to use g_tiny_tls_mag_cap[class_idx]

4.2 Modify `core/hakmem_tiny_magazine.c` (30 min)

Define global capacity array:

uint32_t g_tiny_tls_mag_cap[8] = {
    128, 128, 128, 128, 128, 128, 128, 128  // Default values
};

Add setter function:

void hkm_tiny_set_tls_capacity(uint8_t class_idx, uint32_t new_cap) {
    if (class_idx >= 8) return;
    g_tiny_tls_mag_cap[class_idx] = new_cap;
}

Update magazine refill logic to respect dynamic capacity:

// In tiny_magazine_refill():
uint32_t cap = g_tiny_tls_mag_cap[class_idx];
if (mag->count >= cap) return;  // Already at capacity

4.3 Integration with ACE controller (30 min)

In hkm_ace_controller_tick(), apply TLS capacity changes:

for (int c = 0; c < 8; c++) {
    uint32_t new_cap = ctrl->tls_capacity[c];
    hkm_tiny_set_tls_capacity(c, new_cap);
}

Similarly for drain threshold (if implemented in tiny pool):

for (int c = 0; c < 8; c++) {
    uint32_t new_thresh = ctrl->drain_threshold[c];
    hkm_tiny_set_drain_threshold(c, new_thresh);
}

5. ON/OFF Toggle and Configuration (1 hour)

5.1 Environment variables (30 min)

Add to core/hakmem_config.h:

// ACE Learning Layer
#define HAKMEM_ACE_ENABLED              "HAKMEM_ACE_ENABLED"              // 0/1
#define HAKMEM_ACE_FAST_INTERVAL_MS     "HAKMEM_ACE_FAST_INTERVAL_MS"     // Default 500
#define HAKMEM_ACE_SLOW_INTERVAL_MS     "HAKMEM_ACE_SLOW_INTERVAL_MS"     // Default 30000
#define HAKMEM_ACE_LOG_LEVEL            "HAKMEM_ACE_LOG_LEVEL"            // 0=off, 1=info, 2=debug

// Safety guards
#define HAKMEM_ACE_MAX_P99_LAT_NS       "HAKMEM_ACE_MAX_P99_LAT_NS"       // Default 10000000 (10ms)
#define HAKMEM_ACE_MAX_RSS_MB           "HAKMEM_ACE_MAX_RSS_MB"           // Default 16384 (16GB)
#define HAKMEM_ACE_MAX_CPU_PERCENT      "HAKMEM_ACE_MAX_CPU_PERCENT"      // Default 5

Parse environment variables in hkm_ace_controller_init()

5.2 Logging infrastructure (30 min)

Add logging macros in core/hakmem_ace_controller.c:

#define ACE_LOG_INFO(fmt, ...) \
    if (g_ace_log_level >= 1) fprintf(stderr, "[ACE] " fmt "\n", ##__VA_ARGS__)

#define ACE_LOG_DEBUG(fmt, ...) \
    if (g_ace_log_level >= 2) fprintf(stderr, "[ACE DEBUG] " fmt "\n", ##__VA_ARGS__)

Add debug output in fast loop:

ACE_LOG_DEBUG("Fast loop: reward=%.2f, llc_miss=%.2f, backlog=%u",
              reward, llc_miss_rate, remote_backlog[0]);
ACE_LOG_INFO("Adjusting TLS cap[%d]: %u → %u (diet factor=%.2f)",
             c, old_cap, new_cap, diet_factor);

Testing Strategy

Unit Tests

Test metrics collection:

# Verify throughput tracking
HAKMEM_ACE_ENABLED=1 ./test_ace_metrics

Test UCB1 selection:

# Verify arm selection and update
./test_ace_ucb1

Integration Tests

Test ACE on fragmentation stress benchmark:

# Baseline (ACE OFF)
HAKMEM_ACE_ENABLED=0 ./bench_fragment_stress_hakx > baseline.txt

# ACE ON
HAKMEM_ACE_ENABLED=1 ./bench_fragment_stress_hakx > ace_on.txt

# Compare
diff baseline.txt ace_on.txt

Verify dynamic TLS capacity adjustment:

# Enable debug logging
export HAKMEM_ACE_ENABLED=1
export HAKMEM_ACE_LOG_LEVEL=2
./bench_fragment_stress_hakx
# Should see log output: "Adjusting TLS cap[2]: 128 → 96"

Benchmark Validation

Run A/B comparison on all weak workloads:
```
bash scripts/ace_ab_test.sh
```
Expected results:
- Fragmentation stress: 3.87 → 8-12 M ops/s (2-3x)
- Mid MT: 111.6 M ops/s → 110-115 M ops/s (maintain ±5%)
- Large WS: 22.15 M ops/s → 25-30 M ops/s (1.1-1.4x, partial improvement)

Implementation Order

Day 1 (7-9 hours):

Morning (3-4 hours):
- 1.1 Create hakmem_ace_metrics.h (30 min)
- 1.2 Create hakmem_ace_metrics.c (2 hours)
- 1.3 Integration (30 min)
- Test: Verify metrics collection works
Midday (2-3 hours):
- 2.1 Create hakmem_ace_controller.h (30 min)
- 2.2 Create hakmem_ace_controller.c (1.5 hours)
- 2.3 Integration (30 min)
- Test: Verify fast/slow loops run
Afternoon (2-3 hours):
- 3.1 Create hakmem_ace_ucb1.h (30 min)
- 3.2 Create hakmem_ace_ucb1.c (45 min)
- 3.3 Integration (30 min)
- 4.1-4.3 Dynamic TLS capacity (1.5 hours)
- 5.1-5.2 ON/OFF toggle (1 hour)
Evening (1-2 hours):
- Build and test complete system
- Run fragmentation stress A/B test
- Verify 2-3x improvement

Success Criteria

Phase 1 is complete when:

✅ Metrics collection works (throughput, LLC miss, mutex wait, backlog)
✅ Fast loop adjusts TLS capacity based on LLC miss rate
✅ UCB1 learning selects optimal knob values
✅ Dynamic TLS capacity affects runtime behavior
✅ ON/OFF toggle via HAKMEM_ACE_ENABLED=1 works
✅ Benchmark improvement: Fragmentation stress 3.87 → 8-12 M ops/s (2-3x)
✅ No regression: Mid MT maintains 110-115 M ops/s (±5%)

Files to Create

New files (Phase 1):

core/hakmem_ace_metrics.h         (80 lines)
core/hakmem_ace_metrics.c         (300 lines)
core/hakmem_ace_controller.h      (100 lines)
core/hakmem_ace_controller.c      (400 lines)
core/hakmem_ace_ucb1.h            (80 lines)
core/hakmem_ace_ucb1.c            (150 lines)

Modified files:

core/hakmem_tiny_magazine.h       (change TINY_TLS_MAG_CAP to array)
core/hakmem_tiny_magazine.c       (add setter, use dynamic capacity)
core/hakmem.c                     (start ACE thread)
core/hakmem_config.h              (add ACE env vars)

Test files:

tests/unit/test_ace_metrics.c     (150 lines)
tests/unit/test_ace_ucb1.c        (120 lines)
tests/integration/test_ace_e2e.c  (200 lines)

Scripts:

benchmarks/scripts/utils/ace_ab_test.sh  (100 lines)

Total new code: ~1,680 lines (Phase 1 only)

Next Steps After Phase 1

Once Phase 1 is complete and validated:

Phase 2: Fragmentation countermeasures (budgeted scavenge, partial release)
Phase 3: Large WS countermeasures (auto diet, LLC miss optimization)
Phase 4: realloc optimization (in-place expansion, NT store)

Status: READY TO IMPLEMENT Priority: HIGH 🔥 Expected Impact: 2-3x improvement on fragmentation stress Risk: LOW (isolated, ON/OFF toggle, no impact when disabled)

Let's build it! 💪

15 KiB Raw Blame History