Files

Moe Charm (CI) 4ef0171bc0 feat: Add ACE allocation failure tracing and debug hooks

This commit introduces a comprehensive tracing mechanism for allocation failures within the Adaptive Cache Engine (ACE) component. This feature allows for precise identification of the root cause for Out-Of-Memory (OOM) issues related to ACE allocations.

Key changes include:
- **ACE Tracing Implementation**:
  - Added  environment variable to enable/disable detailed logging of allocation failures.
  - Instrumented , , and  to distinguish between "Threshold" (size class mismatch), "Exhaustion" (pool depletion), and "MapFail" (OS memory allocation failure).
- **Build System Fixes**:
  - Corrected  to ensure  is properly linked into , resolving an  error.
- **LD_PRELOAD Wrapper Adjustments**:
  - Investigated and understood the  wrapper's behavior under , particularly its interaction with  and  checks.
  - Enabled debugging flags for  environment to prevent unintended fallbacks to 's  for non-tiny allocations, allowing comprehensive testing of the  allocator.
- **Debugging & Verification**:
  - Introduced temporary verbose logging to pinpoint execution flow issues within  interception and  routing. These temporary logs have been removed.
  - Created  to facilitate testing of the tracing features.

This feature will significantly aid in diagnosing and resolving allocation-related OOM issues in  by providing clear insights into the failure pathways.

2025-12-01 16:37:59 +09:00

4.1 KiB

Raw Blame History

ChatGPT Debug Instructions: Phase 9-2（EMPTY Slab Recycle）

箱理論の原則で「境界1か所・戻せる」を守りつつ、EMPTY slab が Stage 1 に戻らず shared_fail→legacy が出る原因を特定するためのデバッグ指示書。

1. 現状の実装まとめ

実装: Phase 9-2 で SLAB_TRY_RECYCLE() を Remote/TLS の drain 境界に統合
- core/superslab_slab.c:113（remote drain 後の EMPTY 判定）
- core/box/tls_sll_drain_box.h:246-254（TLS SLL drain で触れた slab をチェック）
ChatGPT 前回修正（レジストリ詰まり解消）
- sp_meta_sync_slots_from_ss() で SLOT_ACTIVE ミスマッチを同期
- shared_pool_release_slab() で slot_state を再読込して早期 return 回避（registry full 消滅）
問題点
- 性能改善なし: SuperSlab ON 16.15 M ops/s vs OFF 16.23 M ops/s（-0.5%）
- shared_fail→legacy cls=7 が 4 回発生（Stage 1 ヒット 0% 近傍）

2. デバッグタスク

デバッグビルドの作り方（release ガードを外す）

make clean
make CFLAGS="-O2 -g -DHAKMEM_BUILD_RELEASE=0" bench_random_mixed_hakmem

トレースフラグの使い方

HAKMEM_TINY_USE_SUPERSLAB=1 \
HAKMEM_SLAB_RECYCLE_TRACE=1 \
HAKMEM_SS_ACQUIRE_DEBUG=1 \
HAKMEM_SHARED_POOL_STAGE_STATS=1 \
  ./bench_random_mixed_hakmem 10000000 8192 2>&1 | tee debug_output.log

確認すべきログ出力（ワンショット＋Fail-Fast）
- [SLAB_RECYCLE] EMPTY/SUCCESS/SKIP_* の回数と対象 slab/class
- [SS_ACQUIRE] Stage 1 HIT / Stage 3 の比率
- shared_fail→legacy cls=7 の残存有無

3. 調査ポイント（Box単位）

SLAB_TRY_RECYCLE() が呼ばれているか（remote drain / TLS SLL drain の両方で）
slab_is_empty(meta) が正しく true を返すか（meta->used==0 && capacity>0）
shared_pool_release_slab() が freelist 挿入まで完走しているか（slot_state 同期後に早期 return していないか）
Stage 1 hit が発生しているか（期待 80%+、現状ほぼ 0%）

4. 期待される動作フロー

正しいフロー（11 ステップ; 境界は recycle→release→Stage1 の 1 本道）
1. SuperSlab Class 7 から alloc
2. free → TLS SLL
3. TLS SLL drain（used--）
4. Remote drain（used--）
5. SLAB_TRY_RECYCLE() で EMPTY 判定
6. ss_mark_slab_empty(ss, slab_idx)
7. shared_pool_release_slab(ss, slab_idx)
8. sp_slot_mark_empty() で SLOT_EMPTY へ遷移
9. sp_meta->empty_list へ挿入（Stage 1 freelist）
10. g_super_reg 解除（前回修正で安定）
11. 次回 alloc で Stage 1 HIT（再利用）
現状のフロー（止まりどころの仮説）
- 7→8 で sp_slot_mark_empty() に失敗し早期 return（freelist 未挿入）
- 5 で EMPTY 判定に失敗して recycle 自体が走らない可能性もあり

5. 4つの可能性のある問題

Issue A: EMPTY 検出失敗（slab_is_empty() が false）
- meta->used が drain で減っていないか、capacity 0 判定漏れ
Issue B: shared_pool_release_slab() 早期リターン
- slot_state 再同期後も sp_slot_mark_empty() が非 0 を返して中断していないか
Issue C: フリーリスト挿入が起きていない
- SLOT_EMPTY にはなるが empty_list に繋がらず Stage 1 が枯渇
Issue D: Class 7 特有の問題
- SuperSlab 容量 512KB で block 数少なく、recycle が間に合わず legacy へ落下

6. 期待する出力形式（ChatGPT への回答テンプレ）

デバッグログ分析: 主要イベントの回数・比率・例示ログ
根本原因: どのステップで境界が破れているか（Box/境界を明示）
修正提案: 具体的なパッチ案 or 実験フラグ（A/B 可能に）
検証計画: どのベンチ・どのフラグで再測定するか（成功条件付き）

7. 成功基準（A/B で戻せる形に）

shared_fail→legacy cls=7: 4 → 0
Stage 1 hit rate: 0% → 80%+
性能: 16.5 M ops/s → 25–30 M ops/s（SuperSlab ON が明確に勝つ）

4.1 KiB Raw Blame History Unescape Escape

ChatGPT Debug Instructions: Phase 9-2（EMPTY Slab Recycle）

1. 現状の実装まとめ

2. デバッグタスク

3. 調査ポイント（Box単位）

4. 期待される動作フロー

5. 4つの可能性のある問題

6. 期待する出力形式（ChatGPT への回答テンプレ）

7. 成功基準（A/B で戻せる形に）

4.1 KiB

Raw Blame History