Files
hakmem/scripts/make_chatgpt_pro_packet_free_path.sh
Moe Charm (CI) e4c5f05355 Phase 86: Free Path Legacy Mask (NO-GO, +0.25%)
## Summary

Implemented Phase 86 "mask-only commit" optimization for free path:
- Bitset mask (0x7f for C0-C6) to identify LEGACY classes
- Direct call to tiny_legacy_fallback_free_base_with_env()
- No indirect function pointers (avoids Phase 85's -0.86% regression)
- Fail-fast on LARSON_FIX=1 (cross-thread validation incompatibility)

## Results (10-run SSOT)

**NO-GO**: +0.25% improvement (threshold: +1.0%)
- Control:    51,750,467 ops/s (CV: 2.26%)
- Treatment:  51,881,055 ops/s (CV: 2.32%)
- Delta:      +0.25% (mean), -0.15% (median)

## Root Cause

Competing optimizations plateau:
1. Phase 9/10 MONO LEGACY (+1.89%) already capture most free path benefit
2. Remaining margin insufficient to overcome:
   - Two branch checks (mask_enabled + has_class)
   - I-cache layout tax in hot path
   - Direct function call overhead

## Phase 85 vs Phase 86

| Metric | Phase 85 | Phase 86 |
|--------|----------|----------|
| Approach | Indirect calls + table | Bitset mask + direct call |
| Result | -0.86% | +0.25% |
| Verdict | NO-GO (regression) | NO-GO (insufficient) |

Phase 86 correctly avoided indirect call penalties but revealed architectural
limit: can't escape Phase 9/10 overlay without restructuring.

## Recommendation

Free path optimization layer has reached practical ceiling:
- Phase 9/10 +1.89% + Phase 6/19/FASTLANE +16-27% ≈ 18-29% total
- Further attempts on ceremony elimination face same constraints
- Recommend focus on different optimization layers (malloc, etc.)

## Files Changed

### New
- core/box/free_path_legacy_mask_box.h (API + globals)
- core/box/free_path_legacy_mask_box.c (refresh logic)

### Modified
- core/bench_profile.h (added refresh call)
- core/front/malloc_tiny_fast.h (added Phase 86 fast path check)
- Makefile (added object files)
- CURRENT_TASK.md (documented result)

All changes conditional on HAKMEM_FREE_PATH_LEGACY_MASK=1 (default OFF).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-18 22:05:34 +09:00

128 lines
4.4 KiB
Bash
Executable File
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#!/usr/bin/env bash
set -euo pipefail
# Generate a compact "free-path review packet" for sharing with ChatGPT Pro.
# Output: Markdown to stdout (copy/paste).
#
# Usage:
# scripts/make_chatgpt_pro_packet_free_path.sh > /tmp/free_path_packet.md
#
# Notes:
# - Extracts key functions with a simple brace counter.
# - Clips each snippet to keep it shareable.
root_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
cd "${root_dir}"
# Default clip is intentionally small; you can override via CLIP_LINES=...
clip="${CLIP_LINES:-160}"
need() { command -v "$1" >/dev/null 2>&1 || { echo "[packet] missing $1" >&2; exit 1; }; }
need awk
need sed
extract_func_n_clip() {
local file="$1"
local re="$2"
local nth="$3"
local clip_lines="$4"
awk -v re="${re}" -v nth="${nth}" '
function count_char(s, c, i,n) { n=0; for (i=1;i<=length(s);i++) if (substr(s,i,1)==c) n++; return n }
BEGIN { hit=0; started=0; depth=0; seen_open=0 }
{
if (!started) {
if ($0 ~ re) {
hit++;
if (hit == nth) {
started=1;
}
}
}
if (started) {
print $0;
depth += count_char($0, "{");
if (count_char($0, "{") > 0) seen_open=1;
depth -= count_char($0, "}");
if (seen_open && depth <= 0) exit 0;
}
}
' "${file}" | sed -n "1,${clip_lines}p"
}
extract_func() {
extract_func_n_clip "$1" "$2" 1 "${clip}"
}
md_code() {
local lang="$1"
local file="$2"
echo ""
echo "### \`${file}\`"
echo "\`\`\`${lang}"
cat
echo "\`\`\`"
}
cat <<'MD'
# Hakmem free-path review packet (compact)
Goal: understand remaining fixed costs vs mimalloc/tcmalloc, with Box Theory (single boundary, reversible ENV gates).
SSOT bench conditions (current practice):
- `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE`
- `ITERS=20000000 WS=400 RUNS=10`
- run via `scripts/run_mixed_10_cleanenv.sh`
Request:
1) Where is the dominant fixed cost on free path now?
2) What structural change would give +510% without breaking Box Theory?
3) What NOT to do (layout tax pitfalls)?
MD
echo ""
echo "## Code excerpts (clipped)"
# We focus on the hot tiny-free pipeline (the most actionable for instruction/branch work).
# If the reviewer needs wrapper/registry code too, we can provide a larger packet.
# A) tiny_free_gate_try_fast(): user_ptr -> class_idx/base -> tiny_hot_free_fast()/fallback
extract_func core/box/tiny_free_gate_box.h '^static inline int tiny_free_gate_try_fast\\(void\\* user_ptr\\)' | md_code c core/box/tiny_free_gate_box.h
# B) free_tiny_fast(): main Tiny free dispatcher (hot/cold + env snapshot)
extract_func_n_clip core/front/malloc_tiny_fast.h '^static inline int free_tiny_fast\\(void\\* ptr\\)' 1 220 | md_code c core/front/malloc_tiny_fast.h
# C) tiny_hot_free_fast(): TLS unified cache push
extract_func core/box/tiny_front_hot_box.h '^static inline int tiny_hot_free_fast\\(int class_idx, void\\* base\\)' | md_code c core/box/tiny_front_hot_box.h
# D) tiny_legacy_fallback_free_base_with_env(): inline-slots cascade + unified_cache_push(_fast)
extract_func_n_clip core/box/tiny_legacy_fallback_box.h '^static inline void tiny_legacy_fallback_free_base_with_env\\(void\\* base, uint32_t class_idx, const HakmemEnvSnapshot\\* env\\)' 1 260 | md_code c core/box/tiny_legacy_fallback_box.h
cat <<'MD'
## Questions to answer (please be concrete)
1) In these snippets, which checks/branches are still "per-op fixed taxes" on the hot free path?
- Please point to specific lines/conditions and estimate cost (branches/instructions or dependency chain).
2) Is `tiny_hot_free_fast()` already close to optimal, and the real bottleneck is upstream (user->base/classify/route)?
- If yes, whats the smallest structural refactor that removes that upstream fixed tax?
3) Should we introduce a "commit once" plan (freeze the chosen free path) — or is branch prediction already making lazy-init checks ~free here?
- If "commit once", where should it live to avoid runtime gate overhead (bench_profile refresh boundary vs per-op)?
4) We have had many layout-tax regressions from code removal/reordering.
- What patterns here are most likely to trigger layout tax if changed?
- How would you stage a safe A/B (same binary, ENV toggle) for your proposal?
5) If you could change just ONE of:
- pointer classification to base/class_idx,
- route determination,
- unified cache push/pop structure,
which is highest ROI for +510% on WS=400?
MD
echo ""
echo "[packet] done"