Files
hakmem/scripts/make_chatgpt_pro_packet_free_path.sh

128 lines
4.4 KiB
Bash
Raw Normal View History

Phase 86: Free Path Legacy Mask (NO-GO, +0.25%) ## Summary Implemented Phase 86 "mask-only commit" optimization for free path: - Bitset mask (0x7f for C0-C6) to identify LEGACY classes - Direct call to tiny_legacy_fallback_free_base_with_env() - No indirect function pointers (avoids Phase 85's -0.86% regression) - Fail-fast on LARSON_FIX=1 (cross-thread validation incompatibility) ## Results (10-run SSOT) **NO-GO**: +0.25% improvement (threshold: +1.0%) - Control: 51,750,467 ops/s (CV: 2.26%) - Treatment: 51,881,055 ops/s (CV: 2.32%) - Delta: +0.25% (mean), -0.15% (median) ## Root Cause Competing optimizations plateau: 1. Phase 9/10 MONO LEGACY (+1.89%) already capture most free path benefit 2. Remaining margin insufficient to overcome: - Two branch checks (mask_enabled + has_class) - I-cache layout tax in hot path - Direct function call overhead ## Phase 85 vs Phase 86 | Metric | Phase 85 | Phase 86 | |--------|----------|----------| | Approach | Indirect calls + table | Bitset mask + direct call | | Result | -0.86% | +0.25% | | Verdict | NO-GO (regression) | NO-GO (insufficient) | Phase 86 correctly avoided indirect call penalties but revealed architectural limit: can't escape Phase 9/10 overlay without restructuring. ## Recommendation Free path optimization layer has reached practical ceiling: - Phase 9/10 +1.89% + Phase 6/19/FASTLANE +16-27% ≈ 18-29% total - Further attempts on ceremony elimination face same constraints - Recommend focus on different optimization layers (malloc, etc.) ## Files Changed ### New - core/box/free_path_legacy_mask_box.h (API + globals) - core/box/free_path_legacy_mask_box.c (refresh logic) ### Modified - core/bench_profile.h (added refresh call) - core/front/malloc_tiny_fast.h (added Phase 86 fast path check) - Makefile (added object files) - CURRENT_TASK.md (documented result) All changes conditional on HAKMEM_FREE_PATH_LEGACY_MASK=1 (default OFF). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
2025-12-18 22:05:34 +09:00
#!/usr/bin/env bash
set -euo pipefail
# Generate a compact "free-path review packet" for sharing with ChatGPT Pro.
# Output: Markdown to stdout (copy/paste).
#
# Usage:
# scripts/make_chatgpt_pro_packet_free_path.sh > /tmp/free_path_packet.md
#
# Notes:
# - Extracts key functions with a simple brace counter.
# - Clips each snippet to keep it shareable.
root_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
cd "${root_dir}"
# Default clip is intentionally small; you can override via CLIP_LINES=...
clip="${CLIP_LINES:-160}"
need() { command -v "$1" >/dev/null 2>&1 || { echo "[packet] missing $1" >&2; exit 1; }; }
need awk
need sed
extract_func_n_clip() {
local file="$1"
local re="$2"
local nth="$3"
local clip_lines="$4"
awk -v re="${re}" -v nth="${nth}" '
function count_char(s, c, i,n) { n=0; for (i=1;i<=length(s);i++) if (substr(s,i,1)==c) n++; return n }
BEGIN { hit=0; started=0; depth=0; seen_open=0 }
{
if (!started) {
if ($0 ~ re) {
hit++;
if (hit == nth) {
started=1;
}
}
}
if (started) {
print $0;
depth += count_char($0, "{");
if (count_char($0, "{") > 0) seen_open=1;
depth -= count_char($0, "}");
if (seen_open && depth <= 0) exit 0;
}
}
' "${file}" | sed -n "1,${clip_lines}p"
}
extract_func() {
extract_func_n_clip "$1" "$2" 1 "${clip}"
}
md_code() {
local lang="$1"
local file="$2"
echo ""
echo "### \`${file}\`"
echo "\`\`\`${lang}"
cat
echo "\`\`\`"
}
cat <<'MD'
# Hakmem free-path review packet (compact)
Goal: understand remaining fixed costs vs mimalloc/tcmalloc, with Box Theory (single boundary, reversible ENV gates).
SSOT bench conditions (current practice):
- `HAKMEM_PROFILE=MIXED_TINYV3_C7_SAFE`
- `ITERS=20000000 WS=400 RUNS=10`
- run via `scripts/run_mixed_10_cleanenv.sh`
Request:
1) Where is the dominant fixed cost on free path now?
2) What structural change would give +510% without breaking Box Theory?
3) What NOT to do (layout tax pitfalls)?
MD
echo ""
echo "## Code excerpts (clipped)"
# We focus on the hot tiny-free pipeline (the most actionable for instruction/branch work).
# If the reviewer needs wrapper/registry code too, we can provide a larger packet.
# A) tiny_free_gate_try_fast(): user_ptr -> class_idx/base -> tiny_hot_free_fast()/fallback
extract_func core/box/tiny_free_gate_box.h '^static inline int tiny_free_gate_try_fast\\(void\\* user_ptr\\)' | md_code c core/box/tiny_free_gate_box.h
# B) free_tiny_fast(): main Tiny free dispatcher (hot/cold + env snapshot)
extract_func_n_clip core/front/malloc_tiny_fast.h '^static inline int free_tiny_fast\\(void\\* ptr\\)' 1 220 | md_code c core/front/malloc_tiny_fast.h
# C) tiny_hot_free_fast(): TLS unified cache push
extract_func core/box/tiny_front_hot_box.h '^static inline int tiny_hot_free_fast\\(int class_idx, void\\* base\\)' | md_code c core/box/tiny_front_hot_box.h
# D) tiny_legacy_fallback_free_base_with_env(): inline-slots cascade + unified_cache_push(_fast)
extract_func_n_clip core/box/tiny_legacy_fallback_box.h '^static inline void tiny_legacy_fallback_free_base_with_env\\(void\\* base, uint32_t class_idx, const HakmemEnvSnapshot\\* env\\)' 1 260 | md_code c core/box/tiny_legacy_fallback_box.h
cat <<'MD'
## Questions to answer (please be concrete)
1) In these snippets, which checks/branches are still "per-op fixed taxes" on the hot free path?
- Please point to specific lines/conditions and estimate cost (branches/instructions or dependency chain).
2) Is `tiny_hot_free_fast()` already close to optimal, and the real bottleneck is upstream (user->base/classify/route)?
- If yes, whats the smallest structural refactor that removes that upstream fixed tax?
3) Should we introduce a "commit once" plan (freeze the chosen free path) — or is branch prediction already making lazy-init checks ~free here?
- If "commit once", where should it live to avoid runtime gate overhead (bench_profile refresh boundary vs per-op)?
4) We have had many layout-tax regressions from code removal/reordering.
- What patterns here are most likely to trigger layout tax if changed?
- How would you stage a safe A/B (same binary, ENV toggle) for your proposal?
5) If you could change just ONE of:
- pointer classification to base/class_idx,
- route determination,
- unified cache push/pop structure,
which is highest ROI for +510% on WS=400?
MD
echo ""
echo "[packet] done"