Files
hakmem/CHECKPOINT_PHASE2_COMPLETE.md
Moe Charm (CI) ca6e8ecaf1 Checkpoint: Phase 2 Box化 complete - 100% stable (0% crash rate)
Validation: 100/100 test iterations passed
Commits included:
- dea7ced42: Phase 1b fix (12% → 0% crash)
- 4f2bcb7d3: Phase 2 Box化 (3-level contract design)

Key achievements:
✓ 0% crash rate (100/100 iterations)
✓ Clear safety contracts (UNSAFE/SAFE/GUARDED)
✓ Future optimization paths documented
✓ Backward compatibility maintained

See CHECKPOINT_PHASE2_COMPLETE.md for full analysis.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 08:48:43 +09:00

4.1 KiB

CHECKPOINT: Phase 2 Box化 Complete

Date: 2025-11-29
Status: ✓ STABLE (100/100 tests passed, 0% crash rate)

Problem Summary

Initial State (Phase 12)

  • Implementation: Direct mask+dereference for SuperSlab lookup
  • Performance: ~5-10 cycles (fast)
  • Safety: ⚠️ UNSAFE - 12% crash rate
  • Root Cause: Arbitrary pointers → unmapped addresses → SEGFAULT

Evolution

Phase Approach Performance Safety Result
Phase 12 mask+dereference 5-10 cycles ⚠️ UNSAFE 12% crash
Phase 1a Range checks 10-20 cycles ⚠️ UNSAFE 10-12% crash (failed)
Phase 1b Registry lookup 50-100 cycles ✓ SAFE 0% crash
Phase 2 Box化 (3 levels) Selectable Contract-based 0% crash

Solution (Phase 1b + Phase 2)

Phase 1b: Immediate Fix

Commit: dea7ced42
Change: Replace ss_fast_lookup() with safe registry lookup
Result: 12% → 0% crash rate

Phase 2: Box化

Commit: 4f2bcb7d3
Design: SuperSlab Lookup Box with 3 contract levels

// Contract Level 1: UNSAFE (5-10 cycles)
ss_lookup_unsafe(ptr);   // Internal use only, requires validated pointer

// Contract Level 2: SAFE (50-100 cycles) - RECOMMENDED
ss_lookup_safe(ptr);     // Works with arbitrary pointers, 0% crash

// Contract Level 3: GUARDED (100-200 cycles)
ss_lookup_guarded(ptr);  // Debug builds only, full validation

Testing Results

Final Checkpoint Validation

Test: 100 iterations of bench_random_mixed_hakmem (200K ops)
SUCCESS: 100/100 (100%)
CRASH: 0/100 (0%)

✓ CHECKPOINT VERIFIED: 100% STABLE

Performance Impact

  • Phase 12 (unsafe): 5-10 cycles, 12% crash
  • Phase 1b/2 (safe): 50-100 cycles, 0% crash
  • Trade-off: 5-10x slower, but crash-free
  • Still faster than mincore() syscall (5000-10000 cycles)

Files Modified

Core Implementation

  • core/superslab/superslab_inline.h - Box integration
  • core/box/superslab_lookup_box.h - NEW - Box definition

Cleanup (removed conflicting extern declarations)

  • core/box/tls_sll_drain_box.h
  • core/box/external_guard_box.h
  • core/tiny_free_fast.inc.h

Future Optimization Opportunities

Documented in superslab_lookup_box.h:

Phase 2.1: Hybrid Lookup

  • Try UNSAFE first (optimistic fast path)
  • Fallback to SAFE on magic check failure
  • Best of both: 5-10 cycles (hit), 50-100 cycles (miss)

Phase 2.2: Per-Thread Cache

  • Cache last N lookups in TLS (ptr → SuperSlab)
  • Expected hit rate: 80-90%
  • Cost: 1-2 cycles (hit), 50-100 cycles (miss)

Phase 2.3: Hardware-Assisted Validation

  • Use x86 CPUID / ARM PAC for pointer tagging
  • Validate pointer origin without registry lookup
  • Requires kernel support / specific hardware

Key Insights

Why SEGFAULT Occurred (Even with Correct Code)

  1. Public API Nature

    void free(void* ptr);  // Accepts ANY pointer
    
    • Users can pass wrong pointers (stack, global, garbage)
    • This is within normal API usage
  2. Implementation Mismatch

    • Phase 12 assumed: "pointer is HAKMEM allocation"
    • Actual usage: Called BEFORE header validation
    • Result: Unsafe dereference of arbitrary pointers
  3. Probabilistic Failure

    • Depends on memory layout
    • Masked address may or may not be mapped
    • Benchmark: 12% probability of unmapped address

Why Box Pattern is Important

  • Clear Contracts: Each API documents preconditions
  • Multiple Levels: Choose speed vs safety based on context
  • Future-Proof: Enable optimizations without breaking code
  • Safety by Default: Recommended API (SAFE) is crash-free

References

  • Root cause analysis: In-session rr debugging (run 21/50)
  • Test methodology: 50-100 iteration validation loops
  • Design discussion: Option A/B/C analysis (user chose Option C)

Conclusion: Phase 2 Box化 provides both immediate stability (0% crash) and future optimization flexibility. This checkpoint represents a robust, well-documented state suitable for production deployment.

🤖 Generated with Claude Code