Commit Graph

10 Commits

Author SHA1 Message Date
3f461ba25f Cleanup: Consolidate debug ENV vars to HAKMEM_DEBUG_LEVEL
Integrated 4 new debug environment variables added during bug fixes
into the existing unified HAKMEM_DEBUG_LEVEL system (expanded to 0-5 levels).

Changes:

1. Expanded HAKMEM_DEBUG_LEVEL from 0-3 to 0-5 levels:
   - 0 = OFF (production)
   - 1 = ERROR (critical errors)
   - 2 = WARN (warnings)
   - 3 = INFO (allocation paths, header validation, stats)
   - 4 = DEBUG (guard instrumentation, failfast)
   - 5 = TRACE (verbose tracing)

2. Integrated 4 environment variables:
   - HAKMEM_ALLOC_PATH_TRACE → HAKMEM_DEBUG_LEVEL >= 3 (INFO)
   - HAKMEM_TINY_SLL_VALIDATE_HDR → HAKMEM_DEBUG_LEVEL >= 3 (INFO)
   - HAKMEM_TINY_REFILL_FAILFAST → HAKMEM_DEBUG_LEVEL >= 4 (DEBUG)
   - HAKMEM_TINY_GUARD → HAKMEM_DEBUG_LEVEL >= 4 (DEBUG)

3. Kept 2 special-purpose variables (fine-grained control):
   - HAKMEM_TINY_GUARD_CLASS (target class for guard)
   - HAKMEM_TINY_GUARD_MAX (max guard events)

4. Backward compatibility:
   - Legacy ENV vars still work via hak_debug_check_level()
   - New code uses unified system
   - No behavior changes for existing users

Updated files:
- core/hakmem_debug_master.h (level 0-5 expansion)
- core/hakmem_tiny_superslab_internal.h (alloc path trace)
- core/box/tls_sll_box.h (header validation)
- core/tiny_failfast.c (failfast level)
- core/tiny_refill_opt.h (failfast guard)
- core/hakmem_tiny_ace_guard_box.inc (guard enable)
- core/hakmem_tiny.c (include hakmem_debug_master.h)

Impact:
- Simpler debug control: HAKMEM_DEBUG_LEVEL=3 instead of 4 separate ENVs
- Easier to discover/use
- Consistent debug levels across codebase
- Reduces ENV variable proliferation (43+ vars surveyed)

Future work:
- Consolidate remaining 39+ debug variables (documented in survey)
- Gradual migration over 2-3 releases

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 06:57:03 +09:00
20f8d6f179 Cleanup: Add tiny_debug_api.h to eliminate guard/failfast implicit warnings
Created central header for debug instrumentation API to fix implicit
function declaration warnings across the codebase.

Changes:
1. Created core/tiny_debug_api.h
   - Declares guard system API (3 functions)
   - Declares failfast debugging API (3 functions)
   - Uses forward declarations for SuperSlab/TinySlabMeta

2. Updated 3 files to include tiny_debug_api.h:
   - core/tiny_region_id.h (removed inline externs)
   - core/hakmem_tiny_tls_ops.h
   - core/tiny_superslab_alloc.inc.h

Warnings eliminated (6 of 11 total):
 tiny_guard_is_enabled()
 tiny_guard_on_alloc()
 tiny_guard_on_invalid()
 tiny_failfast_log()
 tiny_failfast_abort_ptr()
 tiny_refill_failfast_level()

Remaining warnings (deferred to P1):
- ss_active_add (2 occurrences)
- expand_superslab_head
- hkm_ace_set_tls_capacity
- smallmid_backend_free

Impact:
- Cleaner build output
- Better type safety for debug functions
- No behavior changes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 06:47:13 +09:00
5a5aaf7514 Cleanup: Reformat super-long line in pool_api.inc.h for readability
Refactored the extremely compressed line 312 (previously 600+ chars) into
properly indented, readable code while preserving identical logic:

- Broke down TLS local freelist spill operation into clear steps
- Added clarifying comment for spill operation
- Improved atomic CAS loop formatting
- No functional changes, only formatting improvements

Performance verified: 16-18M ops/s maintained (same as before)
2025-11-28 17:10:32 +09:00
f8b0f38f78 ENV Cleanup Step 8: Gate HAKMEM_SUPER_LOOKUP_DEBUG in header
Gate HAKMEM_SUPER_LOOKUP_DEBUG environment variable behind
#if !HAKMEM_BUILD_RELEASE in hakmem_super_registry.h inline function.

Changes:
- Wrap s_dbg initialization in conditional compilation
- Release builds use constant s_dbg = 0 for complete elimination
- Debug logging in hak_super_lookup() now fully compiled out in release

Performance: 30.3M ops/s Larson (stable, no regression)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-28 01:45:45 +09:00
bcfb4f6b59 Remove dead code: UltraHot, RingCache, FrontC23, Class5 Hotpath
(cherry-picked from 225b6fcc7, conflicts resolved)
2025-11-26 12:33:49 +09:00
25d963a4aa Code Cleanup: Remove false positives, redundant validations, and reduce verbose logging
Following the C7 stride upgrade fix (commit 23c0d9541), this commit performs
comprehensive cleanup to improve code quality and reduce debug noise.

## Changes

### 1. Disable False Positive Checks (tiny_nextptr.h)
- **Disabled**: NXT_MISALIGN validation block with `#if 0`
- **Reason**: Produces false positives due to slab base offsets (2048, 65536)
  not being stride-aligned, causing all blocks to appear "misaligned"
- **TODO**: Reimplement to check stride DISTANCE between consecutive blocks
  instead of absolute alignment to stride boundaries

### 2. Remove Redundant Geometry Validations

**hakmem_tiny_refill_p0.inc.h (P0 batch refill)**
- Removed 25-line CARVE_GEOMETRY_FIX validation block
- Replaced with NOTE explaining redundancy
- **Reason**: Stride table is now correct in tiny_block_stride_for_class(),
  defense-in-depth validation adds overhead without benefit

**ss_legacy_backend_box.c (legacy backend)**
- Removed 18-line LEGACY_FIX_GEOMETRY validation block
- Replaced with NOTE explaining redundancy
- **Reason**: Shared_pool validates geometry at acquisition time

### 3. Reduce Verbose Logging

**hakmem_shared_pool.c (sp_fix_geometry_if_needed)**
- Made SP_FIX_GEOMETRY logging conditional on `!HAKMEM_BUILD_RELEASE`
- **Reason**: Geometry fixes are expected during stride upgrades,
  no need to log in release builds

### 4. Verification
- Build:  Successful (LTO warnings expected)
- Test:  10K iterations (1.87M ops/s, no crashes)
- NXT_MISALIGN false positives:  Eliminated

## Files Modified
- core/tiny_nextptr.h - Disabled false positive NXT_MISALIGN check
- core/hakmem_tiny_refill_p0.inc.h - Removed redundant CARVE validation
- core/box/ss_legacy_backend_box.c - Removed redundant LEGACY validation
- core/hakmem_shared_pool.c - Made SP_FIX_GEOMETRY logging debug-only

## Impact
- **Code clarity**: Removed 43 lines of redundant validation code
- **Debug noise**: Reduced false positive diagnostics
- **Performance**: Eliminated overhead from redundant geometry checks
- **Maintainability**: Single source of truth for geometry validation

🧹 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-21 23:00:24 +09:00
6afaa5703a Phase 12-1.1: EMPTY Slab Detection + Immediate Reuse (+13% improvement, 10.2M→11.5M ops/s)
Implementation of Task-sensei Priority 1 recommendation: Add empty_mask to SuperSlab
for immediate EMPTY slab detection and reuse, reducing Stage 3 (mmap) overhead.

## Changes

### 1. SuperSlab Structure (core/superslab/superslab_types.h)
- Added `empty_mask` (uint32_t): Bitmap for EMPTY slabs (used==0)
- Added `empty_count` (uint8_t): Quick check for EMPTY slab availability

### 2. EMPTY Detection API (core/box/ss_hot_cold_box.h)
- Added `ss_is_slab_empty()`: Returns true if slab is completely EMPTY
- Added `ss_mark_slab_empty()`: Marks slab as EMPTY (highest reuse priority)
- Added `ss_clear_slab_empty()`: Removes EMPTY state when reactivated
- Updated `ss_update_hot_cold_indices()`: Classify EMPTY/Hot/Cold slabs
- Updated `ss_init_hot_cold()`: Initialize empty_mask/empty_count

### 3. Free Path Integration (core/box/free_local_box.c)
- After `meta->used--`, check if `meta->used == 0`
- If true, call `ss_mark_slab_empty()` to update empty_mask
- Enables immediate EMPTY detection on every free operation

### 4. Shared Pool Stage 0.5 (core/hakmem_shared_pool.c)
- New Stage 0.5 before Stage 1: Scan existing SuperSlabs for EMPTY slabs
- Iterate over `g_super_reg_by_class[class_idx][]` (first 16 entries)
- Check `ss->empty_count > 0` → scan `empty_mask` with `__builtin_ctz()`
- Reuse EMPTY slab directly, avoiding Stage 3 (mmap/lock overhead)
- ENV control: `HAKMEM_SS_EMPTY_REUSE=1` (default OFF for A/B testing)
- ENV tunable: `HAKMEM_SS_EMPTY_SCAN_LIMIT=N` (default 16 SuperSlabs)

## Performance Results

```
Benchmark: Random Mixed 256B (100K iterations)

OFF (default):  10.2M ops/s (baseline)
ON  (ENV=1):    11.5M ops/s (+13.0% improvement) 
```

## Expected Impact (from Task-sensei analysis)

**Current bottleneck**:
- Stage 1: 2-5% hit rate (free list broken)
- Stage 2: 3-8% hit rate (rare UNUSED)
- Stage 3: 87-95% hit rate (lock + mmap overhead) ← bottleneck

**Expected with Phase 12-1.1**:
- Stage 0.5: 20-40% hit rate (EMPTY scan)
- Stage 1-2: 20-30% hit rate (combined)
- Stage 3: 30-50% hit rate (significantly reduced)

**Theoretical max**: 25M → 55-70M ops/s (+120-180%)

## Current Gap Analysis

**Observed**: 11.5M ops/s (+13%)
**Expected**: 55-70M ops/s (+120-180%)
**Gap**: Performance regression or missing complementary optimizations

Possible causes:
1. Phase 3d-C (25.1M→10.2M) regression - unrelated to this change
2. EMPTY scan overhead (16 SuperSlabs × empty_count check)
3. Missing Priority 2-5 optimizations (Lazy SS deallocation, etc.)
4. Stage 0.5 too conservative (scan_limit=16, should be higher?)

## Usage

```bash
# Enable EMPTY reuse optimization
export HAKMEM_SS_EMPTY_REUSE=1

# Optional: increase scan limit (trade-off: throughput vs latency)
export HAKMEM_SS_EMPTY_SCAN_LIMIT=32

./bench_random_mixed_hakmem 100000 256 42
```

## Next Steps

**Priority 1-A**: Investigate Phase 3d-C→12-1.1 regression (25.1M→10.2M)
**Priority 1-B**: Implement Phase 12-1.2 (Lazy SS deallocation) for complementary effect
**Priority 1-C**: Profile Stage 0.5 overhead (scan_limit tuning)

## Files Modified

Core implementation:
- `core/superslab/superslab_types.h` - empty_mask/empty_count fields
- `core/box/ss_hot_cold_box.h` - EMPTY detection/marking API
- `core/box/free_local_box.c` - Free path EMPTY detection
- `core/hakmem_shared_pool.c` - Stage 0.5 EMPTY scan

Documentation:
- `CURRENT_TASK.md` - Task-sensei investigation report

---

🎯 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Task-sensei (investigation & design analysis)
2025-11-21 04:56:48 +09:00
9b0d746407 Phase 3d-B: TLS Cache Merge - Unified g_tls_sll[] structure (+12-18% expected)
Merge separate g_tls_sll_head[] and g_tls_sll_count[] arrays into unified
TinyTLSSLL struct to improve L1D cache locality. Expected performance gain:
+12-18% from reducing cache line splits (2 loads → 1 load per operation).

Changes:
- core/hakmem_tiny.h: Add TinyTLSSLL type (16B aligned, head+count+pad)
- core/hakmem_tiny.c: Replace separate arrays with g_tls_sll[8]
- core/box/tls_sll_box.h: Update Box API (13 sites) for unified access
- Updated 32+ files: All g_tls_sll_head[i] → g_tls_sll[i].head
- Updated 32+ files: All g_tls_sll_count[i] → g_tls_sll[i].count
- core/hakmem_tiny_integrity.h: Unified canary guards
- core/box/integrity_box.c: Simplified canary validation
- Makefile: Added core/box/tiny_sizeclass_hist_box.o to link

Build:  PASS (10K ops sanity test)
Warnings: Only pre-existing LTO type mismatches (unrelated)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-20 07:32:30 +09:00
03ba62df4d Phase 23 Unified Cache + PageFaultTelemetry generalization: Mid/VM page-fault bottleneck identified
Summary:
- Phase 23 Unified Cache: +30% improvement (Random Mixed 256B: 18.18M → 23.68M ops/s)
- PageFaultTelemetry: Extended to generic buckets (C0-C7, MID, L25, SSM)
- Measurement-driven decision: Mid/VM page-faults (80-100K) >> Tiny (6K) → prioritize Mid/VM optimization

Phase 23 Changes:
1. Unified Cache implementation (core/front/tiny_unified_cache.{c,h})
   - Direct SuperSlab carve (TLS SLL bypass)
   - Self-contained pop-or-refill pattern
   - ENV: HAKMEM_TINY_UNIFIED_CACHE=1, HAKMEM_TINY_UNIFIED_C{0-7}=128

2. Fast path pruning (tiny_alloc_fast.inc.h, tiny_free_fast_v2.inc.h)
   - Unified ON → direct cache access (skip all intermediate layers)
   - Alloc: unified_cache_pop_or_refill() → immediate fail to slow
   - Free: unified_cache_push() → fallback to SLL only if full

PageFaultTelemetry Changes:
3. Generic bucket architecture (core/box/pagefault_telemetry_box.{c,h})
   - PF_BUCKET_{C0-C7, MID, L25, SSM} for domain-specific measurement
   - Integration: hak_pool_try_alloc(), l25_alloc_new_run(), shared_pool_allocate_superslab_unlocked()

4. Measurement results (Random Mixed 500K / 256B):
   - Tiny C2-C7: 2-33 pages, high reuse (64-3.8 touches/page)
   - SSM: 512 pages (initialization footprint)
   - MID/L25: 0 (unused in this workload)
   - Mid/Large VM benchmarks: 80-100K page-faults (13-16x higher than Tiny)

Ring Cache Enhancements:
5. Hot Ring Cache (core/front/tiny_ring_cache.{c,h})
   - ENV: HAKMEM_TINY_HOT_RING_ENABLE=1, HAKMEM_TINY_HOT_RING_C{0-7}=size
   - Conditional compilation cleanup

Documentation:
6. Analysis reports
   - RANDOM_MIXED_BOTTLENECK_ANALYSIS.md: Page-fault breakdown
   - RANDOM_MIXED_SUMMARY.md: Phase 23 summary
   - RING_CACHE_ACTIVATION_GUIDE.md: Ring cache usage
   - CURRENT_TASK.md: Updated with Phase 23 results and Phase 24 plan

Next Steps (Phase 24):
- Target: Mid/VM PageArena/HotSpanBox (page-fault reduction 80-100K → 30-40K)
- Tiny SSM optimization deferred (low ROI, ~6K page-faults already optimal)
- Expected improvement: +30-50% for Mid/Large workloads

Generated with Claude Code

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-17 02:47:58 +09:00
fcf098857a Phase12 debug: restore SUPERSLAB constants/APIs, implement Box2 drain boundary, fix tiny_fast_pop to return BASE, honor TLS SLL toggle in alloc/free fast paths, add fail-fast stubs, and quiet capacity sentinel. Update CURRENT_TASK with A/B results (SLL-off stable; SLL-on crash). 2025-11-14 01:02:00 +09:00