Files

42 lines
1.6 KiB
Markdown
Raw Permalink Normal View History

🔍 Research: GPT-5-Codex capabilities and GitHub PR integration ## Summary Investigated OpenAI's new GPT-5-Codex model and Codex GitHub PR review integration capabilities. ## GPT-5-Codex Analysis ### Benchmark Performance (Good) - SWE-bench Verified: 74.5% (vs GPT-5's 72.8%) - Refactoring tasks: 51.3% (vs GPT-5's 33.9%) - Code review: Higher developer ratings ### Real-World Issues (Concerning) - Users report degraded coding performance - Scripts that previously worked now fail - Less consistent than GPT-4.5 - Longer response times (minutes vs instant) - "Creatively and emotionally flat" - Basic errors (e.g., counting letters incorrectly) ### Key Finding Classic case of "optimizing for benchmarks vs real usability" - scores well on tests but performs poorly in practice. ## Codex GitHub PR Integration ### Setup Process 1. Enable MFA and connect GitHub account 2. Authorize Codex GitHub app for repos 3. Enable "Code review" in repository settings ### Usage Methods - **Manual**: Comment '@codex review' in PR - **Automatic**: Triggers when PR moves from draft to ready ### Current Limitations - One-way communication (doesn't respond to review comments) - Prefers creating new PRs over updating existing ones - Better for single-pass reviews than iterative feedback ## 'codex resume' Feature New session management capability: - Resume previous codex exec sessions - Useful for continuing long tasks across days - Maintains context from interrupted work 🐱 The investigation reveals that while GPT-5-Codex shows benchmark improvements, practical developer experience has declined - a reminder that metrics don't always reflect real-world utility\!
2025-09-16 16:28:25 +09:00
# Nyash Syntax Torture (10 minimal repros)
Date: 2025-09-16
Purpose: stress parser → AST → MIR(Core-13/PURE) → Interpreter/VM/LLVM(AOT) consistency.
Each test is **one phenomenon per file**, tiny and deterministic.
## How to run (suggested)
```
# 1) Run all modes and compare outputs
bash run_spec_smoke.sh
# 2) PURE mode (surface MIR violations):
NYASH_MIR_CORE13_PURE=1 bash run_spec_smoke.sh
# 3) Extra logging when a case fails:
NYASH_VM_STATS=1 NYASH_VM_STATS_JSON=1 NYASH_VM_DEBUG_BOXCALL=1 bash run_spec_smoke.sh
# For LLVM diagnostics (when applicable):
# NYASH_LLVM_VINVOKE_TRACE=1 NYASH_LLVM_VINVOKE_PREFER_I64=1 bash run_spec_smoke.sh
```
## Expected outputs (goldens)
We deliberately **print a single line** per test to make diffing trivial.
See inline comments in each `*.hako`.
🔍 Research: GPT-5-Codex capabilities and GitHub PR integration ## Summary Investigated OpenAI's new GPT-5-Codex model and Codex GitHub PR review integration capabilities. ## GPT-5-Codex Analysis ### Benchmark Performance (Good) - SWE-bench Verified: 74.5% (vs GPT-5's 72.8%) - Refactoring tasks: 51.3% (vs GPT-5's 33.9%) - Code review: Higher developer ratings ### Real-World Issues (Concerning) - Users report degraded coding performance - Scripts that previously worked now fail - Less consistent than GPT-4.5 - Longer response times (minutes vs instant) - "Creatively and emotionally flat" - Basic errors (e.g., counting letters incorrectly) ### Key Finding Classic case of "optimizing for benchmarks vs real usability" - scores well on tests but performs poorly in practice. ## Codex GitHub PR Integration ### Setup Process 1. Enable MFA and connect GitHub account 2. Authorize Codex GitHub app for repos 3. Enable "Code review" in repository settings ### Usage Methods - **Manual**: Comment '@codex review' in PR - **Automatic**: Triggers when PR moves from draft to ready ### Current Limitations - One-way communication (doesn't respond to review comments) - Prefers creating new PRs over updating existing ones - Better for single-pass reviews than iterative feedback ## 'codex resume' Feature New session management capability: - Resume previous codex exec sessions - Useful for continuing long tasks across days - Maintains context from interrupted work 🐱 The investigation reveals that while GPT-5-Codex shows benchmark improvements, practical developer experience has declined - a reminder that metrics don't always reflect real-world utility\!
2025-09-16 16:28:25 +09:00
## File list
1. 01_ops_assoc.hako operator associativity & coercion order
2. 02_deep_parens.hako deep parentheses & arithmetic nesting
3. 03_array_map_nested.hako nested array/map literal & access
4. 04_map_array_mix.hako object/array cross indexing & updates
5. 05_string_concat_unicode.hako string/number/Unicode concatenation
6. 06_control_flow_loopform.hako break/continue/dispatch shape
7. 07_await_nowait_mix.hako nowait/await interleave determinism
8. 08_visibility_access.hako private/public & override routing
9. 09_lambda_closure_scope.hako closure capture & shadowing
10. 10_match_result_early_return.hako early return vs. branch merge
🔍 Research: GPT-5-Codex capabilities and GitHub PR integration ## Summary Investigated OpenAI's new GPT-5-Codex model and Codex GitHub PR review integration capabilities. ## GPT-5-Codex Analysis ### Benchmark Performance (Good) - SWE-bench Verified: 74.5% (vs GPT-5's 72.8%) - Refactoring tasks: 51.3% (vs GPT-5's 33.9%) - Code review: Higher developer ratings ### Real-World Issues (Concerning) - Users report degraded coding performance - Scripts that previously worked now fail - Less consistent than GPT-4.5 - Longer response times (minutes vs instant) - "Creatively and emotionally flat" - Basic errors (e.g., counting letters incorrectly) ### Key Finding Classic case of "optimizing for benchmarks vs real usability" - scores well on tests but performs poorly in practice. ## Codex GitHub PR Integration ### Setup Process 1. Enable MFA and connect GitHub account 2. Authorize Codex GitHub app for repos 3. Enable "Code review" in repository settings ### Usage Methods - **Manual**: Comment '@codex review' in PR - **Automatic**: Triggers when PR moves from draft to ready ### Current Limitations - One-way communication (doesn't respond to review comments) - Prefers creating new PRs over updating existing ones - Better for single-pass reviews than iterative feedback ## 'codex resume' Feature New session management capability: - Resume previous codex exec sessions - Useful for continuing long tasks across days - Maintains context from interrupted work 🐱 The investigation reveals that while GPT-5-Codex shows benchmark improvements, practical developer experience has declined - a reminder that metrics don't always reflect real-world utility\!
2025-09-16 16:28:25 +09:00
## CI hint
- Add this suite **before** your self-host smokes:
- `make spec-smoke` -> `make smoke-selfhost`
- Fail fast on any diff across Interpreter/VM/LLVM.