Commit cd58afa
authored
🤖 ci: replace colons in TB artifact names with hyphens (#488)
## Problem
The nightly Terminal-Bench workflow runs successfully but fails at the
artifact upload step because artifact names contain colons (from model
names like `anthropic:claude-sonnet-4-5`).
GitHub Actions artifact names cannot contain colons due to filesystem
restrictions (NTFS compatibility).
## Solution
### 1. Fixed Artifact Names
Use the `replace()` function in the artifact name template to convert
colons to hyphens:
- `anthropic:claude-sonnet-4-5` → `anthropic-claude-sonnet-4-5`
- `openai:gpt-5-codex` → `openai-gpt-5-codex`
### 2. Added Results Logging
**Important**: Since the previous run had 720 files that failed to
upload, added a new step to print `results.json` to workflow logs before
artifact upload:
```yaml
- name: Print results summary
if: always()
run: |
# Outputs full results.json
# Plus per-task summary: task_id: ✓ PASS / ✗ FAIL
```
This ensures task-level results are preserved in logs even if artifact
upload fails.
### 3. Added Model Verification Logging
Added logging in `agentSessionCli.ts` to confirm model names:
```typescript
console.error(`[cmux-cli] Using model: ${model}`);
```
## Investigation: Identical 42.50% Accuracy
During the manual run (#18913267878), **both models achieved exactly
42.50% accuracy** (34/80 tasks). Investigation revealed:
### Facts:
- Both models had 24 timeouts (out of 360s limit)
- 50% overlap: Only 12 tasks timed out for both models
- Each model attempted 56 non-timeout tasks and passed 34 (60.7% pass
rate)
- Results stored in separate timestamped directories
(`runs/2025-10-29__15-29-47` vs `15-29-29`)
- **720 files were ready to upload but artifact upload failed**
### Code Path Verification:
Traced model parameter through the entire chain:
1. ✅ Workflow → `TB_ARGS: --agent-kwarg model_name=<model>`
2. ✅ Makefile → Passes `$TB_ARGS` to terminal-bench
3. ✅ cmux_agent.py → Constructor accepts `model_name`, sets `CMUX_MODEL`
env var
4. ✅ cmux-run.sh → Passes `--model "${CMUX_MODEL}"` to CLI
5. ✅ agentSessionCli.ts → Parses `--model` flag and uses it
**The code is correct.** The identical scores are statistically unlikely
but possible with offsetting timeout patterns.
### Next Steps:
With the new results logging, the next benchmark run will show:
- ✅ Model name used (in stderr logs)
- ✅ Full results.json (in workflow logs)
- ✅ Per-task pass/fail breakdown (in workflow logs)
- ✅ Artifacts uploaded successfully (with fixed names)
This allows full verification that models produce different task-level
results.
## Testing
The next nightly run (tonight at 00:00 UTC) will:
- Successfully upload artifacts with names like:
- `terminal-bench-results-anthropic-claude-sonnet-4-5-<run_id>`
- `terminal-bench-results-openai-gpt-5-codex-<run_id>`
- Show task-level results in workflow logs (survives even if upload
fails)
- Confirm each model in logs: `[cmux-cli] Using model: <model_name>`
---
_Generated with `cmux`_1 parent cc13e60 commit cd58afa
File tree
2 files changed
+26
-1
lines changed- .github/workflows
- src/debug
2 files changed
+26
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
108 | 108 | | |
109 | 109 | | |
110 | 110 | | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
111 | 129 | | |
112 | 130 | | |
113 | 131 | | |
114 | 132 | | |
115 | | - | |
| 133 | + | |
| 134 | + | |
116 | 135 | | |
117 | 136 | | |
118 | 137 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
187 | 187 | | |
188 | 188 | | |
189 | 189 | | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
190 | 196 | | |
191 | 197 | | |
192 | 198 | | |
| |||
0 commit comments