🤖 ci: replace colons in TB artifact names with hyphens (#488)

ammar-agent · web-flow · commit cd58afa7e63d · 2025-10-29T19:28:44.000Z
## Problem

The nightly Terminal-Bench workflow runs successfully but fails at the
artifact upload step because artifact names contain colons (from model
names like `anthropic:claude-sonnet-4-5`).

GitHub Actions artifact names cannot contain colons due to filesystem
restrictions (NTFS compatibility).

## Solution

### 1. Fixed Artifact Names
Use the `replace()` function in the artifact name template to convert
colons to hyphens:
- `anthropic:claude-sonnet-4-5` → `anthropic-claude-sonnet-4-5`
- `openai:gpt-5-codex` → `openai-gpt-5-codex`

### 2. Added Results Logging
**Important**: Since the previous run had 720 files that failed to
upload, added a new step to print `results.json` to workflow logs before
artifact upload:

```yaml
- name: Print results summary
  if: always()
  run: |
    # Outputs full results.json
    # Plus per-task summary: task_id: ✓ PASS / ✗ FAIL
```

This ensures task-level results are preserved in logs even if artifact
upload fails.

### 3. Added Model Verification Logging
Added logging in `agentSessionCli.ts` to confirm model names:
```typescript
console.error(`[cmux-cli] Using model: ${model}`);
```

## Investigation: Identical 42.50% Accuracy

During the manual run (#18913267878), **both models achieved exactly
42.50% accuracy** (34/80 tasks). Investigation revealed:

### Facts:
- Both models had 24 timeouts (out of 360s limit)
- 50% overlap: Only 12 tasks timed out for both models
- Each model attempted 56 non-timeout tasks and passed 34 (60.7% pass
rate)
- Results stored in separate timestamped directories
(`runs/2025-10-29__15-29-47` vs `15-29-29`)
- **720 files were ready to upload but artifact upload failed**

### Code Path Verification:
Traced model parameter through the entire chain:
1. ✅ Workflow → `TB_ARGS: --agent-kwarg model_name=&lt;model&gt;`
2. ✅ Makefile → Passes `$TB_ARGS` to terminal-bench
3. ✅ cmux_agent.py → Constructor accepts `model_name`, sets `CMUX_MODEL`
env var
4. ✅ cmux-run.sh → Passes `--model "${CMUX_MODEL}"` to CLI
5. ✅ agentSessionCli.ts → Parses `--model` flag and uses it

**The code is correct.** The identical scores are statistically unlikely
but possible with offsetting timeout patterns.

### Next Steps:
With the new results logging, the next benchmark run will show:
- ✅ Model name used (in stderr logs)
- ✅ Full results.json (in workflow logs)
- ✅ Per-task pass/fail breakdown (in workflow logs)
- ✅ Artifacts uploaded successfully (with fixed names)

This allows full verification that models produce different task-level
results.

## Testing

The next nightly run (tonight at 00:00 UTC) will:
- Successfully upload artifacts with names like:
  - `terminal-bench-results-anthropic-claude-sonnet-4-5-&lt;run_id&gt;`
  - `terminal-bench-results-openai-gpt-5-codex-&lt;run_id&gt;`
- Show task-level results in workflow logs (survives even if upload
fails)
- Confirm each model in logs: `[cmux-cli] Using model: &lt;model_name&gt;`

---

_Generated with `cmux`_
diff --git a/.github/workflows/terminal-bench.yml b/.github/workflows/terminal-bench.yml
@@ -108,11 +108,30 @@ jobs:
           ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
           OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
 
+      - name: Print results summary
+        if: always()
+        run: |
+          echo "=== Terminal-Bench Results Summary ==="
+          if [ -f "$(find runs -name 'results.json' | head -1)" ]; then
+            RESULTS_FILE=$(find runs -name 'results.json' | head -1)
+            echo "Results file: $RESULTS_FILE"
+            echo ""
+            echo "Full results.json:"
+            cat "$RESULTS_FILE" | jq '.' || cat "$RESULTS_FILE"
+            echo ""
+            echo "Per-task summary:"
+            cat "$RESULTS_FILE" | jq -r '.trials[] | "\(.task_id): \(if .resolved then "✓ PASS" else "✗ FAIL" end)"' 2>/dev/null || echo "Failed to parse task details"
+          else
+            echo "No results.json found in runs/"
+            ls -la runs/
+          fi
+
       - name: Upload benchmark results
         if: always()
         uses: actions/upload-artifact@v4
         with:
-          name: terminal-bench-results-${{ inputs.model_name && format('{0}-{1}', inputs.model_name, github.run_id) || format('{0}', github.run_id) }}
+          # Replace colons with hyphens to avoid GitHub artifact name restrictions
+          name: terminal-bench-results-${{ inputs.model_name && replace(format('{0}-{1}', inputs.model_name, github.run_id), ':', '-') || format('{0}', github.run_id) }}
           path: |
             runs/
           if-no-files-found: warn
diff --git a/src/debug/agentSessionCli.ts b/src/debug/agentSessionCli.ts
@@ -187,6 +187,12 @@ async function main(): Promise<void> {
   const emitJsonStreaming = values["json-streaming"] === true;
 
   const suppressHumanOutput = emitJsonStreaming || emitFinalJson;
+
+  // Log model selection for terminal-bench verification
+  if (!suppressHumanOutput) {
+    console.error(`[cmux-cli] Using model: ${model}`);
+  }
+
   const humanStream = process.stdout;
   const writeHuman = (text: string) => {
     if (suppressHumanOutput) {