feat: add Terminal-Bench adapter and headless agent CLI #198

ThomasK33 · 2025-10-12T13:43:38Z

This commit introduces a headless agent runner and an adapter for the
Terminal-Bench framework, enabling automated, programmatic evaluation of
the agent's capabilities.

The core of this change is a new AgentSession class that encapsulates
the logic for managing a single workspace session. This refactors logic
out of ipcMain, allowing the agent core to be used in environments
without an Electron UI, such as the new headless CLI.

Key components added:

src/debug/agentSessionCli.ts: A CLI for running an agent session
headlessly. It can be driven programmatically and supports JSON output.
benchmarks/terminal_bench/: A Python adapter for Terminal-Bench that
packages the cmux application, installs it in a task container, and
runs it against benchmark instructions using the agent CLI.
Makefile target benchmark-terminal and docs/benchmarking.md to
provide an easy entrypoint and documentation for running benchmarks.
Integration tests for the new agentSessionCli to ensure its
correctness.

ThomasK33 · 2025-10-12T14:06:23Z

@codex review

chatgpt-codex-connector · 2025-10-12T14:12:06Z

Codex Review: Didn't find any major issues. Delightful!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

ammario

tbh i just gave it a cursory review -- plz just make sure urself that there are no regressions -- as an aside i think the whole "aiService" middleware file is unnecessary abstraction but we can fix that later

This commit introduces a headless agent runner and an adapter for the Terminal-Bench framework, enabling automated, programmatic evaluation of the agent's capabilities. The core of this change is a new `AgentSession` class that encapsulates the logic for managing a single workspace session. This refactors logic out of `ipcMain`, allowing the agent core to be used in environments without an Electron UI, such as the new headless CLI. Key components added: - `src/debug/agentSessionCli.ts`: A CLI for running an agent session headlessly. It can be driven programmatically and supports JSON output. - `benchmarks/terminal_bench/`: A Python adapter for Terminal-Bench that packages the cmux application, installs it in a task container, and runs it against benchmark instructions using the agent CLI. - `Makefile` target `benchmark-terminal` and `docs/benchmarking.md` to provide an easy entrypoint and documentation for running benchmarks. - Integration tests for the new `agentSessionCli` to ensure its correctness.

ThomasK33 · 2025-10-13T08:47:50Z

tbh i just gave it a cursory review -- plz just make sure urself that there are no regressions -- as an aside i think the whole "aiService" middleware file is unnecessary abstraction but we can fix that later

I don't think there should be any breaking changes here; this is primarily a refactoring to separate the "agentic" components from the IPC layer, as the benchmarks assume files are modified in place.
And since my local setup, E2Es, and integration tests continue working, it should be rather safe to merge. (Famous last words)

ThomasK33 force-pushed the thomask33/10-12-add_terminal_bench_integration branch from 10bb4bb to a70399a Compare October 12, 2025 13:49

coder deleted a comment from chatgpt-codex-connector bot Oct 12, 2025

ThomasK33 requested a review from ammario October 12, 2025 14:06

ThomasK33 force-pushed the thomask33/10-12-add_terminal_bench_integration branch from a70399a to feea514 Compare October 12, 2025 17:33

ammario reviewed Oct 12, 2025

View reviewed changes

ThomasK33 force-pushed the thomask33/10-12-add_terminal_bench_integration branch 2 times, most recently from c53fe34 to fd4dddb Compare October 12, 2025 22:56

ThomasK33 force-pushed the thomask33/10-12-add_terminal_bench_integration branch from fd4dddb to 6b30e43 Compare October 13, 2025 08:31

ThomasK33 added this pull request to the merge queue Oct 13, 2025

Merged via the queue into main with commit d5c343c Oct 13, 2025
9 checks passed

ThomasK33 deleted the thomask33/10-12-add_terminal_bench_integration branch October 13, 2025 08:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add Terminal-Bench adapter and headless agent CLI #198

feat: add Terminal-Bench adapter and headless agent CLI #198

Uh oh!

ThomasK33 commented Oct 12, 2025

Uh oh!

ThomasK33 commented Oct 12, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 12, 2025

Uh oh!

ammario left a comment

Uh oh!

ThomasK33 commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add Terminal-Bench adapter and headless agent CLI #198

feat: add Terminal-Bench adapter and headless agent CLI #198

Uh oh!

Conversation

ThomasK33 commented Oct 12, 2025

Uh oh!

ThomasK33 commented Oct 12, 2025

Uh oh!

chatgpt-codex-connector bot commented Oct 12, 2025

Uh oh!

ammario left a comment

Choose a reason for hiding this comment

Uh oh!

ThomasK33 commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants