Skip to content

Conversation

@ThomasK33
Copy link
Member

This commit introduces a headless agent runner and an adapter for the
Terminal-Bench framework, enabling automated, programmatic evaluation of
the agent's capabilities.

The core of this change is a new AgentSession class that encapsulates
the logic for managing a single workspace session. This refactors logic
out of ipcMain, allowing the agent core to be used in environments
without an Electron UI, such as the new headless CLI.

Key components added:

  • src/debug/agentSessionCli.ts: A CLI for running an agent session
    headlessly. It can be driven programmatically and supports JSON output.
  • benchmarks/terminal_bench/: A Python adapter for Terminal-Bench that
    packages the cmux application, installs it in a task container, and
    runs it against benchmark instructions using the agent CLI.
  • Makefile target benchmark-terminal and docs/benchmarking.md to
    provide an easy entrypoint and documentation for running benchmarks.
  • Integration tests for the new agentSessionCli to ensure its
    correctness.

@ThomasK33 ThomasK33 force-pushed the thomask33/10-12-add_terminal_bench_integration branch from 10bb4bb to a70399a Compare October 12, 2025 13:49
@coder coder deleted a comment from chatgpt-codex-connector bot Oct 12, 2025
@ThomasK33 ThomasK33 requested a review from ammario October 12, 2025 14:06
@ThomasK33
Copy link
Member Author

@codex review

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Delightful!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@ThomasK33 ThomasK33 force-pushed the thomask33/10-12-add_terminal_bench_integration branch from a70399a to feea514 Compare October 12, 2025 17:33
Copy link
Member

@ammario ammario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tbh i just gave it a cursory review -- plz just make sure urself that there are no regressions -- as an aside i think the whole "aiService" middleware file is unnecessary abstraction but we can fix that later

@ThomasK33 ThomasK33 force-pushed the thomask33/10-12-add_terminal_bench_integration branch 2 times, most recently from c53fe34 to fd4dddb Compare October 12, 2025 22:56
This commit introduces a headless agent runner and an adapter for the
Terminal-Bench framework, enabling automated, programmatic evaluation of
the agent's capabilities.

The core of this change is a new `AgentSession` class that encapsulates
the logic for managing a single workspace session. This refactors logic
out of `ipcMain`, allowing the agent core to be used in environments
without an Electron UI, such as the new headless CLI.

Key components added:
- `src/debug/agentSessionCli.ts`: A CLI for running an agent session
  headlessly. It can be driven programmatically and supports JSON output.
- `benchmarks/terminal_bench/`: A Python adapter for Terminal-Bench that
  packages the cmux application, installs it in a task container, and
  runs it against benchmark instructions using the agent CLI.
- `Makefile` target `benchmark-terminal` and `docs/benchmarking.md` to
  provide an easy entrypoint and documentation for running benchmarks.
- Integration tests for the new `agentSessionCli` to ensure its
  correctness.
@ThomasK33 ThomasK33 force-pushed the thomask33/10-12-add_terminal_bench_integration branch from fd4dddb to 6b30e43 Compare October 13, 2025 08:31
@ThomasK33
Copy link
Member Author

tbh i just gave it a cursory review -- plz just make sure urself that there are no regressions -- as an aside i think the whole "aiService" middleware file is unnecessary abstraction but we can fix that later

I don't think there should be any breaking changes here; this is primarily a refactoring to separate the "agentic" components from the IPC layer, as the benchmarks assume files are modified in place.
And since my local setup, E2Es, and integration tests continue working, it should be rather safe to merge. (Famous last words)

@ThomasK33 ThomasK33 added this pull request to the merge queue Oct 13, 2025
Merged via the queue into main with commit d5c343c Oct 13, 2025
9 checks passed
@ThomasK33 ThomasK33 deleted the thomask33/10-12-add_terminal_bench_integration branch October 13, 2025 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants