Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
-
Updated
Oct 3, 2025 - TypeScript
Vivaria is METR's tool for running evaluations and conducting agent elicitation research.
Repository for benchmark saturation research project.
A fully open-source database of AI models with benchmark scores, prices, and capabilities.
Plug-and-play Human-In-The-Loop integration of agentic workflows
A production-grade platform to evaluate and compare the performance of Large Language Models (LLMs) like OpenAI, Anthropic, and Google’s PaLM. It features real time analytics, hallucination detection, and cost performance benchmarking using standardized datasets (e.g., GSM8K).
Add a description, image, and links to the ai-evaluation topic page so that developers can more easily learn about it.
To associate your repository with the ai-evaluation topic, visit your repo's landing page and select "manage topics."