The LLM Evaluation Framework
-
Updated
Nov 6, 2025 - Python
The LLM Evaluation Framework
A framework for few-shot evaluation of language models.
The easiest tool for fine-tuning LLM models, synthetic data generation, and collaborating on datasets.
Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
This is the repository of our article published in RecSys 2019 "Are We Really Making Much Progress? A Worrying Analysis of Recent Neural Recommendation Approaches" and of several follow-up studies.
Data-Driven Evaluation for LLM-Powered Applications
AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.
Metrics to evaluate the quality of responses of your Retrieval Augmented Generation (RAG) applications.
Python SDK for running evaluations on LLM generated responses
Moonshot - A simple and modular tool to evaluate and red-team any LLM application.
PySODEvalToolkit: A Python-based Evaluation Toolbox for Salient Object Detection and Camouflaged Object Detection
The official evaluation suite and dynamic data release for MixEval.
A research library for automating experiments on Deep Graph Networks
MedEvalKit: A Unified Medical Evaluation Framework
A framework for standardizing evaluations of large foundation models, beyond single-score reporting and rankings.
🔥[VLDB'24] Official repository for the paper “The Dawn of Natural Language to SQL: Are We Fully Ready?”
Evaluation framework for oncology foundation models (FMs)
The robust European language model benchmark.
Multilingual Large Language Models Evaluation Benchmark
WritingBench: A Comprehensive Benchmark for Generative Writing
Add a description, image, and links to the evaluation-framework topic page so that developers can more easily learn about it.
To associate your repository with the evaluation-framework topic, visit your repo's landing page and select "manage topics."