Evaluation‑Driven LoRA Fine‑Tuning with Tinker

This project implements a proof‑of‑concept evaluation‑driven fine‑tuning loop on top of Tinker. The goal is to continuously improve a model by training it using LoRA and then measuring its performance on a suite of evaluation tasks. When the model fails to meet a specified threshold, the loop collects additional data or modifies hyperparameters and launches a new fine‑tuning job.

How it works

┌─────────────────────────────────────────────────────────────────┐
│                     Evaluation-Driven Loop                      │
└─────────────────────────────────────────────────────────────────┘

    ┌──────────────┐
    │ Load Config  │
    │  & Data      │
    └──────┬───────┘
           │
           ▼
    ┌──────────────┐
    │ Fine-Tune    │◄─────────┐
    │ with LoRA    │          │
    │ (Tinker)     │          │
    └──────┬───────┘          │
           │                  │
           ▼                  │
    ┌──────────────┐          │
    │ Save         │          │
    │ Checkpoint   │          │
    └──────┬───────┘          │
           │                  │
           ▼                  │
    ┌──────────────┐          │
    │ Run Evals    │          │
    │ (Inspect AI) │          │
    └──────┬───────┘          │
           │                  │
           ├─────────────┐    │
           │             │    │
           ▼             ▼    │
    ┌──────────────┐  ┌────────────┐
    │ Submit to    │  │ Score ≥    │
    │ EvalOps      │  │ Threshold? │
    │ (optional)   │  └─────┬──┬───┘
    └──────────────┘        │  │
                            │  │ No: Adjust LR
                    Yes: ✓  │  │ & select data
                            │  └──────┘
                            ▼
                      ┌──────────┐
                      │   Done   │
                      └──────────┘

Why evaluation‑driven fine‑tuning?

Tinker is a low‑level API for LoRA fine‑tuning that offloads distributed training to managed infrastructure. It also provides an evaluation API that can run inline or offline evaluations and integrate with the Inspect AI library. These features make it possible to build a higher‑level service that:

Automatically benchmarks models using standard tasks or custom tasks.
Identifies weakness patterns, such as specific topics or reasoning styles, from evaluation results.
Launches targeted fine‑tuning jobs with new data or tuned hyperparameters when metrics fall below thresholds.
Tracks progress over multiple rounds of fine‑tuning to quantify the impact of each intervention.

Such a loop can be particularly useful for domains where quality requirements are high and failure modes are diverse (e.g., legal drafting, safety moderation, tutoring).

Features

Mock Mode: Run the entire demo locally without Tinker API (perfect for CI and reviewers)
Proper Tinker Integration: Uses renderers for correct loss masking, async futures for performance, and recommended LR schedules
EvalOps Integration: Optional automatic submission of evaluation results for centralized tracking
Pydantic Config Validation: Type-safe configuration with clear error messages
Production-Grade Hyperparameters: Model-specific LR formula, warmup/cosine scheduling, configurable LoRA rank
Async Batching: Overlapping forward/backward and optimizer steps for faster training
Comprehensive Tests: 37 unit and integration tests covering all components

Usage overview

This project contains two main components:

File	Description
`trainer_with_eval.py`	The main script that orchestrates training and evaluation. It connects to Tinker, creates a LoRA training client, runs fine‑tuning, performs evaluations via Inspect AI, and decides whether to launch further training rounds.
`eval_loop_config.json`	A sample configuration file specifying the base model, dataset paths, evaluation tasks, thresholds and hyperparameters.
`evalops_client.py`	Python SDK for submitting evaluation results to EvalOps platform.
`config_schema.py`	Pydantic schema for configuration validation with hyperparameter tuning.
`data_loader.py`	JSONL data loader with proper Tinker renderers, loss masking, validation, and deduplication.
`data_selector.py`	Utilities for mining hard examples based on evaluation failures.
`hyperparam_utils.py`	Tinker's recommended LR formula and warmup/cosine scheduler.
`simple_eval.py`	Minimal working evaluator for demo (fallback if Inspect AI unavailable).
`inspect_eval.py`	Inspect AI task integration with Tinker sampling adapter.
`logger.py`	Structured JSON logging for metrics and events.
`checkpoint_manager.py`	Checkpoint save/resume for interrupted runs.
`error_handling.py`	Retry logic with exponential backoff and rate limiting.
`mock_tinker.py`	Mock Tinker client for offline demos and CI.
`requirements.txt`	Dependencies required to run the script.
`tests/`	Unit and integration tests for all components.

Quick Demo

Try the minimal working demo (uses 20 sample QA pairs and simulated evaluation):

Option 1: Mock Mode (No API Key Required)

./run_demo.sh  # Automatically uses mock mode if no API key

Option 2: Real Tinker Infrastructure

export TINKER_API_KEY=sk-...  # Your Tinker API key
./run_demo.sh

This will run 1-3 training rounds, automatically adjusting the learning rate and triggering Round 2 when scores fall below threshold (0.75). Mock mode simulates training locally without cloud calls—perfect for understanding the loop before using real infrastructure.

See DEMO.md for a detailed walkthrough of what happens during the demo run.

Quickstart

Install dependencies:
```
pip install -r requirements.txt
```
Set your Tinker API key in your environment (see Tinker docs for how to obtain one). For example:
```
export TINKER_API_KEY=sk-...
```

(Optional) Configure EvalOps integration to automatically track all evaluation results:

export EVALOPS_API_KEY=your-evalops-api-key
export EVALOPS_API_URL=https://api.evalops.dev  # or your self-hosted instance

Then update eval_loop_config.json to enable EvalOps:

{
  ...
  "evalops_enabled": true,
  "evalops_test_suite_id": "your-test-suite-id"
}

Prepare your data (e.g., instruction/output pairs in JSON or JSONL format) and update eval_loop_config.json with the correct paths. Optionally specify evaluation tasks and thresholds.
Run the evaluation‑driven loop:
```
python trainer_with_eval.py --config eval_loop_config.json
```
The script will fine‑tune the specified base model using LoRA, run evaluations, and iteratively improve the model until it meets your quality targets or a maximum number of rounds. If EvalOps integration is enabled, each evaluation round will be automatically submitted to your EvalOps workspace for tracking and analysis.
Resume an interrupted run:
```
python trainer_with_eval.py --config eval_loop_config.json --resume
```
Continues from the latest checkpoint in runs/ directory.

EvalOps Integration

This project includes built-in integration with EvalOps to automatically track evaluation results across training rounds. The evalops_client.py module provides a lightweight Python SDK that:

Submits each training round's evaluation results as a test run in EvalOps
Tracks metrics, model checkpoints, and metadata for each iteration
Enables centralized monitoring and comparison of fine-tuning experiments

To use EvalOps integration:

Create a test suite in your EvalOps workspace
Set the evalops_enabled: true and evalops_test_suite_id in your config
Provide your EvalOps API key via the EVALOPS_API_KEY environment variable

The client will automatically submit results after each evaluation round, making it easy to track progress over time and compare different fine-tuning runs.

Configuration Options

Key configuration parameters in eval_loop_config.json:

Parameter	Default	Description
`base_model`	-	Model to fine-tune (e.g., "meta-llama/Llama-3.1-8B-Instruct")
`lora_rank`	16	LoRA adapter rank (1-256)
`learning_rate`	0.0001	Initial learning rate
`use_recommended_lr`	false	Use Tinker's model-specific LR formula
`warmup_steps`	100	LR warmup steps
`max_steps`	1000	Total training steps for cosine decay
`batch_size`	8	Training batch size
`steps_per_round`	1	Training steps per evaluation round
`eval_threshold`	0.85	Minimum score to stop training
`max_rounds`	3	Maximum training rounds
`renderer_name`	"llama3"	Renderer for proper chat formatting
`evalops_enabled`	false	Enable EvalOps integration

Extending this project

This is a production-ready prototype demonstrating best practices from Tinker documentation. Future extensions could include:

Custom data selection based on evaluation feedback. For example, automatically mine additional examples from your corpora that match prompts where the model performs poorly.
Dynamic hyperparameter tuning (e.g., adjusting LoRA rank or learning rate) using heuristics or Bayesian optimisation.
Feedback integration from users or human graders to generate reward signals for RL fine‑tuning (Tinker supports PPO and importance‑sampling losses).
Advanced EvalOps features, such as quality gates, automated alerts when metrics drop below thresholds, or integration with regression testing schedules.

Testing

Run the test suite to validate all components:

pytest tests/ -v

The test suite includes:

Unit tests for EvalOps client, config validation, and data loading
Integration tests for the training loop with mocked Tinker/EvalOps services
Coverage for early stopping, LR decay, and error handling

Implementation Notes

Based on Tinker Documentation:

Uses renderers.build_supervised_example() for proper loss masking (trains only on assistant outputs)
Implements async futures with forward_backward_async() and optim_step_async() for performance
Uses save_weights_for_sampler() for evaluation (not save_state() which includes optimizer state)
Supports Tinker's recommended LR formula: LR = 5e-5 × 10 × (2000/H_m)^P_m with model-specific exponents
Includes warmup + cosine decay scheduler for stable training
Inspect AI integration with fallback to simple evaluator
Structured JSON logging to runs/<timestamp>/metrics.jsonl
Checkpoint resume with --resume flag
Mock mode for offline demos and CI (set TINKER_MOCK=1)
Gracefully falls back when tinker-cookbook unavailable

Disclaimer

This code requires an active Tinker API key and appropriate computing quotas to execute training and evaluation. The implementation follows Tinker's documented best practices and is suitable for production use with real evaluation tasks. The simple evaluator is for demo purposes only—replace with Inspect AI integration for production deployments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluation‑Driven LoRA Fine‑Tuning with Tinker

How it works

Why evaluation‑driven fine‑tuning?

Features

Usage overview

Quick Demo

Quickstart

EvalOps Integration

Configuration Options

Extending this project

Testing

Implementation Notes

Disclaimer

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
data		data
mock_checkpoints		mock_checkpoints
runs		runs
tests		tests
.gitignore		.gitignore
DEMO.md		DEMO.md
README.md		README.md
checkpoint_manager.py		checkpoint_manager.py
config_schema.py		config_schema.py
data_loader.py		data_loader.py
data_selector.py		data_selector.py
demo_config.json		demo_config.json
demo_data.jsonl		demo_data.jsonl
error_handling.py		error_handling.py
eval_loop_config.json		eval_loop_config.json
evalops_client.py		evalops_client.py
hyperparam_utils.py		hyperparam_utils.py
inspect_eval.py		inspect_eval.py
logger.py		logger.py
mock_tinker.py		mock_tinker.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_demo.sh		run_demo.sh
simple_eval.py		simple_eval.py
trainer_with_eval.py		trainer_with_eval.py

evalops/tinker-eval-loop

Folders and files

Latest commit

History

Repository files navigation

Evaluation‑Driven LoRA Fine‑Tuning with Tinker

How it works

Why evaluation‑driven fine‑tuning?

Features

Usage overview

Quick Demo

Quickstart

EvalOps Integration

Configuration Options

Extending this project

Testing

Implementation Notes

Disclaimer

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages