Skip to content

Commit 78f978a

Browse files
author
Lech
committed
Gemini 2.5 Pro Preview 06-05
1 parent bac66ca commit 78f978a

File tree

525 files changed

+75267
-151
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

525 files changed

+75267
-151
lines changed

README.md

Lines changed: 156 additions & 151 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ This benchmark tests how well large language models (LLMs) incorporate a set of
88
---
99
## Method Summary
1010
Each LLM produces 500 short stories, each approximately 400–500 words long, that must organically incorporate all assigned random elements.
11-
In the updated April 2025 version of the benchmark, which uses newer grader LLMs, 33 of the latest models are evaluated. In the earlier version, 38 LLMs were assessed.
11+
In the updated April 2025 version of the benchmark, which uses newer grader LLMs, 39 of the latest models are evaluated. In the earlier version, 38 LLMs were assessed.
1212

1313
Seven LLMs grade each of these stories on 16 questions regarding:
1414
1. Character Development & Motivation
@@ -38,43 +38,44 @@ The new grading LLMs are:
3838
| Rank | LLM | Mean |
3939
|-----:|-------------------|------:|
4040
| 1 | o3 (medium reasoning) | 8.39 |
41-
| 2 | Claude Opus 4 Thinking 16K | 8.36 |
42-
| 3 | Claude Opus 4 (no reasoning) | 8.31 |
43-
| 4 | Qwen 3 235B A22B | 8.30 |
44-
| 5 | DeepSeek R1 | 8.30 |
45-
| 6 | DeepSeek R1 05/28 | 8.19 |
46-
| 7 | GPT-4o Mar 2025 | 8.18 |
47-
| 8 | Claude Sonnet 4 Thinking 16K | 8.14 |
48-
| 9 | Claude 3.7 Sonnet Thinking 16K | 8.11 |
49-
| 10 | Claude Sonnet 4 (no reasoning) | 8.09 |
50-
| 11 | Gemini 2.5 Pro Preview 05-06 | 8.09 |
51-
| 12 | Gemini 2.5 Pro Exp 03-25 | 8.05 |
52-
| 13 | Claude 3.5 Sonnet 2024-10-22 | 8.03 |
53-
| 14 | Qwen QwQ-32B 16K | 8.02 |
54-
| 15 | Gemma 3 27B | 7.99 |
55-
| 16 | Claude 3.7 Sonnet | 7.94 |
56-
| 17 | Mistral Medium 3 | 7.73 |
57-
| 18 | DeepSeek V3-0324 | 7.70 |
58-
| 19 | Gemini 2.5 Flash Preview 24K | 7.65 |
59-
| 20 | Grok 3 Beta (no reasoning) | 7.64 |
60-
| 21 | GPT-4.5 Preview | 7.56 |
61-
| 22 | Qwen 3 30B A3B | 7.53 |
62-
| 23 | o4-mini (medium reasoning) | 7.50 |
63-
| 24 | Gemini 2.0 Flash Think Exp 01-21 | 7.38 |
64-
| 25 | Claude 3.5 Haiku | 7.35 |
65-
| 26 | Grok 3 Mini Beta (low) | 7.35 |
66-
| 27 | Qwen 2.5 Max | 7.29 |
67-
| 28 | Gemini 2.0 Flash Exp | 7.15 |
68-
| 29 | o1 (medium reasoning) | 7.02 |
69-
| 30 | Mistral Large 2 | 6.90 |
70-
| 31 | GPT-4o mini | 6.72 |
71-
| 32 | o1-mini | 6.49 |
72-
| 33 | Grok 2 12-12 | 6.36 |
73-
| 34 | Microsoft Phi-4 | 6.26 |
74-
| 35 | Llama 4 Maverick | 6.20 |
75-
| 36 | o3-mini (high reasoning) | 6.17 |
76-
| 37 | o3-mini (medium reasoning) | 6.15 |
77-
| 38 | Amazon Nova Pro | 6.05 |
41+
| 2 | Gemini 2.5 Pro Preview 06-05 | 8.38 |
42+
| 3 | Claude Opus 4 Thinking 16K | 8.36 |
43+
| 4 | Claude Opus 4 (no reasoning) | 8.31 |
44+
| 5 | Qwen 3 235B A22B | 8.30 |
45+
| 6 | DeepSeek R1 | 8.30 |
46+
| 7 | DeepSeek R1 05/28 | 8.19 |
47+
| 8 | GPT-4o Mar 2025 | 8.18 |
48+
| 9 | Claude Sonnet 4 Thinking 16K | 8.14 |
49+
| 10 | Claude 3.7 Sonnet Thinking 16K | 8.11 |
50+
| 11 | Claude Sonnet 4 (no reasoning) | 8.09 |
51+
| 12 | Gemini 2.5 Pro Preview 05-06 | 8.09 |
52+
| 13 | Gemini 2.5 Pro Exp 03-25 | 8.05 |
53+
| 14 | Claude 3.5 Sonnet 2024-10-22 | 8.03 |
54+
| 15 | Qwen QwQ-32B 16K | 8.02 |
55+
| 16 | Gemma 3 27B | 7.99 |
56+
| 17 | Claude 3.7 Sonnet | 7.94 |
57+
| 18 | Mistral Medium 3 | 7.73 |
58+
| 19 | DeepSeek V3-0324 | 7.70 |
59+
| 20 | Gemini 2.5 Flash Preview 24K | 7.65 |
60+
| 21 | Grok 3 Beta (no reasoning) | 7.64 |
61+
| 22 | GPT-4.5 Preview | 7.56 |
62+
| 23 | Qwen 3 30B A3B | 7.53 |
63+
| 24 | o4-mini (medium reasoning) | 7.50 |
64+
| 25 | Gemini 2.0 Flash Think Exp 01-21 | 7.38 |
65+
| 26 | Claude 3.5 Haiku | 7.35 |
66+
| 27 | Grok 3 Mini Beta (low) | 7.35 |
67+
| 28 | Qwen 2.5 Max | 7.29 |
68+
| 29 | Gemini 2.0 Flash Exp | 7.15 |
69+
| 30 | o1 (medium reasoning) | 7.02 |
70+
| 31 | Mistral Large 2 | 6.90 |
71+
| 32 | GPT-4o mini | 6.72 |
72+
| 33 | o1-mini | 6.49 |
73+
| 34 | Grok 2 12-12 | 6.36 |
74+
| 35 | Microsoft Phi-4 | 6.26 |
75+
| 36 | Llama 4 Maverick | 6.20 |
76+
| 37 | o3-mini (high reasoning) | 6.17 |
77+
| 38 | o3-mini (medium reasoning) | 6.15 |
78+
| 39 | Amazon Nova Pro | 6.05 |
7879
---
7980

8081
### Overall Strip Plot of Questions
@@ -219,88 +220,90 @@ Excluding 10% worst stories per LLM does not significantly change the rankings:
219220
| LLM Full | Old Rank | Old Mean | New Rank | New Mean |
220221
|----------|---------:|---------:|---------:|---------:|
221222
| o3 (medium reasoning) | 1 | 8.39 | 1 | 8.44 |
222-
| Claude Opus 4 Thinking 16K | 2 | 8.36 | 2 | 8.43 |
223-
| Claude Opus 4 (no reasoning) | 3 | 8.31 | 3 | 8.39 |
224-
| Qwen 3 235B A22B | 4 | 8.30 | 4 | 8.36 |
225-
| DeepSeek R1 | 5 | 8.30 | 5 | 8.36 |
226-
| DeepSeek R1 05/28 | 6 | 8.19 | 6 | 8.25 |
227-
| GPT-4o Mar 2025 | 7 | 8.18 | 7 | 8.23 |
228-
| Claude Sonnet 4 Thinking 16K | 8 | 8.14 | 8 | 8.21 |
229-
| Claude 3.7 Sonnet Thinking 16K | 9 | 8.11 | 9 | 8.17 |
230-
| Claude Sonnet 4 (no reasoning) | 10 | 8.09 | 10 | 8.16 |
231-
| Gemini 2.5 Pro Preview 05-06 | 11 | 8.09 | 11 | 8.15 |
232-
| Gemini 2.5 Pro Exp 03-25 | 12 | 8.05 | 12 | 8.11 |
233-
| Claude 3.5 Sonnet 2024-10-22 | 13 | 8.03 | 13 | 8.09 |
234-
| Qwen QwQ-32B 16K | 14 | 8.02 | 14 | 8.09 |
235-
| Gemma 3 27B | 15 | 7.99 | 15 | 8.06 |
236-
| Claude 3.7 Sonnet | 16 | 7.94 | 16 | 8.00 |
237-
| Mistral Medium 3 | 17 | 7.73 | 17 | 7.82 |
238-
| DeepSeek V3-0324 | 18 | 7.69 | 18 | 7.77 |
239-
| Gemini 2.5 Flash Preview 24K | 19 | 7.65 | 19 | 7.73 |
240-
| Grok 3 Beta (no reasoning) | 20 | 7.64 | 20 | 7.70 |
241-
| GPT-4.5 Preview | 21 | 7.56 | 21 | 7.63 |
242-
| Qwen 3 30B A3B | 22 | 7.53 | 22 | 7.61 |
243-
| o4-mini (medium reasoning) | 23 | 7.50 | 23 | 7.58 |
244-
| Gemini 2.0 Flash Think Exp 01-21 | 24 | 7.38 | 24 | 7.47 |
245-
| Claude 3.5 Haiku | 25 | 7.35 | 25 | 7.43 |
246-
| Grok 3 Mini Beta (low) | 26 | 7.35 | 26 | 7.42 |
247-
| Qwen 2.5 Max | 27 | 7.29 | 27 | 7.37 |
248-
| Gemini 2.0 Flash Exp | 28 | 7.15 | 28 | 7.24 |
249-
| o1 (medium reasoning) | 29 | 7.02 | 29 | 7.11 |
250-
| Mistral Large 2 | 30 | 6.90 | 30 | 7.00 |
251-
| GPT-4o mini | 31 | 6.72 | 31 | 6.80 |
252-
| o1-mini | 32 | 6.49 | 32 | 6.58 |
253-
| Grok 2 12-12 | 33 | 6.36 | 33 | 6.46 |
254-
| Microsoft Phi-4 | 34 | 6.26 | 34 | 6.35 |
255-
| Llama 4 Maverick | 35 | 6.20 | 35 | 6.29 |
256-
| o3-mini (high reasoning) | 36 | 6.17 | 36 | 6.26 |
257-
| o3-mini (medium reasoning) | 37 | 6.15 | 37 | 6.24 |
258-
| Amazon Nova Pro | 38 | 6.05 | 38 | 6.15 |
223+
| Gemini 2.5 Pro Preview 06-05 | 2 | 8.38 | 2 | 8.44 |
224+
| Claude Opus 4 Thinking 16K | 3 | 8.36 | 3 | 8.43 |
225+
| Claude Opus 4 (no reasoning) | 4 | 8.31 | 4 | 8.39 |
226+
| Qwen 3 235B A22B | 5 | 8.30 | 5 | 8.36 |
227+
| DeepSeek R1 | 6 | 8.30 | 6 | 8.36 |
228+
| DeepSeek R1 05/28 | 7 | 8.19 | 7 | 8.25 |
229+
| GPT-4o Mar 2025 | 8 | 8.18 | 8 | 8.23 |
230+
| Claude Sonnet 4 Thinking 16K | 9 | 8.14 | 9 | 8.21 |
231+
| Claude 3.7 Sonnet Thinking 16K | 10 | 8.11 | 10 | 8.17 |
232+
| Claude Sonnet 4 (no reasoning) | 11 | 8.09 | 11 | 8.16 |
233+
| Gemini 2.5 Pro Preview 05-06 | 12 | 8.09 | 12 | 8.15 |
234+
| Gemini 2.5 Pro Exp 03-25 | 13 | 8.05 | 13 | 8.11 |
235+
| Claude 3.5 Sonnet 2024-10-22 | 14 | 8.03 | 14 | 8.09 |
236+
| Qwen QwQ-32B 16K | 15 | 8.02 | 15 | 8.09 |
237+
| Gemma 3 27B | 16 | 7.99 | 16 | 8.06 |
238+
| Claude 3.7 Sonnet | 17 | 7.94 | 17 | 8.00 |
239+
| Mistral Medium 3 | 18 | 7.73 | 18 | 7.82 |
240+
| DeepSeek V3-0324 | 19 | 7.69 | 19 | 7.77 |
241+
| Gemini 2.5 Flash Preview 24K | 20 | 7.65 | 20 | 7.73 |
242+
| Grok 3 Beta (no reasoning) | 21 | 7.64 | 21 | 7.70 |
243+
| GPT-4.5 Preview | 22 | 7.56 | 22 | 7.63 |
244+
| Qwen 3 30B A3B | 23 | 7.53 | 23 | 7.61 |
245+
| o4-mini (medium reasoning) | 24 | 7.50 | 24 | 7.58 |
246+
| Gemini 2.0 Flash Think Exp 01-21 | 25 | 7.38 | 25 | 7.47 |
247+
| Claude 3.5 Haiku | 26 | 7.35 | 26 | 7.43 |
248+
| Grok 3 Mini Beta (low) | 27 | 7.35 | 27 | 7.42 |
249+
| Qwen 2.5 Max | 28 | 7.29 | 28 | 7.37 |
250+
| Gemini 2.0 Flash Exp | 29 | 7.15 | 29 | 7.24 |
251+
| o1 (medium reasoning) | 30 | 7.02 | 30 | 7.11 |
252+
| Mistral Large 2 | 31 | 6.90 | 31 | 7.00 |
253+
| GPT-4o mini | 32 | 6.72 | 32 | 6.80 |
254+
| o1-mini | 33 | 6.49 | 33 | 6.58 |
255+
| Grok 2 12-12 | 34 | 6.36 | 34 | 6.46 |
256+
| Microsoft Phi-4 | 35 | 6.26 | 35 | 6.35 |
257+
| Llama 4 Maverick | 36 | 6.20 | 36 | 6.29 |
258+
| o3-mini (high reasoning) | 37 | 6.17 | 37 | 6.26 |
259+
| o3-mini (medium reasoning) | 38 | 6.15 | 38 | 6.24 |
260+
| Amazon Nova Pro | 39 | 6.05 | 39 | 6.15 |
259261

260262

261263
Excluding any one LLM from grading also does not significantly change the rankings. For example, here is what happens when LLama 4 Maverick is excluded:
262264
### Ranking after Excluding LLama 4 Maverick from Grading
263265

264266
| LLM | Old Rank | Old Mean | New Rank | New Mean |
265267
|--------------------|---------:|---------:|---------:|---------:|
266-
| o3 (medium reasoning) | 1 | 8.39 | 1 | 8.27 |
267-
| Claude Opus 4 Thinking 16K | 2 | 8.36 | 2 | 8.25 |
268-
| Claude Opus 4 (no reasoning) | 3 | 8.31 | 3 | 8.20 |
269-
| DeepSeek R1 | 5 | 8.30 | 4 | 8.19 |
270-
| Qwen 3 235B A22B | 4 | 8.30 | 5 | 8.19 |
271-
| DeepSeek R1 05/28 | 6 | 8.19 | 6 | 8.07 |
272-
| GPT-4o Mar 2025 | 7 | 8.18 | 7 | 8.04 |
273-
| Claude Sonnet 4 Thinking 16K | 8 | 8.14 | 8 | 8.01 |
274-
| Claude 3.7 Sonnet Thinking 16K | 9 | 8.11 | 9 | 7.98 |
275-
| Gemini 2.5 Pro Preview 05-06 | 11 | 8.09 | 10 | 7.96 |
276-
| Claude Sonnet 4 (no reasoning) | 10 | 8.09 | 11 | 7.95 |
277-
| Gemini 2.5 Pro Exp 03-25 | 12 | 8.05 | 12 | 7.92 |
278-
| Claude 3.5 Sonnet 2024-10-22 | 13 | 8.03 | 13 | 7.89 |
279-
| Qwen QwQ-32B 16K | 14 | 8.02 | 14 | 7.88 |
280-
| Gemma 3 27B | 15 | 7.99 | 15 | 7.85 |
281-
| Claude 3.7 Sonnet | 16 | 7.94 | 16 | 7.78 |
282-
| Mistral Medium 3 | 17 | 7.73 | 17 | 7.55 |
283-
| DeepSeek V3-0324 | 18 | 7.69 | 18 | 7.51 |
284-
| Gemini 2.5 Flash Preview 24K | 19 | 7.65 | 19 | 7.46 |
285-
| Grok 3 Beta (no reasoning) | 20 | 7.64 | 20 | 7.44 |
286-
| GPT-4.5 Preview | 21 | 7.56 | 21 | 7.36 |
287-
| Qwen 3 30B A3B | 22 | 7.53 | 22 | 7.32 |
288-
| o4-mini (medium reasoning) | 23 | 7.50 | 23 | 7.26 |
289-
| Gemini 2.0 Flash Think Exp 01-21 | 24 | 7.38 | 24 | 7.14 |
290-
| Claude 3.5 Haiku | 25 | 7.35 | 25 | 7.11 |
291-
| Grok 3 Mini Beta (low) | 26 | 7.35 | 26 | 7.10 |
292-
| Qwen 2.5 Max | 27 | 7.29 | 27 | 7.08 |
293-
| Gemini 2.0 Flash Exp | 28 | 7.15 | 28 | 6.89 |
294-
| o1 (medium reasoning) | 29 | 7.02 | 29 | 6.74 |
295-
| Mistral Large 2 | 30 | 6.90 | 30 | 6.63 |
296-
| GPT-4o mini | 31 | 6.72 | 31 | 6.43 |
297-
| o1-mini | 32 | 6.49 | 32 | 6.13 |
298-
| Grok 2 12-12 | 33 | 6.36 | 33 | 6.03 |
299-
| Microsoft Phi-4 | 34 | 6.26 | 34 | 5.90 |
300-
| Llama 4 Maverick | 35 | 6.20 | 35 | 5.83 |
301-
| o3-mini (high reasoning) | 36 | 6.17 | 36 | 5.76 |
302-
| o3-mini (medium reasoning) | 37 | 6.15 | 37 | 5.73 |
303-
| Amazon Nova Pro | 38 | 6.05 | 38 | 5.67 |
268+
| Gemini 2.5 Pro Preview 06-05 | 2 | 8.38 | 1 | 8.29 |
269+
| o3 (medium reasoning) | 1 | 8.39 | 2 | 8.27 |
270+
| Claude Opus 4 Thinking 16K | 3 | 8.36 | 3 | 8.25 |
271+
| Claude Opus 4 (no reasoning) | 4 | 8.31 | 4 | 8.20 |
272+
| DeepSeek R1 | 6 | 8.30 | 5 | 8.19 |
273+
| Qwen 3 235B A22B | 5 | 8.30 | 6 | 8.19 |
274+
| DeepSeek R1 05/28 | 7 | 8.19 | 7 | 8.07 |
275+
| GPT-4o Mar 2025 | 8 | 8.18 | 8 | 8.04 |
276+
| Claude Sonnet 4 Thinking 16K | 9 | 8.14 | 9 | 8.01 |
277+
| Claude 3.7 Sonnet Thinking 16K | 10 | 8.11 | 10 | 7.98 |
278+
| Gemini 2.5 Pro Preview 05-06 | 12 | 8.09 | 11 | 7.96 |
279+
| Claude Sonnet 4 (no reasoning) | 11 | 8.09 | 12 | 7.95 |
280+
| Gemini 2.5 Pro Exp 03-25 | 13 | 8.05 | 13 | 7.92 |
281+
| Claude 3.5 Sonnet 2024-10-22 | 14 | 8.03 | 14 | 7.89 |
282+
| Qwen QwQ-32B 16K | 15 | 8.02 | 15 | 7.88 |
283+
| Gemma 3 27B | 16 | 7.99 | 16 | 7.85 |
284+
| Claude 3.7 Sonnet | 17 | 7.94 | 17 | 7.78 |
285+
| Mistral Medium 3 | 18 | 7.73 | 18 | 7.55 |
286+
| DeepSeek V3-0324 | 19 | 7.69 | 19 | 7.51 |
287+
| Gemini 2.5 Flash Preview 24K | 20 | 7.65 | 20 | 7.46 |
288+
| Grok 3 Beta (no reasoning) | 21 | 7.64 | 21 | 7.44 |
289+
| GPT-4.5 Preview | 22 | 7.56 | 22 | 7.36 |
290+
| Qwen 3 30B A3B | 23 | 7.53 | 23 | 7.32 |
291+
| o4-mini (medium reasoning) | 24 | 7.50 | 24 | 7.26 |
292+
| Gemini 2.0 Flash Think Exp 01-21 | 25 | 7.38 | 25 | 7.14 |
293+
| Claude 3.5 Haiku | 26 | 7.35 | 26 | 7.11 |
294+
| Grok 3 Mini Beta (low) | 27 | 7.35 | 27 | 7.10 |
295+
| Qwen 2.5 Max | 28 | 7.29 | 28 | 7.08 |
296+
| Gemini 2.0 Flash Exp | 29 | 7.15 | 29 | 6.89 |
297+
| o1 (medium reasoning) | 30 | 7.02 | 30 | 6.74 |
298+
| Mistral Large 2 | 31 | 6.90 | 31 | 6.63 |
299+
| GPT-4o mini | 32 | 6.72 | 32 | 6.43 |
300+
| o1-mini | 33 | 6.49 | 33 | 6.13 |
301+
| Grok 2 12-12 | 34 | 6.36 | 34 | 6.03 |
302+
| Microsoft Phi-4 | 35 | 6.26 | 35 | 5.90 |
303+
| Llama 4 Maverick | 36 | 6.20 | 36 | 5.83 |
304+
| o3-mini (high reasoning) | 37 | 6.17 | 37 | 5.76 |
305+
| o3-mini (medium reasoning) | 38 | 6.15 | 38 | 5.73 |
306+
| Amazon Nova Pro | 39 | 6.05 | 39 | 5.67 |
304307

305308
Normalizing each grader’s scores doesn’t significantly alter the rankings:
306309

@@ -309,44 +312,45 @@ Normalizing each grader’s scores doesn’t significantly alter the rankings:
309312

310313
| Rank | LLM | Normalized Mean |
311314
|-----:|------------------------|-----------------:|
312-
| 1 | o3 (medium reasoning) | 0.948 |
313-
| 2 | Claude Opus 4 Thinking 16K | 0.892 |
314-
| 3 | DeepSeek R1 | 0.845 |
315-
| 4 | Qwen 3 235B A22B | 0.843 |
316-
| 5 | Claude Opus 4 (no reasoning) | 0.838 |
317-
| 6 | GPT-4o Mar 2025 | 0.744 |
318-
| 7 | DeepSeek R1 05/28 | 0.695 |
319-
| 8 | Claude Sonnet 4 Thinking 16K | 0.661 |
320-
| 9 | Claude 3.7 Sonnet Thinking 16K | 0.653 |
321-
| 10 | Claude Sonnet 4 (no reasoning) | 0.604 |
322-
| 11 | Claude 3.5 Sonnet 2024-10-22 | 0.571 |
323-
| 12 | Qwen QwQ-32B 16K | 0.563 |
324-
| 13 | Gemini 2.5 Pro Preview 05-06 | 0.551 |
325-
| 14 | Gemini 2.5 Pro Exp 03-25 | 0.550 |
326-
| 15 | Gemma 3 27B | 0.495 |
327-
| 16 | Claude 3.7 Sonnet | 0.479 |
328-
| 17 | DeepSeek V3-0324 | 0.226 |
329-
| 18 | Mistral Medium 3 | 0.219 |
330-
| 19 | Gemini 2.5 Flash Preview 24K | 0.162 |
331-
| 20 | Grok 3 Beta (no reasoning) | 0.157 |
332-
| 21 | GPT-4.5 Preview | 0.108 |
333-
| 22 | Qwen 3 30B A3B | 0.079 |
334-
| 23 | o4-mini (medium reasoning) | 0.064 |
335-
| 24 | Grok 3 Mini Beta (low) | -0.086 |
336-
| 25 | Gemini 2.0 Flash Think Exp 01-21 | -0.095 |
337-
| 26 | Claude 3.5 Haiku | -0.103 |
338-
| 27 | Qwen 2.5 Max | -0.251 |
339-
| 28 | Gemini 2.0 Flash Exp | -0.325 |
340-
| 29 | o1 (medium reasoning) | -0.469 |
341-
| 30 | Mistral Large 2 | -0.642 |
342-
| 31 | GPT-4o mini | -0.866 |
343-
| 32 | o1-mini | -0.984 |
344-
| 33 | Grok 2 12-12 | -1.244 |
345-
| 34 | o3-mini (high reasoning) | -1.280 |
346-
| 35 | o3-mini (medium reasoning) | -1.294 |
347-
| 36 | Microsoft Phi-4 | -1.317 |
348-
| 37 | Llama 4 Maverick | -1.408 |
349-
| 38 | Amazon Nova Pro | -1.581 |
315+
| 1 | o3 (medium reasoning) | 0.925 |
316+
| 2 | Claude Opus 4 Thinking 16K | 0.868 |
317+
| 3 | Gemini 2.5 Pro Preview 06-05 | 0.846 |
318+
| 4 | DeepSeek R1 | 0.822 |
319+
| 5 | Qwen 3 235B A22B | 0.820 |
320+
| 6 | Claude Opus 4 (no reasoning) | 0.814 |
321+
| 7 | GPT-4o Mar 2025 | 0.722 |
322+
| 8 | DeepSeek R1 05/28 | 0.671 |
323+
| 9 | Claude Sonnet 4 Thinking 16K | 0.636 |
324+
| 10 | Claude 3.7 Sonnet Thinking 16K | 0.631 |
325+
| 11 | Claude Sonnet 4 (no reasoning) | 0.580 |
326+
| 12 | Claude 3.5 Sonnet 2024-10-22 | 0.549 |
327+
| 13 | Qwen QwQ-32B 16K | 0.541 |
328+
| 14 | Gemini 2.5 Pro Exp 03-25 | 0.527 |
329+
| 15 | Gemini 2.5 Pro Preview 05-06 | 0.526 |
330+
| 16 | Gemma 3 27B | 0.472 |
331+
| 17 | Claude 3.7 Sonnet | 0.457 |
332+
| 18 | DeepSeek V3-0324 | 0.205 |
333+
| 19 | Mistral Medium 3 | 0.195 |
334+
| 20 | Gemini 2.5 Flash Preview 24K | 0.140 |
335+
| 21 | Grok 3 Beta (no reasoning) | 0.135 |
336+
| 22 | GPT-4.5 Preview | 0.087 |
337+
| 23 | Qwen 3 30B A3B | 0.057 |
338+
| 24 | o4-mini (medium reasoning) | 0.043 |
339+
| 25 | Grok 3 Mini Beta (low) | -0.107 |
340+
| 26 | Gemini 2.0 Flash Think Exp 01-21 | -0.117 |
341+
| 27 | Claude 3.5 Haiku | -0.125 |
342+
| 28 | Qwen 2.5 Max | -0.273 |
343+
| 29 | Gemini 2.0 Flash Exp | -0.347 |
344+
| 30 | o1 (medium reasoning) | -0.490 |
345+
| 31 | Mistral Large 2 | -0.664 |
346+
| 32 | GPT-4o mini | -0.888 |
347+
| 33 | o1-mini | -1.005 |
348+
| 34 | Grok 2 12-12 | -1.265 |
349+
| 35 | o3-mini (high reasoning) | -1.301 |
350+
| 36 | o3-mini (medium reasoning) | -1.315 |
351+
| 37 | Microsoft Phi-4 | -1.338 |
352+
| 38 | Llama 4 Maverick | -1.430 |
353+
| 39 | Amazon Nova Pro | -1.603 |
350354

351355

352356

@@ -430,6 +434,7 @@ It's important to note that each story is graded individually rather than as par
430434
- [LLM Divergent Thinking Creativity Benchmark](https://github.com/lechmazur/divergent/)
431435
---
432436
## Updates
437+
- June 5, 2025: Gemini 2.5 Pro Preview 06-05 added.
433438
- May 29, 2025: DeepSeek R1 05/28 added.
434439
- May 23, 2025: Claude 4 added.
435440
- May 8, 2025: Gemini 2.5 Pro Preview 05-06 and Mistral Medium 3 added.

0 commit comments

Comments
 (0)