You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -8,7 +8,7 @@ This benchmark tests how well large language models (LLMs) incorporate a set of
8
8
---
9
9
## Method Summary
10
10
Each LLM produces 500 short stories, each approximately 400–500 words long, that must organically incorporate all assigned random elements.
11
-
In the updated April 2025 version of the benchmark, which uses newer grader LLMs, 33 of the latest models are evaluated. In the earlier version, 38 LLMs were assessed.
11
+
In the updated April 2025 version of the benchmark, which uses newer grader LLMs, 39 of the latest models are evaluated. In the earlier version, 38 LLMs were assessed.
12
12
13
13
Seven LLMs grade each of these stories on 16 questions regarding:
14
14
1. Character Development & Motivation
@@ -38,43 +38,44 @@ The new grading LLMs are:
38
38
| Rank | LLM | Mean |
39
39
|-----:|-------------------|------:|
40
40
| 1 | o3 (medium reasoning) | 8.39 |
41
-
| 2 | Claude Opus 4 Thinking 16K | 8.36 |
42
-
| 3 | Claude Opus 4 (no reasoning) | 8.31 |
43
-
| 4 | Qwen 3 235B A22B | 8.30 |
44
-
| 5 | DeepSeek R1 | 8.30 |
45
-
| 6 | DeepSeek R1 05/28 | 8.19 |
46
-
| 7 | GPT-4o Mar 2025 | 8.18 |
47
-
| 8 | Claude Sonnet 4 Thinking 16K | 8.14 |
48
-
| 9 | Claude 3.7 Sonnet Thinking 16K | 8.11 |
49
-
| 10 | Claude Sonnet 4 (no reasoning) | 8.09 |
50
-
| 11 | Gemini 2.5 Pro Preview 05-06 | 8.09 |
51
-
| 12 | Gemini 2.5 Pro Exp 03-25 | 8.05 |
52
-
| 13 | Claude 3.5 Sonnet 2024-10-22 | 8.03 |
53
-
| 14 | Qwen QwQ-32B 16K | 8.02 |
54
-
| 15 | Gemma 3 27B | 7.99 |
55
-
| 16 | Claude 3.7 Sonnet | 7.94 |
56
-
| 17 | Mistral Medium 3 | 7.73 |
57
-
| 18 | DeepSeek V3-0324 | 7.70 |
58
-
| 19 | Gemini 2.5 Flash Preview 24K | 7.65 |
59
-
| 20 | Grok 3 Beta (no reasoning) | 7.64 |
60
-
| 21 | GPT-4.5 Preview | 7.56 |
61
-
| 22 | Qwen 3 30B A3B | 7.53 |
62
-
| 23 | o4-mini (medium reasoning) | 7.50 |
63
-
| 24 | Gemini 2.0 Flash Think Exp 01-21 | 7.38 |
64
-
| 25 | Claude 3.5 Haiku | 7.35 |
65
-
| 26 | Grok 3 Mini Beta (low) | 7.35 |
66
-
| 27 | Qwen 2.5 Max | 7.29 |
67
-
| 28 | Gemini 2.0 Flash Exp | 7.15 |
68
-
| 29 | o1 (medium reasoning) | 7.02 |
69
-
| 30 | Mistral Large 2 | 6.90 |
70
-
| 31 | GPT-4o mini | 6.72 |
71
-
| 32 | o1-mini | 6.49 |
72
-
| 33 | Grok 2 12-12 | 6.36 |
73
-
| 34 | Microsoft Phi-4 | 6.26 |
74
-
| 35 | Llama 4 Maverick | 6.20 |
75
-
| 36 | o3-mini (high reasoning) | 6.17 |
76
-
| 37 | o3-mini (medium reasoning) | 6.15 |
77
-
| 38 | Amazon Nova Pro | 6.05 |
41
+
| 2 | Gemini 2.5 Pro Preview 06-05 | 8.38 |
42
+
| 3 | Claude Opus 4 Thinking 16K | 8.36 |
43
+
| 4 | Claude Opus 4 (no reasoning) | 8.31 |
44
+
| 5 | Qwen 3 235B A22B | 8.30 |
45
+
| 6 | DeepSeek R1 | 8.30 |
46
+
| 7 | DeepSeek R1 05/28 | 8.19 |
47
+
| 8 | GPT-4o Mar 2025 | 8.18 |
48
+
| 9 | Claude Sonnet 4 Thinking 16K | 8.14 |
49
+
| 10 | Claude 3.7 Sonnet Thinking 16K | 8.11 |
50
+
| 11 | Claude Sonnet 4 (no reasoning) | 8.09 |
51
+
| 12 | Gemini 2.5 Pro Preview 05-06 | 8.09 |
52
+
| 13 | Gemini 2.5 Pro Exp 03-25 | 8.05 |
53
+
| 14 | Claude 3.5 Sonnet 2024-10-22 | 8.03 |
54
+
| 15 | Qwen QwQ-32B 16K | 8.02 |
55
+
| 16 | Gemma 3 27B | 7.99 |
56
+
| 17 | Claude 3.7 Sonnet | 7.94 |
57
+
| 18 | Mistral Medium 3 | 7.73 |
58
+
| 19 | DeepSeek V3-0324 | 7.70 |
59
+
| 20 | Gemini 2.5 Flash Preview 24K | 7.65 |
60
+
| 21 | Grok 3 Beta (no reasoning) | 7.64 |
61
+
| 22 | GPT-4.5 Preview | 7.56 |
62
+
| 23 | Qwen 3 30B A3B | 7.53 |
63
+
| 24 | o4-mini (medium reasoning) | 7.50 |
64
+
| 25 | Gemini 2.0 Flash Think Exp 01-21 | 7.38 |
65
+
| 26 | Claude 3.5 Haiku | 7.35 |
66
+
| 27 | Grok 3 Mini Beta (low) | 7.35 |
67
+
| 28 | Qwen 2.5 Max | 7.29 |
68
+
| 29 | Gemini 2.0 Flash Exp | 7.15 |
69
+
| 30 | o1 (medium reasoning) | 7.02 |
70
+
| 31 | Mistral Large 2 | 6.90 |
71
+
| 32 | GPT-4o mini | 6.72 |
72
+
| 33 | o1-mini | 6.49 |
73
+
| 34 | Grok 2 12-12 | 6.36 |
74
+
| 35 | Microsoft Phi-4 | 6.26 |
75
+
| 36 | Llama 4 Maverick | 6.20 |
76
+
| 37 | o3-mini (high reasoning) | 6.17 |
77
+
| 38 | o3-mini (medium reasoning) | 6.15 |
78
+
| 39 | Amazon Nova Pro | 6.05 |
78
79
---
79
80
80
81
### Overall Strip Plot of Questions
@@ -219,88 +220,90 @@ Excluding 10% worst stories per LLM does not significantly change the rankings:
219
220
| LLM Full | Old Rank | Old Mean | New Rank | New Mean |
Excluding any one LLM from grading also does not significantly change the rankings. For example, here is what happens when LLama 4 Maverick is excluded:
262
264
### Ranking after Excluding LLama 4 Maverick from Grading
263
265
264
266
| LLM | Old Rank | Old Mean | New Rank | New Mean |
0 commit comments