You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -10,7 +10,7 @@ This benchmark tests how well large language models (LLMs) incorporate a set of
10
10
Each LLM produces 500 short stories, each approximately 400–500 words long, that must organically incorporate all assigned random elements.
11
11
In the updated April 2025 version of the benchmark, which uses newer grader LLMs, 27 of the latest models are evaluated. In the earlier version, 38 LLMs were assessed.
12
12
13
-
Six LLMs grade each of these stories on 16 questions regarding:
13
+
Seven LLMs grade each of these stories on 16 questions regarding:
14
14
1. Character Development & Motivation
15
15
2. Plot Structure & Coherence
16
16
3. World & Atmosphere
@@ -26,6 +26,7 @@ The new grading LLMs are:
26
26
4. DeepSeek V3-0324
27
27
5. Grok 3 Beta (no reasoning)
28
28
6. Gemini 2.5 Pro Exp
29
+
7. Qwen 3 235B
29
30
30
31
31
32
@@ -36,33 +37,37 @@ The new grading LLMs are:
36
37
**Leaderboard:**
37
38
| Rank | LLM | Mean |
38
39
|-----:|-------------------|------:|
39
-
| 1 | o3 (medium reasoning) | 8.43|
40
-
| 2 |DeepSeek R1 | 8.34|
41
-
| 3 |GPT-4o Mar 2025 | 8.22|
42
-
| 4 |Claude 3.7 Sonnet Thinking 16K | 8.15|
43
-
| 5 |Gemini 2.5 Pro Exp 03-25| 8.10|
40
+
| 1 | o3 (medium reasoning) | 8.39|
41
+
| 2 |Qwen 3 235B A22B | 8.30|
42
+
| 3 |DeepSeek R1 | 8.30|
43
+
| 4 |GPT-4o Mar 2025 | 8.18|
44
+
| 5 |Claude 3.7 Sonnet Thinking 16K| 8.11|
44
45
| 6 | Qwen QwQ-32B 16K | 8.07 |
45
-
| 7 | Gemma 3 27B | 8.04 |
46
-
| 8 | Claude 3.7 Sonnet | 8.00 |
47
-
| 9 | DeepSeek V3-0324 | 7.78 |
48
-
| 10 | Gemini 2.5 Flash Preview 24K | 7.72 |
49
-
| 11 | Grok 3 Beta (no reasoning) | 7.71 |
50
-
| 12 | GPT-4.5 Preview | 7.65 |
51
-
| 13 | o4-mini (medium reasoning) | 7.60 |
52
-
| 14 | Gemini 2.0 Flash Think Exp 01-21 | 7.49 |
53
-
| 15 | Claude 3.5 Haiku | 7.49 |
54
-
| 16 | Grok 3 Mini Beta (low) | 7.47 |
55
-
| 17 | Qwen 2.5 Max | 7.42 |
56
-
| 18 | Gemini 2.0 Flash Exp | 7.27 |
57
-
| 19 | o1 (medium reasoning) | 7.15 |
58
-
| 20 | Mistral Large 2 | 7.00 |
59
-
| 21 | GPT-4o mini | 6.84 |
60
-
| 22 | o1-mini | 6.64 |
61
-
| 23 | Microsoft Phi-4 | 6.40 |
62
-
| 24 | o3-mini (high reasoning) | 6.38 |
63
-
| 25 | o3-mini (medium reasoning) | 6.36 |
64
-
| 26 | Llama 4 Maverick | 6.35 |
65
-
| 27 | Amazon Nova Pro | 6.22 |
46
+
| 7 | Gemini 2.5 Pro Exp 03-25 | 8.05 |
47
+
| 8 | Claude 3.5 Sonnet 2024-10-22 | 8.03 |
48
+
| 9 | Gemma 3 27B | 7.99 |
49
+
| 10 | Claude 3.7 Sonnet | 7.94 |
50
+
| 11 | DeepSeek V3-0324 | 7.70 |
51
+
| 12 | Gemini 2.5 Flash Preview 24K | 7.65 |
52
+
| 13 | Grok 3 Beta (no reasoning) | 7.64 |
53
+
| 14 | GPT-4.5 Preview | 7.56 |
54
+
| 15 | Qwen 3 30B A3B | 7.53 |
55
+
| 16 | o4-mini (medium reasoning) | 7.50 |
56
+
| 17 | Gemini 2.0 Flash Think Exp 01-21 | 7.38 |
57
+
| 18 | Claude 3.5 Haiku | 7.35 |
58
+
| 19 | Grok 3 Mini Beta (low) | 7.35 |
59
+
| 20 | Qwen 2.5 Max | 7.29 |
60
+
| 21 | Gemini 2.0 Flash Exp | 7.15 |
61
+
| 22 | o1 (medium reasoning) | 7.02 |
62
+
| 23 | Mistral Large 2 | 6.90 |
63
+
| 24 | GPT-4o mini | 6.72 |
64
+
| 25 | o1-mini | 6.49 |
65
+
| 26 | Grok 2 12-12 | 6.36 |
66
+
| 27 | Microsoft Phi-4 | 6.26 |
67
+
| 28 | Llama 4 Maverick | 6.20 |
68
+
| 29 | o3-mini (high reasoning) | 6.17 |
69
+
| 30 | o3-mini (medium reasoning) | 6.15 |
70
+
| 31 | Amazon Nova Pro | 6.05 |
66
71
---
67
72
68
73
### Overall Strip Plot of Questions
@@ -121,7 +126,7 @@ Here, we list the top 3 and the bottom 3 individual stories (written by any LLM)
121
126
### Top 3 Individual Stories (All Graders)
122
127
123
128
***Story**: [story_403.txt](stories_wc/r1/story_403.txt) by DeepSeek R1
Excluding any one LLM from grading also does not significantly change the rankings. For example, here is what happens when LLama 4 Maverick is excluded:
239
248
### Ranking after Excluding LLama 3.1 405B from Grading
240
249
241
250
| LLM | Old Rank | Old Mean | New Rank | New Mean |
0 commit comments