3
3
This benchmark tests how well large language models (LLMs) incorporate a set of 10 mandatory story elements (characters, objects, core concepts, attributes, motivations, etc.) in a short narrative. This is particularly relevant for creative LLM use cases. Because every story has the same required building blocks and similar length, their resulting cohesiveness and creativity become directly comparable across models. A wide variety of required random elements ensures that LLMs must create diverse stories and cannot resort to repetition. The benchmark captures both constraint satisfaction (did the LLM incorporate all elements properly?) and literary quality (how engaging or coherent is the final piece?). By applying a multi-question grading rubric and multiple "grader" LLMs, we can pinpoint differences in how well each model integrates the assigned elements, develops characters, maintains atmosphere, and sustains an overall coherent plot. It measures more than fluency or style: it probes whether each model can adapt to rigid requirements, remain original, and produce a cohesive story that meaningfully uses every single assigned element.
4
4
5
5
---
6
- ![ llm_overall_bar_zoomed_with_err ] ( https://github.com/user-attachments/assets/780ce7f3-9c8f-4780-826d-2357c179f955 )
6
+ ![ Overall scores ] ( /images/llm_overall_bar_zoomed_with_err.png )
7
7
8
8
---
9
9
## Method Summary
10
- Each of the 34 LLMs produces 500 short stories - each targeted at 400–500 words long - that must organically integrate all assigned random elements. In total, 34 * 500 = 17,000 unique stories are generated.
10
+ Each of the 37 LLMs produces 500 short stories - each targeted at 400–500 words long - that must organically integrate all assigned random elements. In total, 37 * 500 = 18,500 unique stories are generated.
11
11
12
12
Six LLMs grade each of these stories on 16 questions regarding:
13
13
1 . Character Development & Motivation
@@ -26,7 +26,7 @@ The grading LLMs are:
26
26
5 . Grok 2 12-12
27
27
6 . Gemini 1.5 Pro (Sept)
28
28
29
- In total, 34 * 500 * 6 * 16 = 1,632 ,000 grades are generated.
29
+ In total, 37 * 500 * 6 * 16 = 1,776 ,000 grades are generated.
30
30
31
31
---
32
32
## Results
@@ -40,70 +40,73 @@ In total, 34 * 500 * 6 * 16 = 1,632,000 grades are generated.
40
40
| 3 | Claude 3.5 Sonnet 2024-10-22 | 8.47 |
41
41
| 4 | Claude 3.7 Sonnet | 8.39 |
42
42
| 5 | Qwen QwQ-32B 16K | 8.34 |
43
- | 6 | Gemma 3 27B | 8.22 |
44
- | 7 | Gemini 2.0 Pro Exp 02-05 | 8.08 |
45
- | 8 | GPT-4.5 Preview | 8.07 |
46
- | 9 | Claude 3.5 Haiku | 8.07 |
47
- | 10 | Gemini 1.5 Pro (Sept) | 7.97 |
48
- | 11 | GPT-4o Feb 2025 | 7.96 |
49
- | 12 | Gemini 2.0 Flash Thinking Exp Old | 7.87 |
50
- | 13 | GPT-4o 2024-11-20 | 7.87 |
51
- | 14 | Gemini 2.0 Flash Thinking Exp 01-21 | 7.82 |
52
- | 15 | o1-preview | 7.74 |
53
- | 16 | Gemini 2.0 Flash Exp | 7.65 |
54
- | 17 | Qwen 2.5 Max | 7.64 |
55
- | 18 | DeepSeek-V3 | 7.62 |
56
- | 19 | o1 | 7.57 |
57
- | 20 | Mistral Large 2 | 7.54 |
58
- | 21 | Gemma 2 27B | 7.49 |
59
- | 22 | Qwen QwQ Preview | 7.44 |
60
- | 23 | GPT-4o mini | 7.37 |
61
- | 24 | GPT-4o 2024-08-06 | 7.36 |
62
- | 25 | o1-mini | 7.30 |
63
- | 26 | Claude 3 Opus | 7.17 |
64
- | 27 | Qwen 2.5 72B | 7.00 |
65
- | 28 | Grok 2 12-12 | 6.98 |
66
- | 29 | o3-mini | 6.90 |
67
- | 30 | Microsoft Phi-4 | 6.89 |
68
- | 31 | Amazon Nova Pro | 6.70 |
69
- | 32 | Llama 3.1 405B | 6.60 |
70
- | 33 | Llama 3.3 70B | 5.95 |
71
- | 34 | Claude 3 Haiku | 5.83 |
43
+ | 6 | Gemini 2.5 Pro Exp 03-24 | 8.30 |
44
+ | 7 | Gemma 3 27B | 8.22 |
45
+ | 8 | DeepSeek V3-0324 | 8.09 |
46
+ | 9 | Gemini 2.0 Pro Exp 02-05 | 8.08 |
47
+ | 10 | GPT-4.5 Preview | 8.07 |
48
+ | 11 | Claude 3.5 Haiku | 8.07 |
49
+ | 12 | Gemini 1.5 Pro (Sept) | 7.97 |
50
+ | 13 | GPT-4o Feb 2025 | 7.96 |
51
+ | 14 | Gemini 2.0 Flash Thinking Exp Old | 7.87 |
52
+ | 15 | GPT-4o 2024-11-20 | 7.87 |
53
+ | 16 | Gemini 2.0 Flash Thinking Exp 01-21 | 7.82 |
54
+ | 17 | o1-preview | 7.74 |
55
+ | 18 | Gemini 2.0 Flash Exp | 7.65 |
56
+ | 19 | Qwen 2.5 Max | 7.64 |
57
+ | 20 | DeepSeek-V3 | 7.62 |
58
+ | 21 | o1 | 7.57 |
59
+ | 22 | Mistral Large 2 | 7.54 |
60
+ | 23 | Gemma 2 27B | 7.49 |
61
+ | 24 | Qwen QwQ Preview | 7.44 |
62
+ | 25 | GPT-4o mini | 7.37 |
63
+ | 26 | GPT-4o 2024-08-06 | 7.36 |
64
+ | 27 | o1-mini | 7.30 |
65
+ | 28 | Claude 3 Opus | 7.17 |
66
+ | 29 | Qwen 2.5 72B | 7.00 |
67
+ | 30 | o3-mini-high | 6.99 |
68
+ | 31 | Grok 2 12-12 | 6.98 |
69
+ | 32 | o3-mini | 6.90 |
70
+ | 33 | Microsoft Phi-4 | 6.89 |
71
+ | 34 | Amazon Nova Pro | 6.70 |
72
+ | 35 | Llama 3.1 405B | 6.60 |
73
+ | 36 | Llama 3.3 70B | 5.95 |
74
+ | 37 | Claude 3 Haiku | 5.83 |
72
75
73
76
Qwen QwQ-32B joins DeepSeek R1 and Claude Sonnet as the clear overall winners. Notably, Claude 3.5 Haiku shows a large improvement upon Claude 3 Haiku and Gemma 3 shows a large improvement upon Gemma 2. Gemini models perform well, while Llama models lag behind. Interestingly, larger, more expensive models did not outperform smaller models by as much as one might expect. o3-mini performs worse than expected.
74
77
75
78
### Overall Strip Plot of Questions
76
79
A strip plot illustrating distributions of scores (y-axis) by LLM (x-axis) across all stories, with Grader LLMs marked in different colors:
77
80
78
- ![ normalized_scores_strip ] ( https://github.com/user-attachments/assets/14f8f129-fdcd-4713-b264-71b09f668224 )
81
+ ![ Normalized scores strip chart ] ( /images/normalized_scores_strip )
79
82
80
83
The plot reveals that Llama 3.1 405B occasionally, and DeepSeek-V3 sporadically, award a perfect 10 across the board, despite prompts explicitly asking them to be strict graders.
81
84
82
85
### LLM vs. Question (Detailed)
83
86
A heatmap showing each LLM's mean rating per question:
84
87
85
- ![ llm_vs_question_detailed ] ( https://github.com/user-attachments/assets/adc5bcae-42bb-4a42-b53a-9feb6a988748 )
88
+ ![ LLM per question ] ( /images/llm_vs_question_detailed.png )
86
89
87
90
Before DeepSeek R1's release, Claude 3.5 Sonnet ranked #1 on every single question.
88
91
89
92
### LLM #1 Finishes
90
93
Which LLM ranked #1 the most times across all stories? This pie chart shows the distribution of #1 finishes:
91
94
92
- ![ llm_best_pie ] ( https://github.com/user-attachments/assets/36136d08-d150-45be-8d3c-6fb7c80cf5e5 )
95
+ ![ # 1 stories pie chart ] ( /images/llm_best_pie.png )
93
96
94
97
Claude Sonnet's and R1's dominance is undeniable when analyzing the best scores by story. Qwen QwQ-32B and Gemma 3 27B get some victories.
95
98
96
99
### Grader - LLM Mean Heatmap
97
100
A heatmap of Grader (row) vs. LLM (column) average scores:
98
101
99
- ![ grader_vs_llm_normalized_means ] ( https://github.com/user-attachments/assets/4107fb09-c7df-4ee0-b457-d1538c2162ad )
102
+ ![ Grader vs LLM normalized ] ( /images/grader_vs_llm_normalized_means.png )
100
103
101
104
The chart highlights that grading LLMs do not disproportionately overrate their own stories. Llama 3.1 405B is impressed by the o3-mini, while other grading LLMs dislike its stories.
102
105
103
106
### Grader-Grader Correlation
104
107
A correlation matrix (−1 to 1 scale) measuring how strongly multiple LLMs correlate when cross-grading the same stories:
105
108
106
- ![ teacher_grader_correlation ] ( https://github.com/user-attachments/assets/32d7721d-5860-41af-97cd-a319db174d26 )
109
+ ![ Grader vs LLM correlation ] ( /images/teacher_grader_correlation.png )
107
110
108
111
Llama 3.1 405B's grades show the least correlation with other LLMs.
109
112
@@ -114,17 +117,17 @@ A basic prompt asking LLMs to create a 400-500 word story resulted in an unaccep
114
117
115
118
Since the benchmark aims to evaluate how well LLMs write, not how well they count or follow prompts about the format, we adjusted the word counts in the prompt for different LLMs to approximately match the target story length - an approach similar to what someone dissatisfied with the initial story length might adopt. Qwen QwQ and Llama 3.x models required the most extensive prompt engineering to achieve the required word counts and to adhere to the proper output format across all 500 stories. Note that this did not require any evaluation of the story's content itself. These final stories were then graded and they are available in [ stories_wc/] ( stories_wc/ ) .
116
119
117
- ![ word_count_distribution_by_model ] ( https://github.com/user-attachments/assets/7952ccae-d3d4-4427-ad80-5f41aa21d92c )
120
+ ![ Word count distribution by model ] ( /images/word_count_distribution_by_model.png )
118
121
119
122
This chart shows the correlations between each LLM's scores and their story lengths:
120
123
121
- ![ len_vs_score_overall_enhanced ] ( https://github.com/user-attachments/assets/bd8521ae-724d-42b4-b2f1-701eed160eef )
124
+ ![ Len vs score ] ( /images/len_vs_score_overall_enhanced.png )
122
125
123
126
o3-mini and o1 seem to force too many of their stories to be exactly within the specified limits, which may hurt their grades.
124
127
125
128
This chart shows the correlations between each Grader LLM's scores and the lengths of stories they graded:
126
129
127
- ![ len_vs_score_grader_enhanced ] ( https://github.com/user-attachments/assets/9dd09881-3ccd-4437-a0e8-610f09bc9462 )
130
+ ![ Length vs score by grader ] ( /images/len_vs_score_grader_enhanced.png )
128
131
129
132
---
130
133
## Best and Worst Stories
@@ -192,7 +195,7 @@ A valid concern is whether LLM graders can accurately score questions 1 to 6 (Ma
192
195
193
196
### Questions 7A to 7J Only: Element Integration
194
197
195
- ![ llm_overall_bar_zoomed_7Ato7J ] ( https://github.com/user-attachments/assets/3cc8e6e1-3326-49b1-99d5-2563a0f463c6 )
198
+ ![ Element Integration ] ( /images/llm_overall_bar_zoomed_7Ato7J.png )
196
199
197
200
198
201
Excluding 10% worst stories per LLM does not significantly change the rankings:
@@ -205,35 +208,37 @@ Excluding 10% worst stories per LLM does not significantly change the rankings:
205
208
| Claude 3.5 Sonnet 2024-10-22 | 3 | 8.47 | 3 | 8.54 |
206
209
| Claude 3.7 Sonnet | 4 | 8.39 | 4 | 8.45 |
207
210
| Qwen QwQ-32B 16K | 5 | 8.34 | 5 | 8.41 |
208
- | Gemma 3 27B | 6 | 8.22 | 6 | 8.29 |
209
- | Gemini 2.0 Pro Exp 02-05 | 7 | 8.08 | 7 | 8.16 |
210
- | GPT-4.5 Preview | 8 | 8.07 | 8 | 8.16 |
211
- | Claude 3.5 Haiku | 9 | 8.07 | 9 | 8.15 |
212
- | Gemini 1.5 Pro (Sept) | 10 | 7.97 | 10 | 8.06 |
213
- | GPT-4o Feb 2025 | 11 | 7.96 | 11 | 8.05 |
214
- | Gemini 2.0 Flash Thinking Exp Old | 12 | 7.87 | 12 | 7.96 |
215
- | GPT-4o 2024-11-20 | 13 | 7.87 | 13 | 7.95 |
216
- | Gemini 2.0 Flash Thinking Exp 01-21 | 14 | 7.82 | 14 | 7.93 |
217
- | o1-preview | 15 | 7.74 | 15 | 7.85 |
218
- | Gemini 2.0 Flash Exp | 16 | 7.65 | 16 | 7.76 |
219
- | DeepSeek-V3 | 18 | 7.62 | 17 | 7.74 |
220
- | Qwen 2.5 Max | 17 | 7.64 | 18 | 7.74 |
221
- | o1 | 19 | 7.57 | 19 | 7.68 |
222
- | Mistral Large 2 | 20 | 7.54 | 20 | 7.65 |
223
- | Gemma 2 27B | 21 | 7.49 | 21 | 7.60 |
224
- | Qwen QwQ Preview | 22 | 7.44 | 22 | 7.55 |
225
- | GPT-4o 2024-08-06 | 24 | 7.36 | 23 | 7.47 |
226
- | GPT-4o mini | 23 | 7.37 | 24 | 7.46 |
227
- | o1-mini | 25 | 7.30 | 25 | 7.44 |
228
- | Claude 3 Opus | 26 | 7.17 | 26 | 7.30 |
229
- | Grok 2 12-12 | 28 | 6.98 | 27 | 7.12 |
230
- | Qwen 2.5 72B | 27 | 7.00 | 28 | 7.12 |
231
- | o3-mini | 29 | 6.90 | 29 | 7.04 |
232
- | Microsoft Phi-4 | 30 | 6.89 | 30 | 7.02 |
233
- | Amazon Nova Pro | 31 | 6.70 | 31 | 6.84 |
234
- | Llama 3.1 405B | 32 | 6.60 | 32 | 6.72 |
235
- | Llama 3.3 70B | 33 | 5.95 | 33 | 6.08 |
236
- | Claude 3 Haiku | 34 | 5.83 | 34 | 5.97 |
211
+ | Gemini 2.5 Pro Exp 03-24 | 6 | 8.30 | 6 | 8.36 |
212
+ | Gemma 3 27B | 7 | 8.22 | 7 | 8.29 |
213
+ | DeepSeek V3-0324 | 8 | 8.09 | 8 | 8.17 |
214
+ | Gemini 2.0 Pro Exp 02-05 | 9 | 8.08 | 9 | 8.16 |
215
+ | GPT-4.5 Preview | 10 | 8.07 | 10 | 8.16 |
216
+ | Claude 3.5 Haiku | 11 | 8.07 | 11 | 8.15 |
217
+ | Gemini 1.5 Pro (Sept) | 12 | 7.97 | 12 | 8.06 |
218
+ | GPT-4o Feb 2025 | 13 | 7.96 | 13 | 8.05 |
219
+ | Gemini 2.0 Flash Thinking Exp Old | 14 | 7.87 | 14 | 7.96 |
220
+ | GPT-4o 2024-11-20 | 15 | 7.87 | 15 | 7.95 |
221
+ | Gemini 2.0 Flash Thinking Exp 01-21 | 16 | 7.82 | 16 | 7.93 |
222
+ | o1-preview | 17 | 7.74 | 17 | 7.85 |
223
+ | Gemini 2.0 Flash Exp | 18 | 7.65 | 18 | 7.76 |
224
+ | DeepSeek-V3 | 20 | 7.62 | 19 | 7.74 |
225
+ | Qwen 2.5 Max | 19 | 7.64 | 20 | 7.74 |
226
+ | o1 | 21 | 7.57 | 21 | 7.68 |
227
+ | Mistral Large 2 | 22 | 7.54 | 22 | 7.65 |
228
+ | Gemma 2 27B | 23 | 7.49 | 23 | 7.60 |
229
+ | Qwen QwQ Preview | 24 | 7.44 | 24 | 7.55 |
230
+ | GPT-4o 2024-08-06 | 26 | 7.36 | 25 | 7.47 |
231
+ | GPT-4o mini | 25 | 7.37 | 26 | 7.46 |
232
+ | o1-mini | 27 | 7.30 | 27 | 7.44 |
233
+ | Claude 3 Opus | 28 | 7.17 | 28 | 7.30 |
234
+ | o3-mini-high | 30 | 6.99 | 29 | 7.12 |
235
+ | Grok 2 12-12 | 31 | 6.98 | 30 | 7.12 |
236
+ | Qwen 2.5 72B | 29 | 7.00 | 31 | 7.12 |
237
+ | o3-mini | 32 | 6.90 | 32 | 7.04 |
238
+ | Microsoft Phi-4 | 33 | 6.89 | 33 | 7.02 |
239
+ | Amazon Nova Pro | 34 | 6.70 | 34 | 6.84 |
240
+ | Llama 3.1 405B | 35 | 6.60 | 35 | 6.72 |
241
+ | Llama 3.3 70B | 36 | 5.95 | 36 | 6.08 |
237
242
238
243
239
244
Excluding any one LLM from grading also does not significantly change the rankings. For example, here is what happens when LLama 3.1 405B is excluded:
@@ -246,83 +251,89 @@ Excluding any one LLM from grading also does not significantly change the rankin
246
251
| Claude 3.5 Sonnet 2024-10-22 | 3 | 8.47 | 3 | 8.25 |
247
252
| Qwen QwQ-32B 16K | 5 | 8.34 | 4 | 8.17 |
248
253
| Claude 3.7 Sonnet | 4 | 8.39 | 5 | 8.17 |
249
- | Gemma 3 27B | 6 | 8.22 | 6 | 8.05 |
250
- | Gemini 2.0 Pro Exp 02-05 | 7 | 8.08 | 7 | 7.87 |
251
- | GPT-4.5 Preview | 8 | 8.07 | 8 | 7.80 |
252
- | GPT-4o Feb 2025 | 11 | 7.96 | 9 | 7.78 |
253
- | Claude 3.5 Haiku | 9 | 8.07 | 10 | 7.75 |
254
- | Gemini 1.5 Pro (Sept) | 10 | 7.97 | 11 | 7.73 |
255
- | GPT-4o 2024-11-20 | 13 | 7.87 | 12 | 7.69 |
256
- | Gemini 2.0 Flash Thinking Exp Old | 12 | 7.87 | 13 | 7.64 |
257
- | Gemini 2.0 Flash Thinking Exp 01-21 | 14 | 7.82 | 14 | 7.54 |
258
- | o1-preview | 15 | 7.74 | 15 | 7.47 |
259
- | Qwen 2.5 Max | 17 | 7.64 | 16 | 7.42 |
260
- | DeepSeek-V3 | 18 | 7.62 | 17 | 7.36 |
261
- | Gemini 2.0 Flash Exp | 16 | 7.65 | 18 | 7.36 |
262
- | o1 | 19 | 7.57 | 19 | 7.29 |
263
- | Gemma 2 27B | 21 | 7.49 | 20 | 7.29 |
264
- | Mistral Large 2 | 20 | 7.54 | 21 | 7.24 |
265
- | Qwen QwQ Preview | 22 | 7.44 | 22 | 7.18 |
266
- | GPT-4o mini | 23 | 7.37 | 23 | 7.09 |
267
- | GPT-4o 2024-08-06 | 24 | 7.36 | 24 | 7.03 |
268
- | o1-mini | 25 | 7.30 | 25 | 6.91 |
269
- | Claude 3 Opus | 26 | 7.17 | 26 | 6.84 |
270
- | Qwen 2.5 72B | 27 | 7.00 | 27 | 6.66 |
271
- | Grok 2 12-12 | 28 | 6.98 | 28 | 6.63 |
272
- | Microsoft Phi-4 | 30 | 6.89 | 29 | 6.49 |
273
- | o3-mini | 29 | 6.90 | 30 | 6.38 |
274
- | Amazon Nova Pro | 31 | 6.70 | 31 | 6.34 |
275
- | Llama 3.1 405B | 32 | 6.60 | 32 | 6.18 |
276
- | Llama 3.3 70B | 33 | 5.95 | 33 | 5.41 |
277
- | Claude 3 Haiku | 34 | 5.83 | 34 | 5.32 |
254
+ | Gemini 2.5 Pro Exp 03-24 | 6 | 8.30 | 6 | 8.13 |
255
+ | Gemma 3 27B | 7 | 8.22 | 7 | 8.05 |
256
+ | DeepSeek V3-0324 | 8 | 8.09 | 8 | 7.90 |
257
+ | Gemini 2.0 Pro Exp 02-05 | 9 | 8.08 | 9 | 7.87 |
258
+ | GPT-4.5 Preview | 10 | 8.07 | 10 | 7.80 |
259
+ | GPT-4o Feb 2025 | 13 | 7.96 | 11 | 7.78 |
260
+ | Claude 3.5 Haiku | 11 | 8.07 | 12 | 7.75 |
261
+ | Gemini 1.5 Pro (Sept) | 12 | 7.97 | 13 | 7.73 |
262
+ | GPT-4o 2024-11-20 | 15 | 7.87 | 14 | 7.69 |
263
+ | Gemini 2.0 Flash Thinking Exp Old | 14 | 7.87 | 15 | 7.64 |
264
+ | Gemini 2.0 Flash Thinking Exp 01-21 | 16 | 7.82 | 16 | 7.54 |
265
+ | o1-preview | 17 | 7.74 | 17 | 7.47 |
266
+ | Qwen 2.5 Max | 19 | 7.64 | 18 | 7.42 |
267
+ | DeepSeek-V3 | 20 | 7.62 | 19 | 7.36 |
268
+ | Gemini 2.0 Flash Exp | 18 | 7.65 | 20 | 7.36 |
269
+ | o1 | 21 | 7.57 | 21 | 7.29 |
270
+ | Gemma 2 27B | 23 | 7.49 | 22 | 7.29 |
271
+ | Mistral Large 2 | 22 | 7.54 | 23 | 7.24 |
272
+ | Qwen QwQ Preview | 24 | 7.44 | 24 | 7.18 |
273
+ | GPT-4o mini | 25 | 7.37 | 25 | 7.09 |
274
+ | GPT-4o 2024-08-06 | 26 | 7.36 | 26 | 7.03 |
275
+ | o1-mini | 27 | 7.30 | 27 | 6.91 |
276
+ | Claude 3 Opus | 28 | 7.17 | 28 | 6.84 |
277
+ | Qwen 2.5 72B | 29 | 7.00 | 29 | 6.66 |
278
+ | Grok 2 12-12 | 31 | 6.98 | 30 | 6.63 |
279
+ | o3-mini-high | 30 | 6.99 | 31 | 6.49 |
280
+ | Microsoft Phi-4 | 33 | 6.89 | 32 | 6.49 |
281
+ | o3-mini | 32 | 6.90 | 33 | 6.38 |
282
+ | Amazon Nova Pro | 34 | 6.70 | 34 | 6.34 |
283
+ | Llama 3.1 405B | 35 | 6.60 | 35 | 6.18 |
284
+ | Llama 3.3 70B | 36 | 5.95 | 36 | 5.41 |
285
+ | Claude 3 Haiku | 37 | 5.83 | 37 | 5.32 |
278
286
279
287
Normalizing each grader’s scores doesn’t significantly alter the rankings:
280
288
281
289
### Normalized Mean Leaderboard
282
290
283
291
| Rank | LLM | Normalized Mean |
284
292
| -----:| ------------------------| -----------------:|
285
- | 1 | DeepSeek R1 | 0.951 |
286
- | 2 | Claude 3.7 Sonnet Thinking 16K | 0.930 |
287
- | 3 | Claude 3.5 Sonnet 2024-10-22 | 0.902 |
288
- | 4 | Claude 3.7 Sonnet | 0.803 |
289
- | 5 | Qwen QwQ-32B 16K | 0.732 |
290
- | 6 | Gemma 3 27B | 0.630 |
291
- | 7 | GPT-4.5 Preview | 0.505 |
292
- | 8 | Claude 3.5 Haiku | 0.501 |
293
- | 9 | Gemini 2.0 Pro Exp 02-05 | 0.488 |
294
- | 10 | Gemini 1.5 Pro (Sept) | 0.402 |
295
- | 11 | GPT-4o Feb 2025 | 0.365 |
296
- | 12 | Gemini 2.0 Flash Thinking Exp Old | 0.307 |
297
- | 13 | GPT-4o 2024-11-20 | 0.259 |
298
- | 14 | Gemini 2.0 Flash Thinking Exp 01-21 | 0.258 |
299
- | 15 | o1-preview | 0.190 |
300
- | 16 | Gemini 2.0 Flash Exp | 0.114 |
301
- | 17 | DeepSeek-V3 | 0.058 |
302
- | 18 | Qwen 2.5 Max | 0.049 |
303
- | 19 | o1 | 0.014 |
304
- | 20 | Mistral Large 2 | 0.002 |
305
- | 21 | Gemma 2 27B | -0.118 |
306
- | 22 | Qwen QwQ Preview | -0.160 |
307
- | 23 | GPT-4o 2024-08-06 | -0.180 |
308
- | 24 | o1-mini | -0.189 |
309
- | 25 | GPT-4o mini | -0.194 |
310
- | 26 | Claude 3 Opus | -0.374 |
311
- | 27 | Grok 2 12-12 | -0.532 |
312
- | 28 | Qwen 2.5 72B | -0.543 |
313
- | 29 | o3-mini | -0.586 |
314
- | 30 | Microsoft Phi-4 | -0.638 |
315
- | 31 | Amazon Nova Pro | -0.865 |
316
- | 32 | Llama 3.1 405B | -0.894 |
317
- | 33 | Llama 3.3 70B | -1.505 |
318
- | 34 | Claude 3 Haiku | -1.687 |
293
+ | 1 | DeepSeek R1 | 0.935 |
294
+ | 2 | Claude 3.7 Sonnet Thinking 16K | 0.915 |
295
+ | 3 | Claude 3.5 Sonnet 2024-10-22 | 0.887 |
296
+ | 4 | Claude 3.7 Sonnet | 0.787 |
297
+ | 5 | Qwen QwQ-32B 16K | 0.715 |
298
+ | 6 | Gemini 2.5 Pro Exp 03-24 | 0.672 |
299
+ | 7 | Gemma 3 27B | 0.614 |
300
+ | 8 | GPT-4.5 Preview | 0.489 |
301
+ | 9 | Claude 3.5 Haiku | 0.485 |
302
+ | 10 | Gemini 2.0 Pro Exp 02-05 | 0.471 |
303
+ | 11 | DeepSeek V3-0324 | 0.452 |
304
+ | 12 | Gemini 1.5 Pro (Sept) | 0.386 |
305
+ | 13 | GPT-4o Feb 2025 | 0.348 |
306
+ | 14 | Gemini 2.0 Flash Thinking Exp Old | 0.290 |
307
+ | 15 | GPT-4o 2024-11-20 | 0.242 |
308
+ | 16 | Gemini 2.0 Flash Thinking Exp 01-21 | 0.241 |
309
+ | 17 | o1-preview | 0.174 |
310
+ | 18 | Gemini 2.0 Flash Exp | 0.097 |
311
+ | 19 | DeepSeek-V3 | 0.041 |
312
+ | 20 | Qwen 2.5 Max | 0.031 |
313
+ | 21 | o1 | -0.003 |
314
+ | 22 | Mistral Large 2 | -0.016 |
315
+ | 23 | Gemma 2 27B | -0.137 |
316
+ | 24 | Qwen QwQ Preview | -0.178 |
317
+ | 25 | GPT-4o 2024-08-06 | -0.197 |
318
+ | 26 | o1-mini | -0.204 |
319
+ | 27 | GPT-4o mini | -0.212 |
320
+ | 28 | Claude 3 Opus | -0.392 |
321
+ | 29 | o3-mini-high | -0.540 |
322
+ | 30 | Grok 2 12-12 | -0.549 |
323
+ | 31 | Qwen 2.5 72B | -0.561 |
324
+ | 32 | o3-mini | -0.602 |
325
+ | 33 | Microsoft Phi-4 | -0.655 |
326
+ | 34 | Amazon Nova Pro | -0.883 |
327
+ | 35 | Llama 3.1 405B | -0.912 |
328
+ | 36 | Llama 3.3 70B | -1.523 |
329
+ | 37 | Claude 3 Haiku | -1.706 |
319
330
320
331
321
332
---
322
333
## Details
323
334
Full range of scores:
324
335
325
- ![ llm_overall_bar_start0_with_err ] ( https://github.com/user-attachments/assets/90485dea-d800-4171-9abb-d945426a08d9 )
336
+ ![ Full range ] ( /images/llm_overall_bar_start0_with_err.png )
326
337
327
338
---
328
339
## Limitations
@@ -344,6 +355,7 @@ It's important to note that each story is graded individually rather than as par
344
355
- [ LLM Divergent Thinking Creativity Benchmark] ( https://github.com/lechmazur/divergent/ )
345
356
---
346
357
## Updates
358
+ - Mar 26, 2025: Gemini 2.5 Pro Exp 03-25, DeepSeek V3-0324, o3-mini-high added.
347
359
- Mar 13, 2025: Gemma 3 27B added.
348
360
- Mar 10, 2025: Qwen QwQ-32B added.
349
361
- Feb 26, 2025: GPT-4.5 Preview added.
0 commit comments