Skip to content

Commit 2b6d5ad

Browse files
author
Lech
committed
Qwen 3
1 parent 1371b8b commit 2b6d5ad

File tree

1,012 files changed

+23063
-126
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,012 files changed

+23063
-126
lines changed

README.md

Lines changed: 144 additions & 126 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ This benchmark tests how well large language models (LLMs) incorporate a set of
1010
Each LLM produces 500 short stories, each approximately 400–500 words long, that must organically incorporate all assigned random elements.
1111
In the updated April 2025 version of the benchmark, which uses newer grader LLMs, 27 of the latest models are evaluated. In the earlier version, 38 LLMs were assessed.
1212

13-
Six LLMs grade each of these stories on 16 questions regarding:
13+
Seven LLMs grade each of these stories on 16 questions regarding:
1414
1. Character Development & Motivation
1515
2. Plot Structure & Coherence
1616
3. World & Atmosphere
@@ -26,6 +26,7 @@ The new grading LLMs are:
2626
4. DeepSeek V3-0324
2727
5. Grok 3 Beta (no reasoning)
2828
6. Gemini 2.5 Pro Exp
29+
7. Qwen 3 235B
2930

3031

3132

@@ -36,33 +37,37 @@ The new grading LLMs are:
3637
**Leaderboard:**
3738
| Rank | LLM | Mean |
3839
|-----:|-------------------|------:|
39-
| 1 | o3 (medium reasoning) | 8.43 |
40-
| 2 | DeepSeek R1 | 8.34 |
41-
| 3 | GPT-4o Mar 2025 | 8.22 |
42-
| 4 | Claude 3.7 Sonnet Thinking 16K | 8.15 |
43-
| 5 | Gemini 2.5 Pro Exp 03-25 | 8.10 |
40+
| 1 | o3 (medium reasoning) | 8.39 |
41+
| 2 | Qwen 3 235B A22B | 8.30 |
42+
| 3 | DeepSeek R1 | 8.30 |
43+
| 4 | GPT-4o Mar 2025 | 8.18 |
44+
| 5 | Claude 3.7 Sonnet Thinking 16K | 8.11 |
4445
| 6 | Qwen QwQ-32B 16K | 8.07 |
45-
| 7 | Gemma 3 27B | 8.04 |
46-
| 8 | Claude 3.7 Sonnet | 8.00 |
47-
| 9 | DeepSeek V3-0324 | 7.78 |
48-
| 10 | Gemini 2.5 Flash Preview 24K | 7.72 |
49-
| 11 | Grok 3 Beta (no reasoning) | 7.71 |
50-
| 12 | GPT-4.5 Preview | 7.65 |
51-
| 13 | o4-mini (medium reasoning) | 7.60 |
52-
| 14 | Gemini 2.0 Flash Think Exp 01-21 | 7.49 |
53-
| 15 | Claude 3.5 Haiku | 7.49 |
54-
| 16 | Grok 3 Mini Beta (low) | 7.47 |
55-
| 17 | Qwen 2.5 Max | 7.42 |
56-
| 18 | Gemini 2.0 Flash Exp | 7.27 |
57-
| 19 | o1 (medium reasoning) | 7.15 |
58-
| 20 | Mistral Large 2 | 7.00 |
59-
| 21 | GPT-4o mini | 6.84 |
60-
| 22 | o1-mini | 6.64 |
61-
| 23 | Microsoft Phi-4 | 6.40 |
62-
| 24 | o3-mini (high reasoning) | 6.38 |
63-
| 25 | o3-mini (medium reasoning) | 6.36 |
64-
| 26 | Llama 4 Maverick | 6.35 |
65-
| 27 | Amazon Nova Pro | 6.22 |
46+
| 7 | Gemini 2.5 Pro Exp 03-25 | 8.05 |
47+
| 8 | Claude 3.5 Sonnet 2024-10-22 | 8.03 |
48+
| 9 | Gemma 3 27B | 7.99 |
49+
| 10 | Claude 3.7 Sonnet | 7.94 |
50+
| 11 | DeepSeek V3-0324 | 7.70 |
51+
| 12 | Gemini 2.5 Flash Preview 24K | 7.65 |
52+
| 13 | Grok 3 Beta (no reasoning) | 7.64 |
53+
| 14 | GPT-4.5 Preview | 7.56 |
54+
| 15 | Qwen 3 30B A3B | 7.53 |
55+
| 16 | o4-mini (medium reasoning) | 7.50 |
56+
| 17 | Gemini 2.0 Flash Think Exp 01-21 | 7.38 |
57+
| 18 | Claude 3.5 Haiku | 7.35 |
58+
| 19 | Grok 3 Mini Beta (low) | 7.35 |
59+
| 20 | Qwen 2.5 Max | 7.29 |
60+
| 21 | Gemini 2.0 Flash Exp | 7.15 |
61+
| 22 | o1 (medium reasoning) | 7.02 |
62+
| 23 | Mistral Large 2 | 6.90 |
63+
| 24 | GPT-4o mini | 6.72 |
64+
| 25 | o1-mini | 6.49 |
65+
| 26 | Grok 2 12-12 | 6.36 |
66+
| 27 | Microsoft Phi-4 | 6.26 |
67+
| 28 | Llama 4 Maverick | 6.20 |
68+
| 29 | o3-mini (high reasoning) | 6.17 |
69+
| 30 | o3-mini (medium reasoning) | 6.15 |
70+
| 31 | Amazon Nova Pro | 6.05 |
6671
---
6772

6873
### Overall Strip Plot of Questions
@@ -121,7 +126,7 @@ Here, we list the top 3 and the bottom 3 individual stories (written by any LLM)
121126
### Top 3 Individual Stories (All Graders)
122127

123128
* **Story**: [story_403.txt](stories_wc/r1/story_403.txt) by DeepSeek R1
124-
- Overall Mean (All Graders): 9.01
129+
- Overall Mean (All Graders): 9.00
125130
- Grader Score Range: 7.76 (lowest: Grok 3 Beta (no reasoning)) .. 9.64 (highest: Llama 4 Maverick)
126131
- Required Elements:
127132
- Character: hope-worn knight
@@ -135,23 +140,8 @@ Here, we list the top 3 and the bottom 3 individual stories (written by any LLM)
135140
- Motivation: to escape the limitations of perception
136141
- Tone: joyful agony
137142

138-
* **Story**: [story_364.txt](stories_wc/o3/story_364.txt) by o3 (medium reasoning)
139-
- Overall Mean (All Graders): 8.98
140-
- Grader Score Range: 6.30 (lowest: Grok 3 Beta (no reasoning)) .. 9.44 (highest: Claude 3.7 Sonnet)
141-
- Required Elements:
142-
- Character: skewed visionary
143-
- Object: botanical sketches
144-
- Core Concept: reexamining the familiar
145-
- Attribute: cryptically clear
146-
- Action: advise
147-
- Method: by following smudged hieroglyphs on broken pottery
148-
- Setting: temporal anomaly study
149-
- Timeframe: across the hush of a silent revolution
150-
- Motivation: to photograph vanishing trades
151-
- Tone: mundane miracles
152-
153143
* **Story**: [story_15.txt](stories_wc/r1/story_15.txt) by DeepSeek R1
154-
- Overall Mean (All Graders): 8.96
144+
- Overall Mean (All Graders): 8.99
155145
- Grader Score Range: 7.57 (lowest: Grok 3 Beta (no reasoning)) .. 9.50 (highest: Llama 4 Maverick)
156146
- Required Elements:
157147
- Character: gracious widow
@@ -165,12 +155,27 @@ Here, we list the top 3 and the bottom 3 individual stories (written by any LLM)
165155
- Motivation: to defy the gods
166156
- Tone: serious playfulness
167157

158+
* **Story**: [story_360.txt](stories_wc/o3/story_360.txt) by o3 (medium reasoning)
159+
- Overall Mean (All Graders): 8.93
160+
- Grader Score Range: 6.97 (lowest: Grok 3 Beta (no reasoning)) .. 9.50 (highest: Claude 3.7 Sonnet)
161+
- Required Elements:
162+
- Character: meek necromancer
163+
- Object: fountain pen with a broken nib
164+
- Core Concept: the tangled tapestry
165+
- Attribute: peculiarly sincere
166+
- Action: regain
167+
- Method: through the way light reflects off a dew drop
168+
- Setting: underground city of the goblins
169+
- Timeframe: amid playground sounds
170+
- Motivation: to decode a universal riddle
171+
- Tone: mystic simplicity
172+
168173

169174
### Bottom 3 Individual Stories (All Graders)
170175

171176
* **Story**: [story_150.txt](stories_wc/nova-pro/story_150.txt) by Amazon Nova Pro. 4.44
172-
* **Story**: [story_431.txt](stories_wc/nova-pro/story_431.txt) by Amazon Nova Pro. 4.86
173-
* **Story**: [story_412.txt](stories_wc/nova-pro/story_412.txt) by Amazon Nova Pro. 4.92
177+
* **Story**: [story_194.txt](stories_wc/o3-mini/story_194.txt) by o3-mini (medium reasoning). 4.59
178+
* **Story**: [story_431.txt](stories_wc/nova-pro/story_431.txt) by Amazon Nova Pro. 4.71
174179

175180

176181
---
@@ -206,67 +211,75 @@ Excluding 10% worst stories per LLM does not significantly change the rankings:
206211

207212
| LLM Full | Old Rank | Old Mean | New Rank | New Mean |
208213
|----------|---------:|---------:|---------:|---------:|
209-
| o3 (medium reasoning) | 1 | 8.43 | 1 | 8.47 |
210-
| DeepSeek R1 | 2 | 8.34 | 2 | 8.39 |
211-
| GPT-4o Mar 2025 | 3 | 8.22 | 3 | 8.27 |
212-
| Claude 3.7 Sonnet Thinking 16K | 4 | 8.15 | 4 | 8.20 |
213-
| Gemini 2.5 Pro Exp 03-25 | 5 | 8.10 | 5 | 8.16 |
214+
| o3 (medium reasoning) | 1 | 8.39 | 1 | 8.44 |
215+
| Qwen 3 235B A22B | 2 | 8.30 | 2 | 8.36 |
216+
| DeepSeek R1 | 3 | 8.30 | 3 | 8.36 |
217+
| GPT-4o Mar 2025 | 4 | 8.18 | 4 | 8.23 |
218+
| Claude 3.7 Sonnet Thinking 16K | 5 | 8.11 | 5 | 8.17 |
214219
| Qwen QwQ-32B 16K | 6 | 8.07 | 6 | 8.13 |
215-
| Gemma 3 27B | 7 | 8.04 | 7 | 8.10 |
216-
| Claude 3.7 Sonnet | 8 | 8.00 | 8 | 8.05 |
217-
| DeepSeek V3-0324 | 9 | 7.78 | 9 | 7.84 |
218-
| Gemini 2.5 Flash Preview 24K | 10 | 7.72 | 10 | 7.79 |
219-
| Grok 3 Beta (no reasoning) | 11 | 7.71 | 11 | 7.78 |
220-
| GPT-4.5 Preview | 12 | 7.65 | 12 | 7.72 |
221-
| o4-mini (medium reasoning) | 13 | 7.60 | 13 | 7.68 |
222-
| Gemini 2.0 Flash Think Exp 01-21 | 14 | 7.49 | 14 | 7.57 |
223-
| Claude 3.5 Haiku | 15 | 7.49 | 15 | 7.56 |
224-
| Grok 3 Mini Beta (low) | 16 | 7.47 | 16 | 7.54 |
225-
| Qwen 2.5 Max | 17 | 7.42 | 17 | 7.49 |
226-
| Gemini 2.0 Flash Exp | 18 | 7.27 | 18 | 7.35 |
227-
| o1 (medium reasoning) | 19 | 7.15 | 19 | 7.23 |
228-
| Mistral Large 2 | 20 | 7.01 | 20 | 7.10 |
229-
| GPT-4o mini | 21 | 6.84 | 21 | 6.92 |
230-
| o1-mini | 22 | 6.64 | 22 | 6.73 |
231-
| Microsoft Phi-4 | 23 | 6.40 | 23 | 6.50 |
232-
| o3-mini (high reasoning) | 24 | 6.38 | 24 | 6.47 |
233-
| o3-mini (medium reasoning) | 25 | 6.36 | 25 | 6.44 |
234-
| Llama 4 Maverick | 26 | 6.35 | 26 | 6.44 |
235-
| Amazon Nova Pro | 27 | 6.22 | 27 | 6.32 |
220+
| Gemini 2.5 Pro Exp 03-25 | 7 | 8.05 | 7 | 8.11 |
221+
| Claude 3.5 Sonnet 2024-10-22 | 8 | 8.03 | 8 | 8.09 |
222+
| Gemma 3 27B | 9 | 7.99 | 9 | 8.06 |
223+
| Claude 3.7 Sonnet | 10 | 7.94 | 10 | 8.00 |
224+
| DeepSeek V3-0324 | 11 | 7.69 | 11 | 7.77 |
225+
| Gemini 2.5 Flash Preview 24K | 12 | 7.65 | 12 | 7.73 |
226+
| Grok 3 Beta (no reasoning) | 13 | 7.64 | 13 | 7.70 |
227+
| GPT-4.5 Preview | 14 | 7.56 | 14 | 7.63 |
228+
| Qwen 3 30B A3B | 15 | 7.53 | 15 | 7.61 |
229+
| o4-mini (medium reasoning) | 16 | 7.50 | 16 | 7.58 |
230+
| Gemini 2.0 Flash Think Exp 01-21 | 17 | 7.38 | 17 | 7.47 |
231+
| Claude 3.5 Haiku | 18 | 7.35 | 18 | 7.43 |
232+
| Grok 3 Mini Beta (low) | 19 | 7.35 | 19 | 7.42 |
233+
| Qwen 2.5 Max | 20 | 7.29 | 20 | 7.37 |
234+
| Gemini 2.0 Flash Exp | 21 | 7.15 | 21 | 7.24 |
235+
| o1 (medium reasoning) | 22 | 7.02 | 22 | 7.11 |
236+
| Mistral Large 2 | 23 | 6.90 | 23 | 7.00 |
237+
| GPT-4o mini | 24 | 6.72 | 24 | 6.80 |
238+
| o1-mini | 25 | 6.49 | 25 | 6.58 |
239+
| Grok 2 12-12 | 26 | 6.36 | 26 | 6.46 |
240+
| Microsoft Phi-4 | 27 | 6.26 | 27 | 6.35 |
241+
| Llama 4 Maverick | 28 | 6.20 | 28 | 6.29 |
242+
| o3-mini (high reasoning) | 29 | 6.17 | 29 | 6.26 |
243+
| o3-mini (medium reasoning) | 30 | 6.15 | 30 | 6.24 |
244+
| Amazon Nova Pro | 31 | 6.05 | 31 | 6.15 |
236245

237246

238247
Excluding any one LLM from grading also does not significantly change the rankings. For example, here is what happens when LLama 4 Maverick is excluded:
239248
### Ranking after Excluding LLama 3.1 405B from Grading
240249

241250
| LLM | Old Rank | Old Mean | New Rank | New Mean |
242251
|--------------------|---------:|---------:|---------:|---------:|
243-
| o3 (medium reasoning) | 1 | 8.43 | 1 | 8.30 |
244-
| DeepSeek R1 | 2 | 8.34 | 2 | 8.21 |
245-
| GPT-4o Mar 2025 | 3 | 8.22 | 3 | 8.07 |
246-
| Claude 3.7 Sonnet Thinking 16K | 4 | 8.15 | 4 | 8.00 |
247-
| Gemini 2.5 Pro Exp 03-25 | 5 | 8.10 | 5 | 7.95 |
248-
| Qwen QwQ-32B 16K | 6 | 8.07 | 6 | 7.91 |
249-
| Gemma 3 27B | 7 | 8.04 | 7 | 7.89 |
250-
| Claude 3.7 Sonnet | 8 | 8.00 | 8 | 7.82 |
251-
| DeepSeek V3-0324 | 9 | 7.78 | 9 | 7.57 |
252-
| Gemini 2.5 Flash Preview 24K | 10 | 7.72 | 10 | 7.50 |
253-
| Grok 3 Beta (no reasoning) | 11 | 7.71 | 11 | 7.49 |
254-
| GPT-4.5 Preview | 12 | 7.65 | 12 | 7.43 |
255-
| o4-mini (medium reasoning) | 13 | 7.60 | 13 | 7.34 |
256-
| Gemini 2.0 Flash Think Exp 01-21 | 14 | 7.49 | 14 | 7.23 |
257-
| Claude 3.5 Haiku | 15 | 7.49 | 15 | 7.23 |
258-
| Grok 3 Mini Beta (low) | 16 | 7.47 | 16 | 7.20 |
259-
| Qwen 2.5 Max | 17 | 7.42 | 17 | 7.19 |
260-
| Gemini 2.0 Flash Exp | 18 | 7.27 | 18 | 6.98 |
261-
| o1 (medium reasoning) | 19 | 7.15 | 19 | 6.84 |
262-
| Mistral Large 2 | 20 | 7.00 | 20 | 6.70 |
263-
| GPT-4o mini | 21 | 6.84 | 21 | 6.52 |
264-
| o1-mini | 22 | 6.64 | 22 | 6.24 |
265-
| Microsoft Phi-4 | 23 | 6.40 | 23 | 6.00 |
266-
| Llama 4 Maverick | 26 | 6.35 | 24 | 5.94 |
267-
| o3-mini (high reasoning) | 24 | 6.38 | 25 | 5.93 |
268-
| o3-mini (medium reasoning) | 25 | 6.36 | 26 | 5.90 |
269-
| Amazon Nova Pro | 27 | 6.22 | 27 | 5.80 |
252+
| o3 (medium reasoning) | 1 | 8.39 | 1 | 8.27 |
253+
| DeepSeek R1 | 3 | 8.30 | 2 | 8.19 |
254+
| Qwen 3 235B A22B | 2 | 8.30 | 3 | 8.19 |
255+
| GPT-4o Mar 2025 | 4 | 8.18 | 4 | 8.04 |
256+
| Claude 3.7 Sonnet Thinking 16K | 5 | 8.11 | 5 | 7.98 |
257+
| Gemini 2.5 Pro Exp 03-25 | 7 | 8.05 | 6 | 7.92 |
258+
| Qwen QwQ-32B 16K | 6 | 8.07 | 7 | 7.91 |
259+
| Claude 3.5 Sonnet 2024-10-22 | 8 | 8.03 | 8 | 7.89 |
260+
| Gemma 3 27B | 9 | 7.99 | 9 | 7.85 |
261+
| Claude 3.7 Sonnet | 10 | 7.94 | 10 | 7.78 |
262+
| DeepSeek V3-0324 | 11 | 7.69 | 11 | 7.51 |
263+
| Gemini 2.5 Flash Preview 24K | 12 | 7.65 | 12 | 7.46 |
264+
| Grok 3 Beta (no reasoning) | 13 | 7.64 | 13 | 7.44 |
265+
| GPT-4.5 Preview | 14 | 7.56 | 14 | 7.36 |
266+
| Qwen 3 30B A3B | 15 | 7.53 | 15 | 7.32 |
267+
| o4-mini (medium reasoning) | 16 | 7.50 | 16 | 7.26 |
268+
| Gemini 2.0 Flash Think Exp 01-21 | 17 | 7.38 | 17 | 7.14 |
269+
| Claude 3.5 Haiku | 18 | 7.35 | 18 | 7.11 |
270+
| Grok 3 Mini Beta (low) | 19 | 7.35 | 19 | 7.10 |
271+
| Qwen 2.5 Max | 20 | 7.29 | 20 | 7.08 |
272+
| Gemini 2.0 Flash Exp | 21 | 7.15 | 21 | 6.89 |
273+
| o1 (medium reasoning) | 22 | 7.02 | 22 | 6.74 |
274+
| Mistral Large 2 | 23 | 6.90 | 23 | 6.63 |
275+
| GPT-4o mini | 24 | 6.72 | 24 | 6.43 |
276+
| o1-mini | 25 | 6.49 | 25 | 6.13 |
277+
| Grok 2 12-12 | 26 | 6.36 | 26 | 6.03 |
278+
| Microsoft Phi-4 | 27 | 6.26 | 27 | 5.90 |
279+
| Llama 4 Maverick | 28 | 6.20 | 28 | 5.83 |
280+
| o3-mini (high reasoning) | 29 | 6.17 | 29 | 5.76 |
281+
| o3-mini (medium reasoning) | 30 | 6.15 | 30 | 5.73 |
282+
| Amazon Nova Pro | 31 | 6.05 | 31 | 5.67 |
270283

271284
Normalizing each grader’s scores doesn’t significantly alter the rankings:
272285

@@ -275,33 +288,37 @@ Normalizing each grader’s scores doesn’t significantly alter the rankings:
275288

276289
| Rank | LLM | Normalized Mean |
277290
|-----:|------------------------|-----------------:|
278-
| 1 | o3 (medium reasoning) | 1.141 |
279-
| 2 | DeepSeek R1 | 1.029 |
280-
| 3 | GPT-4o Mar 2025 | 0.927 |
281-
| 4 | Claude 3.7 Sonnet Thinking 16K | 0.820 |
282-
| 5 | Qwen QwQ-32B 16K | 0.734 |
283-
| 6 | Gemini 2.5 Pro Exp 03-25 | 0.720 |
284-
| 7 | Gemma 3 27B | 0.663 |
285-
| 8 | Claude 3.7 Sonnet | 0.651 |
286-
| 9 | DeepSeek V3-0324 | 0.402 |
287-
| 10 | Gemini 2.5 Flash Preview 24K | 0.330 |
288-
| 11 | Grok 3 Beta (no reasoning) | 0.327 |
289-
| 12 | GPT-4.5 Preview | 0.286 |
290-
| 13 | o4-mini (medium reasoning) | 0.257 |
291-
| 14 | Grok 3 Mini Beta (low) | 0.112 |
292-
| 15 | Claude 3.5 Haiku | 0.105 |
293-
| 16 | Gemini 2.0 Flash Think Exp 01-21 | 0.096 |
294-
| 17 | Qwen 2.5 Max | -0.074 |
295-
| 18 | Gemini 2.0 Flash Exp | -0.138 |
296-
| 19 | o1 (medium reasoning) | -0.296 |
297-
| 20 | Mistral Large 2 | -0.503 |
298-
| 21 | GPT-4o mini | -0.727 |
299-
| 22 | o1-mini | -0.810 |
300-
| 23 | o3-mini (high reasoning) | -1.071 |
301-
| 24 | o3-mini (medium reasoning) | -1.088 |
302-
| 25 | Microsoft Phi-4 | -1.180 |
303-
| 26 | Llama 4 Maverick | -1.269 |
304-
| 27 | Amazon Nova Pro | -1.442 |
291+
| 1 | o3 (medium reasoning) | 1.098 |
292+
| 2 | DeepSeek R1 | 0.992 |
293+
| 3 | Qwen 3 235B A22B | 0.992 |
294+
| 4 | GPT-4o Mar 2025 | 0.889 |
295+
| 5 | Claude 3.7 Sonnet Thinking 16K | 0.796 |
296+
| 6 | Claude 3.5 Sonnet 2024-10-22 | 0.718 |
297+
| 7 | Gemini 2.5 Pro Exp 03-25 | 0.700 |
298+
| 8 | Qwen QwQ-32B 16K | 0.698 |
299+
| 9 | Gemma 3 27B | 0.641 |
300+
| 10 | Claude 3.7 Sonnet | 0.621 |
301+
| 11 | DeepSeek V3-0324 | 0.369 |
302+
| 12 | Gemini 2.5 Flash Preview 24K | 0.310 |
303+
| 13 | Grok 3 Beta (no reasoning) | 0.303 |
304+
| 14 | GPT-4.5 Preview | 0.249 |
305+
| 15 | Qwen 3 30B A3B | 0.224 |
306+
| 16 | o4-mini (medium reasoning) | 0.209 |
307+
| 17 | Grok 3 Mini Beta (low) | 0.058 |
308+
| 18 | Gemini 2.0 Flash Think Exp 01-21 | 0.054 |
309+
| 19 | Claude 3.5 Haiku | 0.044 |
310+
| 20 | Qwen 2.5 Max | -0.107 |
311+
| 21 | Gemini 2.0 Flash Exp | -0.175 |
312+
| 22 | o1 (medium reasoning) | -0.323 |
313+
| 23 | Mistral Large 2 | -0.496 |
314+
| 24 | GPT-4o mini | -0.717 |
315+
| 25 | o1-mini | -0.835 |
316+
| 26 | Grok 2 12-12 | -1.093 |
317+
| 27 | o3-mini (high reasoning) | -1.129 |
318+
| 28 | o3-mini (medium reasoning) | -1.144 |
319+
| 29 | Microsoft Phi-4 | -1.167 |
320+
| 30 | Llama 4 Maverick | -1.253 |
321+
| 31 | Amazon Nova Pro | -1.428 |
305322

306323

307324
---
@@ -383,6 +400,7 @@ It's important to note that each story is graded individually rather than as par
383400
- [LLM Divergent Thinking Creativity Benchmark](https://github.com/lechmazur/divergent/)
384401
---
385402
## Updates
403+
- May 1, 2025: Qwen 3 models added. Qwen 3 235B added as a grader.
386404
- Apr 24, 2025: Major update: grader LLMs replaced with newer versions, additional specific grading criteria, 0.1 grading granularity, summaries. Added: o3, o4-mini, Gemini 2.5 Flash Preview 16K.
387405
- Apr 11, 2025: Grok 3 added.
388406
- Apr 6, 2025: Llama 4 Maverick added. Some older models excluded from charts.
10.9 KB
Loading
23.1 KB
Loading
20.4 KB
Loading

images/llm_best_pie.png

7.58 KB
Loading
-13 KB
Loading
11.8 KB
Loading
-9.92 KB
Loading
12.7 KB
Loading

images/normalized_scores_strip.png

43.6 KB
Loading

0 commit comments

Comments
 (0)