@@ -63,29 +63,30 @@ The new grading LLMs are:
63
63
| 24 | Gemma 3 27B | 7.99 |
64
64
| 25 | Claude 3.7 Sonnet | 7.94 |
65
65
| 26 | Mistral Medium 3 | 7.73 |
66
- | 27 | DeepSeek V3-0324 | 7.70 |
67
- | 28 | Grok 4 | 7.69 |
68
- | 29 | Gemini 2.5 Flash | 7.65 |
69
- | 30 | Grok 3 Beta (no reasoning) | 7.64 |
70
- | 31 | GPT-4.5 Preview | 7.56 |
71
- | 32 | Qwen 3 30B A3B | 7.53 |
72
- | 33 | o4-mini (medium reasoning) | 7.50 |
73
- | 34 | Gemini 2.0 Flash Think Exp 01-21 | 7.38 |
74
- | 35 | Claude 3.5 Haiku | 7.35 |
75
- | 36 | Grok 3 Mini Beta (low) | 7.35 |
76
- | 37 | GLM-4.5 | 7.34 |
77
- | 38 | Qwen 2.5 Max | 7.29 |
78
- | 39 | Gemini 2.0 Flash Exp | 7.15 |
79
- | 40 | o1 (medium reasoning) | 7.02 |
80
- | 41 | Mistral Large 2 | 6.90 |
81
- | 42 | GPT-4o mini | 6.72 |
82
- | 43 | o1-mini | 6.49 |
83
- | 44 | Grok 2 12-12 | 6.36 |
84
- | 45 | Microsoft Phi-4 | 6.26 |
85
- | 46 | Llama 4 Maverick | 6.20 |
86
- | 47 | o3-mini (high reasoning) | 6.17 |
87
- | 48 | o3-mini (medium reasoning) | 6.15 |
88
- | 49 | Amazon Nova Pro | 6.05 |
66
+ | 27 | GPT-OSS-120B | 7.71 |
67
+ | 28 | DeepSeek V3-0324 | 7.70 |
68
+ | 29 | Grok 4 | 7.69 |
69
+ | 30 | Gemini 2.5 Flash | 7.65 |
70
+ | 31 | Grok 3 Beta (no reasoning) | 7.64 |
71
+ | 32 | GPT-4.5 Preview | 7.56 |
72
+ | 33 | Qwen 3 30B A3B | 7.53 |
73
+ | 34 | o4-mini (medium reasoning) | 7.50 |
74
+ | 35 | Gemini 2.0 Flash Think Exp 01-21 | 7.38 |
75
+ | 36 | Claude 3.5 Haiku | 7.35 |
76
+ | 37 | Grok 3 Mini Beta (low) | 7.35 |
77
+ | 38 | GLM-4.5 | 7.34 |
78
+ | 39 | Qwen 2.5 Max | 7.29 |
79
+ | 40 | Gemini 2.0 Flash Exp | 7.15 |
80
+ | 41 | o1 (medium reasoning) | 7.02 |
81
+ | 42 | Mistral Large 2 | 6.90 |
82
+ | 43 | GPT-4o mini | 6.72 |
83
+ | 44 | o1-mini | 6.49 |
84
+ | 45 | Grok 2 12-12 | 6.36 |
85
+ | 46 | Microsoft Phi-4 | 6.26 |
86
+ | 47 | Llama 4 Maverick | 6.20 |
87
+ | 48 | o3-mini (high reasoning) | 6.17 |
88
+ | 49 | o3-mini (medium reasoning) | 6.15 |
89
+ | 50 | Amazon Nova Pro | 6.05 |
89
90
---
90
91
91
92
### Overall Strip Plot of Questions
@@ -255,29 +256,30 @@ Excluding 10% worst stories per LLM does not significantly change the rankings:
255
256
| Gemma 3 27B | 24 | 7.99 | 24 | 8.06 |
256
257
| Claude 3.7 Sonnet | 25 | 7.94 | 25 | 8.00 |
257
258
| Mistral Medium 3 | 26 | 7.73 | 26 | 7.82 |
258
- | Grok 4 | 28 | 7.69 | 27 | 7.77 |
259
- | DeepSeek V3-0324 | 27 | 7.69 | 28 | 7.77 |
260
- | Gemini 2.5 Flash | 29 | 7.65 | 29 | 7.73 |
261
- | Grok 3 Beta (no reasoning) | 30 | 7.64 | 30 | 7.70 |
262
- | GPT-4.5 Preview | 31 | 7.56 | 31 | 7.63 |
263
- | Qwen 3 30B A3B | 32 | 7.53 | 32 | 7.61 |
264
- | o4-mini (medium reasoning) | 33 | 7.50 | 33 | 7.58 |
265
- | Gemini 2.0 Flash Think Exp 01-21 | 34 | 7.38 | 34 | 7.47 |
266
- | GLM-4.5 | 37 | 7.34 | 35 | 7.44 |
267
- | Claude 3.5 Haiku | 35 | 7.35 | 36 | 7.43 |
268
- | Grok 3 Mini Beta (low) | 36 | 7.35 | 37 | 7.42 |
269
- | Qwen 2.5 Max | 38 | 7.29 | 38 | 7.37 |
270
- | Gemini 2.0 Flash Exp | 39 | 7.15 | 39 | 7.24 |
271
- | o1 (medium reasoning) | 40 | 7.02 | 40 | 7.11 |
272
- | Mistral Large 2 | 41 | 6.90 | 41 | 7.00 |
273
- | GPT-4o mini | 42 | 6.72 | 42 | 6.80 |
274
- | o1-mini | 43 | 6.49 | 43 | 6.58 |
275
- | Grok 2 12-12 | 44 | 6.36 | 44 | 6.46 |
276
- | Microsoft Phi-4 | 45 | 6.26 | 45 | 6.35 |
277
- | Llama 4 Maverick | 46 | 6.20 | 46 | 6.29 |
278
- | o3-mini (high reasoning) | 47 | 6.17 | 47 | 6.26 |
279
- | o3-mini (medium reasoning) | 48 | 6.15 | 48 | 6.24 |
280
- | Amazon Nova Pro | 49 | 6.05 | 49 | 6.15 |
259
+ | GPT-OSS-120B | 27 | 7.71 | 27 | 7.79 |
260
+ | Grok 4 | 29 | 7.69 | 28 | 7.77 |
261
+ | DeepSeek V3-0324 | 28 | 7.69 | 29 | 7.77 |
262
+ | Gemini 2.5 Flash | 30 | 7.65 | 30 | 7.73 |
263
+ | Grok 3 Beta (no reasoning) | 31 | 7.64 | 31 | 7.70 |
264
+ | GPT-4.5 Preview | 32 | 7.56 | 32 | 7.63 |
265
+ | Qwen 3 30B A3B | 33 | 7.53 | 33 | 7.61 |
266
+ | o4-mini (medium reasoning) | 34 | 7.50 | 34 | 7.58 |
267
+ | Gemini 2.0 Flash Think Exp 01-21 | 35 | 7.38 | 35 | 7.47 |
268
+ | GLM-4.5 | 38 | 7.34 | 36 | 7.44 |
269
+ | Claude 3.5 Haiku | 36 | 7.35 | 37 | 7.43 |
270
+ | Grok 3 Mini Beta (low) | 37 | 7.35 | 38 | 7.42 |
271
+ | Qwen 2.5 Max | 39 | 7.29 | 39 | 7.37 |
272
+ | Gemini 2.0 Flash Exp | 40 | 7.15 | 40 | 7.24 |
273
+ | o1 (medium reasoning) | 41 | 7.02 | 41 | 7.11 |
274
+ | Mistral Large 2 | 42 | 6.90 | 42 | 7.00 |
275
+ | GPT-4o mini | 43 | 6.72 | 43 | 6.80 |
276
+ | o1-mini | 44 | 6.49 | 44 | 6.58 |
277
+ | Grok 2 12-12 | 45 | 6.36 | 45 | 6.46 |
278
+ | Microsoft Phi-4 | 46 | 6.26 | 46 | 6.35 |
279
+ | Llama 4 Maverick | 47 | 6.20 | 47 | 6.29 |
280
+ | o3-mini (high reasoning) | 48 | 6.17 | 48 | 6.26 |
281
+ | o3-mini (medium reasoning) | 49 | 6.15 | 49 | 6.24 |
282
+ | Amazon Nova Pro | 50 | 6.05 | 50 | 6.15 |
281
283
282
284
283
285
Excluding any one LLM from grading also does not significantly change the rankings. For example, here is what happens when LLama 4 Maverick is excluded:
@@ -311,29 +313,30 @@ Excluding any one LLM from grading also does not significantly change the rankin
311
313
| Gemma 3 27B | 24 | 7.99 | 24 | 7.85 |
312
314
| Claude 3.7 Sonnet | 25 | 7.94 | 25 | 7.78 |
313
315
| Mistral Medium 3 | 26 | 7.73 | 26 | 7.55 |
314
- | DeepSeek V3-0324 | 27 | 7.69 | 27 | 7.51 |
315
- | Grok 4 | 28 | 7.69 | 28 | 7.49 |
316
- | Gemini 2.5 Flash | 29 | 7.65 | 29 | 7.46 |
317
- | Grok 3 Beta (no reasoning) | 30 | 7.64 | 30 | 7.44 |
318
- | GPT-4.5 Preview | 31 | 7.56 | 31 | 7.36 |
319
- | Qwen 3 30B A3B | 32 | 7.53 | 32 | 7.32 |
320
- | o4-mini (medium reasoning) | 33 | 7.50 | 33 | 7.26 |
321
- | Gemini 2.0 Flash Think Exp 01-21 | 34 | 7.38 | 34 | 7.14 |
322
- | Claude 3.5 Haiku | 35 | 7.35 | 35 | 7.11 |
323
- | GLM-4.5 | 37 | 7.34 | 36 | 7.10 |
324
- | Grok 3 Mini Beta (low) | 36 | 7.35 | 37 | 7.10 |
325
- | Qwen 2.5 Max | 38 | 7.29 | 38 | 7.08 |
326
- | Gemini 2.0 Flash Exp | 39 | 7.15 | 39 | 6.89 |
327
- | o1 (medium reasoning) | 40 | 7.02 | 40 | 6.74 |
328
- | Mistral Large 2 | 41 | 6.90 | 41 | 6.63 |
329
- | GPT-4o mini | 42 | 6.72 | 42 | 6.43 |
330
- | o1-mini | 43 | 6.49 | 43 | 6.13 |
331
- | Grok 2 12-12 | 44 | 6.36 | 44 | 6.03 |
332
- | Microsoft Phi-4 | 45 | 6.26 | 45 | 5.90 |
333
- | Llama 4 Maverick | 46 | 6.20 | 46 | 5.83 |
334
- | o3-mini (high reasoning) | 47 | 6.17 | 47 | 5.76 |
335
- | o3-mini (medium reasoning) | 48 | 6.15 | 48 | 5.73 |
336
- | Amazon Nova Pro | 49 | 6.05 | 49 | 5.67 |
316
+ | DeepSeek V3-0324 | 28 | 7.69 | 27 | 7.51 |
317
+ | GPT-OSS-120B | 27 | 7.71 | 28 | 7.51 |
318
+ | Grok 4 | 29 | 7.69 | 29 | 7.49 |
319
+ | Gemini 2.5 Flash | 30 | 7.65 | 30 | 7.46 |
320
+ | Grok 3 Beta (no reasoning) | 31 | 7.64 | 31 | 7.44 |
321
+ | GPT-4.5 Preview | 32 | 7.56 | 32 | 7.36 |
322
+ | Qwen 3 30B A3B | 33 | 7.53 | 33 | 7.32 |
323
+ | o4-mini (medium reasoning) | 34 | 7.50 | 34 | 7.26 |
324
+ | Gemini 2.0 Flash Think Exp 01-21 | 35 | 7.38 | 35 | 7.14 |
325
+ | Claude 3.5 Haiku | 36 | 7.35 | 36 | 7.11 |
326
+ | GLM-4.5 | 38 | 7.34 | 37 | 7.10 |
327
+ | Grok 3 Mini Beta (low) | 37 | 7.35 | 38 | 7.10 |
328
+ | Qwen 2.5 Max | 39 | 7.29 | 39 | 7.08 |
329
+ | Gemini 2.0 Flash Exp | 40 | 7.15 | 40 | 6.89 |
330
+ | o1 (medium reasoning) | 41 | 7.02 | 41 | 6.74 |
331
+ | Mistral Large 2 | 42 | 6.90 | 42 | 6.63 |
332
+ | GPT-4o mini | 43 | 6.72 | 43 | 6.43 |
333
+ | o1-mini | 44 | 6.49 | 44 | 6.13 |
334
+ | Grok 2 12-12 | 45 | 6.36 | 45 | 6.03 |
335
+ | Microsoft Phi-4 | 46 | 6.26 | 46 | 5.90 |
336
+ | Llama 4 Maverick | 47 | 6.20 | 47 | 5.83 |
337
+ | o3-mini (high reasoning) | 48 | 6.17 | 48 | 5.76 |
338
+ | o3-mini (medium reasoning) | 49 | 6.15 | 49 | 5.73 |
339
+ | Amazon Nova Pro | 50 | 6.05 | 50 | 5.67 |
337
340
338
341
Normalizing each grader’s scores doesn’t significantly alter the rankings:
339
342
@@ -342,55 +345,56 @@ Normalizing each grader’s scores doesn’t significantly alter the rankings:
342
345
343
346
| Rank | LLM | Normalized Mean |
344
347
| -----:| ------------------------| -----------------:|
345
- | 1 | GPT-5 (medium reasoning) | 0.965 |
346
- | 2 | Kimi K2 | 0.950 |
347
- | 3 | Claude Opus 4.1 (no reasoning) | 0.826 |
348
- | 4 | Claude Opus 4.1 Thinking 16K | 0.812 |
349
- | 5 | o3-pro (medium reasoning) | 0.805 |
350
- | 6 | o3 (medium reasoning) | 0.789 |
351
- | 7 | Claude Opus 4 Thinking 16K | 0.726 |
352
- | 8 | Gemini 2.5 Pro | 0.701 |
353
- | 9 | DeepSeek R1 | 0.687 |
354
- | 10 | Qwen 3 235B A22B | 0.684 |
355
- | 11 | Claude Opus 4 (no reasoning) | 0.670 |
356
- | 12 | GPT-5 mini (medium reasoning) | 0.650 |
357
- | 13 | Qwen 3 235B A22B 25-07 Think | 0.609 |
358
- | 14 | GPT-4o Mar 2025 | 0.587 |
359
- | 15 | DeepSeek R1 05/28 | 0.525 |
360
- | 16 | Claude 3.7 Sonnet Thinking 16K | 0.497 |
361
- | 17 | Claude Sonnet 4 Thinking 16K | 0.490 |
362
- | 18 | Claude Sonnet 4 (no reasoning) | 0.434 |
348
+ | 1 | GPT-5 (medium reasoning) | 0.969 |
349
+ | 2 | Kimi K2 | 0.954 |
350
+ | 3 | Claude Opus 4.1 (no reasoning) | 0.829 |
351
+ | 4 | Claude Opus 4.1 Thinking 16K | 0.815 |
352
+ | 5 | o3-pro (medium reasoning) | 0.808 |
353
+ | 6 | o3 (medium reasoning) | 0.792 |
354
+ | 7 | Claude Opus 4 Thinking 16K | 0.728 |
355
+ | 8 | Gemini 2.5 Pro | 0.703 |
356
+ | 9 | DeepSeek R1 | 0.690 |
357
+ | 10 | Qwen 3 235B A22B | 0.686 |
358
+ | 11 | Claude Opus 4 (no reasoning) | 0.672 |
359
+ | 12 | GPT-5 mini (medium reasoning) | 0.652 |
360
+ | 13 | Qwen 3 235B A22B 25-07 Think | 0.610 |
361
+ | 14 | GPT-4o Mar 2025 | 0.589 |
362
+ | 15 | DeepSeek R1 05/28 | 0.527 |
363
+ | 16 | Claude 3.7 Sonnet Thinking 16K | 0.498 |
364
+ | 17 | Claude Sonnet 4 Thinking 16K | 0.491 |
365
+ | 18 | Claude Sonnet 4 (no reasoning) | 0.435 |
363
366
| 19 | Claude 3.5 Sonnet 2024-10-22 | 0.412 |
364
367
| 20 | Qwen QwQ-32B 16K | 0.406 |
365
- | 21 | Gemini 2.5 Pro Exp 03-25 | 0.387 |
366
- | 22 | Gemini 2.5 Pro Preview 05-06 | 0.378 |
368
+ | 21 | Gemini 2.5 Pro Exp 03-25 | 0.388 |
369
+ | 22 | Gemini 2.5 Pro Preview 05-06 | 0.379 |
367
370
| 23 | Gemma 3 27B | 0.334 |
368
371
| 24 | Baidu Ernie 4.5 300B A47B | 0.330 |
369
372
| 25 | Claude 3.7 Sonnet | 0.322 |
370
- | 26 | DeepSeek V3-0324 | 0.066 |
371
- | 27 | Grok 4 | 0.060 |
372
- | 28 | Mistral Medium 3 | 0.044 |
373
- | 29 | Gemini 2.5 Flash | -0.004 |
374
- | 30 | Grok 3 Beta (no reasoning) | -0.008 |
375
- | 31 | GPT-4.5 Preview | -0.052 |
376
- | 32 | Qwen 3 30B A3B | -0.086 |
377
- | 33 | o4-mini (medium reasoning) | -0.100 |
378
- | 34 | Grok 3 Mini Beta (low) | -0.251 |
379
- | 35 | Gemini 2.0 Flash Think Exp 01-21 | -0.264 |
380
- | 36 | Claude 3.5 Haiku | -0.271 |
381
- | 37 | GLM-4.5 | -0.288 |
382
- | 38 | Qwen 2.5 Max | -0.418 |
383
- | 39 | Gemini 2.0 Flash Exp | -0.498 |
384
- | 40 | o1 (medium reasoning) | -0.640 |
385
- | 41 | Mistral Large 2 | -0.816 |
386
- | 42 | GPT-4o mini | -1.044 |
387
- | 43 | o1-mini | -1.163 |
388
- | 44 | Grok 2 12-12 | -1.428 |
389
- | 45 | o3-mini (high reasoning) | -1.464 |
390
- | 46 | o3-mini (medium reasoning) | -1.478 |
391
- | 47 | Microsoft Phi-4 | -1.502 |
392
- | 48 | Llama 4 Maverick | -1.598 |
393
- | 49 | Amazon Nova Pro | -1.773 |
373
+ | 26 | GPT-OSS-120B | 0.082 |
374
+ | 27 | DeepSeek V3-0324 | 0.065 |
375
+ | 28 | Grok 4 | 0.059 |
376
+ | 29 | Mistral Medium 3 | 0.043 |
377
+ | 30 | Gemini 2.5 Flash | -0.005 |
378
+ | 31 | Grok 3 Beta (no reasoning) | -0.009 |
379
+ | 32 | GPT-4.5 Preview | -0.054 |
380
+ | 33 | Qwen 3 30B A3B | -0.088 |
381
+ | 34 | o4-mini (medium reasoning) | -0.102 |
382
+ | 35 | Grok 3 Mini Beta (low) | -0.254 |
383
+ | 36 | Gemini 2.0 Flash Think Exp 01-21 | -0.267 |
384
+ | 37 | Claude 3.5 Haiku | -0.274 |
385
+ | 38 | GLM-4.5 | -0.291 |
386
+ | 39 | Qwen 2.5 Max | -0.422 |
387
+ | 40 | Gemini 2.0 Flash Exp | -0.502 |
388
+ | 41 | o1 (medium reasoning) | -0.646 |
389
+ | 42 | Mistral Large 2 | -0.822 |
390
+ | 43 | GPT-4o mini | -1.052 |
391
+ | 44 | o1-mini | -1.172 |
392
+ | 45 | Grok 2 12-12 | -1.438 |
393
+ | 46 | o3-mini (high reasoning) | -1.474 |
394
+ | 47 | o3-mini (medium reasoning) | -1.488 |
395
+ | 48 | Microsoft Phi-4 | -1.512 |
396
+ | 49 | Llama 4 Maverick | -1.609 |
397
+ | 50 | Amazon Nova Pro | -1.785 |
394
398
395
399
396
400
@@ -476,6 +480,7 @@ It's important to note that each story is graded individually rather than as par
476
480
---
477
481
478
482
## Updates
483
+ - Aug 8, 2025: gpt-oss-120b added.
479
484
- Aug 7, 2025: GPT-5, Claude Opus 4.1 added.
480
485
- Aug 1, 2025: Qwen 3 235B A22B 25-07 Thinking, GLM-4.5 added.
481
486
- July 14, 2025: Kimi K2, Baidu Ernie 4.5 300B A47B added.
0 commit comments