Skip to content

Commit 80b7f17

Browse files
author
Lech
committed
gpt-oss-120b
1 parent 14c1a35 commit 80b7f17

File tree

4,026 files changed

+141827
-113
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

4,026 files changed

+141827
-113
lines changed

README.md

Lines changed: 118 additions & 113 deletions
Original file line numberDiff line numberDiff line change
@@ -63,29 +63,30 @@ The new grading LLMs are:
6363
| 24 | Gemma 3 27B | 7.99 |
6464
| 25 | Claude 3.7 Sonnet | 7.94 |
6565
| 26 | Mistral Medium 3 | 7.73 |
66-
| 27 | DeepSeek V3-0324 | 7.70 |
67-
| 28 | Grok 4 | 7.69 |
68-
| 29 | Gemini 2.5 Flash | 7.65 |
69-
| 30 | Grok 3 Beta (no reasoning) | 7.64 |
70-
| 31 | GPT-4.5 Preview | 7.56 |
71-
| 32 | Qwen 3 30B A3B | 7.53 |
72-
| 33 | o4-mini (medium reasoning) | 7.50 |
73-
| 34 | Gemini 2.0 Flash Think Exp 01-21 | 7.38 |
74-
| 35 | Claude 3.5 Haiku | 7.35 |
75-
| 36 | Grok 3 Mini Beta (low) | 7.35 |
76-
| 37 | GLM-4.5 | 7.34 |
77-
| 38 | Qwen 2.5 Max | 7.29 |
78-
| 39 | Gemini 2.0 Flash Exp | 7.15 |
79-
| 40 | o1 (medium reasoning) | 7.02 |
80-
| 41 | Mistral Large 2 | 6.90 |
81-
| 42 | GPT-4o mini | 6.72 |
82-
| 43 | o1-mini | 6.49 |
83-
| 44 | Grok 2 12-12 | 6.36 |
84-
| 45 | Microsoft Phi-4 | 6.26 |
85-
| 46 | Llama 4 Maverick | 6.20 |
86-
| 47 | o3-mini (high reasoning) | 6.17 |
87-
| 48 | o3-mini (medium reasoning) | 6.15 |
88-
| 49 | Amazon Nova Pro | 6.05 |
66+
| 27 | GPT-OSS-120B | 7.71 |
67+
| 28 | DeepSeek V3-0324 | 7.70 |
68+
| 29 | Grok 4 | 7.69 |
69+
| 30 | Gemini 2.5 Flash | 7.65 |
70+
| 31 | Grok 3 Beta (no reasoning) | 7.64 |
71+
| 32 | GPT-4.5 Preview | 7.56 |
72+
| 33 | Qwen 3 30B A3B | 7.53 |
73+
| 34 | o4-mini (medium reasoning) | 7.50 |
74+
| 35 | Gemini 2.0 Flash Think Exp 01-21 | 7.38 |
75+
| 36 | Claude 3.5 Haiku | 7.35 |
76+
| 37 | Grok 3 Mini Beta (low) | 7.35 |
77+
| 38 | GLM-4.5 | 7.34 |
78+
| 39 | Qwen 2.5 Max | 7.29 |
79+
| 40 | Gemini 2.0 Flash Exp | 7.15 |
80+
| 41 | o1 (medium reasoning) | 7.02 |
81+
| 42 | Mistral Large 2 | 6.90 |
82+
| 43 | GPT-4o mini | 6.72 |
83+
| 44 | o1-mini | 6.49 |
84+
| 45 | Grok 2 12-12 | 6.36 |
85+
| 46 | Microsoft Phi-4 | 6.26 |
86+
| 47 | Llama 4 Maverick | 6.20 |
87+
| 48 | o3-mini (high reasoning) | 6.17 |
88+
| 49 | o3-mini (medium reasoning) | 6.15 |
89+
| 50 | Amazon Nova Pro | 6.05 |
8990
---
9091

9192
### Overall Strip Plot of Questions
@@ -255,29 +256,30 @@ Excluding 10% worst stories per LLM does not significantly change the rankings:
255256
| Gemma 3 27B | 24 | 7.99 | 24 | 8.06 |
256257
| Claude 3.7 Sonnet | 25 | 7.94 | 25 | 8.00 |
257258
| Mistral Medium 3 | 26 | 7.73 | 26 | 7.82 |
258-
| Grok 4 | 28 | 7.69 | 27 | 7.77 |
259-
| DeepSeek V3-0324 | 27 | 7.69 | 28 | 7.77 |
260-
| Gemini 2.5 Flash | 29 | 7.65 | 29 | 7.73 |
261-
| Grok 3 Beta (no reasoning) | 30 | 7.64 | 30 | 7.70 |
262-
| GPT-4.5 Preview | 31 | 7.56 | 31 | 7.63 |
263-
| Qwen 3 30B A3B | 32 | 7.53 | 32 | 7.61 |
264-
| o4-mini (medium reasoning) | 33 | 7.50 | 33 | 7.58 |
265-
| Gemini 2.0 Flash Think Exp 01-21 | 34 | 7.38 | 34 | 7.47 |
266-
| GLM-4.5 | 37 | 7.34 | 35 | 7.44 |
267-
| Claude 3.5 Haiku | 35 | 7.35 | 36 | 7.43 |
268-
| Grok 3 Mini Beta (low) | 36 | 7.35 | 37 | 7.42 |
269-
| Qwen 2.5 Max | 38 | 7.29 | 38 | 7.37 |
270-
| Gemini 2.0 Flash Exp | 39 | 7.15 | 39 | 7.24 |
271-
| o1 (medium reasoning) | 40 | 7.02 | 40 | 7.11 |
272-
| Mistral Large 2 | 41 | 6.90 | 41 | 7.00 |
273-
| GPT-4o mini | 42 | 6.72 | 42 | 6.80 |
274-
| o1-mini | 43 | 6.49 | 43 | 6.58 |
275-
| Grok 2 12-12 | 44 | 6.36 | 44 | 6.46 |
276-
| Microsoft Phi-4 | 45 | 6.26 | 45 | 6.35 |
277-
| Llama 4 Maverick | 46 | 6.20 | 46 | 6.29 |
278-
| o3-mini (high reasoning) | 47 | 6.17 | 47 | 6.26 |
279-
| o3-mini (medium reasoning) | 48 | 6.15 | 48 | 6.24 |
280-
| Amazon Nova Pro | 49 | 6.05 | 49 | 6.15 |
259+
| GPT-OSS-120B | 27 | 7.71 | 27 | 7.79 |
260+
| Grok 4 | 29 | 7.69 | 28 | 7.77 |
261+
| DeepSeek V3-0324 | 28 | 7.69 | 29 | 7.77 |
262+
| Gemini 2.5 Flash | 30 | 7.65 | 30 | 7.73 |
263+
| Grok 3 Beta (no reasoning) | 31 | 7.64 | 31 | 7.70 |
264+
| GPT-4.5 Preview | 32 | 7.56 | 32 | 7.63 |
265+
| Qwen 3 30B A3B | 33 | 7.53 | 33 | 7.61 |
266+
| o4-mini (medium reasoning) | 34 | 7.50 | 34 | 7.58 |
267+
| Gemini 2.0 Flash Think Exp 01-21 | 35 | 7.38 | 35 | 7.47 |
268+
| GLM-4.5 | 38 | 7.34 | 36 | 7.44 |
269+
| Claude 3.5 Haiku | 36 | 7.35 | 37 | 7.43 |
270+
| Grok 3 Mini Beta (low) | 37 | 7.35 | 38 | 7.42 |
271+
| Qwen 2.5 Max | 39 | 7.29 | 39 | 7.37 |
272+
| Gemini 2.0 Flash Exp | 40 | 7.15 | 40 | 7.24 |
273+
| o1 (medium reasoning) | 41 | 7.02 | 41 | 7.11 |
274+
| Mistral Large 2 | 42 | 6.90 | 42 | 7.00 |
275+
| GPT-4o mini | 43 | 6.72 | 43 | 6.80 |
276+
| o1-mini | 44 | 6.49 | 44 | 6.58 |
277+
| Grok 2 12-12 | 45 | 6.36 | 45 | 6.46 |
278+
| Microsoft Phi-4 | 46 | 6.26 | 46 | 6.35 |
279+
| Llama 4 Maverick | 47 | 6.20 | 47 | 6.29 |
280+
| o3-mini (high reasoning) | 48 | 6.17 | 48 | 6.26 |
281+
| o3-mini (medium reasoning) | 49 | 6.15 | 49 | 6.24 |
282+
| Amazon Nova Pro | 50 | 6.05 | 50 | 6.15 |
281283

282284

283285
Excluding any one LLM from grading also does not significantly change the rankings. For example, here is what happens when LLama 4 Maverick is excluded:
@@ -311,29 +313,30 @@ Excluding any one LLM from grading also does not significantly change the rankin
311313
| Gemma 3 27B | 24 | 7.99 | 24 | 7.85 |
312314
| Claude 3.7 Sonnet | 25 | 7.94 | 25 | 7.78 |
313315
| Mistral Medium 3 | 26 | 7.73 | 26 | 7.55 |
314-
| DeepSeek V3-0324 | 27 | 7.69 | 27 | 7.51 |
315-
| Grok 4 | 28 | 7.69 | 28 | 7.49 |
316-
| Gemini 2.5 Flash | 29 | 7.65 | 29 | 7.46 |
317-
| Grok 3 Beta (no reasoning) | 30 | 7.64 | 30 | 7.44 |
318-
| GPT-4.5 Preview | 31 | 7.56 | 31 | 7.36 |
319-
| Qwen 3 30B A3B | 32 | 7.53 | 32 | 7.32 |
320-
| o4-mini (medium reasoning) | 33 | 7.50 | 33 | 7.26 |
321-
| Gemini 2.0 Flash Think Exp 01-21 | 34 | 7.38 | 34 | 7.14 |
322-
| Claude 3.5 Haiku | 35 | 7.35 | 35 | 7.11 |
323-
| GLM-4.5 | 37 | 7.34 | 36 | 7.10 |
324-
| Grok 3 Mini Beta (low) | 36 | 7.35 | 37 | 7.10 |
325-
| Qwen 2.5 Max | 38 | 7.29 | 38 | 7.08 |
326-
| Gemini 2.0 Flash Exp | 39 | 7.15 | 39 | 6.89 |
327-
| o1 (medium reasoning) | 40 | 7.02 | 40 | 6.74 |
328-
| Mistral Large 2 | 41 | 6.90 | 41 | 6.63 |
329-
| GPT-4o mini | 42 | 6.72 | 42 | 6.43 |
330-
| o1-mini | 43 | 6.49 | 43 | 6.13 |
331-
| Grok 2 12-12 | 44 | 6.36 | 44 | 6.03 |
332-
| Microsoft Phi-4 | 45 | 6.26 | 45 | 5.90 |
333-
| Llama 4 Maverick | 46 | 6.20 | 46 | 5.83 |
334-
| o3-mini (high reasoning) | 47 | 6.17 | 47 | 5.76 |
335-
| o3-mini (medium reasoning) | 48 | 6.15 | 48 | 5.73 |
336-
| Amazon Nova Pro | 49 | 6.05 | 49 | 5.67 |
316+
| DeepSeek V3-0324 | 28 | 7.69 | 27 | 7.51 |
317+
| GPT-OSS-120B | 27 | 7.71 | 28 | 7.51 |
318+
| Grok 4 | 29 | 7.69 | 29 | 7.49 |
319+
| Gemini 2.5 Flash | 30 | 7.65 | 30 | 7.46 |
320+
| Grok 3 Beta (no reasoning) | 31 | 7.64 | 31 | 7.44 |
321+
| GPT-4.5 Preview | 32 | 7.56 | 32 | 7.36 |
322+
| Qwen 3 30B A3B | 33 | 7.53 | 33 | 7.32 |
323+
| o4-mini (medium reasoning) | 34 | 7.50 | 34 | 7.26 |
324+
| Gemini 2.0 Flash Think Exp 01-21 | 35 | 7.38 | 35 | 7.14 |
325+
| Claude 3.5 Haiku | 36 | 7.35 | 36 | 7.11 |
326+
| GLM-4.5 | 38 | 7.34 | 37 | 7.10 |
327+
| Grok 3 Mini Beta (low) | 37 | 7.35 | 38 | 7.10 |
328+
| Qwen 2.5 Max | 39 | 7.29 | 39 | 7.08 |
329+
| Gemini 2.0 Flash Exp | 40 | 7.15 | 40 | 6.89 |
330+
| o1 (medium reasoning) | 41 | 7.02 | 41 | 6.74 |
331+
| Mistral Large 2 | 42 | 6.90 | 42 | 6.63 |
332+
| GPT-4o mini | 43 | 6.72 | 43 | 6.43 |
333+
| o1-mini | 44 | 6.49 | 44 | 6.13 |
334+
| Grok 2 12-12 | 45 | 6.36 | 45 | 6.03 |
335+
| Microsoft Phi-4 | 46 | 6.26 | 46 | 5.90 |
336+
| Llama 4 Maverick | 47 | 6.20 | 47 | 5.83 |
337+
| o3-mini (high reasoning) | 48 | 6.17 | 48 | 5.76 |
338+
| o3-mini (medium reasoning) | 49 | 6.15 | 49 | 5.73 |
339+
| Amazon Nova Pro | 50 | 6.05 | 50 | 5.67 |
337340

338341
Normalizing each grader’s scores doesn’t significantly alter the rankings:
339342

@@ -342,55 +345,56 @@ Normalizing each grader’s scores doesn’t significantly alter the rankings:
342345

343346
| Rank | LLM | Normalized Mean |
344347
|-----:|------------------------|-----------------:|
345-
| 1 | GPT-5 (medium reasoning) | 0.965 |
346-
| 2 | Kimi K2 | 0.950 |
347-
| 3 | Claude Opus 4.1 (no reasoning) | 0.826 |
348-
| 4 | Claude Opus 4.1 Thinking 16K | 0.812 |
349-
| 5 | o3-pro (medium reasoning) | 0.805 |
350-
| 6 | o3 (medium reasoning) | 0.789 |
351-
| 7 | Claude Opus 4 Thinking 16K | 0.726 |
352-
| 8 | Gemini 2.5 Pro | 0.701 |
353-
| 9 | DeepSeek R1 | 0.687 |
354-
| 10 | Qwen 3 235B A22B | 0.684 |
355-
| 11 | Claude Opus 4 (no reasoning) | 0.670 |
356-
| 12 | GPT-5 mini (medium reasoning) | 0.650 |
357-
| 13 | Qwen 3 235B A22B 25-07 Think | 0.609 |
358-
| 14 | GPT-4o Mar 2025 | 0.587 |
359-
| 15 | DeepSeek R1 05/28 | 0.525 |
360-
| 16 | Claude 3.7 Sonnet Thinking 16K | 0.497 |
361-
| 17 | Claude Sonnet 4 Thinking 16K | 0.490 |
362-
| 18 | Claude Sonnet 4 (no reasoning) | 0.434 |
348+
| 1 | GPT-5 (medium reasoning) | 0.969 |
349+
| 2 | Kimi K2 | 0.954 |
350+
| 3 | Claude Opus 4.1 (no reasoning) | 0.829 |
351+
| 4 | Claude Opus 4.1 Thinking 16K | 0.815 |
352+
| 5 | o3-pro (medium reasoning) | 0.808 |
353+
| 6 | o3 (medium reasoning) | 0.792 |
354+
| 7 | Claude Opus 4 Thinking 16K | 0.728 |
355+
| 8 | Gemini 2.5 Pro | 0.703 |
356+
| 9 | DeepSeek R1 | 0.690 |
357+
| 10 | Qwen 3 235B A22B | 0.686 |
358+
| 11 | Claude Opus 4 (no reasoning) | 0.672 |
359+
| 12 | GPT-5 mini (medium reasoning) | 0.652 |
360+
| 13 | Qwen 3 235B A22B 25-07 Think | 0.610 |
361+
| 14 | GPT-4o Mar 2025 | 0.589 |
362+
| 15 | DeepSeek R1 05/28 | 0.527 |
363+
| 16 | Claude 3.7 Sonnet Thinking 16K | 0.498 |
364+
| 17 | Claude Sonnet 4 Thinking 16K | 0.491 |
365+
| 18 | Claude Sonnet 4 (no reasoning) | 0.435 |
363366
| 19 | Claude 3.5 Sonnet 2024-10-22 | 0.412 |
364367
| 20 | Qwen QwQ-32B 16K | 0.406 |
365-
| 21 | Gemini 2.5 Pro Exp 03-25 | 0.387 |
366-
| 22 | Gemini 2.5 Pro Preview 05-06 | 0.378 |
368+
| 21 | Gemini 2.5 Pro Exp 03-25 | 0.388 |
369+
| 22 | Gemini 2.5 Pro Preview 05-06 | 0.379 |
367370
| 23 | Gemma 3 27B | 0.334 |
368371
| 24 | Baidu Ernie 4.5 300B A47B | 0.330 |
369372
| 25 | Claude 3.7 Sonnet | 0.322 |
370-
| 26 | DeepSeek V3-0324 | 0.066 |
371-
| 27 | Grok 4 | 0.060 |
372-
| 28 | Mistral Medium 3 | 0.044 |
373-
| 29 | Gemini 2.5 Flash | -0.004 |
374-
| 30 | Grok 3 Beta (no reasoning) | -0.008 |
375-
| 31 | GPT-4.5 Preview | -0.052 |
376-
| 32 | Qwen 3 30B A3B | -0.086 |
377-
| 33 | o4-mini (medium reasoning) | -0.100 |
378-
| 34 | Grok 3 Mini Beta (low) | -0.251 |
379-
| 35 | Gemini 2.0 Flash Think Exp 01-21 | -0.264 |
380-
| 36 | Claude 3.5 Haiku | -0.271 |
381-
| 37 | GLM-4.5 | -0.288 |
382-
| 38 | Qwen 2.5 Max | -0.418 |
383-
| 39 | Gemini 2.0 Flash Exp | -0.498 |
384-
| 40 | o1 (medium reasoning) | -0.640 |
385-
| 41 | Mistral Large 2 | -0.816 |
386-
| 42 | GPT-4o mini | -1.044 |
387-
| 43 | o1-mini | -1.163 |
388-
| 44 | Grok 2 12-12 | -1.428 |
389-
| 45 | o3-mini (high reasoning) | -1.464 |
390-
| 46 | o3-mini (medium reasoning) | -1.478 |
391-
| 47 | Microsoft Phi-4 | -1.502 |
392-
| 48 | Llama 4 Maverick | -1.598 |
393-
| 49 | Amazon Nova Pro | -1.773 |
373+
| 26 | GPT-OSS-120B | 0.082 |
374+
| 27 | DeepSeek V3-0324 | 0.065 |
375+
| 28 | Grok 4 | 0.059 |
376+
| 29 | Mistral Medium 3 | 0.043 |
377+
| 30 | Gemini 2.5 Flash | -0.005 |
378+
| 31 | Grok 3 Beta (no reasoning) | -0.009 |
379+
| 32 | GPT-4.5 Preview | -0.054 |
380+
| 33 | Qwen 3 30B A3B | -0.088 |
381+
| 34 | o4-mini (medium reasoning) | -0.102 |
382+
| 35 | Grok 3 Mini Beta (low) | -0.254 |
383+
| 36 | Gemini 2.0 Flash Think Exp 01-21 | -0.267 |
384+
| 37 | Claude 3.5 Haiku | -0.274 |
385+
| 38 | GLM-4.5 | -0.291 |
386+
| 39 | Qwen 2.5 Max | -0.422 |
387+
| 40 | Gemini 2.0 Flash Exp | -0.502 |
388+
| 41 | o1 (medium reasoning) | -0.646 |
389+
| 42 | Mistral Large 2 | -0.822 |
390+
| 43 | GPT-4o mini | -1.052 |
391+
| 44 | o1-mini | -1.172 |
392+
| 45 | Grok 2 12-12 | -1.438 |
393+
| 46 | o3-mini (high reasoning) | -1.474 |
394+
| 47 | o3-mini (medium reasoning) | -1.488 |
395+
| 48 | Microsoft Phi-4 | -1.512 |
396+
| 49 | Llama 4 Maverick | -1.609 |
397+
| 50 | Amazon Nova Pro | -1.785 |
394398

395399

396400

@@ -476,6 +480,7 @@ It's important to note that each story is graded individually rather than as par
476480
---
477481

478482
## Updates
483+
- Aug 8, 2025: gpt-oss-120b added.
479484
- Aug 7, 2025: GPT-5, Claude Opus 4.1 added.
480485
- Aug 1, 2025: Qwen 3 235B A22B 25-07 Thinking, GLM-4.5 added.
481486
- July 14, 2025: Kimi K2, Baidu Ernie 4.5 300B A47B added.

0 commit comments

Comments
 (0)