You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: docs/source/grpo_trainer.md
+37-7
Original file line number
Diff line number
Diff line change
@@ -121,7 +121,12 @@ The GRPO Trainer logs the following metrics:
121
121
The [`GRPOTrainer`] supports using custom reward functions instead of dense reward models. To ensure compatibility, your reward function must satisfy the following requirements:
122
122
123
123
1.**Input arguments**:
124
-
- The function must accept two arguments: `prompts` and `completions`.
124
+
- The function must accept the following as keyword arguments:
125
+
-`prompts` (contains the prompts),
126
+
-`completions` (contains the generated completions),
127
+
- All columns names (but `prompt`) that the dataset may have. For example, if the dataset contains a column named `ground_truth`, the function will be called with `ground_truth` as a keyword argument.
128
+
129
+
The easiest way to comply with this requirement is to use `**kwargs` in the function signature.
125
130
- Depending on the dataset format, the input will vary:
126
131
- For [standard format](dataset_formats#standard), `prompts` and `completions` will be lists of strings.
127
132
- For [conversational format](dataset_formats#conversational), `prompts` and `completions` will be lists of message dictionaries.
@@ -133,7 +138,7 @@ The [`GRPOTrainer`] supports using custom reward functions instead of dense rewa
133
138
Below is an example of a reward function for a standard format that rewards longer completions:
134
139
135
140
```python
136
-
defreward_func(prompts, completions):
141
+
defreward_func(completions, **kwargs):
137
142
"""Reward function that gives higher scores to longer completions."""
138
143
return [float(len(completion)) for completion in completions]
#### Example 2: Reward completions with specific format
151
156
152
-
Below is an example of a reward function that checks if the completion has a specific format. This example is inspired by the reward function used in the paper [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://huggingface.co/papers/2501.12948).
157
+
Below is an example of a reward function that checks if the completion has a specific format. This example is inspired by the _format reward_ function used in the paper [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://huggingface.co/papers/2501.12948).
153
158
It is designed for conversational format, where prompts and completions consist of structured messages.
154
159
155
160
```python
156
161
import re
157
162
158
-
defformat_reward_func(prompts, completions):
163
+
defformat_reward_func(completions, **kwargs):
159
164
"""Reward function that checks if the completion has a specific format."""
#### Example 3: Reward completions based on a reference
187
+
188
+
Below is an example of a reward function that checks if the is correct. This example is inspired by the _accuracy reward_ function used in the paper [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://huggingface.co/papers/2501.12948).
189
+
This example is designed for [standard format](dataset_formats#standard), where the dataset contains a column named `ground_truth`.
0 commit comments