Skip to content

Commit 8f137b2

Browse files
bauwenstArthurZucker
andauthoredFeb 13, 2025··
Move DataCollatorForMultipleChoice from the docs to the package (#34763)
* Add implementation for DataCollatorForMultipleChoice based on docs. * Add DataCollatorForMultipleChoice to import structure. * Remove custom DataCollatorForMultipleChoice implementations from example scripts. * Remove custom implementations of DataCollatorForMultipleChoice from docs in English, Spanish, Japanese and Korean. * Refactor torch version of DataCollatorForMultipleChoice to be more easily understandable. * Apply suggested changes and run make fixup. * fix copies, style and fixup * add missing documentation * nits * fix docstring * style * nits * isort --------- Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com> Co-authored-by: Arthur Zucker <arthur.zucker@gmail.com>
1 parent 35c1550 commit 8f137b2

File tree

25 files changed

+361
-670
lines changed

25 files changed

+361
-670
lines changed
 

‎docs/source/en/main_classes/data_collator.md

+3
Original file line numberDiff line numberDiff line change
@@ -71,3 +71,6 @@ Examples of use can be found in the [example scripts](../examples) or [example n
7171

7272
[[autodoc]] data.data_collator.DataCollatorWithFlattening
7373

74+
# DataCollatorForMultipleChoice
75+
76+
[[autodoc]] data.data_collator.DataCollatorForMultipleChoice

‎docs/source/en/tasks/multiple_choice.md

+5-90
Original file line numberDiff line numberDiff line change
@@ -109,99 +109,14 @@ The preprocessing function you want to create needs to:
109109
To apply the preprocessing function over the entire dataset, use 🤗 Datasets [`~datasets.Dataset.map`] method. You can speed up the `map` function by setting `batched=True` to process multiple elements of the dataset at once:
110110

111111
```py
112-
tokenized_swag = swag.map(preprocess_function, batched=True)
112+
>>> tokenized_swag = swag.map(preprocess_function, batched=True)
113113
```
114114

115-
🤗 Transformers doesn't have a data collator for multiple choice, so you'll need to adapt the [`DataCollatorWithPadding`] to create a batch of examples. It's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.
116-
117-
`DataCollatorForMultipleChoice` flattens all the model inputs, applies padding, and then unflattens the results:
118-
119-
<frameworkcontent>
120-
<pt>
121-
```py
122-
>>> from dataclasses import dataclass
123-
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
124-
>>> from typing import Optional, Union
125-
>>> import torch
126-
127-
128-
>>> @dataclass
129-
... class DataCollatorForMultipleChoice:
130-
... """
131-
... Data collator that will dynamically pad the inputs for multiple choice received.
132-
... """
133-
134-
... tokenizer: PreTrainedTokenizerBase
135-
... padding: Union[bool, str, PaddingStrategy] = True
136-
... max_length: Optional[int] = None
137-
... pad_to_multiple_of: Optional[int] = None
138-
139-
... def __call__(self, features):
140-
... label_name = "label" if "label" in features[0].keys() else "labels"
141-
... labels = [feature.pop(label_name) for feature in features]
142-
... batch_size = len(features)
143-
... num_choices = len(features[0]["input_ids"])
144-
... flattened_features = [
145-
... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
146-
... ]
147-
... flattened_features = sum(flattened_features, [])
148-
149-
... batch = self.tokenizer.pad(
150-
... flattened_features,
151-
... padding=self.padding,
152-
... max_length=self.max_length,
153-
... pad_to_multiple_of=self.pad_to_multiple_of,
154-
... return_tensors="pt",
155-
... )
156-
157-
... batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
158-
... batch["labels"] = torch.tensor(labels, dtype=torch.int64)
159-
... return batch
160-
```
161-
</pt>
162-
<tf>
115+
To create a batch of examples, it's more efficient to *dynamically pad* the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length. [`DataCollatorForMultipleChoice`] flattens all the model inputs, applies padding, and then unflattens the results.
163116
```py
164-
>>> from dataclasses import dataclass
165-
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
166-
>>> from typing import Optional, Union
167-
>>> import tensorflow as tf
168-
169-
170-
>>> @dataclass
171-
... class DataCollatorForMultipleChoice:
172-
... """
173-
... Data collator that will dynamically pad the inputs for multiple choice received.
174-
... """
175-
176-
... tokenizer: PreTrainedTokenizerBase
177-
... padding: Union[bool, str, PaddingStrategy] = True
178-
... max_length: Optional[int] = None
179-
... pad_to_multiple_of: Optional[int] = None
180-
181-
... def __call__(self, features):
182-
... label_name = "label" if "label" in features[0].keys() else "labels"
183-
... labels = [feature.pop(label_name) for feature in features]
184-
... batch_size = len(features)
185-
... num_choices = len(features[0]["input_ids"])
186-
... flattened_features = [
187-
... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
188-
... ]
189-
... flattened_features = sum(flattened_features, [])
190-
191-
... batch = self.tokenizer.pad(
192-
... flattened_features,
193-
... padding=self.padding,
194-
... max_length=self.max_length,
195-
... pad_to_multiple_of=self.pad_to_multiple_of,
196-
... return_tensors="tf",
197-
... )
198-
199-
... batch = {k: tf.reshape(v, (batch_size, num_choices, -1)) for k, v in batch.items()}
200-
... batch["labels"] = tf.convert_to_tensor(labels, dtype=tf.int64)
201-
... return batch
117+
>>> from transformers import DataCollatorForMultipleChoice
118+
>>> collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
202119
```
203-
</tf>
204-
</frameworkcontent>
205120

206121
## Evaluate
207122

@@ -271,7 +186,7 @@ At this point, only three steps remain:
271186
... train_dataset=tokenized_swag["train"],
272187
... eval_dataset=tokenized_swag["validation"],
273188
... processing_class=tokenizer,
274-
... data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
189+
... data_collator=collator,
275190
... compute_metrics=compute_metrics,
276191
... )
277192

‎docs/source/es/tasks/multiple_choice.md

+5-90
Original file line numberDiff line numberDiff line change
@@ -91,99 +91,14 @@ Usa la función [`~datasets.Dataset.map`] de 🤗 Datasets para aplicarle la fun
9191
tokenized_swag = swag.map(preprocess_function, batched=True)
9292
```
9393

94-
🤗 Transformers no tiene un collator de datos para la tarea de selección múltiple, así que tendrías que crear uno. Puedes adaptar el [`DataCollatorWithPadding`] para crear un lote de ejemplos para selección múltiple. Este también
95-
le *añadirá relleno de manera dinámica* a tu texto y a las etiquetas para que tengan la longitud del elemento más largo en su lote, de forma que tengan una longitud uniforme. Aunque es posible rellenar el texto en la función `tokenizer` haciendo
94+
Para crear un lote de ejemplos para selección múltiple, este también le *añadirá relleno de manera dinámica* a tu texto y a las etiquetas para que tengan la longitud del elemento más largo en su lote, de forma que tengan una longitud uniforme. Aunque es posible rellenar el texto en la función `tokenizer` haciendo
9695
`padding=True`, el rellenado dinámico es más eficiente.
9796

98-
El `DataCollatorForMultipleChoice` aplanará todas las entradas del modelo, les aplicará relleno y luego des-aplanará los resultados:
99-
100-
<frameworkcontent>
101-
<pt>
97+
El [`DataCollatorForMultipleChoice`] aplanará todas las entradas del modelo, les aplicará relleno y luego des-aplanará los resultados.
10298
```py
103-
>>> from dataclasses import dataclass
104-
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
105-
>>> from typing import Optional, Union
106-
>>> import torch
107-
108-
109-
>>> @dataclass
110-
... class DataCollatorForMultipleChoice:
111-
... """
112-
... Collator de datos que le añadirá relleno de forma automática a las entradas recibidas para
113-
... una tarea de selección múltiple.
114-
... """
115-
116-
... tokenizer: PreTrainedTokenizerBase
117-
... padding: Union[bool, str, PaddingStrategy] = True
118-
... max_length: Optional[int] = None
119-
... pad_to_multiple_of: Optional[int] = None
120-
121-
... def __call__(self, features):
122-
... label_name = "label" if "label" in features[0].keys() else "labels"
123-
... labels = [feature.pop(label_name) for feature in features]
124-
... batch_size = len(features)
125-
... num_choices = len(features[0]["input_ids"])
126-
... flattened_features = [
127-
... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
128-
... ]
129-
... flattened_features = sum(flattened_features, [])
130-
131-
... batch = self.tokenizer.pad(
132-
... flattened_features,
133-
... padding=self.padding,
134-
... max_length=self.max_length,
135-
... pad_to_multiple_of=self.pad_to_multiple_of,
136-
... return_tensors="pt",
137-
... )
138-
139-
... batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
140-
... batch["labels"] = torch.tensor(labels, dtype=torch.int64)
141-
... return batch
99+
>>> from transformers import DataCollatorForMultipleChoice
100+
>>> collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
142101
```
143-
</pt>
144-
<tf>
145-
```py
146-
>>> from dataclasses import dataclass
147-
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
148-
>>> from typing import Optional, Union
149-
>>> import tensorflow as tf
150-
151-
152-
>>> @dataclass
153-
... class DataCollatorForMultipleChoice:
154-
... """
155-
... Data collator that will dynamically pad the inputs for multiple choice received.
156-
... """
157-
158-
... tokenizer: PreTrainedTokenizerBase
159-
... padding: Union[bool, str, PaddingStrategy] = True
160-
... max_length: Optional[int] = None
161-
... pad_to_multiple_of: Optional[int] = None
162-
163-
... def __call__(self, features):
164-
... label_name = "label" if "label" in features[0].keys() else "labels"
165-
... labels = [feature.pop(label_name) for feature in features]
166-
... batch_size = len(features)
167-
... num_choices = len(features[0]["input_ids"])
168-
... flattened_features = [
169-
... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
170-
... ]
171-
... flattened_features = sum(flattened_features, [])
172-
173-
... batch = self.tokenizer.pad(
174-
... flattened_features,
175-
... padding=self.padding,
176-
... max_length=self.max_length,
177-
... pad_to_multiple_of=self.pad_to_multiple_of,
178-
... return_tensors="tf",
179-
... )
180-
181-
... batch = {k: tf.reshape(v, (batch_size, num_choices, -1)) for k, v in batch.items()}
182-
... batch["labels"] = tf.convert_to_tensor(labels, dtype=tf.int64)
183-
... return batch
184-
```
185-
</tf>
186-
</frameworkcontent>
187102

188103
## Entrenamiento
189104

@@ -226,7 +141,7 @@ En este punto, solo quedan tres pasos:
226141
... train_dataset=tokenized_swag["train"],
227142
... eval_dataset=tokenized_swag["validation"],
228143
... processing_class=tokenizer,
229-
... data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
144+
... data_collator=collator,
230145
... )
231146

232147
>>> trainer.train()

‎docs/source/ja/tasks/multiple_choice.md

+4-89
Original file line numberDiff line numberDiff line change
@@ -113,96 +113,11 @@ pip install transformers datasets evaluate
113113
tokenized_swag = swag.map(preprocess_function, batched=True)
114114
```
115115

116-
🤗 Transformers には多肢選択用のデータ照合器がないため、[`DataCollat​​orWithPadding`] を調整してサンプルのバッチを作成する必要があります。データセット全体を最大長までパディングするのではなく、照合中にバッチ内の最長の長さまで文を *動的にパディング* する方が効率的です。
117-
118-
`DataCollat​​orForMultipleChoice` は、すべてのモデル入力を平坦化し、パディングを適用して、結果を非平坦化します。
119-
120-
<frameworkcontent>
121-
<pt>
122-
```py
123-
>>> from dataclasses import dataclass
124-
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
125-
>>> from typing import Optional, Union
126-
>>> import torch
127-
128-
129-
>>> @dataclass
130-
... class DataCollatorForMultipleChoice:
131-
... """
132-
... Data collator that will dynamically pad the inputs for multiple choice received.
133-
... """
134-
135-
... tokenizer: PreTrainedTokenizerBase
136-
... padding: Union[bool, str, PaddingStrategy] = True
137-
... max_length: Optional[int] = None
138-
... pad_to_multiple_of: Optional[int] = None
139-
140-
... def __call__(self, features):
141-
... label_name = "label" if "label" in features[0].keys() else "labels"
142-
... labels = [feature.pop(label_name) for feature in features]
143-
... batch_size = len(features)
144-
... num_choices = len(features[0]["input_ids"])
145-
... flattened_features = [
146-
... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
147-
... ]
148-
... flattened_features = sum(flattened_features, [])
149-
150-
... batch = self.tokenizer.pad(
151-
... flattened_features,
152-
... padding=self.padding,
153-
... max_length=self.max_length,
154-
... pad_to_multiple_of=self.pad_to_multiple_of,
155-
... return_tensors="pt",
156-
... )
157-
158-
... batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
159-
... batch["labels"] = torch.tensor(labels, dtype=torch.int64)
160-
... return batch
161-
```
162-
</pt>
163-
<tf>
116+
[`DataCollatorForMultipleChoice`] は、すべてのモデル入力を平坦化し、パディングを適用して、結果を非平坦化します。
164117
```py
165-
>>> from dataclasses import dataclass
166-
>>> from transformers.tokenization_utils_base import PreTrainedTokenizerBase, PaddingStrategy
167-
>>> from typing import Optional, Union
168-
>>> import tensorflow as tf
169-
170-
171-
>>> @dataclass
172-
... class DataCollatorForMultipleChoice:
173-
... """
174-
... Data collator that will dynamically pad the inputs for multiple choice received.
175-
... """
176-
177-
... tokenizer: PreTrainedTokenizerBase
178-
... padding: Union[bool, str, PaddingStrategy] = True
179-
... max_length: Optional[int] = None
180-
... pad_to_multiple_of: Optional[int] = None
181-
182-
... def __call__(self, features):
183-
... label_name = "label" if "label" in features[0].keys() else "labels"
184-
... labels = [feature.pop(label_name) for feature in features]
185-
... batch_size = len(features)
186-
... num_choices = len(features[0]["input_ids"])
187-
... flattened_features = [
188-
... [{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features
189-
... ]
190-
... flattened_features = sum(flattened_features, [])
191-
192-
... batch = self.tokenizer.pad(
193-
... flattened_features,
194-
... padding=self.padding,
195-
... max_length=self.max_length,
196-
... pad_to_multiple_of=self.pad_to_multiple_of,
197-
... return_tensors="tf",
198-
... )
199-
200-
... batch = {k: tf.reshape(v, (batch_size, num_choices, -1)) for k, v in batch.items()}
201-
... batch["labels"] = tf.convert_to_tensor(labels, dtype=tf.int64)
202-
... return batch
118+
>>> from transformers import DataCollatorForMultipleChoice
119+
>>> collator = DataCollatorForMultipleChoice(tokenizer=tokenizer)
203120
```
204-
</tf>
205-
</frameworkcontent>
206121

207122
## Evaluate
208123

@@ -272,7 +187,7 @@ tokenized_swag = swag.map(preprocess_function, batched=True)
272187
... train_dataset=tokenized_swag["train"],
273188
... eval_dataset=tokenized_swag["validation"],
274189
... processing_class=tokenizer,
275-
... data_collator=DataCollatorForMultipleChoice(tokenizer=tokenizer),
190+
... data_collator=collator,
276191
... compute_metrics=compute_metrics,
277192
... )
278193

0 commit comments

Comments
 (0)
Please sign in to comment.