Skip to content

Commit 342cf2b

Browse files
Updated Transformer doc notebooks with commit 44c7857b873e535d8a200000b1da2ec23cf74273 \n\nSee: huggingface/transformers@44c7857
1 parent 64b51b8 commit 342cf2b

21 files changed

+746
-1218
lines changed

transformers_doc/custom_datasets.ipynb

+16-16
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,7 @@
126126
"source": [
127127
"The next step is to tokenize the text into a readable format by the model. It is important to load the same tokenizer a\n",
128128
"model was trained with to ensure appropriately tokenized words. Load the DistilBERT tokenizer with the\n",
129-
"[AutoTokenizer](https://huggingface.co/docs/transformers/v4.16.1/en/model_doc/auto#transformers.AutoTokenizer) because we will eventually train a classifier using a pretrained [DistilBERT](https://huggingface.co/distilbert-base-uncased) model:"
129+
"[AutoTokenizer](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoTokenizer) because we will eventually train a classifier using a pretrained [DistilBERT](https://huggingface.co/distilbert-base-uncased) model:"
130130
]
131131
},
132132
{
@@ -207,7 +207,7 @@
207207
"cell_type": "markdown",
208208
"metadata": {},
209209
"source": [
210-
"Now load your model with the [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/v4.16.1/en/model_doc/auto#transformers.AutoModelForSequenceClassification) class along with the number of expected labels:"
210+
"Now load your model with the [AutoModelForSequenceClassification](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForSequenceClassification) class along with the number of expected labels:"
211211
]
212212
},
213213
{
@@ -227,8 +227,8 @@
227227
"source": [
228228
"At this point, only three steps remain:\n",
229229
"\n",
230-
"1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/v4.16.1/en/main_classes/trainer#transformers.TrainingArguments).\n",
231-
"2. Pass the training arguments to a [Trainer](https://huggingface.co/docs/transformers/v4.16.1/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, and data collator.\n",
230+
"1. Define your training hyperparameters in [TrainingArguments](https://huggingface.co/docs/transformers/master/en/main_classes/trainer#transformers.TrainingArguments).\n",
231+
"2. Pass the training arguments to a [Trainer](https://huggingface.co/docs/transformers/master/en/main_classes/trainer#transformers.Trainer) along with the model, dataset, tokenizer, and data collator.\n",
232232
"3. Call `Trainer.train()` to fine-tune your model."
233233
]
234234
},
@@ -274,7 +274,7 @@
274274
"source": [
275275
"Fine-tuning with TensorFlow is just as easy, with only a few differences.\n",
276276
"\n",
277-
"Start by batching the processed examples together with dynamic padding using the [DataCollatorWithPadding](https://huggingface.co/docs/transformers/v4.16.1/en/main_classes/data_collator#transformers.DataCollatorWithPadding) function.\n",
277+
"Start by batching the processed examples together with dynamic padding using the [DataCollatorWithPadding](https://huggingface.co/docs/transformers/master/en/main_classes/data_collator#transformers.DataCollatorWithPadding) function.\n",
278278
"Make sure you set `return_tensors=\"tf\"` to return `tf.Tensor` outputs instead of PyTorch tensors!"
279279
]
280280
},
@@ -345,7 +345,7 @@
345345
"cell_type": "markdown",
346346
"metadata": {},
347347
"source": [
348-
"Load your model with the [TFAutoModelForSequenceClassification](https://huggingface.co/docs/transformers/v4.16.1/en/model_doc/auto#transformers.TFAutoModelForSequenceClassification) class along with the number of expected labels:"
348+
"Load your model with the [TFAutoModelForSequenceClassification](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.TFAutoModelForSequenceClassification) class along with the number of expected labels:"
349349
]
350350
},
351351
{
@@ -548,7 +548,7 @@
548548
"cell_type": "markdown",
549549
"metadata": {},
550550
"source": [
551-
"Now you need to tokenize the text. Load the DistilBERT tokenizer with an [AutoTokenizer](https://huggingface.co/docs/transformers/v4.16.1/en/model_doc/auto#transformers.AutoTokenizer):"
551+
"Now you need to tokenize the text. Load the DistilBERT tokenizer with an [AutoTokenizer](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoTokenizer):"
552552
]
553553
},
554554
{
@@ -680,7 +680,7 @@
680680
"cell_type": "markdown",
681681
"metadata": {},
682682
"source": [
683-
"Load your model with the [AutoModelForTokenClassification](https://huggingface.co/docs/transformers/v4.16.1/en/model_doc/auto#transformers.AutoModelForTokenClassification) class along with the number of expected labels:"
683+
"Load your model with the [AutoModelForTokenClassification](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForTokenClassification) class along with the number of expected labels:"
684684
]
685685
},
686686
{
@@ -698,7 +698,7 @@
698698
"cell_type": "markdown",
699699
"metadata": {},
700700
"source": [
701-
"Gather your training arguments in [TrainingArguments](https://huggingface.co/docs/transformers/v4.16.1/en/main_classes/trainer#transformers.TrainingArguments):"
701+
"Gather your training arguments in [TrainingArguments](https://huggingface.co/docs/transformers/master/en/main_classes/trainer#transformers.TrainingArguments):"
702702
]
703703
},
704704
{
@@ -722,7 +722,7 @@
722722
"cell_type": "markdown",
723723
"metadata": {},
724724
"source": [
725-
"Collect your model, training arguments, dataset, data collator, and tokenizer in [Trainer](https://huggingface.co/docs/transformers/v4.16.1/en/main_classes/trainer#transformers.Trainer):"
725+
"Collect your model, training arguments, dataset, data collator, and tokenizer in [Trainer](https://huggingface.co/docs/transformers/master/en/main_classes/trainer#transformers.Trainer):"
726726
]
727727
},
728728
{
@@ -814,7 +814,7 @@
814814
"cell_type": "markdown",
815815
"metadata": {},
816816
"source": [
817-
"Load the model with the [TFAutoModelForTokenClassification](https://huggingface.co/docs/transformers/v4.16.1/en/model_doc/auto#transformers.TFAutoModelForTokenClassification) class along with the number of expected labels:"
817+
"Load the model with the [TFAutoModelForTokenClassification](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.TFAutoModelForTokenClassification) class along with the number of expected labels:"
818818
]
819819
},
820820
{
@@ -990,7 +990,7 @@
990990
"cell_type": "markdown",
991991
"metadata": {},
992992
"source": [
993-
"Load the DistilBERT tokenizer with an [AutoTokenizer](https://huggingface.co/docs/transformers/v4.16.1/en/model_doc/auto#transformers.AutoTokenizer):"
993+
"Load the DistilBERT tokenizer with an [AutoTokenizer](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoTokenizer):"
994994
]
995995
},
996996
{
@@ -1123,7 +1123,7 @@
11231123
"cell_type": "markdown",
11241124
"metadata": {},
11251125
"source": [
1126-
"Load your model with the [AutoModelForQuestionAnswering](https://huggingface.co/docs/transformers/v4.16.1/en/model_doc/auto#transformers.AutoModelForQuestionAnswering) class:"
1126+
"Load your model with the [AutoModelForQuestionAnswering](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoModelForQuestionAnswering) class:"
11271127
]
11281128
},
11291129
{
@@ -1141,7 +1141,7 @@
11411141
"cell_type": "markdown",
11421142
"metadata": {},
11431143
"source": [
1144-
"Gather your training arguments in [TrainingArguments](https://huggingface.co/docs/transformers/v4.16.1/en/main_classes/trainer#transformers.TrainingArguments):"
1144+
"Gather your training arguments in [TrainingArguments](https://huggingface.co/docs/transformers/master/en/main_classes/trainer#transformers.TrainingArguments):"
11451145
]
11461146
},
11471147
{
@@ -1165,7 +1165,7 @@
11651165
"cell_type": "markdown",
11661166
"metadata": {},
11671167
"source": [
1168-
"Collect your model, training arguments, dataset, data collator, and tokenizer in [Trainer](https://huggingface.co/docs/transformers/v4.16.1/en/main_classes/trainer#transformers.Trainer):"
1168+
"Collect your model, training arguments, dataset, data collator, and tokenizer in [Trainer](https://huggingface.co/docs/transformers/master/en/main_classes/trainer#transformers.Trainer):"
11691169
]
11701170
},
11711171
{
@@ -1284,7 +1284,7 @@
12841284
"cell_type": "markdown",
12851285
"metadata": {},
12861286
"source": [
1287-
"Load your model with the [TFAutoModelForQuestionAnswering](https://huggingface.co/docs/transformers/v4.16.1/en/model_doc/auto#transformers.TFAutoModelForQuestionAnswering) class:"
1287+
"Load your model with the [TFAutoModelForQuestionAnswering](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.TFAutoModelForQuestionAnswering) class:"
12881288
]
12891289
},
12901290
{

transformers_doc/perplexity.ipynb

+2-2
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
"source": [
2626
"Perplexity (PPL) is one of the most common metrics for evaluating language models. Before diving in, we should note\n",
2727
"that the metric applies specifically to classical language models (sometimes called autoregressive or causal language\n",
28-
"models) and is not well defined for masked language models like BERT (see [summary of the models](https://huggingface.co/docs/transformers/v4.16.1/en/model_summary)).\n",
28+
"models) and is not well defined for masked language models like BERT (see [summary of the models](https://huggingface.co/docs/transformers/master/en/model_summary)).\n",
2929
"\n",
3030
"Perplexity is defined as the exponentiated average negative log-likelihood of a sequence. If we have a tokenized\n",
3131
"sequence $X = (x_0, x_1, \\dots, x_t)$, then the perplexity of $X$ is,\n",
@@ -56,7 +56,7 @@
5656
"<img width=\"600\" alt=\"Full decomposition of a sequence with unlimited context length\" src=\"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/ppl_full.gif\"/>\n",
5757
"\n",
5858
"When working with approximate models, however, we typically have a constraint on the number of tokens the model can\n",
59-
"process. The largest version of [GPT-2](https://huggingface.co/docs/transformers/v4.16.1/en/model_doc/gpt2), for example, has a fixed length of 1024 tokens, so we\n",
59+
"process. The largest version of [GPT-2](https://huggingface.co/docs/transformers/master/en/model_doc/gpt2), for example, has a fixed length of 1024 tokens, so we\n",
6060
"cannot calculate $p_\\theta(x_t|x_{<t})$ directly when $t$ is greater than 1024.\n",
6161
"\n",
6262
"Instead, the sequence is typically broken into subsequences equal to the model's maximum input size. If a model's max\n",

transformers_doc/preprocessing.ipynb

+10-10
Original file line numberDiff line numberDiff line change
@@ -24,10 +24,10 @@
2424
"metadata": {},
2525
"source": [
2626
"In this tutorial, we'll explore how to preprocess your data using 🤗 Transformers. The main tool for this is what we\n",
27-
"call a [tokenizer](https://huggingface.co/docs/transformers/v4.16.1/en/main_classes/tokenizer). You can build one using the tokenizer class associated to the model\n",
28-
"you would like to use, or directly with the [AutoTokenizer](https://huggingface.co/docs/transformers/v4.16.1/en/model_doc/auto#transformers.AutoTokenizer) class.\n",
27+
"call a [tokenizer](https://huggingface.co/docs/transformers/master/en/main_classes/tokenizer). You can build one using the tokenizer class associated to the model\n",
28+
"you would like to use, or directly with the [AutoTokenizer](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoTokenizer) class.\n",
2929
"\n",
30-
"As we saw in the [quick tour](https://huggingface.co/docs/transformers/v4.16.1/en/quicktour), the tokenizer will first split a given text in words (or part of\n",
30+
"As we saw in the [quick tour](https://huggingface.co/docs/transformers/master/en/quicktour), the tokenizer will first split a given text in words (or part of\n",
3131
"words, punctuation symbols, etc.) usually called _tokens_. Then it will convert those _tokens_ into numbers, to be able\n",
3232
"to build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect\n",
3333
"to work properly.\n",
@@ -41,7 +41,7 @@
4141
"</Tip>\n",
4242
"\n",
4343
"To automatically download the vocab used during pretraining or fine-tuning a given model, you can use the\n",
44-
"[AutoTokenizer.from_pretrained()](https://huggingface.co/docs/transformers/v4.16.1/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained) method:"
44+
"[AutoTokenizer.from_pretrained()](https://huggingface.co/docs/transformers/master/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained) method:"
4545
]
4646
},
4747
{
@@ -95,7 +95,7 @@
9595
"cell_type": "markdown",
9696
"metadata": {},
9797
"source": [
98-
"A [PreTrainedTokenizer](https://huggingface.co/docs/transformers/v4.16.1/en/main_classes/tokenizer#transformers.PreTrainedTokenizer) has many methods, but the only one you need to remember for preprocessing\n",
98+
"A [PreTrainedTokenizer](https://huggingface.co/docs/transformers/master/en/main_classes/tokenizer#transformers.PreTrainedTokenizer) has many methods, but the only one you need to remember for preprocessing\n",
9999
"is its `__call__`: you just need to feed your sentence to your tokenizer object."
100100
]
101101
},
@@ -126,9 +126,9 @@
126126
"cell_type": "markdown",
127127
"metadata": {},
128128
"source": [
129-
"This returns a dictionary string to list of ints. The [input_ids](https://huggingface.co/docs/transformers/v4.16.1/en/glossary#input-ids) are the indices corresponding\n",
130-
"to each token in our sentence. We will see below what the [attention_mask](https://huggingface.co/docs/transformers/v4.16.1/en/glossary#attention-mask) is used for and\n",
131-
"in [the next section](#preprocessing-pairs-of-sentences) the goal of [token_type_ids](https://huggingface.co/docs/transformers/v4.16.1/en/glossary#token-type-ids).\n",
129+
"This returns a dictionary string to list of ints. The [input_ids](https://huggingface.co/docs/transformers/master/en/glossary#input-ids) are the indices corresponding\n",
130+
"to each token in our sentence. We will see below what the [attention_mask](https://huggingface.co/docs/transformers/master/en/glossary#attention-mask) is used for and\n",
131+
"in [the next section](#preprocessing-pairs-of-sentences) the goal of [token_type_ids](https://huggingface.co/docs/transformers/master/en/glossary#token-type-ids).\n",
132132
"\n",
133133
"The tokenizer can decode a list of token ids in a proper sentence:"
134134
]
@@ -274,7 +274,7 @@
274274
"cell_type": "markdown",
275275
"metadata": {},
276276
"source": [
277-
"It returns a dictionary with string keys and tensor values. We can now see what the [attention_mask](https://huggingface.co/docs/transformers/v4.16.1/en/glossary#attention-mask) is all about: it points out which tokens the model should pay attention to and which ones\n",
277+
"It returns a dictionary with string keys and tensor values. We can now see what the [attention_mask](https://huggingface.co/docs/transformers/master/en/glossary#attention-mask) is all about: it points out which tokens the model should pay attention to and which ones\n",
278278
"it should not (because they represent padding in this case).\n",
279279
"\n",
280280
"\n",
@@ -360,7 +360,7 @@
360360
"cell_type": "markdown",
361361
"metadata": {},
362362
"source": [
363-
"This shows us what the [token_type_ids](https://huggingface.co/docs/transformers/v4.16.1/en/glossary#token-type-ids) are for: they indicate to the model which part of\n",
363+
"This shows us what the [token_type_ids](https://huggingface.co/docs/transformers/master/en/glossary#token-type-ids) are for: they indicate to the model which part of\n",
364364
"the inputs correspond to the first sentence and which part corresponds to the second sentence. Note that\n",
365365
"_token_type_ids_ are not required or handled by all models. By default, a tokenizer will only return the inputs that\n",
366366
"its associated model expects. You can force the return (or the non-return) of any of those special arguments by using\n",

0 commit comments

Comments
 (0)