|
12 | 12 | "# ! pip install git+https://github.com/huggingface/transformers.git\n"
|
13 | 13 | ]
|
14 | 14 | },
|
15 |
| - { |
16 |
| - "cell_type": "markdown", |
17 |
| - "metadata": {}, |
18 |
| - "source": [ |
19 |
| - ".. \n", |
20 |
| - " Copyright 2020 The HuggingFace Team. All rights reserved.\n", |
21 |
| - "\n", |
22 |
| - " Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use this file except in compliance with\n", |
23 |
| - " the License. You may obtain a copy of the License at\n", |
24 |
| - "\n", |
25 |
| - " http://www.apache.org/licenses/LICENSE-2.0\n", |
26 |
| - "\n", |
27 |
| - " Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on\n", |
28 |
| - " an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the\n", |
29 |
| - " specific language governing permissions and limitations under the License." |
30 |
| - ] |
31 |
| - }, |
32 | 15 | {
|
33 | 16 | "cell_type": "markdown",
|
34 | 17 | "metadata": {},
|
|
44 | 27 | "call a [tokenizer](https://huggingface.co/transformers/main_classes/tokenizer.html). You can build one using the tokenizer class associated to the model\n",
|
45 | 28 | "you would like to use, or directly with the `AutoTokenizer` class.\n",
|
46 | 29 | "\n",
|
47 |
| - "As we saw in the [quicktour](https://huggingface.co/transformers/quicktour.html), the tokenizer will first split a given text in words (or part of words,\n", |
48 |
| - "punctuation symbols, etc.) usually called *tokens*. Then it will convert those *tokens* into numbers, to be able to\n", |
49 |
| - "build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect to\n", |
50 |
| - "work properly." |
| 30 | + "As we saw in the [quick tour](https://huggingface.co/transformers/quicktour.html), the tokenizer will first split a given text in words (or part of\n", |
| 31 | + "words, punctuation symbols, etc.) usually called *tokens*. Then it will convert those *tokens* into numbers, to be able\n", |
| 32 | + "to build a tensor out of them and feed them to the model. It will also add any additional inputs the model might expect\n", |
| 33 | + "to work properly." |
51 | 34 | ]
|
52 | 35 | },
|
53 | 36 | {
|
|
276 | 259 | "\n",
|
277 | 260 | "\n",
|
278 | 261 | "Note that if your model does not have a maximum length associated to it, the command above will throw a warning. You\n",
|
279 |
| - "can safely ignore it. You can also pass `verbose=False` to stop the tokenizer to throw those kinds of warnings." |
| 262 | + "can safely ignore it. You can also pass `verbose=False` to stop the tokenizer from throwing those kinds of warnings." |
280 | 263 | ]
|
281 | 264 | },
|
282 | 265 | {
|
|
477 | 460 | "metadata": {},
|
478 | 461 | "source": [
|
479 | 462 | "We have seen the commands that will work for most cases (pad your batch to the length of the maximum sentence and\n",
|
480 |
| - "\n", |
481 | 463 | "truncate to the maximum length the mode can accept). However, the API supports more strategies if you need them. The\n",
|
482 | 464 | "three arguments you need to know for this are `padding`, `truncation` and `max_length`.\n",
|
483 | 465 | "\n",
|
|
0 commit comments