Skip to content

Commit d7b50ce

Browse files
authored
Add examples/run_ner_no_trainer.py (#10902)
* Add NER example with accelerate library * This commit contains the first (yet really unfinished) version of a script for showing how to train HuggingFace model with their new accelerate library. * Fix metric calculation * make style quality * mv ner_no_trainer to token-classification dir * Delete --debug flag from running script * hf_datasets -> raw_datasets * Make a few slight adjustments * Add an informative comment + rewrite a help comment * Change header * Fix a few things * Enforce to use fast tokenizers only * DataCollatorWithPadding -> DataCollatorForTokenClassification * Change bash script: python3 -> accelerate launch * make style * Add a few missing things (see below) * Add a max-lenghth padding to predictions and labels to enable accelerate gather functionality * Add PyTorch no trainer example to the example README.md * Remove --do-train from args as being redundant for now * DataCollatorWithPadding -> DataCollatorForTokenClassification * Remove some obsolete args.do_train conditions from the script * Delete --do_train from bash running script * Delete use_slow_tokenizer from args * Add unintentionally removed flag --label_all_tokens * Delete --debug flag from running script
1 parent 06a6fea commit d7b50ce

File tree

3 files changed

+629
-3
lines changed

3 files changed

+629
-3
lines changed

examples/token-classification/README.md

Lines changed: 73 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,10 +14,12 @@ See the License for the specific language governing permissions and
1414
limitations under the License.
1515
-->
1616

17-
## Token classification
17+
# Token classification
1818

19-
Fine-tuning the library models for token classification task such as Named Entity Recognition (NER) or Parts-of-speech
20-
tagging (POS). The main scrip `run_ner.py` leverages the 🤗 Datasets library and the Trainer API. You can easily
19+
## PyTorch version
20+
21+
Fine-tuning the library models for token classification task such as Named Entity Recognition (NER), Parts-of-speech
22+
tagging (POS) pr phrase extraction (CHUNKS). The main scrip `run_ner.py` leverages the 🤗 Datasets library and the Trainer API. You can easily
2123
customize it to your needs if you need extra processing on your datasets.
2224

2325
It will either run on a datasets hosted on our [hub](https://huggingface.co/datasets) or with your own text files for
@@ -57,6 +59,74 @@ of the script.
5759

5860
You can find the old version of the PyTorch script [here](https://github.com/huggingface/transformers/blob/master/examples/legacy/token-classification/run_ner.py).
5961

62+
## Pytorch version, no Trainer
63+
64+
Based on the script [run_ner_no_trainer.py](https://github.com/huggingface/transformers/blob/master/examples/token-classification/run_ner_no_trainer.py).
65+
66+
Like `run_ner.py`, this script allows you to fine-tune any of the models on the [hub](https://huggingface.co/models) on a
67+
token classification task, either NER, POS or CHUNKS tasks or your own data in a csv or a JSON file. The main difference is that this
68+
script exposes the bare training loop, to allow you to quickly experiment and add any customization you would like.
69+
70+
It offers less options than the script with `Trainer` (for instance you can easily change the options for the optimizer
71+
or the dataloaders directly in the script) but still run in a distributed setup, on TPU and supports mixed precision by
72+
the mean of the [🤗 `Accelerate`](https://github.com/huggingface/accelerate) library. You can use the script normally
73+
after installing it:
74+
75+
```bash
76+
pip install accelerate
77+
```
78+
79+
then
80+
81+
```bash
82+
export TASK_NAME=ner
83+
84+
python run_ner_no_trainer.py \
85+
--model_name_or_path bert-base-cased \
86+
--task_name $TASK_NAME \
87+
--max_seq_length 128 \
88+
--per_device_train_batch_size 32 \
89+
--learning_rate 2e-5 \
90+
--num_train_epochs 3 \
91+
--output_dir /tmp/$TASK_NAME/
92+
```
93+
94+
You can then use your usual launchers to run in it in a distributed environment, but the easiest way is to run
95+
96+
```bash
97+
accelerate config
98+
```
99+
100+
and reply to the questions asked. Then
101+
102+
```bash
103+
accelerate test
104+
```
105+
106+
that will check everything is ready for training. Finally, you cna launch training with
107+
108+
```bash
109+
export TASK_NAME=ner
110+
111+
accelerate launch run_ner_no_trainer.py \
112+
--model_name_or_path bert-base-cased \
113+
--task_name $TASK_NAME \
114+
--max_seq_length 128 \
115+
--per_device_train_batch_size 32 \
116+
--learning_rate 2e-5 \
117+
--num_train_epochs 3 \
118+
--output_dir /tmp/$TASK_NAME/
119+
```
120+
121+
This command is the same and will work for:
122+
123+
- a CPU-only setup
124+
- a setup with one GPU
125+
- a distributed training with several GPUs (single or multi node)
126+
- a training on TPUs
127+
128+
Note that this library is in alpha release so your feedback is more than welcome if you encounter any problem using it.
129+
60130
### TensorFlow version
61131

62132
The following examples are covered in this section:

0 commit comments

Comments
 (0)