You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -19,7 +20,8 @@ This is the repo for the Stanford Alpaca project, which aims to build and share
19
20
20
21
Note: We thank the community for feedback on Stanford-Alpaca and supporting our research. Our live demo is suspended until further notice.
21
22
22
-
**Usage and License Notices**: Alpaca is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
23
+
**Usage and License Notices**: Alpaca is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
24
+
The weight diff is also CC BY NC 4.0 (allowing only non-commercial use).
23
25
24
26
## Overview
25
27
@@ -116,16 +118,12 @@ We fine-tune LLaMA-7B and LLaMA-13B with the following hyperparameters:
116
118
| Max length | 512 | 512 |
117
119
| Weight decay | 0 | 0 |
118
120
119
-
We have also fine-tuned larger variants of LLaMA and performed subsequent RLHF and are in the process of evaluating those models.
120
-
121
121
To reproduce our fine-tuning runs for LLaMA, first install the requirements
122
122
123
123
```bash
124
124
pip install -r requirements.txt
125
125
```
126
126
127
-
Then, install the particular fork of Hugging Face's transformers library.
128
-
129
127
Below is a command that fine-tunes LLaMA-7B with our dataset on a machine with 4 A100 80G GPUs in FSDP `full_shard` mode.
130
128
We were able to reproduce a model of similar quality as the one we hosted in our demo with the following command using **Python 3.10**.
131
129
Replace `<your_random_port>` with a port of your own, `<your_path_to_hf_converted_llama_ckpt_and_tokenizer>` with the
@@ -189,8 +187,8 @@ To run on more gpus, you may prefer to turn down `gradient_accumulation_steps` t
189
187
Naively, fine-tuning a 7B model requires about 7 x 4 x 4 = 112 GB of VRAM. Commands given above enable parameter sharding, so no redundant model copy is stored on any GPU.
190
188
If you'd like to further reduce the memory footprint, here are some options:
191
189
192
-
- Turn on CPU offload for FSDP with `--fsdp "full_shard auto_wrap offload"`. This saves VRAM at the cost longer runtime.
193
-
- In our experience, DeepSpeed stage-3 (with offload) can at times be more memory efficient than FSDP. Here's an example to use DeepSpeed stage-3 with 4 GPUs with both parameter and optimizer offload:
190
+
- Turn on CPU offload for FSDP with `--fsdp "full_shard auto_wrap offload"`. This saves VRAM at the cost of longer runtime.
191
+
- In our experience, DeepSpeed stage-3 (with offload) can at times be more memory efficient than FSDP with offload. Here's an example to use DeepSpeed stage-3 with 4 GPUs with both parameter and optimizer offload:
@@ -213,7 +211,7 @@ If you'd like to further reduce the memory footprint, here are some options:
213
211
--tf32 True
214
212
```
215
213
- The DeepSpeed library also provides some [helpful functions](https://deepspeed.readthedocs.io/en/latest/memory.html) to estimate memory usage.
216
-
- [LoRA](https://arxiv.org/abs/2106.09685) fine-tunes low-rank slices of the query, key, and value embeddings. This can reduce the total memory footprint from 112GB to about 7x4=28GB. We may release our re-implemention of this in the future, but for now the [peft](https://github.com/huggingface/peft) codebase can be a useful resource.
214
+
- [LoRA](https://arxiv.org/abs/2106.09685) fine-tunes low-rank slices of the query, key, and value embedding heads. This can reduce the total memory footprint from 112GB to about 7x4=28GB. We may release our re-implemention of this in the future, but for now the [peft](https://github.com/huggingface/peft) codebase can be a useful resource.
0 commit comments