Nightly (#632)

danielhanchen · shimmyshimmer · chrehall68 · web-flow · commit d2c0c1bbda7b · 2024-06-14T02:58:08.000+10:00
* Update llama.py * offload * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * Update llama.py * continued pretraining trainer * Update trainer.py * Update trainer.py * Update trainer.py * Update trainer.py * is_bfloat16_supported * Update __init__.py * Update README.md * Update llama.py * is_bfloat16_supported * Update __init__.py * Mistral v3 * Phi 3 medium * Update chat_templates.py * Update chat_templates.py * Phi-3 * Update save.py * Update README.md Mistral v3 to Mistral v0.3 * Untrained tokens * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update llama.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update save.py * Update save.py * Update save.py * checkpoint * Update _utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update tokenizer_utils.py * Update llama.py * accelerate * Update _utils.py * Update _utils.py * Update _utils.py * Update _utils.py * Update _utils.py * Update _utils.py * Update _utils.py * Update tokenizer_utils.py * train_dataloader * Update llama.py * Update llama.py * Update llama.py * use_fast_convert * Update save.py * Update save.py * Update save.py * Update save.py * remove_special_tokens * Ollama * Update chat_templates.py * Update chat_templates.py * Update chat_templates.py * Update llama.py * Update chat_templates.py * Support bfloat16 GGUF * Update save.py * Update llama.py * fast_forward_inference * Update mapper.py * Update loader.py * Update llama.py * Update tokenizer_utils.py * info * edits * Create chat template * Fix tokenizer * Update tokenizer_utils.py * fix case where gguf saving fails due to first_conversion dtype (#630) * Support revision parameter in FastLanguageModel.from_pretrained (#629) * support `revision` parameter * match unsloth formatting of named parameters * clears any selected_adapters before calling internal_model.save_pretrained (#609) * Update __init__.py (#602) Check for incompatible modules before importing unsloth * Fixed unsloth/tokenizer_utils.py for chat training (#604) * Add GGML saving option to Unsloth for easier Ollama model creation and testing. (#345) * Add save to llama.cpp GGML to save.py. * Fix conversion command and path of convert to GGML function. * Add autosaving lora to the GGML function * Create lora save function for conversion to GGML * Test fix #2 for saving lora * Test fix #3 to save the lora adapters to convert to GGML * Remove unwated tokenizer saving for conversion to ggml and added a few print statements. * Needed tokenizer for saving, added it back, also made it more unslothy style by having positional arguments, and added a few messages. * Positional arguments didn't work out, so reverted to older version of the code, and added a few comments. * Test fix 1 for arch * Test fix 2 new Mistral error. * Test fix 3 * Revert to old version for testing. * Upload issue test fix 1 * Fix 2 uploading ggml * Positional ags added. * Temporray remove positional args * Fix upload again!!! * Add print statements and fix link * Make the calling name better * Create local saving for GGML * Add choosing directory to save local GGML. * Fix lil variable error in the save_to_custom_dir func * docs: Add LoraConfig parameters documentation (#619) * llama.cpp failing (#371) llama.cpp is failing to generate quantize versions for the trained models. Error: ```bash You might have to compile llama.cpp yourself, then run this again. You do not need to close this Python program. Run the following commands in a new terminal: You must run this in the same folder as you're saving your model. git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && make clean && LLAMA_CUDA=1 make all -j Once that's done, redo the quantization. ``` But when i do clone this with recursive it works. Co-authored-by: Daniel Han <danielhanchen@gmail.com> * fix libcuda_dirs import for triton 3.0 (#227) * fix libcuda_dirs import for triton 3.0 * Update __init__.py * Update __init__.py --------- Co-authored-by: Daniel Han <danielhanchen@gmail.com> * Update save.py * Update __init__.py * Update fast_lora.py * Update save.py * Update save.py * Update save.py * Update loader.py * Update save.py * Update save.py * quantize now llama-quantize * Update chat_templates.py * Update loader.py * Update mapper.py * Update __init__.py * embedding size --------- Co-authored-by: Michael Han <107991372+shimmyshimmer@users.noreply.github.com> Co-authored-by: Eliot Hall <60240707+chrehall68@users.noreply.github.com> Co-authored-by: Rickard Edén <rickardeden@gmail.com> Co-authored-by: XiaoYang <xyangk@gmail.com> Co-authored-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com> Co-authored-by: mahiatlinux <110882203+mahiatlinux@users.noreply.github.com> Co-authored-by: Sébastien De Greef <sebdg@binarycompany.com> Co-authored-by: Alberto Ferrer <albertof@barrahome.org> Co-authored-by: Thomas Viehmann <tv.github-private@beamnet.de>
diff --git a/PARAMETERS.md b/PARAMETERS.md
@@ -0,0 +1,87 @@
+## LoraConfig Parameters
+
+Adjusting the `LoraConfig` parameters allows you to balance model performance and computational efficiency in Low-Rank Adaptation (LoRA). Here’s a concise breakdown of key parameters:
+
+**r**
+- **Description**: Rank of the low-rank decomposition for factorizing weight matrices.
+- **Impact**:
+  - **Higher**: Retains more information, increases computational load.
+  - **Lower**: Fewer parameters, more efficient training, potential performance drop if too small.
+
+
+**lora_alpha**
+- **Description**: Scaling factor for the low-rank matrices' contribution.
+- **Impact**:
+  - **Higher**: Increases influence, speeds up convergence, risks instability or overfitting.
+  - **Lower**: Subtler effect, may require more training steps.
+
+**lora_dropout**
+- **Description**: Probability of zeroing out elements in low-rank matrices for regularization.
+- **Impact**:
+  - **Higher**: More regularization, prevents overfitting, may slow training and degrade performance.
+  - **Lower**: Less regularization, may speed up training, risks overfitting.
+
+**loftq_config**
+- **Description**: Configuration for LoftQ, a quantization method for the backbone weights and initialization of LoRA layers.
+- **Impact**:
+  - **Not None**: If specified, LoftQ will quantize the backbone weights and initialize the LoRA layers. It requires setting `init_lora_weights='loftq'`.
+  - **None**: LoftQ quantization is not applied.
+  - **Note**: Do not pass an already quantized model when using LoftQ as LoftQ handles the quantization process itself.
+
+
+**use_rslora**
+- **Description**: Enables Rank-Stabilized LoRA (RSLora).
+- **Impact**:
+  - **True**: Uses Rank-Stabilized LoRA, setting the adapter scaling factor to `lora_alpha/math.sqrt(r)`, which has been proven to work better as per the [Rank-Stabilized LoRA paper](https://doi.org/10.48550/arXiv.2312.03732).
+  - **False**: Uses the original default scaling factor `lora_alpha/r`.
+
+**gradient_accumulation_steps**
+- **Default**: 1
+- **Description**: The number of steps to accumulate gradients before performing a backpropagation update.
+- **Impact**: 
+  - **Higher**: Accumulate gradients over multiple steps, effectively increasing the batch size without requiring additional memory. This can improve training stability and convergence, especially with large models and limited hardware.
+  - **Lower**: Faster updates but may require more memory per step and can be less stable.
+
+**weight_decay**
+- **Default**: 0.01
+- **Description**: Regularization technique that applies a small penalty to the weights during training.
+- **Impact**:
+  - **Non-zero Value (e.g., 0.01)**: Adds a penalty proportional to the magnitude of the weights to the loss function, helping to prevent overfitting by discouraging large weights.
+  - **Zero**: No weight decay is applied, which can lead to overfitting, especially in large models or with small datasets.
+
+**learning_rate**
+- **Default**: 2e-4
+- **Description**: The rate at which the model updates its parameters during training.
+- **Impact**:
+  - **Higher**: Faster convergence but risks overshooting optimal parameters and causing instability in training.
+  - **Lower**: More stable and precise updates but may slow down convergence, requiring more training steps to achieve good performance.
+
+## Target Modules 
+
+**q_proj (query projection)**
+- **Description**: Part of the attention mechanism in transformer models, responsible for projecting the input into the query space.
+- **Impact**: Transforms the input into query vectors that are used to compute attention scores.
+
+**k_proj (key projection)**
+- **Description**: Projects the input into the key space in the attention mechanism.
+- **Impact**: Produces key vectors that are compared with query vectors to determine attention weights.
+
+**v_proj (value projection)**
+- **Description**: Projects the input into the value space in the attention mechanism.
+- **Impact**: Produces value vectors that are weighted by the attention scores and combined to form the output.
+
+**o_proj (output projection)**
+- **Description**: Projects the output of the attention mechanism back into the original space.
+- **Impact**: Transforms the combined weighted value vectors back to the input dimension, integrating attention results into the model.
+
+**gate_proj (gate projection)**
+- **Description**: Typically used in gated mechanisms within neural networks, such as gating units in gated recurrent units (GRUs) or other gating mechanisms.
+- **Impact**: Controls the flow of information through the gate, allowing selective information passage based on learned weights.
+
+**up_proj (up projection)**
+- **Description**: Used for up-projection, typically increasing the dimensionality of the input.
+- **Impact**: Expands the input to a higher-dimensional space, often used in feedforward layers or when transitioning between different layers with differing dimensionalities.
+
+**down_proj (down projection)**
+- **Description**: Used for down-projection, typically reducing the dimensionality of the input.
+- **Impact**: Compresses the input to a lower-dimensional space, useful for reducing computational complexity and controlling the model size.
diff --git a/unsloth/__init__.py b/unsloth/__init__.py
@@ -14,8 +14,20 @@
 import os
 import warnings
 import importlib
+import sys
+from packaging.version import Version
 
-# Currently only supports 1 GPU, or else seg faults will occur.
+# Define a list of modules to check
+MODULES_TO_CHECK = ["peft", "bitsandbytes"]
+
+# Check if any of the modules in the list have been imported
+for module in MODULES_TO_CHECK:
+    if module in sys.modules:
+        raise ImportError(f"Unsloth: Please import Unsloth before {module}.")
+    pass
+pass
+
+# Currently only supports 1 GPU, or else seg faults will occur.    
 if "CUDA_VISIBLE_DEVICES" in os.environ:
     os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
     devices = os.environ["CUDA_VISIBLE_DEVICES"]
@@ -66,8 +78,14 @@ def is_bf16_supported(): return SUPPORTS_BFLOAT16
 
 # Try loading bitsandbytes and triton
 import bitsandbytes as bnb
+
 import triton
-from triton.common.build import libcuda_dirs
+libcuda_dirs = lambda: None
+if Version(triton.__version__) >= Version("3.0.0"):
+    try: from triton.backends.nvidia.driver import libcuda_dirs
+    except: pass
+else: from triton.common.build import libcuda_dirs
+
 import os
 import re
 import numpy as np
@@ -103,8 +121,11 @@ def is_bf16_supported(): return SUPPORTS_BFLOAT16
     importlib.reload(bnb)
     importlib.reload(triton)
     try:
-        import bitsandbytes as bnb
-        from triton.common.build import libcuda_dirs
+        libcuda_dirs = lambda: None
+        if Version(triton.__version__) >= Version("3.0.0"):
+            try: from triton.backends.nvidia.driver import libcuda_dirs
+            except: pass
+        else: from triton.common.build import libcuda_dirs
         cdequantize_blockwise_fp32 = bnb.functional.lib.cdequantize_blockwise_fp32
         libcuda_dirs()
     except:
diff --git a/unsloth/chat_templates.py b/unsloth/chat_templates.py
@@ -1286,7 +1286,7 @@ def test_hf_gguf_equivalence(tokenizer, gguf_model = "./model-unsloth.F16.gguf")
     pass
     
     for prompt in prompts:
-        command = f"./llama.cpp/main -m {gguf_model} -n 0 --temp 0.0 --verbose-prompt "\
+        command = f"./llama.cpp/llama-cli -m {gguf_model} -n 0 --temp 0.0 --verbose-prompt "\
             f"--check-tensors -p '{prompt}'"
 
         datas = []
diff --git a/unsloth/kernels/__init__.py b/unsloth/kernels/__init__.py
@@ -24,6 +24,7 @@
 )
 from .fast_lora import (
 	get_lora_parameters,
+	get_lora_parameters_bias,
 	apply_lora_mlp_swiglu,
 	apply_lora_mlp_geglu_exact,
 	apply_lora_mlp_geglu_approx,
diff --git a/unsloth/kernels/fast_lora.py b/unsloth/kernels/fast_lora.py
@@ -13,7 +13,13 @@
 # limitations under the License.
 
 import torch
-from .utils import fast_dequantize, QUANT_STATE, get_lora_parameters, matmul_lora
+from .utils import (
+    fast_dequantize,
+    QUANT_STATE,
+    get_lora_parameters,
+    get_lora_parameters_bias,
+    matmul_lora,
+)
 
 
 class LoRA_MLP(torch.autograd.Function):
diff --git a/unsloth/models/loader.py b/unsloth/models/loader.py
@@ -33,11 +33,8 @@
 
 def _get_model_name(model_name, load_in_4bit = True):
 
-    # First try replacing lowercase 'b' with uppercase 'B'
-    model_name = model_name.lower()
-
     if not SUPPORTS_FOURBIT and model_name in INT_TO_FLOAT_MAPPER:
-        model_name = INT_TO_FLOAT_MAPPER[model_name]
+        model_name = INT_TO_FLOAT_MAPPER[model_name.lower()]
         logger.warning_once(
             f"Unsloth: Your transformers version of {transformers_version} does not support native "\
             f"4bit loading.\nThe minimum required version is 4.37.\n"\
@@ -47,15 +44,15 @@ def _get_model_name(model_name, load_in_4bit = True):
         )
     
     elif not load_in_4bit and model_name in INT_TO_FLOAT_MAPPER:
-        new_model_name = INT_TO_FLOAT_MAPPER[model_name]
+        new_model_name = INT_TO_FLOAT_MAPPER[model_name.lower()]
         # logger.warning_once(
         #     f"Unsloth: You passed in `{model_name}` which is a 4bit model, yet you set\n"\
         #     f"`load_in_4bit = False`. We shall load `{new_model_name}` instead."
         # )
         model_name = new_model_name
 
     elif load_in_4bit and SUPPORTS_FOURBIT and model_name in FLOAT_TO_INT_MAPPER:
-        new_model_name = FLOAT_TO_INT_MAPPER[model_name]
+        new_model_name = FLOAT_TO_INT_MAPPER[model_name.lower()]
         # logger.warning_once(
         #     f"Unsloth: You passed in `{model_name}` and `load_in_4bit = True`.\n"\
         #     f"We shall load `{new_model_name}` for 4x faster loading."
@@ -70,17 +67,18 @@ def _get_model_name(model_name, load_in_4bit = True):
 class FastLanguageModel(FastLlamaModel):
     @staticmethod
     def from_pretrained(
-        model_name     = "unsloth/llama-3-8b-bnb-4bit",
-        max_seq_length = None,
-        dtype          = None,
-        load_in_4bit   = True,
-        token          = None,
-        device_map     = "sequential",
-        rope_scaling   = None,
-        fix_tokenizer  = True,
-        trust_remote_code = False,
-        use_gradient_checkpointing = True,
-        resize_model_vocab = None,
+        model_name                 = "unsloth/llama-3-8b-bnb-4bit",
+        max_seq_length             = None,
+        dtype                      = None,
+        load_in_4bit               = True,
+        token                      = None,
+        device_map                 = "sequential",
+        rope_scaling               = None,
+        fix_tokenizer              = True,
+        trust_remote_code          = False,
+        use_gradient_checkpointing = "unsloth",
+        resize_model_vocab         = None,
+        revision                   = None,
         *args, **kwargs,
     ):
         if token is None and "HF_TOKEN" in os.environ:
@@ -95,12 +93,12 @@ def from_pretrained(
         # First check if it's a normal model via AutoConfig
         is_peft = False
         try:
-            model_config = AutoConfig.from_pretrained(model_name, token = token)
+            model_config = AutoConfig.from_pretrained(model_name, token = token, revision = revision)
             is_peft = False
         except:
             try:
                 # Most likely a PEFT model
-                peft_config = PeftConfig.from_pretrained(model_name, token = token)
+                peft_config = PeftConfig.from_pretrained(model_name, token = token, revision = revision)
             except:
                 raise RuntimeError(f"Unsloth: `{model_name}` is not a full model or a PEFT model.")
             
@@ -143,22 +141,24 @@ def from_pretrained(
         pass
 
         model, tokenizer = dispatch_model.from_pretrained(
-            model_name     = model_name,
-            max_seq_length = max_seq_length,
-            dtype          = dtype,
-            load_in_4bit   = load_in_4bit,
-            token          = token,
-            device_map     = device_map,
-            rope_scaling   = rope_scaling,
-            fix_tokenizer  = fix_tokenizer,
-            model_patcher  = dispatch_model,
-            tokenizer_name = tokenizer_name,
+            model_name        = model_name,
+            max_seq_length    = max_seq_length,
+            dtype             = dtype,
+            load_in_4bit      = load_in_4bit,
+            token             = token,
+            device_map        = device_map,
+            rope_scaling      = rope_scaling,
+            fix_tokenizer     = fix_tokenizer,
+            model_patcher     = dispatch_model,
+            tokenizer_name    = tokenizer_name,
             trust_remote_code = trust_remote_code,
+            revision          = revision if not is_peft else None,
             *args, **kwargs,
         )
         
         if resize_model_vocab is not None:
             model.resize_token_embeddings(resize_model_vocab)
+        pass
 
         # In case the model supports tagging, add the unsloth tag.
         if hasattr(model, "add_model_tags"):
@@ -188,8 +188,16 @@ def from_pretrained(
         pass
 
         if is_peft:
+            # From https://github.com/huggingface/peft/issues/184
             # Now add PEFT adapters
-            model = PeftModel.from_pretrained(model, old_model_name, token = token)
+            model.enable_input_require_grads()
+            model = PeftModel.from_pretrained(
+                model,
+                old_model_name,
+                token = token,
+                revision = revision,
+                is_trainable = True,
+            )
             # Patch it as well!
             model = dispatch_model.patch_peft_model(model, use_gradient_checkpointing)
         pass
diff --git a/unsloth/models/mapper.py b/unsloth/models/mapper.py
@@ -186,6 +186,9 @@
     "unsloth/Qwen2-70B-Instruct-bnb-4bit" : (
         "Qwen/Qwen2-70B-Instruct",
     ),
+    "mistralai/Codestral-22B-v0.1" : (
+        "mistral-community/Codestral-22B-v0.1",
+    ),
 }
 
 INT_TO_FLOAT_MAPPER = {}
diff --git a/unsloth/save.py b/unsloth/save.py
diff --git a/unsloth/tokenizer_utils.py b/unsloth/tokenizer_utils.py

Original file line number	Diff line number	Diff line change
`@@ -24,6 +24,7 @@`
`24`	`24`	`)`
`25`	`25`	`from .fast_lora import (`
`26`	`26`	`get_lora_parameters,`
	`27`	`+ get_lora_parameters_bias,`
`27`	`28`	`apply_lora_mlp_swiglu,`
`28`	`29`	`apply_lora_mlp_geglu_exact,`
`29`	`30`	`apply_lora_mlp_geglu_approx,`
Original file line number	Diff line number	Diff line change
`@@ -186,6 +186,9 @@`
`186`	`186`	`"unsloth/Qwen2-70B-Instruct-bnb-4bit" : (`
`187`	`187`	`"Qwen/Qwen2-70B-Instruct",`
`188`	`188`	`),`
	`189`	`+ "mistralai/Codestral-22B-v0.1" : (`
	`190`	`+ "mistral-community/Codestral-22B-v0.1",`
	`191`	`+ ),`
`189`	`192`	`}`
`190`	`193`
`191`	`194`	`INT_TO_FLOAT_MAPPER = {}`