You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Adjusting the `LoraConfig` parameters allows you to balance model performance and computational efficiency in Low-Rank Adaptation (LoRA). Here’s a concise breakdown of key parameters:
4
+
5
+
**r**
6
+
-**Description**: Rank of the low-rank decomposition for factorizing weight matrices.
7
+
-**Impact**:
8
+
-**Higher**: Retains more information, increases computational load.
9
+
-**Lower**: Fewer parameters, more efficient training, potential performance drop if too small.
10
+
11
+
12
+
**lora_alpha**
13
+
-**Description**: Scaling factor for the low-rank matrices' contribution.
14
+
-**Impact**:
15
+
-**Higher**: Increases influence, speeds up convergence, risks instability or overfitting.
16
+
-**Lower**: Subtler effect, may require more training steps.
17
+
18
+
**lora_dropout**
19
+
-**Description**: Probability of zeroing out elements in low-rank matrices for regularization.
20
+
-**Impact**:
21
+
-**Higher**: More regularization, prevents overfitting, may slow training and degrade performance.
22
+
-**Lower**: Less regularization, may speed up training, risks overfitting.
23
+
24
+
**loftq_config**
25
+
-**Description**: Configuration for LoftQ, a quantization method for the backbone weights and initialization of LoRA layers.
26
+
-**Impact**:
27
+
-**Not None**: If specified, LoftQ will quantize the backbone weights and initialize the LoRA layers. It requires setting `init_lora_weights='loftq'`.
28
+
-**None**: LoftQ quantization is not applied.
29
+
-**Note**: Do not pass an already quantized model when using LoftQ as LoftQ handles the quantization process itself.
-**True**: Uses Rank-Stabilized LoRA, setting the adapter scaling factor to `lora_alpha/math.sqrt(r)`, which has been proven to work better as per the [Rank-Stabilized LoRA paper](https://doi.org/10.48550/arXiv.2312.03732).
36
+
-**False**: Uses the original default scaling factor `lora_alpha/r`.
37
+
38
+
**gradient_accumulation_steps**
39
+
-**Default**: 1
40
+
-**Description**: The number of steps to accumulate gradients before performing a backpropagation update.
41
+
-**Impact**:
42
+
-**Higher**: Accumulate gradients over multiple steps, effectively increasing the batch size without requiring additional memory. This can improve training stability and convergence, especially with large models and limited hardware.
43
+
-**Lower**: Faster updates but may require more memory per step and can be less stable.
44
+
45
+
**weight_decay**
46
+
-**Default**: 0.01
47
+
-**Description**: Regularization technique that applies a small penalty to the weights during training.
48
+
-**Impact**:
49
+
-**Non-zero Value (e.g., 0.01)**: Adds a penalty proportional to the magnitude of the weights to the loss function, helping to prevent overfitting by discouraging large weights.
50
+
-**Zero**: No weight decay is applied, which can lead to overfitting, especially in large models or with small datasets.
51
+
52
+
**learning_rate**
53
+
-**Default**: 2e-4
54
+
-**Description**: The rate at which the model updates its parameters during training.
55
+
-**Impact**:
56
+
-**Higher**: Faster convergence but risks overshooting optimal parameters and causing instability in training.
57
+
-**Lower**: More stable and precise updates but may slow down convergence, requiring more training steps to achieve good performance.
58
+
59
+
## Target Modules
60
+
61
+
**q_proj (query projection)**
62
+
-**Description**: Part of the attention mechanism in transformer models, responsible for projecting the input into the query space.
63
+
-**Impact**: Transforms the input into query vectors that are used to compute attention scores.
64
+
65
+
**k_proj (key projection)**
66
+
-**Description**: Projects the input into the key space in the attention mechanism.
67
+
-**Impact**: Produces key vectors that are compared with query vectors to determine attention weights.
68
+
69
+
**v_proj (value projection)**
70
+
-**Description**: Projects the input into the value space in the attention mechanism.
71
+
-**Impact**: Produces value vectors that are weighted by the attention scores and combined to form the output.
72
+
73
+
**o_proj (output projection)**
74
+
-**Description**: Projects the output of the attention mechanism back into the original space.
75
+
-**Impact**: Transforms the combined weighted value vectors back to the input dimension, integrating attention results into the model.
76
+
77
+
**gate_proj (gate projection)**
78
+
-**Description**: Typically used in gated mechanisms within neural networks, such as gating units in gated recurrent units (GRUs) or other gating mechanisms.
79
+
-**Impact**: Controls the flow of information through the gate, allowing selective information passage based on learned weights.
80
+
81
+
**up_proj (up projection)**
82
+
-**Description**: Used for up-projection, typically increasing the dimensionality of the input.
83
+
-**Impact**: Expands the input to a higher-dimensional space, often used in feedforward layers or when transitioning between different layers with differing dimensionalities.
84
+
85
+
**down_proj (down projection)**
86
+
-**Description**: Used for down-projection, typically reducing the dimensionality of the input.
87
+
-**Impact**: Compresses the input to a lower-dimensional space, useful for reducing computational complexity and controlling the model size.
0 commit comments