LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) are techniques designed to fine-tune large language models efficiently, focusing on training fewer parameters while maintaining performance.
What does LoRA do?
-
LoRA injects small, trainable rank-decomposition matrices into the weights of the original pre-trained model.
-
During fine-tuning, only these low-rank matrices are updated, while the original model weights remain frozen.
-
This drastically reduces memory and compute required for fine-tuning, enabling adaptation on modest hardware.
-
It retains the original model intact, allowing efficient storage and parameter sharing.
What does QLoRA do?
-
QLoRA is an extension that combines quantization with LoRA.
-
It quantizes the model weights (e.g., to 4-bit representations) to reduce model size and memory footprint.
-
Fine-tuning is still done on the low-rank adaptation matrices but now on a highly compressed model.
-
Enables fine-tuning of very large models on consumer-grade GPUs.
Summary:
-
LoRA fine-tunes a small fraction of parameters (low-rank adapters) added to a frozen base model.
-
QLoRA adds quantization to LoRA, further lowering memory requirements while preserving accuracy.
-
Both methods enable efficient, resource-friendly fine-tuning of large language models.
Neither LoRA nor QLoRA changes the base model weights directly; they learn tweaks (adapters) allowing you to adapt models without full retraining, commonly used in modern large-scale fine-tuning workflows.LoRA (Low-Rank Adaptation) is a fine-tuning technique that updates a small set of added low-rank matrices while freezing the original model weights, greatly reducing training costs.
QLoRA is an extension of LoRA that applies quantization (e.g., 4-bit) to the base model weights and fine-tunes the low-rank adapters on this compressed model, enabling fine-tuning of very large models on limited hardware.
-
LoRA fine-tunes by adding trainable low-rank matrices, keeping original weights fixed.
-
QLoRA combines quantization + LoRA for memory-efficient fine-tuning.
These methods enable efficient fine-tuning without modifying the entire model, making them popular for adapting large language models with smaller compute resources.LoRA (Low-Rank Adaptation) is a fine-tuning technique that trains only a small set of low-rank matrices added to the original model weights, freezing the main model, reducing memory and compute needs.
Example of a Matrix of a LLM
In a large language model (LLM) like a Transformer, the core computation in many layers involves multiplying input vectors by weight matrices.
- Suppose the input is a sequence of tokens, each represented as a vector of size
demb = 512. - The model has a linear layer with weight matrix
Wof shape512 × 1536. - This matrix
Wprojects each input token vector into concatenated query, key, and value vectors needed for multi-head attention:
P = XW
Xis a matrix with rows as token embeddings (sequence length × 512),Wmaps to asequence length × 1536output (512 per query/key/value × 3).
Simple numeric example (smaller scale):
Input token embedding x:
x = [0.2, -0.3, 0.5]
Weight matrix W (3 x 2):
W = [
[ 0.1, 0.4 ],
[ -0.2, 0.3 ],
[ 0.7, -0.5 ]
]
Linear layer output:
y = xW = [
0.2*0.1 + (-0.3)*(-0.2) + 0.5*0.7,
0.2*0.4 + (-0.3)*0.3 + 0.5*(-0.5)
] = [0.54, -0.19]
This matrix multiplication is the fundamental calculation in layers, mapping input embeddings to query/key/value vectors and ultimately to predicted tokens.
The large weight matrices in LLMs are learned during pre-training and fine-tuning, transforming input sequences into meaningful contextual representations.
Now on the above 3×2 weight matrix how LoRA will work?
Let’s apply LoRA fine-tuning on your example 3×2 weight matrix W:
Original matrix W:
W = [
[0.1, 0.4],
[-0.2, 0.3],
[0.7, -0.5]
]
LoRA Approach
- Choose a low-rank
r, e.g.,r=1. - Introduce two small matrices:
A = [
[a₁],
[a₂],
[a₃]
] (3×1),
B = [
[b₁, b₂]
] (1×2)
where A and B are trainable and start initialized randomly or zero.
Compute LoRA update ΔW = A × B:
ΔW = [
[a₁b₁, a₁b₂],
[a₂b₁, a₂b₂],
[a₃b₁, a₃b₂]
]
Adapted weight matrix W’:
W' = W + α × ΔW = [
[0.1 + α a₁ b₁, 0.4 + α a₁ b₂],
[-0.2 + α a₂ b₁, 0.3 + α a₂ b₂],
[0.7 + α a₃ b₁, -0.5 + α a₃ b₂]
]
During training:
Wis frozen (does not change).- Only
AandBare updated via backpropagation. - The combined
W'represents the fine-tuned weights.
Intuition
A and B are much smaller (few parameters) than original W, making fine-tuning efficient while still adapting the model weights effectively.
What are the other methods besides LoRA and QLoRA?
Other efficient fine-tuning methods for large language models (LLMs) include:
- Adapter Layers
- Adapters insert small trainable layers into selected points of the original (frozen) network.
- Only the adapter parameters are updated during fine-tuning, keeping most of the model unchanged.
- This allows multi-task flexibility and efficient parameter storage.
- Prefix Tuning
- Learns a fixed-length “prefix” of continuous vectors prepended to the model’s input, optimizing just these prefix embeddings.
- All original model weights remain frozen.
- Useful for tasks where adding a small amount of trainable conditioning data is sufficient.
- Prompt Tuning (Soft Prompts)
- Optimizes a set of learnable input prompts/embeddings while keeping the model frozen.
- Prompts are tuned to steer model behavior for new tasks, with very low parameter overhead.
- Full Fine-Tuning
- Updates every parameter in the model on new data.
- Offers maximal flexibility and accuracy, but is extremely resource- and data-intensive for LLMs.
- Feature Extraction / Head Tuning
- Freezes all model layers except the final head/classifier, which is fine-tuned.
- Often combined with using the LLM as a fixed “feature extractor” for various downstream classifiers.
- Knowledge Distillation
- Trains a smaller “student” model to mimic a larger pretrained “teacher,” using the teacher’s output as supervision.
- Retrieval-Augmented Fine-Tuning (RAFT) / RAG
- Combines LLMs with an external retriever that provides relevant retrievals from a database or corpus during generation/training.
- Useful for domain adaptation and factuality improvements.
- Reinforcement Learning from Human Feedback (RLHF)
- Uses feedback from humans to optimize LLM responses via reinforcement learning, aligning outputs with user preferences or safety standards.
- Pattern-Based Fine-Tuning (PBFT) & Context Distillation
- Varies prompts, uses context patterns, or distills model context for efficient task adaptation.