AI fine tuning – How Lora works

LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) are techniques designed to fine-tune large language models efficiently, focusing on training fewer parameters while maintaining performance.

What does LoRA do?

LoRA injects small, trainable rank-decomposition matrices into the weights of the original pre-trained model.
During fine-tuning, only these low-rank matrices are updated, while the original model weights remain frozen.
This drastically reduces memory and compute required for fine-tuning, enabling adaptation on modest hardware.
It retains the original model intact, allowing efficient storage and parameter sharing.

What does QLoRA do?

QLoRA is an extension that combines quantization with LoRA.
It quantizes the model weights (e.g., to 4-bit representations) to reduce model size and memory footprint.
Fine-tuning is still done on the low-rank adaptation matrices but now on a highly compressed model.
Enables fine-tuning of very large models on consumer-grade GPUs.

Summary:

LoRA fine-tunes a small fraction of parameters (low-rank adapters) added to a frozen base model.
QLoRA adds quantization to LoRA, further lowering memory requirements while preserving accuracy.
Both methods enable efficient, resource-friendly fine-tuning of large language models.

Neither LoRA nor QLoRA changes the base model weights directly; they learn tweaks (adapters) allowing you to adapt models without full retraining, commonly used in modern large-scale fine-tuning workflows.LoRA (Low-Rank Adaptation) is a fine-tuning technique that updates a small set of added low-rank matrices while freezing the original model weights, greatly reducing training costs.

QLoRA is an extension of LoRA that applies quantization (e.g., 4-bit) to the base model weights and fine-tunes the low-rank adapters on this compressed model, enabling fine-tuning of very large models on limited hardware.

LoRA fine-tunes by adding trainable low-rank matrices, keeping original weights fixed.
QLoRA combines quantization + LoRA for memory-efficient fine-tuning.

These methods enable efficient fine-tuning without modifying the entire model, making them popular for adapting large language models with smaller compute resources.LoRA (Low-Rank Adaptation) is a fine-tuning technique that trains only a small set of low-rank matrices added to the original model weights, freezing the main model, reducing memory and compute needs.

Example of a Matrix of a LLM

In a large language model (LLM) like a Transformer, the core computation in many layers involves multiplying input vectors by weight matrices.

Suppose the input is a sequence of tokens, each represented as a vector of size d_emb = 512.
The model has a linear layer with weight matrix W of shape 512 × 1536.
This matrix W projects each input token vector into concatenated query, key, and value vectors needed for multi-head attention:

P = XW

X is a matrix with rows as token embeddings (sequence length × 512),
W maps to a sequence length × 1536 output (512 per query/key/value × 3).

Simple numeric example (smaller scale):

Input token embedding x:

x = [0.2, -0.3, 0.5]

Weight matrix W (3 x 2):

W = [
  [ 0.1,  0.4 ],
  [ -0.2, 0.3 ],
  [ 0.7, -0.5 ]
]

Linear layer output:

y = xW = [
  0.2*0.1 + (-0.3)*(-0.2) + 0.5*0.7,
  0.2*0.4 + (-0.3)*0.3 + 0.5*(-0.5)
] = [0.54, -0.19]

This matrix multiplication is the fundamental calculation in layers, mapping input embeddings to query/key/value vectors and ultimately to predicted tokens.

The large weight matrices in LLMs are learned during pre-training and fine-tuning, transforming input sequences into meaningful contextual representations.

Now on the above 3×2 weight matrix how LoRA will work?

Let’s apply LoRA fine-tuning on your example 3×2 weight matrix W:

Original matrix W:

W = [
  [0.1, 0.4],
  [-0.2, 0.3],
  [0.7, -0.5]
]

LoRA Approach

Choose a low-rank r, e.g., r=1.
Introduce two small matrices:

A = [
  [a₁],
  [a₂],
  [a₃]
] (3×1),
B = [
  [b₁, b₂]
] (1×2)

where A and B are trainable and start initialized randomly or zero.

Compute LoRA update ΔW = A × B:

ΔW = [
  [a₁b₁, a₁b₂],
  [a₂b₁, a₂b₂],
  [a₃b₁, a₃b₂]
]

Adapted weight matrix W’:

W' = W + α × ΔW = [
  [0.1 + α a₁ b₁, 0.4 + α a₁ b₂],
  [-0.2 + α a₂ b₁, 0.3 + α a₂ b₂],
  [0.7 + α a₃ b₁, -0.5 + α a₃ b₂]
]

During training:

W is frozen (does not change).
Only A and B are updated via backpropagation.
The combined W' represents the fine-tuned weights.

Intuition

A and B are much smaller (few parameters) than original W, making fine-tuning efficient while still adapting the model weights effectively.

What are the other methods besides LoRA and QLoRA?

Other efficient fine-tuning methods for large language models (LLMs) include:

Adapter Layers
- Adapters insert small trainable layers into selected points of the original (frozen) network.
- Only the adapter parameters are updated during fine-tuning, keeping most of the model unchanged.
- This allows multi-task flexibility and efficient parameter storage.
Prefix Tuning
- Learns a fixed-length “prefix” of continuous vectors prepended to the model’s input, optimizing just these prefix embeddings.
- All original model weights remain frozen.
- Useful for tasks where adding a small amount of trainable conditioning data is sufficient.
Prompt Tuning (Soft Prompts)
- Optimizes a set of learnable input prompts/embeddings while keeping the model frozen.
- Prompts are tuned to steer model behavior for new tasks, with very low parameter overhead.
Full Fine-Tuning
- Updates every parameter in the model on new data.
- Offers maximal flexibility and accuracy, but is extremely resource- and data-intensive for LLMs.
Feature Extraction / Head Tuning
- Freezes all model layers except the final head/classifier, which is fine-tuned.
- Often combined with using the LLM as a fixed “feature extractor” for various downstream classifiers.
Knowledge Distillation
- Trains a smaller “student” model to mimic a larger pretrained “teacher,” using the teacher’s output as supervision.
Retrieval-Augmented Fine-Tuning (RAFT) / RAG
- Combines LLMs with an external retriever that provides relevant retrievals from a database or corpus during generation/training.
- Useful for domain adaptation and factuality improvements.
Reinforcement Learning from Human Feedback (RLHF)
- Uses feedback from humans to optimize LLM responses via reinforcement learning, aligning outputs with user preferences or safety standards.
Pattern-Based Fine-Tuning (PBFT) & Context Distillation
- Varies prompts, uses context patterns, or distills model context for efficient task adaptation.