AI fine tuning – How Lora works

LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) are techniques designed to fine-tune large language models efficiently, focusing on training fewer parameters while maintaining performance.


What does LoRA do?

  • LoRA injects small, trainable rank-decomposition matrices into the weights of the original pre-trained model.

  • During fine-tuning, only these low-rank matrices are updated, while the original model weights remain frozen.

  • This drastically reduces memory and compute required for fine-tuning, enabling adaptation on modest hardware.

  • It retains the original model intact, allowing efficient storage and parameter sharing.


What does QLoRA do?

  • QLoRA is an extension that combines quantization with LoRA.

  • It quantizes the model weights (e.g., to 4-bit representations) to reduce model size and memory footprint.

  • Fine-tuning is still done on the low-rank adaptation matrices but now on a highly compressed model.

  • Enables fine-tuning of very large models on consumer-grade GPUs.


Summary:

  • LoRA fine-tunes a small fraction of parameters (low-rank adapters) added to a frozen base model.

  • QLoRA adds quantization to LoRA, further lowering memory requirements while preserving accuracy.

  • Both methods enable efficient, resource-friendly fine-tuning of large language models.

Neither LoRA nor QLoRA changes the base model weights directly; they learn tweaks (adapters) allowing you to adapt models without full retraining, commonly used in modern large-scale fine-tuning workflows.LoRA (Low-Rank Adaptation) is a fine-tuning technique that updates a small set of added low-rank matrices while freezing the original model weights, greatly reducing training costs.

QLoRA is an extension of LoRA that applies quantization (e.g., 4-bit) to the base model weights and fine-tunes the low-rank adapters on this compressed model, enabling fine-tuning of very large models on limited hardware.

  • LoRA fine-tunes by adding trainable low-rank matrices, keeping original weights fixed.

  • QLoRA combines quantization + LoRA for memory-efficient fine-tuning.

These methods enable efficient fine-tuning without modifying the entire model, making them popular for adapting large language models with smaller compute resources.LoRA (Low-Rank Adaptation) is a fine-tuning technique that trains only a small set of low-rank matrices added to the original model weights, freezing the main model, reducing memory and compute needs.

 

Example of a Matrix of a LLM

In a large language model (LLM) like a Transformer, the core computation in many layers involves multiplying input vectors by weight matrices.

  • Suppose the input is a sequence of tokens, each represented as a vector of size demb = 512.
  • The model has a linear layer with weight matrix W of shape 512 × 1536.
  • This matrix W projects each input token vector into concatenated query, key, and value vectors needed for multi-head attention:

P = XW

  • X is a matrix with rows as token embeddings (sequence length × 512),
  • W maps to a sequence length × 1536 output (512 per query/key/value × 3).

Simple numeric example (smaller scale):

Input token embedding x:

x = [0.2, -0.3, 0.5]

Weight matrix W (3 x 2):

W = [
  [ 0.1,  0.4 ],
  [ -0.2, 0.3 ],
  [ 0.7, -0.5 ]
]

Linear layer output:

y = xW = [
  0.2*0.1 + (-0.3)*(-0.2) + 0.5*0.7,
  0.2*0.4 + (-0.3)*0.3 + 0.5*(-0.5)
] = [0.54, -0.19]

This matrix multiplication is the fundamental calculation in layers, mapping input embeddings to query/key/value vectors and ultimately to predicted tokens.

The large weight matrices in LLMs are learned during pre-training and fine-tuning, transforming input sequences into meaningful contextual representations.

Now on the above 3×2 weight matrix how LoRA will work?

Let’s apply LoRA fine-tuning on your example 3×2 weight matrix W:

Original matrix W:

W = [
  [0.1, 0.4],
  [-0.2, 0.3],
  [0.7, -0.5]
]

LoRA Approach

  • Choose a low-rank r, e.g., r=1.
  • Introduce two small matrices:
A = [
  [a₁],
  [a₂],
  [a₃]
] (3×1),
B = [
  [b₁, b₂]
] (1×2)

where A and B are trainable and start initialized randomly or zero.

Compute LoRA update ΔW = A × B:

ΔW = [
  [a₁b₁, a₁b₂],
  [a₂b₁, a₂b₂],
  [a₃b₁, a₃b₂]
]

Adapted weight matrix W’:

W' = W + α × ΔW = [
  [0.1 + α a₁ b₁, 0.4 + α a₁ b₂],
  [-0.2 + α a₂ b₁, 0.3 + α a₂ b₂],
  [0.7 + α a₃ b₁, -0.5 + α a₃ b₂]
]

During training:

  • W is frozen (does not change).
  • Only A and B are updated via backpropagation.
  • The combined W' represents the fine-tuned weights.

Intuition

A and B are much smaller (few parameters) than original W, making fine-tuning efficient while still adapting the model weights effectively.

What are the other methods besides LoRA and QLoRA?

Other efficient fine-tuning methods for large language models (LLMs) include:

  1. Adapter Layers
    • Adapters insert small trainable layers into selected points of the original (frozen) network.
    • Only the adapter parameters are updated during fine-tuning, keeping most of the model unchanged.
    • This allows multi-task flexibility and efficient parameter storage.
  2. Prefix Tuning
    • Learns a fixed-length “prefix” of continuous vectors prepended to the model’s input, optimizing just these prefix embeddings.
    • All original model weights remain frozen.
    • Useful for tasks where adding a small amount of trainable conditioning data is sufficient.
  3. Prompt Tuning (Soft Prompts)
    • Optimizes a set of learnable input prompts/embeddings while keeping the model frozen.
    • Prompts are tuned to steer model behavior for new tasks, with very low parameter overhead.
  4. Full Fine-Tuning
    • Updates every parameter in the model on new data.
    • Offers maximal flexibility and accuracy, but is extremely resource- and data-intensive for LLMs.
  5. Feature Extraction / Head Tuning
    • Freezes all model layers except the final head/classifier, which is fine-tuned.
    • Often combined with using the LLM as a fixed “feature extractor” for various downstream classifiers.
  6. Knowledge Distillation
    • Trains a smaller “student” model to mimic a larger pretrained “teacher,” using the teacher’s output as supervision.
  7. Retrieval-Augmented Fine-Tuning (RAFT) / RAG
    • Combines LLMs with an external retriever that provides relevant retrievals from a database or corpus during generation/training.
    • Useful for domain adaptation and factuality improvements.
  8. Reinforcement Learning from Human Feedback (RLHF)
    • Uses feedback from humans to optimize LLM responses via reinforcement learning, aligning outputs with user preferences or safety standards.
  9. Pattern-Based Fine-Tuning (PBFT) & Context Distillation
    • Varies prompts, uses context patterns, or distills model context for efficient task adaptation.
Published
Categorised as AI

Leave a comment