{"id":1124,"date":"2025-10-24T18:13:33","date_gmt":"2025-10-24T17:13:33","guid":{"rendered":"https:\/\/www.kolkataonweb.com\/code-bank\/?p=1124"},"modified":"2025-10-23T12:05:46","modified_gmt":"2025-10-23T11:05:46","slug":"ai-fine-tuning-how-lora-works","status":"publish","type":"post","link":"https:\/\/www.kolkataonweb.com\/code-bank\/ai\/ai-fine-tuning-how-lora-works\/","title":{"rendered":"AI fine tuning &#8211; How Lora works"},"content":{"rendered":"<div class=\"prose dark:prose-invert inline leading-relaxed break-words min-w-0 [word-break:break-word] prose-strong:font-medium\">\n<p class=\"my-2 [&amp;+p]:mt-4 [&amp;_strong:has(+br)]:inline-block [&amp;_strong:has(+br)]:pb-2\">LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) are techniques designed to fine-tune large language models efficiently, focusing on training fewer parameters while maintaining performance.<\/p>\n<hr class=\"bg-subtle h-px border-0\" \/>\n<h3 class=\"mb-2 mt-4 font-display font-semimedium text-base first:mt-0\">What does LoRA do?<\/h3>\n<ul class=\"marker:text-quiet list-disc\">\n<li class=\"py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;&gt;p]:pt-0 [&amp;&gt;p]:mb-2 [&amp;&gt;p]:my-0\">\n<p class=\"my-2 [&amp;+p]:mt-4 [&amp;_strong:has(+br)]:inline-block [&amp;_strong:has(+br)]:pb-2\">LoRA injects small, trainable rank-decomposition matrices into the weights of the original pre-trained model.<\/p>\n<\/li>\n<li class=\"py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;&gt;p]:pt-0 [&amp;&gt;p]:mb-2 [&amp;&gt;p]:my-0\">\n<p class=\"my-2 [&amp;+p]:mt-4 [&amp;_strong:has(+br)]:inline-block [&amp;_strong:has(+br)]:pb-2\">During fine-tuning, <strong>only these low-rank matrices are updated<\/strong>, while the original model weights remain frozen.<\/p>\n<\/li>\n<li class=\"py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;&gt;p]:pt-0 [&amp;&gt;p]:mb-2 [&amp;&gt;p]:my-0\">\n<p class=\"my-2 [&amp;+p]:mt-4 [&amp;_strong:has(+br)]:inline-block [&amp;_strong:has(+br)]:pb-2\">This drastically reduces memory and compute required for fine-tuning, enabling adaptation on modest hardware.<\/p>\n<\/li>\n<li class=\"py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;&gt;p]:pt-0 [&amp;&gt;p]:mb-2 [&amp;&gt;p]:my-0\">\n<p class=\"my-2 [&amp;+p]:mt-4 [&amp;_strong:has(+br)]:inline-block [&amp;_strong:has(+br)]:pb-2\">It retains the original model intact, allowing efficient storage and parameter sharing.<\/p>\n<\/li>\n<\/ul>\n<hr class=\"bg-subtle h-px border-0\" \/>\n<h3 class=\"mb-2 mt-4 font-display font-semimedium text-base first:mt-0\">What does QLoRA do?<\/h3>\n<ul class=\"marker:text-quiet list-disc\">\n<li class=\"py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;&gt;p]:pt-0 [&amp;&gt;p]:mb-2 [&amp;&gt;p]:my-0\">\n<p class=\"my-2 [&amp;+p]:mt-4 [&amp;_strong:has(+br)]:inline-block [&amp;_strong:has(+br)]:pb-2\">QLoRA is an extension that <strong>combines quantization with LoRA<\/strong>.<\/p>\n<\/li>\n<li class=\"py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;&gt;p]:pt-0 [&amp;&gt;p]:mb-2 [&amp;&gt;p]:my-0\">\n<p class=\"my-2 [&amp;+p]:mt-4 [&amp;_strong:has(+br)]:inline-block [&amp;_strong:has(+br)]:pb-2\">It quantizes the model weights (e.g., to 4-bit representations) to reduce model size and memory footprint.<\/p>\n<\/li>\n<li class=\"py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;&gt;p]:pt-0 [&amp;&gt;p]:mb-2 [&amp;&gt;p]:my-0\">\n<p class=\"my-2 [&amp;+p]:mt-4 [&amp;_strong:has(+br)]:inline-block [&amp;_strong:has(+br)]:pb-2\">Fine-tuning is still done on the low-rank adaptation matrices but now on a highly compressed model.<\/p>\n<\/li>\n<li class=\"py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;&gt;p]:pt-0 [&amp;&gt;p]:mb-2 [&amp;&gt;p]:my-0\">\n<p class=\"my-2 [&amp;+p]:mt-4 [&amp;_strong:has(+br)]:inline-block [&amp;_strong:has(+br)]:pb-2\">Enables fine-tuning of very large models on consumer-grade GPUs.<\/p>\n<\/li>\n<\/ul>\n<hr class=\"bg-subtle h-px border-0\" \/>\n<h3 class=\"mb-2 mt-4 font-display font-semimedium text-base first:mt-0\">Summary:<\/h3>\n<ul class=\"marker:text-quiet list-disc\">\n<li class=\"py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;&gt;p]:pt-0 [&amp;&gt;p]:mb-2 [&amp;&gt;p]:my-0\">\n<p class=\"my-2 [&amp;+p]:mt-4 [&amp;_strong:has(+br)]:inline-block [&amp;_strong:has(+br)]:pb-2\"><strong>LoRA<\/strong> fine-tunes a small fraction of parameters (low-rank adapters) added to a frozen base model.<\/p>\n<\/li>\n<li class=\"py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;&gt;p]:pt-0 [&amp;&gt;p]:mb-2 [&amp;&gt;p]:my-0\">\n<p class=\"my-2 [&amp;+p]:mt-4 [&amp;_strong:has(+br)]:inline-block [&amp;_strong:has(+br)]:pb-2\"><strong>QLoRA<\/strong> adds quantization to LoRA, further lowering memory requirements while preserving accuracy.<\/p>\n<\/li>\n<li class=\"py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;&gt;p]:pt-0 [&amp;&gt;p]:mb-2 [&amp;&gt;p]:my-0\">\n<p class=\"my-2 [&amp;+p]:mt-4 [&amp;_strong:has(+br)]:inline-block [&amp;_strong:has(+br)]:pb-2\">Both methods enable efficient, resource-friendly fine-tuning of large language models.<\/p>\n<\/li>\n<\/ul>\n<p class=\"my-2 [&amp;+p]:mt-4 [&amp;_strong:has(+br)]:inline-block [&amp;_strong:has(+br)]:pb-2\">Neither LoRA nor QLoRA changes the base model weights directly; they learn tweaks (adapters) allowing you to adapt models without full retraining, commonly used in modern large-scale fine-tuning workflows.LoRA (Low-Rank Adaptation) is a fine-tuning technique that updates a small set of added low-rank matrices while freezing the original model weights, greatly reducing training costs.<\/p>\n<p class=\"my-2 [&amp;+p]:mt-4 [&amp;_strong:has(+br)]:inline-block [&amp;_strong:has(+br)]:pb-2\">QLoRA is an extension of LoRA that applies <strong>quantization<\/strong> (e.g., 4-bit) to the base model weights and fine-tunes the low-rank adapters on this compressed model, enabling fine-tuning of very large models on limited hardware.<\/p>\n<ul class=\"marker:text-quiet list-disc\">\n<li class=\"py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;&gt;p]:pt-0 [&amp;&gt;p]:mb-2 [&amp;&gt;p]:my-0\">\n<p class=\"my-2 [&amp;+p]:mt-4 [&amp;_strong:has(+br)]:inline-block [&amp;_strong:has(+br)]:pb-2\"><strong>LoRA<\/strong> fine-tunes by adding trainable low-rank matrices, keeping original weights fixed.<\/p>\n<\/li>\n<li class=\"py-0 my-0 prose-p:pt-0 prose-p:mb-2 prose-p:my-0 [&amp;&gt;p]:pt-0 [&amp;&gt;p]:mb-2 [&amp;&gt;p]:my-0\">\n<p class=\"my-2 [&amp;+p]:mt-4 [&amp;_strong:has(+br)]:inline-block [&amp;_strong:has(+br)]:pb-2\"><strong>QLoRA<\/strong> combines quantization + LoRA for memory-efficient fine-tuning.<\/p>\n<\/li>\n<\/ul>\n<p class=\"my-2 [&amp;+p]:mt-4 [&amp;_strong:has(+br)]:inline-block [&amp;_strong:has(+br)]:pb-2\">These methods enable efficient fine-tuning without modifying the entire model, making them popular for adapting large language models with smaller compute resources.LoRA (Low-Rank Adaptation) is a fine-tuning technique that trains only a small set of low-rank matrices added to the original model weights, freezing the main model, reducing memory and compute needs.<\/p>\n<p>&nbsp;<\/p>\n<\/div>\n<h3>Example of a Matrix of a LLM<\/h3>\n<p>In a large language model (LLM) like a Transformer, the core computation in many layers involves multiplying input vectors by weight matrices.<\/p>\n<ul>\n<li>Suppose the input is a sequence of tokens, each represented as a vector of size <code>d<sub>emb<\/sub> = 512<\/code>.<\/li>\n<li>The model has a linear layer with weight matrix <code>W<\/code> of shape <code>512 \u00d7 1536<\/code>.<\/li>\n<li>This matrix <code>W<\/code> projects each input token vector into concatenated query, key, and value vectors needed for multi-head attention:<\/li>\n<\/ul>\n<p><code>P = XW<\/code><\/p>\n<ul>\n<li><code>X<\/code> is a matrix with rows as token embeddings (<code>sequence length \u00d7 512<\/code>),<\/li>\n<li><code>W<\/code> maps to a <code>sequence length \u00d7 1536<\/code> output (512 per query\/key\/value \u00d7 3).<\/li>\n<\/ul>\n<h3>Simple numeric example (smaller scale):<\/h3>\n<p>Input token embedding <code>x<\/code>:<\/p>\n<pre class='wp-code-highlight prettyprint'><code>x = [0.2, -0.3, 0.5]\r\n<\/code><\/pre>\n<p>Weight matrix <code>W<\/code> (3 x 2):<\/p>\n<pre class='wp-code-highlight prettyprint'><code>W = [\r\n  [ 0.1,  0.4 ],\r\n  [ -0.2, 0.3 ],\r\n  [ 0.7, -0.5 ]\r\n]\r\n<\/code><\/pre>\n<p>Linear layer output:<\/p>\n<pre class='wp-code-highlight prettyprint'><code>y = xW = [\r\n  0.2*0.1 + (-0.3)*(-0.2) + 0.5*0.7,\r\n  0.2*0.4 + (-0.3)*0.3 + 0.5*(-0.5)\r\n] = [0.54, -0.19]\r\n<\/code><\/pre>\n<p>This matrix multiplication is the fundamental calculation in layers, mapping input embeddings to query\/key\/value vectors and ultimately to predicted tokens.<\/p>\n<p>The large weight matrices in LLMs are learned during pre-training and fine-tuning, transforming input sequences into meaningful contextual representations.<\/p>\n<h3>Now on the above 3&#215;2 weight matrix how LoRA will work?<\/h3>\n<p>Let&#8217;s apply LoRA fine-tuning on your example 3&#215;2 weight matrix <code>W<\/code>:<\/p>\n<p><strong>Original matrix W:<\/strong><\/p>\n<pre class='wp-code-highlight prettyprint'><code>W = [\r\n  [0.1, 0.4],\r\n  [-0.2, 0.3],\r\n  [0.7, -0.5]\r\n]\r\n<\/code><\/pre>\n<p><strong>LoRA Approach<\/strong><\/p>\n<ul>\n<li>Choose a low-rank <code>r<\/code>, e.g., <code>r=1<\/code>.<\/li>\n<li>Introduce two small matrices:<\/li>\n<\/ul>\n<pre class='wp-code-highlight prettyprint'><code>A = [\r\n  [a\u2081],\r\n  [a\u2082],\r\n  [a\u2083]\r\n] (3\u00d71),\r\nB = [\r\n  [b\u2081, b\u2082]\r\n] (1\u00d72)\r\n<\/code><\/pre>\n<p>where A and B are trainable and start initialized randomly or zero.<\/p>\n<p><strong>Compute LoRA update \u0394W = A \u00d7 B:<\/strong><\/p>\n<pre class='wp-code-highlight prettyprint'><code>\u0394W = [\r\n  [a\u2081b\u2081, a\u2081b\u2082],\r\n  [a\u2082b\u2081, a\u2082b\u2082],\r\n  [a\u2083b\u2081, a\u2083b\u2082]\r\n]\r\n<\/code><\/pre>\n<p><strong>Adapted weight matrix W&#8217;:<\/strong><\/p>\n<pre class='wp-code-highlight prettyprint'><code>W' = W + \u03b1 \u00d7 \u0394W = [\r\n  [0.1 + \u03b1 a\u2081 b\u2081, 0.4 + \u03b1 a\u2081 b\u2082],\r\n  [-0.2 + \u03b1 a\u2082 b\u2081, 0.3 + \u03b1 a\u2082 b\u2082],\r\n  [0.7 + \u03b1 a\u2083 b\u2081, -0.5 + \u03b1 a\u2083 b\u2082]\r\n]\r\n<\/code><\/pre>\n<p>During training:<\/p>\n<ul>\n<li><code>W<\/code> is <strong>frozen<\/strong> (does not change).<\/li>\n<li>Only <code>A<\/code> and <code>B<\/code> are <strong>updated<\/strong> via backpropagation.<\/li>\n<li>The combined <code>W'<\/code> represents the fine-tuned weights.<\/li>\n<\/ul>\n<p><strong>Intuition<\/strong><\/p>\n<p><code>A<\/code> and <code>B<\/code> are much smaller (few parameters) than original <code>W<\/code>, making fine-tuning efficient while still adapting the model weights effectively.<\/p>\n<h4><\/h4>\n<h3>What are the other methods besides LoRA and QLoRA?<\/h3>\n<p>Other efficient fine-tuning methods for large language models (LLMs) include:<\/p>\n<ol>\n<li><strong>Adapter Layers<\/strong>\n<ul>\n<li>Adapters insert small trainable layers into selected points of the original (frozen) network.<\/li>\n<li>Only the adapter parameters are updated during fine-tuning, keeping most of the model unchanged.<\/li>\n<li>This allows multi-task flexibility and efficient parameter storage.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Prefix Tuning<\/strong>\n<ul>\n<li>Learns a fixed-length \u201cprefix\u201d of continuous vectors prepended to the model\u2019s input, optimizing just these prefix embeddings.<\/li>\n<li>All original model weights remain frozen.<\/li>\n<li>Useful for tasks where adding a small amount of trainable conditioning data is sufficient.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Prompt Tuning (Soft Prompts)<\/strong>\n<ul>\n<li>Optimizes a set of learnable input prompts\/embeddings while keeping the model frozen.<\/li>\n<li>Prompts are tuned to steer model behavior for new tasks, with very low parameter overhead.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Full Fine-Tuning<\/strong>\n<ul>\n<li>Updates every parameter in the model on new data.<\/li>\n<li>Offers maximal flexibility and accuracy, but is extremely resource- and data-intensive for LLMs.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Feature Extraction \/ Head Tuning<\/strong>\n<ul>\n<li>Freezes all model layers except the final head\/classifier, which is fine-tuned.<\/li>\n<li>Often combined with using the LLM as a fixed \u201cfeature extractor\u201d for various downstream classifiers.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Knowledge Distillation<\/strong>\n<ul>\n<li>Trains a smaller \u201cstudent\u201d model to mimic a larger pretrained \u201cteacher,\u201d using the teacher\u2019s output as supervision.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Retrieval-Augmented Fine-Tuning (RAFT) \/ RAG<\/strong>\n<ul>\n<li>Combines LLMs with an external retriever that provides relevant retrievals from a database or corpus during generation\/training.<\/li>\n<li>Useful for domain adaptation and factuality improvements.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Reinforcement Learning from Human Feedback (RLHF)<\/strong>\n<ul>\n<li>Uses feedback from humans to optimize LLM responses via reinforcement learning, aligning outputs with user preferences or safety standards.<\/li>\n<\/ul>\n<\/li>\n<li><strong>Pattern-Based Fine-Tuning (PBFT) &amp; Context Distillation<\/strong>\n<ul>\n<li>Varies prompts, uses context patterns, or distills model context for efficient task adaptation.<\/li>\n<\/ul>\n<\/li>\n<\/ol>\n","protected":false},"excerpt":{"rendered":"<p>LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) are techniques designed to fine-tune large language models efficiently, focusing on training fewer parameters while maintaining performance. What does LoRA do? LoRA injects small, trainable rank-decomposition matrices into the weights of the original pre-trained model. During fine-tuning, only these low-rank matrices are updated, while the original model weights&hellip; <a class=\"more-link\" href=\"https:\/\/www.kolkataonweb.com\/code-bank\/ai\/ai-fine-tuning-how-lora-works\/\">Continue reading <span class=\"screen-reader-text\">AI fine tuning &#8211; How Lora works<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[424],"tags":[],"class_list":["post-1124","post","type-post","status-publish","format-standard","hentry","category-ai","entry"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.kolkataonweb.com\/code-bank\/wp-json\/wp\/v2\/posts\/1124","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kolkataonweb.com\/code-bank\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kolkataonweb.com\/code-bank\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kolkataonweb.com\/code-bank\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kolkataonweb.com\/code-bank\/wp-json\/wp\/v2\/comments?post=1124"}],"version-history":[{"count":3,"href":"https:\/\/www.kolkataonweb.com\/code-bank\/wp-json\/wp\/v2\/posts\/1124\/revisions"}],"predecessor-version":[{"id":1129,"href":"https:\/\/www.kolkataonweb.com\/code-bank\/wp-json\/wp\/v2\/posts\/1124\/revisions\/1129"}],"wp:attachment":[{"href":"https:\/\/www.kolkataonweb.com\/code-bank\/wp-json\/wp\/v2\/media?parent=1124"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kolkataonweb.com\/code-bank\/wp-json\/wp\/v2\/categories?post=1124"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kolkataonweb.com\/code-bank\/wp-json\/wp\/v2\/tags?post=1124"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}