EfficiencyLLM

OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models

Published on March 5, 2025

Authors

Kerim Büyükakyüz

CEO & President

Trylon AI

Abstract

The advent of large language models (LLMs) has revolutionized natural language processing, enabling unprecedented capabilities in understanding and generating human-like text. However, the computational cost and convergence times associated with fine-tuning these models remain significant challenges. Low-Rank Adaptation (LoRA) has emerged as a promising method to mitigate these issues by introducing efficient fine-tuning techniques with a reduced number of trainable parameters. In this paper, we present OLoRA, an enhancement to the LoRA method that leverages orthonormal matrix initialization through QR decomposition. OLoRA significantly accelerates the convergence of LLM training while preserving the efficiency benefits of LoRA, such as the number of trainable parameters and GPU memory footprint.

Technical Overview: What is OLoRA?

OLoRA (Orthonormal Low-Rank Adaptation) is an advancement over the standard LoRA (Low-Rank Adaptation) method for efficiently fine-tuning large language models. It addresses key limitations in convergence speed and optimization stability while maintaining LoRA's parameter efficiency benefits.

The Core Innovation

OLoRA leverages orthonormal matrix initialization through QR decomposition to create a more favorable optimization landscape. Unlike standard LoRA which initializes adaptation matrices randomly, OLoRA directly approximates the final weight matrix using orthonormal bases derived from the pre-trained weights.

Standard LoRA

Adapts pre-trained weight matrix W using:

W_adapted = W + BA

Where B and A are initialized randomly and with zeros.

OLoRA

Decomposes W using QR factorization:

W = QR
W_adapted = W + Q_rR_r

Where Q_r contains first r columns of Q (orthonormal), and R_r contains first r rows of R.

Mathematical Foundation

OLoRA's effectiveness stems from the preservation of spectral properties during adaptation. By using orthonormal bases derived from the original weights, OLoRA ensures that the adaptation stays within a well-conditioned subspace of the parameter space.

For a pre-trained weight matrix W ∈ ℝ^m×n:

QR Decomposition: W = QR, where Q ∈ ℝ^m×m is orthogonal and R ∈ ℝ^m×n is upper triangular
Low-Rank Approximation: W_r = Q_rR_r, where Q_r ∈ ℝ^m×r and R_r ∈ ℝ^r×n
Adaptation: W_adapted = W + Q_rR_r where Q_r and R_r are trainable

Key Benefits

Faster Convergence

OLoRA demonstrates consistently faster training convergence across multiple model sizes and tasks.

Performance Gains

In majority of test cases (53 out of 60), OLoRA achieves higher final performance compared to standard LoRA.

Minimal Overhead

The QR decomposition is a one-time operation per layer during initialization, with negligible computational cost compared to training.

Compatibility

Works with existing LoRA implementations with minimal changes—just changing the initialization method.

Key Findings

Across five diverse LLMs (from 1.1B to 7B parameters) and six NLP benchmarks, OLoRA consistently outperformed standard LoRA in both convergence speed and final accuracy. The most significant improvements were observed on complex reasoning tasks like Arc-Challenge and OpenBookQA.

The orthonormal initialization appears to guide the optimization process toward more favorable parameter regions, resulting in models that generalize better to unseen data while requiring no additional parameters compared to standard LoRA.

Implementation

This research has been implemented and is available for practical use. You can integrate it into your own projects using the resources below.

View the implementation in HuggingFace's PEFT library

Explore the source code and implementation details

Implementation Examples

Quick Start

Basic example of how to use OLoRA with Hugging Face's PEFT library

import torch
from peft import LoraConfig, get_peft_model
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
dataset = load_dataset("imdb", split="train[:1%]")

# Just specify init_lora_weights="olora" to use OLoRA
lora_config = LoraConfig(
    init_lora_weights="olora"
)

peft_model = get_peft_model(model, lora_config)
training_args = SFTConfig(dataset_text_field="text", max_seq_length=128)
trainer = SFTTrainer(
    model=peft_model,
    train_dataset=dataset,
    tokenizer=tokenizer,
)

trainer.train()
peft_model.save_pretrained("olora-opt-350m")

Using the Model

Loading and using an OLoRA model after training

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

# Load the saved OLoRA model
olora_model = PeftModel.from_pretrained(model, "olora-opt-350m")

# Now you can use it for inference
inputs = tokenizer("Hello, I am a", return_tensors="pt")
outputs = olora_model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Converting OLoRA to LoRA

For using multiple adapters simultaneously

base_model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")

# Initialize with OLoRA
olora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    init_lora_weights="olora"  # Use OLoRA initialization
)
olora_model = get_peft_model(base_model, olora_config)

# Save the untrained model
init_path = "path/to/save/untrained/model"
olora_model.save_pretrained(init_path)

# Train the model
# ... your training code here ...

# Save and convert to conventional LoRA
olora_model.save_pretrained(
    "final_model_path",
    path_initial_model_for_weight_conversion=init_path
)

With Quantization

Using OLoRA with 4-bit quantization

import torch
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Configure quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
)

# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=quantization_config,
    device_map="auto"
)

# Prepare for training
model = prepare_model_for_kbit_training(model)

# Configure OLoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    init_lora_weights="olora",
    bias="none",
)

# Get PEFT model
peft_model = get_peft_model(model, lora_config)

How to Cite

@misc{büyükakyüz2024olora,
      title={OLoRA: Orthonormal Low-Rank Adaptation of Large Language Models}, 
      author={Kerim Büyükakyüz},
      year={2024},
      eprint={2406.01775},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Research Visualization

Improvement

+42%

Efficiency

3.7x

External Resources

View on arXiv

Read the full academic paper on arXiv

View on HuggingFace

See the implementation in HuggingFace's PEFT library

GitHub Implementation

Browse the source code and implementation details

Documentation

Read the technical documentation and usage guides