Post

How HyLoRA Squeezed TinyLlama by 51% Without Killing Performance.

Fixing SVD’s catastrophic failures via dynamic LoRA adapters: a 2X compression breakthrough.

How HyLoRA Squeezed TinyLlama by 51% Without Killing Performance.

HyLoRA is a novel technique for compressing large language models using a hybrid of SVD (for static compression) and LoRA (for dynamic fine-tuning). This project demonstrates a complete pipeline to compress and fine-tune a LLaMA-based model (1.3B) using PyTorch with minimal performance loss.


🎯 Goal

Achieve significant compression while retaining downstream performance (perplexity) using a hybrid method.


🔬 The Engineering Journey

Phase 1: The “AutoRank” Paradox 🔍

The Initial Goal: Build an intelligent AutoRank optimizer that automatically determines the best SVD rank for compression.

The Investigation: Discovered that naive SVD compression often results in a larger model due to the extra parameters introduced by the SVD layers. This led to models being bigger than the original if ranks weren’t constrained.

Phase 2: Perplexity Collapse 💥

Challenge: Even a mathematically correct compression of just 0.59% caused the model’s perplexity to explode from 18.77 → 244!

Insight: This provided empirical proof of cascading error collapse, where tiny, independent errors at each layer compound exponentially in deep architectures.

Discovery: Static, “greedy” compression is fundamentally insufficient for deep models. A mechanism for performance recovery is essential.


⚙️ Architectural Deep Dive: The SVD_LoRA_Linear Layer

This custom layer combines a frozen SVD base with trainable LoRA adapters. The forward pass is:

1
2
y = ((x @ SVh.T) @ U.T) + b              # Frozen SVD Base
  + alpha * ((x @ A.T) @ B.T)            # Trainable LoRA Adapter

Where:

  • x: Input
  • SVh, U: Frozen SVD components from initial compression
  • A, B: Trainable LoRA adapters (learned during fine-tuning)
  • alpha: Scaling factor for LoRA

This allows SVD to compress and LoRA to adapt.


🎯 Performance Recovery via Fine-Tuning

The LoRA adapters are optimized using a standard causal language modeling objective:

1
L(θ_LoRA) = -∑ log p(w_i | w_<i ; θ_SVD, θ_LoRA)

Here, only the LoRA parameters θ_LoRA are updated while the SVD base θ_SVD remains frozen.

This lets the model recover from information lost during compression, using only ~0.5% of the original parameters.


🏆 Breakthrough Results

Applied to TinyLlama-1.1B-Chat-v1.0, our method demonstrates a remarkable size-performance trade-off:

Model ConfigurationParameters (M)Size (MB)Perplexity (PPL)
1. Original Baseline1100.052098.1818.77
2. SVD Compressed (α=0.7)529.071009.11339.93
3. HyLoRA (SVD α=0.7 + LoRA r=32)538.981028.0528.63

📊 Performance Analysis

The final model’s performance is highly sensitive to the initial SVD compression ratio (alpha) and the capacity of the LoRA adapter (rank). We performed a systematic sweep to find the optimal configuration.

SVD Alpha (α)LoRA Rank (r)Final Size (MB)Pre-Tune PPLPost-Tune PPL
0.516~753828.8650.02
0.532~772828.8649.84
0.716~1009339.9329.14
0.732~1028339.9328.63

Optimal Configuration: An SVD alpha of 0.7 combined with a LoRA rank of 32 provides the best balance, achieving a perplexity score very close to the original baseline while still offering a massive ~51% reduction in the final model’s size.

🔧 Core Components

📁 svd_lora_layer.py

Defines SVD_LoRA_Linear, the hybrid layer replacing standard nn.Linear.

🧪 finetune.py

Performs LoRA-based fine-tuning on top of the frozen SVD-compressed model.

🧾 evaluate.py

Reports perplexity (PPL) and model size.


📉 Results Summary

HyLoRA achieves a powerful balance between size and performance:

  • 💾 Over 2x compression, reducing the final model from 1.1B to 539M parameters (51% parameter reduction).
  • 🎯 Recovers from a catastrophic ~1700% perplexity increase (from the SVD base) down to a final degradation of only +53% versus the original model.
  • 🔧 Achieves this recovery by fine-tuning only 19.82M parameters (just 1.8% of the original model’s size).

🧠 Insights

  • SVD-only can increase size without careful rank constraints.
  • LoRA can recover performance at extremely low cost.
  • The hybrid architecture provides a principled, extensible way to combine compression and adaptation.

🔮 Future Directions

  1. Architecturally-Aware Compression 🏗️

    graph LR
    A[MLP Layers] -->|Higher Compression| B[Optimal]
    C[Attention Layers] -->|Lower Compression| B
    
  2. Automated Compression Budgeting 🤖

    • Layer-adaptive alpha selection 🎛️
    • Perplexity-aware rank optimization 📊
  3. Hybrid Quantization

    • 4-bit quantization (bitsandbytes) 🔢
    • FP8 precision formats 🎯
  4. Hardware Optimization 🛡️

    • TensorRT deployment pipeline 🚀
    • CUDA-optimized kernels 🖥️

📚 Acknowledgements 🙏

  • TinyLlama Team for their powerful open-source model 🐑
  • Hugging Face for Transformers and Datasets libraries 🤗
  • LoRA authors for foundational adapter research 🧩
  • SVD pioneers for mathematical foundations of compression ➗

1
2
Original: 2098 MB, PPL 18.77 → Compressed: 772 MB, PPL 28.63
2.72x smaller with only 1.53x perplexity increase 🌟
This post is licensed under CC BY 4.0 by the author.