I Built "Legal Lens" — A Fine-Tuned AI That Translates Legal Jargon Into Plain English
Building a fine-tuned AI to translate legal jargon into plain English—from FLAN-T5 failures to Gemma-2B success using QLoRA on a free GPU, and the engineering lessons learned along the way.
The Problem
Legal contracts are intentionally confusing. Phrases like "In witness whereof, the parties hereto have executed this Agreement..." exist to protect lawyers, not people.
I set out to change that by fine-tuning an open-source LLM to simplify legal clauses into plain English.
This is the raw, unfiltered engineering journey. No shortcuts, real failures, and a final win.
Phase 1: The Dataset Problem
I started with the CUAD dataset (510+ real contracts from SEC filings). But here's the catch—CUAD only has complex clauses. There are no "simplified" versions.
You can't teach a model to simplify if you don't show it what "simple" looks like.
So I engineered a custom parallel corpus of 2,000 complex-to-simple legal clause pairs covering:
- Termination
- Indemnification
- Confidentiality
- Liability
- And more
This was the foundation. Data quality over data quantity.
Phase 2: The Failures (This Is Where the Real Learning Happened)
My first attempt was with FLAN-T5. It was a disaster.
What Went Wrong:
NaN losses due to fp16 instability with T5 architecture:
# Training would just explode
Step 50: loss = NaN
Step 51: loss = NaN
Mode collapse — the model just repeated "This clause applies only to the parties" for every input, no matter what I fed it.
High eval_loss of 4.97 with completely gibberish output. The model wasn't learning anything meaningful.
I tried fixing:
- Learning rates
- Padding strategies
- Deprecation errors (
evaluation_strategy→eval_strategy,as_target_tokenizerremoved)
Nothing worked well enough. The T5 architecture simply wasn't the right tool for this task.
Phase 3: The Pivot That Changed Everything
I switched to Google's Gemma-2B with a completely different approach:
The Setup:
4-bit Quantization (QLoRA) via bitsandbytes
- Fit the entire 2.5B param model on a free Colab T4 GPU
- Memory efficient, fast inference
LoRA adapters on q_proj and v_proj
- Only 921,600 trainable params (0.037% of total!)
- Full model stays frozen, adapters do the learning
SFTTrainer from TRL library with Gemma's native chat template
- Clean training loop
- Proper instruction formatting
Training config:
training_args = SFTConfig(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
optim="paged_adamw_8bit",
logging_steps=25,
)
The Results
Training loss dropped beautifully: 3.58 → 0.77 in just 25 minutes.
Before:
"In witness whereof, the parties hereto have executed this Agreement as of the date first above written."
After (Model Output):
"The parties that are mentioned in this clause have done everything that is in the next clause."
It works. The model actually understands legal structure and translates it.
Key Engineering Lessons
1. Data Quality > Data Quantity
2,000 well-crafted pairs beat 10,000 generic ones. Every pair I created taught the model something specific about legal-to-plain translation.
2. Model Choice Matters Enormously
FLAN-T5 failed at this task. Gemma-2B nailed it. Architecture isn't just a detail—it's the foundation.
3. QLoRA Is a Game-Changer
Fine-tuning a 2.5B model on a free GPU? That's democratization of AI. No expensive cloud credits. No fancy hardware. Just smart engineering.
4. Fail Fast, Pivot Faster
The FLAN-T5 failures taught me more about NLP engineering than any tutorial. Understanding why something doesn't work is just as valuable as making something work.
Tech Stack
- Python
- Hugging Face Transformers
- PEFT/LoRA
- TRL (Transformer Reinforcement Learning)
- bitsandbytes (quantization)
- Google Colab (T4 GPU)
Try It Yourself
The full notebook is on GitHub: Legal Lens - Clause Simplifier
If you're working on LegalTech or domain-specific LLM fine-tuning, let's connect!
Legal jargon doesn't have to be a barrier. With the right approach, open-source models, and some persistence, we can make legal documents accessible to everyone.