
For a long time, many people assumed that Large Language Models were defined purely by their parameter size. As you go deeper into the field, you start to see that LLMs sit within a broader landscape of model architectures, each optimized for different tasks. Two major categories dominate modern NLP systems: representation models and generative models.
Representation models focus on understanding text. Generative models focus on producing it. Both are powerful, and both can be fine tuned for domain specific tasks. This post walks through what these models are, how they differ, and how to fine tune representation models using two practical strategies.
Representation Models (Encoder Only)
Representation models are built to capture meaning, structure, and context from text. Their main output is an embedding, which is a numerical representation of the input text. These embeddings can be used for search, classification, clustering, retrieval augmented generation, semantic similarity, and more.
If you give a representation model a sentence, it produces a vector that encodes the sentence's semantics. Different sentences with similar meaning should produce vectors close to each other in embedding space.
Common Representation Models
BERT family
These models introduced the encoder only architecture that powered much of early modern NLP.
- BERT base and BERT large
- DistilBERT
- ALBERT
- RoBERTa
- SpanBERT
- SciBERT, BioBERT, ClinicalBERT
- FinBERT
Modern embedding focused models
These models refine BERT like architectures for better semantic similarity performance, multilingual support, or efficiency.
- SBERT (Sentence BERT)
- MiniLM
- E5 models (e5 small, e5 base, e5 large, e5 mistral)
- GTE models (gte base, gte large, gte multilingual)
- BGE models (bge base, bge large, bge m3, bge reranker)
- MPNet
- SimCSE
Representation models are not designed to generate text. They excel at producing embeddings that downstream tasks can use.
Generative Models (Decoder Only)
Generative models do more than understand text. They can produce new content based on a prompt. These are the models behind modern chat assistants, image generators, and code generation tools.
Examples include:
- ChatGPT generating text completions
- Midjourney producing images
- Music generation models composing melodies
These models use decoder only transformer architectures, often trained on massive corpora so they can predict the next token with high accuracy. Their strength lies in creativity, fluency, and instruction following.
Why Fine Tune Representation Models?
Most pretrained models are trained on publicly available data. They perform well across general domains, but may not capture the nuances of your specific field.
If you work in law, finance, medicine, manufacturing, or any domain with specialized vocabulary, fine tuning your representation model can dramatically improve downstream tasks such as:
- Text classification
- Intent detection
- Semantic search
- Information retrieval
- Document similarity
Representation models can be fine tuned in multiple ways. Here, we explore two practical methods: standard classification based fine tuning and SetFit.
Fine Tuning a Representation Model Using Classification
This is the classic approach. You start with a pretrained encoder model like BERT and fine tune it with labeled data. The model learns to adjust its internal embeddings so they better align with your task.
A key point is that fine tuning does not require training the entire model from scratch. You load a pretrained model, freeze most layers initially, and train the classification head.
Below is an example using the Rotten Tomatoes dataset for sentiment classification.
Step 1. Load Model and Tokenizer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = 'google-bert/bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)Step 2. Load Dataset
from datasets import load_dataset
dataset_name = 'rotten_tomatoes'
data = load_dataset(dataset_name)
train = data['train']
test = data['test']Step 3. Tokenize the Data
train_data_text = train.map(
function=lambda data: tokenize(data['text'], truncation=True),
batched=True
)
test_data_text = test.map(
function=lambda data: tokenize(data['text'], truncation=True),
batched=True
)Step 4. Train the Model
from transformers import Trainer, TrainingArguments, DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
args = TrainingArguments(
"model",
learning_rate=2e-5,
num_train_epochs=2,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
save_strategy="epoch",
report_to="none",
weight_decay=0.01
)
import evaluate
import numpy as np
def evaluate_performance(eval):
logits, labels = eval
predictions = np.argmax(logits, axis=-1)
metric = evaluate.load("f1")
return metric.compute(references=labels, predictions=predictions)
trainer = Trainer(
model=model,
train_dataset=train_data_text,
eval_dataset=test_data_text,
processing_class=tokenizer,
data_collator=data_collator,
compute_metrics=evaluate_performance
)
trainer.train()Step 5. Evaluate
print(trainer.evaluate())This approach works well when you have a reasonably sized labeled dataset.
Fine Tuning a Representation Model Using SetFit
Not all tasks have the luxury of large labeled datasets. Many real world scenarios involve domain specific sentences with limited annotations. SetFit was built to solve this.
SetFit uses contrastive learning to train embeddings using only a small number of labeled examples per class. Instead of training a full transformer end to end, it trains a lightweight head over embeddings from SentenceTransformer models. The result is fast training, small data requirements, and strong accuracy.
For this example, we will use the model sentence-transformers/all-mpnet-base-v2 along with a custom dataset containing sentences from multiple domains.
Load Dataset and Sample Small Training Data
from setfit import SetFitModel, Trainer, TrainingArguments, sample_dataset
from datasets import load_dataset
data = load_dataset("kannanrbk/multidomain-sentences-dataset")
train = data['train']
test = data['test']
sample_data = sample_dataset(dataset=data, num_samples=16)
This generates 80 samples since there are 5 classes with 16 examples each.
Fine Tune the Model
args = TrainingArguments(
num_train_epochs=3,
num_iterations=20
)
trainer = Trainer(
model=model,
args=args,
train_dataset=sample_data,
eval_dataset=test,
metric="f1"
)
trainer.train()
trainer.evaluate()SetFit often performs impressively well with very small training sets. It is a great fit when you want quick iteration cycles or when collecting large datasets is costly.
Final Thoughts
Representation models and generative models serve different but complementary roles in modern NLP. Representation models excel at understanding text, generating embeddings, and powering downstream tasks like classification and retrieval. Generative models excel at producing text and reasoning over complex instructions.
Fine tuning your representation model can unlock significantly better performance for domain specific applications. Whether you use standard classification based fine tuning or a more data efficient approach like SetFit, the key is choosing the method that aligns with your available data and task requirements.
If you are building systems that rely on semantic understanding or embeddings, investing time in fine tuning can yield meaningful improvements in accuracy and reliability.
Let me know if you want this shaped into a full publish ready article, formatted in Markdown with headings and sections, or adapted into a more narrative storytelling style.