For a long time, many people assumed that Large Language Models were defined purely by their parameter size. As you go deeper into the field, you start to see that LLMs sit within a broader landscape of model architectures, each optimized for different tasks. Two major categories dominate modern NLP systems: representation models and generative models.

Representation models focus on understanding text. Generative models focus on producing it. Both are powerful, and both can be fine tuned for domain specific tasks. This post walks through what these models are, how they differ, and how to fine tune representation models using two practical strategies.

Representation Models (Encoder Only)

Representation models are built to capture meaning, structure, and context from text. Their main output is an embedding, which is a numerical representation of the input text. These embeddings can be used for search, classification, clustering, retrieval augmented generation, semantic similarity, and more.

If you give a representation model a sentence, it produces a vector that encodes the sentence's semantics. Different sentences with similar meaning should produce vectors close to each other in embedding space.

Common Representation Models

BERT family

These models introduced the encoder only architecture that powered much of early modern NLP.

BERT base and BERT large
DistilBERT
ALBERT
RoBERTa
SpanBERT
SciBERT, BioBERT, ClinicalBERT
FinBERT

Modern embedding focused models

These models refine BERT like architectures for better semantic similarity performance, multilingual support, or efficiency.

SBERT (Sentence BERT)
MiniLM
E5 models (e5 small, e5 base, e5 large, e5 mistral)
GTE models (gte base, gte large, gte multilingual)
BGE models (bge base, bge large, bge m3, bge reranker)
MPNet
SimCSE

Representation models are not designed to generate text. They excel at producing embeddings that downstream tasks can use.

Generative Models (Decoder Only)

Generative models do more than understand text. They can produce new content based on a prompt. These are the models behind modern chat assistants, image generators, and code generation tools.

Examples include:

ChatGPT generating text completions
Midjourney producing images
Music generation models composing melodies

These models use decoder only transformer architectures, often trained on massive corpora so they can predict the next token with high accuracy. Their strength lies in creativity, fluency, and instruction following.

Why Fine Tune Representation Models?

Most pretrained models are trained on publicly available data. They perform well across general domains, but may not capture the nuances of your specific field.

If you work in law, finance, medicine, manufacturing, or any domain with specialized vocabulary, fine tuning your representation model can dramatically improve downstream tasks such as:

Text classification
Intent detection
Semantic search
Information retrieval
Document similarity

Representation models can be fine tuned in multiple ways. Here, we explore two practical methods: standard classification based fine tuning and SetFit.

Fine Tuning a Representation Model Using Classification

This is the classic approach. You start with a pretrained encoder model like BERT and fine tune it with labeled data. The model learns to adjust its internal embeddings so they better align with your task.

A key point is that fine tuning does not require training the entire model from scratch. You load a pretrained model, freeze most layers initially, and train the classification head.

Below is an example using the Rotten Tomatoes dataset for sentiment classification.

Step 1. Load Model and Tokenizer

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = 'google-bert/bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

Step 2. Load Dataset

from datasets import load_dataset
dataset_name = 'rotten_tomatoes'

data = load_dataset(dataset_name)
train = data['train']
test = data['test']

Step 3. Tokenize the Data

train_data_text = train.map(
    function=lambda data: tokenize(data['text'], truncation=True),
    batched=True
)

test_data_text = test.map(
    function=lambda data: tokenize(data['text'], truncation=True),
    batched=True
)

Step 4. Train the Model

from transformers import Trainer, TrainingArguments, DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

args = TrainingArguments(
    "model",
    learning_rate=2e-5,
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    save_strategy="epoch",
    report_to="none",
    weight_decay=0.01
)

import evaluate
import numpy as np

def evaluate_performance(eval):
    logits, labels = eval
    predictions = np.argmax(logits, axis=-1)
    metric = evaluate.load("f1")
    return metric.compute(references=labels, predictions=predictions)

trainer = Trainer(
    model=model,
    train_dataset=train_data_text,
    eval_dataset=test_data_text,
    processing_class=tokenizer,
    data_collator=data_collator,
    compute_metrics=evaluate_performance
)

trainer.train()

Step 5. Evaluate

print(trainer.evaluate())

This approach works well when you have a reasonably sized labeled dataset.

Fine Tuning a Representation Model Using SetFit

Not all tasks have the luxury of large labeled datasets. Many real world scenarios involve domain specific sentences with limited annotations. SetFit was built to solve this.

SetFit uses contrastive learning to train embeddings using only a small number of labeled examples per class. Instead of training a full transformer end to end, it trains a lightweight head over embeddings from SentenceTransformer models. The result is fast training, small data requirements, and strong accuracy.

For this example, we will use the model sentence-transformers/all-mpnet-base-v2 along with a custom dataset containing sentences from multiple domains.

Load Dataset and Sample Small Training Data

from setfit import SetFitModel, Trainer, TrainingArguments, sample_dataset
from datasets import load_dataset

data = load_dataset("kannanrbk/multidomain-sentences-dataset")
train = data['train']
test = data['test']

sample_data = sample_dataset(dataset=data, num_samples=16)

This generates 80 samples since there are 5 classes with 16 examples each.

Fine Tune the Model

args = TrainingArguments(
    num_train_epochs=3,
    num_iterations=20
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=sample_data,
    eval_dataset=test,
    metric="f1"
)

trainer.train()
trainer.evaluate()

SetFit often performs impressively well with very small training sets. It is a great fit when you want quick iteration cycles or when collecting large datasets is costly.

Final Thoughts

Representation models and generative models serve different but complementary roles in modern NLP. Representation models excel at understanding text, generating embeddings, and powering downstream tasks like classification and retrieval. Generative models excel at producing text and reasoning over complex instructions.

Fine tuning your representation model can unlock significantly better performance for domain specific applications. Whether you use standard classification based fine tuning or a more data efficient approach like SetFit, the key is choosing the method that aligns with your available data and task requirements.

If you are building systems that rely on semantic understanding or embeddings, investing time in fine tuning can yield meaningful improvements in accuracy and reliability.

Let me know if you want this shaped into a full publish ready article, formatted in Markdown with headings and sections, or adapted into a more narrative storytelling style.

Understanding Representation Models, Generative Models, and How to Fine Tune Them