Text Classification: Simplified

Text classification is the process of assigning predefined categories or labels to text data. A common example is sentiment analysis, where we determine whether a piece of text expresses a positive or negative emotion.

In this post, we’ll explore three different methods for performing sentiment analysis on text data:

Task-Specific Models (Pretrained Models)
Embedding Models + Regression
Generative Models

We’ll use the sealuzh/app_reviews dataset from Hugging Face, which contains around 40K training samples and 10K test samples of app reviews.
For simplicity and faster processing, we’ll use a smaller subset of the test data.

1. Using a Task-Specific (Pretrained) Model

Task-specific models are representation models—encoder-only transformers like BERT or RoBERTa—that are pretrained for classification tasks.

A popular example for sentiment analysis is the nlptown/bert-base-multilingual-uncased-sentiment model from Hugging Face.
Though originally trained on Twitter data, it generalizes well to other review data.

Steps:

Load the dataset
Load the pretrained model
Predict using the model
Evaluate performance

Load the dataset

from datasets import load_dataset 

data = load_dataset('sealuzh/app_reviews')

overall_training_data = data['train'].select(range(10000))
training_data = overall_training_data.select(range(9000))
test_data = overall_training_data.select(range(9000, 10000))

Load the model

from transformers import pipeline

pipe = pipeline(
    "text-classification",
    model="nlptown/bert-base-multilingual-uncased-sentiment",
    truncation=True,
    return_all_scores=True,
    max_length=512
)

Note: Since some reviews exceed 512 characters, we truncate them to fit the model’s maximum input length.

Predict and evaluate

from transformers.pipelines.pt_utils import KeyDataset
from sklearn.metrics import classification_report

y_pred = []
for result in pipe(KeyDataset(test_data, 'review')):
    y_pred.append(int(result[0]['label'].split(' ')[0]))

def evaluate_performance(y_true, y_pred):
    return classification_report(y_true=y_true, y_pred=y_pred)

print(evaluate_performance(test_data['rating'], y_pred))

The model performs reasonably well for clearly negative and positive reviews, but tends to struggle distinguishing between mid-range ratings (2–4). This is a known limitation when mapping multi-class sentiment to binary outcomes.

2. Using an Embedding Model + Regression

The second approach uses text embeddings as features and a regression model for prediction.
This method is often more flexible and interpretable than a single fine-tuned transformer.

We’ll use:

Embedding model: sentence-transformers/all-mpnet-base-v2
Regression model: LogisticRegression from scikit-learn

Steps:

Generate embeddings for the reviews
Train a regression model
Predict and evaluate

Encode reviews and train the regression model

from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression

embedding_model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")

training_embeddings = embedding_model.encode(training_data['review'])

regression_model = LogisticRegression(random_state=42)
regression_model.fit(training_embeddings, training_data['rating'])

Evaluate on test data

test_embeddings = embedding_model.encode(test_data['review'])
y_pred = regression_model.predict(test_embeddings)

print(evaluate_performance(test_data['rating'], y_pred))

The embedding + regression approach generally yields more stable predictions for medium-range ratings and can be retrained easily on new data. However, it requires more setup and training time.

3. Using Generative Models (LLMs)

Generative models like GPT can perform zero-shot or few-shot sentiment analysis by reasoning over natural language instructions.

We’ll use OpenAI’s GPT model to predict sentiment ratings between 1 and 5 directly from text.

Example Codex

from openai import OpenAI
client = OpenAI()

prompt = """
You are an expert in understanding user sentiment from app reviews.
Your task is to predict how satisfied the user is with the app, on a scale of 1 to 5, based on the tone, wording, and emotion expressed in their review.

1 → Very Negative
2 → Negative
3 → Neutral
4 → Positive
5 → Very Positive

Here are a few examples:

Review: Great app! Works well on my Galaxy Note 5.
Rating: 5

Review: Crash when set a demodulation! :(
Rating: 1

Review: The app is okay but could use some improvements.
Rating: 3

Now, predict the rating for the following review.

Review:
<REVIEW_TEXT>

Answer format:
Rating: <number between 1 and 5>
"""

def predict_rating(review):
    completion = client.chat.completions.create(
        model="gpt-5-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that rates app reviews."},
            {"role": "user", "content": prompt.replace("<REVIEW_TEXT>", review)},
        ],
    )
    return completion.choices[0].message.content.strip()

if __name__ == "__main__":
    test_review = "Crash when set a demodulation! :("
    rating = predict_rating(test_review)
    print(f"Predicted Rating: {rating}")

This approach is powerful because it requires no training—you simply describe the task.
However, it may be slower and more expensive for large datasets, depending on the model used.

Conclusion

Method	Type	Pros	Cons
Task-Specific Model (BERT)	Encoder-based	Fast, pretrained, good for polarity (positive/negative)	Limited to fixed labels
Embedding + Regression	Hybrid	Flexible, retrainable, interpretable	Requires training & tuning
Generative Model (GPT)	LLM-based	Zero-shot, adaptable, intuitive	Slower and costlier

Each method has trade-offs, and the right choice depends on your data size, latency requirements, and flexibility needs.

If you’re working with large-scale structured datasets and need reproducibility, go with embedding models.
If you’re exploring or need to handle complex, nuanced sentiments, generative models are incredibly effective.