Tokenizers Explained — The Building Blocks of Language Models

Tokenizers Explained — The Building Blocks of Language Models

When we interact with any Large Language Model (LLM) — like ChatGPT, Gemini, or Claude — we type in text and receive text back.

But under the hood, these models don’t “see” words.
They process tokens — small chunks of text — one at a time.

If you’ve ever noticed that ChatGPT seems to “think” and type word by word, that’s because it’s generating one token at a time.


What Are Tokens?

When we input a text like:

“What is a large language model (LLM)?”

Before the model can process it, the text must be broken down into tokens — the smallest units the model can understand.

A simple way to tokenize would be splitting by spaces:

["What", "is", "a", "large", "language", "model", "(LLM)?"]

However, this simple approach doesn’t always work — especially for languages without spaces (like Chinese or Japanese) or for words with punctuation and variations.

That’s where advanced tokenizers come in.


Common Types of Tokenizers

1. Word-Based Tokenizers

These split the text by words.
For example:

Input: "AI is amazing"
Tokens: ["AI", "is", "amazing"
]

Pros: Simple and intuitive
Cons: Cannot handle new or rare words (e.g., “ChatGPTified”) because they weren’t seen during training.


2. Subword Tokenizers

Subword tokenizers split words into smaller chunks that can still carry meaning.

For example:

Input: "apologized"
Tokens: ["apolog", "ized"
]

Here, both “apology” and “apologized” share the root “apolog”.
If a new word like “apologizer” appears, the model can still understand it because it’s built from known subwords.

Pros: Handles new words effectively
Cons: Slightly more complex to train


3. Character Tokenizers

This tokenizer breaks everything down into individual characters.

Input: "AI"
Tokens: ["A", "I"
]

Pros: Works for any language or unseen text
Cons: Creates too many tokens — slower and less efficient.


4. Byte Tokenizers

Similar to character tokenizers, but they operate at the byte level (numerical representation of characters).

This is great for handling Unicode characters, emojis, and non-Latin scripts.

Pros: Universal — can represent any text
Cons: More tokens = more computation


Now that we understand the types, let’s explore three widely used tokenization algorithms behind modern LLMs.


1. Byte Pair Encoding (BPE)

Used in: GPT Models (OpenAI)

BPE starts with single characters and then repeatedly merges the most frequent pairs to form new tokens.

Example:

Initial tokens: ["l", "o", "w", "e", "r"]
Frequent pair: ("l", "o") → merge → "lo"
Next: ("lo", "w") → merge → "low"

Over time, “low”, “lowest”, and “lower” might share base tokens like “low”.

✅ Great at compressing common patterns
✅ Efficient for English and similar languages
❌ Not ideal for languages with complex scripts (like Chinese)


2. WordPiece

Used in: BERT, T5, and other Transformer-based models

WordPiece is similar to BPE but selects merges based on likelihood (probability) instead of frequency.

It learns which subwords improve model performance most effectively.

Example:

"playing" → ["play", "##ing"]

Here “##” indicates a continuation of a word — a common convention in WordPiece tokenizers.

✅ Captures semantic meaning
✅ Efficient vocabulary size
❌ Slightly slower training than BPE


3. SentencePiece

Used in: LLaMA, ALBERT, and other multilingual models

SentencePiece doesn’t rely on whitespace at all — it treats the entire text as a continuous stream of characters.

It can tokenize any language, making it perfect for multilingual models.

Example:

Input: "I love AI!"
Output: ["▁I", "▁love", "▁A", "I", "!"
]

The underscore-like symbol marks the beginning of a new word.

✅ Handles multiple languages and punctuation
✅ Doesn’t depend on pre-tokenized input
❌ Token IDs can be less intuitive to interpret


Try It Yourself — Tokenization in Action

You can use Hugging Face’s transformers library to explore tokenization in different models:

from transformers import AutoTokenizer

def tokenize(model_name, text):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    token_ids = tokenizer(text, return_tensors="pt").input_ids

    for token in token_ids[0]:
        print(">>> " + tokenizer.decode(token))

tokenize("microsoft/Phi-3-mini-4k-instruct", "What is a Large Language Model?")

Try swapping different models (like "gpt2" or "bert-base-uncased") to see how each tokenizer breaks the text differently!


Choosing the Right Tokenizer

The tokenizer is not something you choose randomly.
It’s decided during the model’s training phase, because:

  • The tokenizer determines how words map to token IDs.
  • These IDs correspond to the model’s learned vocabulary.
  • Using a different tokenizer than the one used for training will break the model’s understanding.

💡 TL;DR

TypeExampleUsed InStrength
Word-Based“AI”, “is”, “amazing”Early modelsSimple
Subword“apolog”, “ized”WordPiece, BPEHandles new words
Character“A”, “I”RareWorks on any text
ByteBinary encodingGPT, LLaMAUniversal
SentencePiece“▁AI”, “▁rocks”LLaMALanguage-agnostic

Final Thoughts

A tokenizer is like a translator between humans and machines.
It breaks your sentence into tokens, assigns each a numeric ID, and helps the LLM understand what you mean.

In short:

Tokenizer = Text → Tokens → Token IDs → Understanding

Once you grasp this, you’re one step closer to understanding how language models think.