Getting Started with Language AI — How Machines Understand Text

Getting Started with Language AI — How Machines Understand Text

Generative AI has been the talk of the town for the last few years.
But before diving deep into Generative AI and Large Language Models (LLMs), it’s essential to understand one foundational concept — how machines represent language.

Computers don’t understand text, emotions, or meaning the way humans do.
They only understand numbers — 0s and 1s.
So, for machines to “understand” text, we must convert words into numerical representations — a process called text vectorization.


Bag of Words (2000)

One of the earliest and simplest ways to represent text numerically is the Bag of Words (BoW) model.

Let’s consider two example sentences:

  • Sentence 1:
    "OpenAI is one of the leading artificial intelligence research and deployment companies"
  • Sentence 2:
    "Anthropic is one of the leading artificial intelligence (AI) research and development companies in the world"

Step 1. Build the Vocabulary

We create a combined list of unique words from both sentences.
That’s our vocabulary.

Vocabulary = ["OpenAI", "is", "one", "of", "the", "leading", "artificial",
"intelligence", "research", "and", "deployment", "development", "companies",
"in", "the", "world"]

Each sentence is then represented as a vector of 0s and 1s,
depending on whether a word exists in that sentence.

Example (simplified):

Sentence 1 → [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0]
Sentence 2 → [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1]

Example: Bag of Words using scikit-learn

from sklearn.feature_extraction.text import CountVectorizer

sentences = [
    "OpenAI is one of the leading artificial intelligence research and deployment companies",
    "Anthropic is one of the leading artificial intelligence (AI) research and development companies in the world"
]

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sentences)

print(vectorizer.get_feature_names_out())
print(X.toarray())

This outputs the vocabulary and a numeric representation of each sentence.


⚠Limitations of Bag of Words

  1. No context awareness: It doesn’t understand meaning — it just counts words.
    For example, “Apple” (fruit) and “Apple” (company) are treated the same.
  2. Different languages: Some languages (like Japanese or Chinese) don’t use spaces.
    Example: “インドへようこそ” (Welcome to India) — can’t be easily tokenized using spaces.
  3. High dimensionality: As vocabulary grows, vectors become large and sparse.

Word2Vec (2013)

The next major leap in text representation came in 2013 with Word2Vec, introduced by Google researchers.

Unlike Bag of Words, Word2Vec uses neural networks to capture meaning and context.
It learns how words relate to each other by looking at nearby words using a technique called the sliding window.

🔍 Example Concept

Words that appear in similar contexts will have similar vector representations.

"car" → [0.8, 0.45, -0.22, ...]
"petrol" → [0.78, 0.50, -0.18, ...]

These vectors are close to each other in vector space because they’re semantically related.

🧪 Example: Word2Vec using gensim

from gensim.models import Word2Vec

sentences = [
    ["car", "needs", "petrol"],
    ["electric", "car", "uses", "battery"],
    ["bats", "are", "nocturnal", "animals"],
    ["cricket", "bats", "are", "made", "of", "willow"]
]

model = Word2Vec(sentences, vector_size=50, window=3, min_count=1, sg=1)

print(model.wv.most_similar("car"))
print(model.wv["bats"])

This produces a 50-dimensional vector for each word and shows words similar to “car”.


Contextual Limitation of Word2Vec

While Word2Vec captures relationships, it still struggles with polysemy
words that have multiple meanings depending on context.

Example:

  1. A bat is a nocturnal animal.
  2. Cricket bats are made from willow.

Here, “bat” means two different things.
Word2Vec can’t differentiate these because each word has a single static embedding.


Transformers: Attention Is All You Need (2017)

In 2017, Google released a groundbreaking paper —
👉 “Attention Is All You Need”

This introduced the Transformer architecture, which revolutionized AI.

Transformer Architecture Components

  1. Encoder – Processes input text
  2. Decoder – Generates output (e.g., translated sentence)
  3. Self-Attention Mechanism – Understands context across the entire sequence
  4. Softmax Layer – Converts numerical scores into probabilities

Why It Mattered

Transformers solved the context problem by using attention
a mechanism that lets the model “focus” on relevant parts of a sentence while processing.

For instance, in the sentence:

“The animal didn’t cross the street because it was too tired.”

The model learns that “it” refers to “animal,” not “street.”

This architecture paved the way for modern models like GPT, BERT, Claude, and LLaMA — the foundation of today’s Language AI revolution.


Summary

EraMethodKey IdeaLimitation
2000Bag of WordsCount word occurrencesNo meaning/context
2013Word2VecSemantic relationshipsStatic embeddings
2017TransformersContext-aware attentionCompute intensive

Final Thoughts

Language AI has evolved from counting words to understanding them.
Each generation of models — from Bag of Words to Word2Vec, and now Transformers — has brought us closer to enabling machines to comprehend human language.