Building a RAG (Retrieval Augmented Generation) System from Scratch

Large Language Models (LLMs) like ChatGPT have taken the world by storm, with ChatGPT alone surpassing 800 million users. It’s no surprise that people now ask LLMs for everything from trivia to technical explanations.

But there’s a catch.
Even though these models are incredibly capable, they are trained on static datasets. Ask something recent or domain-specific, and you’ll often get a confident but outdated answer.

So how do LLMs still manage to answer day-to-day factual questions?

That’s where RAG (Retrieval Augmented Generation) comes in.

RAG bridges the gap between static model knowledge and live, factual data, letting LLMs retrieve relevant information and then generate accurate responses based on it.

Why RAG Matters

LLMs can hallucinate, which means they sometimes invent facts.
For companies that need reliable answers from internal data, that’s unacceptable.

RAG solves this by grounding the model’s responses in retrieved content.
It’s the best of both worlds: factual accuracy and generative fluency.

For example:

“What’s our 2024 refund policy?” → pulled from your internal docs.
“Which product feature caused the most support tickets?” → retrieved from your logs.

The RAG Workflow

Here’s what a simple RAG pipeline looks like:

Extract text from sources (PDF, DOCX, HTML, etc.)
Chunk the text into smaller parts
Embed those chunks as numerical vectors
Store embeddings in a vector database (for example, FAISS or Pinecone)
Search the store using a query embedding
Rerank or ground the results using an LLM

Let’s walk through each step.

1. Extracting Text

LLMs understand numbers, not raw files.
So the first step is extracting text from your documents.

Libraries like langchain, beautifulsoup, pypandoc, or llamaindex make this easy.
For this demo, let’s use the Wikipedia API.

import requests

def fetch_wikipedia_page(title):
    title = requests.utils.quote(title)
    url = f"https://en.wikipedia.org/w/api.php?action=query&prop=extracts&explaintext=true&titles={title}&format=json"
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    response.raise_for_status()
    data = response.json()

    text = ""
    for page_id in data["query"]["pages"]:
        text += "\n" + data["query"]["pages"][page_id]["extract"]
    return text

Example:

text = fetch_wikipedia_page("OpenAI")
print(text[:500])

2. Chunking the Text

Documents can be huge, far larger than an embedding model can handle.
Chunking splits long text into smaller, semantically meaningful parts.

Common strategies:

Fixed-length chunking (by token or character count)
Paragraph chunking
Semantic chunking (based on meaning boundaries)

Example:

def chunk_text(text, chunk_size=500, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunks.append(" ".join(words[i:i + chunk_size]))
    return chunks

3. Creating Embeddings

Each chunk is converted into an embedding, a vector that represents meaning.

You can use open-source models like SentenceTransformers or APIs like Cohere.

from sentence_transformers import SentenceTransformer

EMBEDDING_MODEL = "sentence-transformers/multi-qa-mpnet-base-dot-v1"

def get_embeddings(texts):
    model = SentenceTransformer(EMBEDDING_MODEL)
    return model.encode(texts, convert_to_numpy=True).tolist()

Example:

embeddings = get_embeddings(["OpenAI develops AI models.", "FAISS enables similarity search."])
print(len(embeddings), len(embeddings[0]))

4. Storing Embeddings in a Vector Database

Embeddings are numeric arrays. To search efficiently, store them in a vector database.
Options include FAISS, Pinecone, Weaviate, or Milvus.

Here’s a minimal FAISS example:

import faiss
import numpy as np

def create_faiss_index(embeddings, index_name):
    dim = len(embeddings[0])
    index = faiss.IndexFlatL2(dim)
    index.add(np.array(embeddings).astype("float32"))
    faiss.write_index(index, index_name)
    return index

def load_index(index_name):
    return faiss.read_index(index_name)

5. Searching the Index

Now embed the user’s query and find similar chunks.

import faiss
from sentence_transformers import SentenceTransformer

EMBEDDING_MODEL = "sentence-transformers/multi-qa-mpnet-base-dot-v1"

def search(query, index_name, top_k=5):
    index = faiss.read_index(index_name)
    model = SentenceTransformer(EMBEDDING_MODEL)
    query_emb = model.encode([query], convert_to_numpy=True).astype("float32")

    distances, indices = index.search(query_emb, top_k)
    return [(int(indices[0][i]), float(distances[0][i])) for i in range(top_k)]

Example:

results = search("Who founded OpenAI?", "wiki_index.faiss")
print(results)

6. Reranking and Response Grounding

Even relevant-looking results can be off.
We can improve accuracy by reranking or by having the LLM ground its answer in the retrieved context.

from openai import OpenAI

OPENAI_MODEL = "gpt-4o-mini"
OPENAI_API_KEY = "your_api_key"

prompt = """
You are a helpful assistant. Answer the question based on the context below.

Question: {question}

Context:
{context}

Provide a concise, accurate answer.
"""

client = OpenAI(api_key=OPENAI_API_KEY)

def answer_question(question, context):
    formatted = prompt.format(question=question, context=context)
    response = client.responses.create(input=formatted, model=OPENAI_MODEL)
    return response.output_text

Putting It All Together

You now have a simple but complete RAG pipeline:

Fetch text
Chunk it
Generate embeddings
Store them in FAISS
Search by similarity
Use an LLM to generate grounded answers

Full code here → github.com/iam-bk/rag

Closing Thoughts

RAG is one of the most practical ways to make LLMs truly useful in the real world.
By combining retrieval and generation, you can build systems that are:

Factually accurate
Context-aware
Always up to date

Whether you are building an internal chatbot, knowledge assistant, or custom AI search engine, RAG is the foundation for reliable and intelligent AI applications