Building Better RAG Systems: Extracting Clean Website Data

Large language models are trained on enormous amounts of public text, but they cannot access real-time information unless we provide it. This is why many modern AI systems rely on external data sources such as search engines, APIs, and internal company knowledge bases. Retrieval Augmented Generation (RAG) is one of the most practical ways to connect an LLM with current or domain-specific information.

RAG allows you to fetch relevant documents from an external source, index them, and feed the results back into an LLM so that the final answer is grounded and accurate.

To build an effective AI agent, there are a few foundational components:

Knowledge base
Tools
Context
Memory

There are other pieces in a full agent architecture, but mastering these four gives you a strong starting point.

What is RAG
RAG works by retrieving the right information at the right moment. The typical workflow looks like this:

Retrieve documents from an external system.
Chunk them if they are large.
Convert the chunks into embeddings and store them in a vector database.
When a query arrives, embed the query, search the vector database, and fetch relevant documents.
Feed the retrieved context into the LLM to generate a grounded response.

Why websites matter in RAG
Almost every company has a website, and websites tend to contain a huge amount of useful information. Examples include product pages, documentation, pricing, FAQs, support content, and company details.

If you can retrieve and store this content, you can build a powerful RAG system for tasks such as customer support automation, internal search, product assistants, or competitor analysis.

How to retrieve pages from a website
There are several approaches to discovering and collecting website content:

Crawling
Using sitemaps
Reading feeds such as RSS or Atom

My initial approach was to write custom crawling logic, traverse the site manually, fetch every reachable page, and extract text using BeautifulSoup. While this works, it raises several practical questions:

How do I convert HTML into clean text, markdown, or structured formats
How do I skip comments, scripts, or unnecessary elements
How deep should the crawler go
How do I reliably discover the sitemap for a domain
How do I extract feeds
How do I avoid crawling thousands of unnecessary links

This is where the library Trafilatura becomes extremely helpful.

Introducing Trafilatura
Trafilatura is a Python library designed for web scraping, text extraction, and site discovery. It dramatically simplifies the process of converting web pages into clean, machine ready formats.

Key features include:

Extracting readable content from HTML
Converting pages into XML, JSON, Markdown, CSV, or plain text
Discovering sitemaps automatically
Detecting feeds such as RSS and Atom
Crawling websites with control over depth, speed, and number of pages
Skipping boilerplate, comments, ads, and unnecessary elements

Example: Extracting clean text

from trafilatura import fetch_url, extract

url = "https://example.com"
html = fetch_url(url)
text = extract(html)
print(text)

Example: Converting a page to markdown

from trafilatura import fetch_url, extract

html = fetch_url("https://example.com")
markdown = extract(html, output_format="markdown")

Example: Finding a sitemap

from trafilatura.sitemaps import find_sitemap_urls

find_sitemap_urls("https://example.com")

Why Trafilatura is valuable for RAG
If you are building a RAG system, you need reliable content ingestion. Clean, consistent text extraction makes downstream chunking and embedding significantly easier.

Trafilatura gives you:

A repeatable and maintainable pipeline for page extraction
Automatic detection of structured site resources
A way to avoid writing and maintaining your own custom scraper for every website
Output formats that are ready for indexing in a vector database

If your goal is to build an enterprise or product-grade RAG system, adding Trafilatura to your toolkit is a smart choice.

You can find complete examples and code in this repository:
https://github.com/iam-bk/website-crawling