The Critical Role of Preprocessing in RAGs: Start With a Curated Knowledge Base

Josselin Thibault

Jan 35 min read

Updated: Jan 9

Pre-processing is the mise en place phase of AI building, getting your information in order. — Pre-processing is the *mise en place* phase of AI building, getting your information in order.

Have you ever asked a chatbot a question and received an answer that seemed to come out of left field?

Maybe the machine cited a policy from the wrong company, combined unrelated pieces of information into a bizarrely confident but incorrect response or straight up started rickrolling your customers.

If so, you've experienced an AI hallucination — false information presented with unwarranted certainty.

AI “bait-and-switch” is frustrating, but here's what might surprise you: often, the problem isn't with the AI model itself, but with how the underlying information was processed and stored.

This is where Retrieval Augmented Generation (RAG) systems enter the picture. RAGs offer the promise of grounding AI responses in reliable source material rather than letting the AI generate responses purely from its training.

Think of it as giving your AI access to a carefully curated library of verified information.

For example, when a customer asks about shipping times, for instance, a RAG-enabled system doesn't just make an educated guess — it actively searches its knowledge base for relevant shipping policy documents and uses that specific information to construct its response.

Trouble is, RAG systems are only as good as their underlying data preparation. Part of the recent AI revolution comes from improved handling of unstructured data — PDFs, Word documents, PowerPoint presentations, images and videos.

But "improved" doesn't mean "perfect," and "handling" means transforming diverse formats into information nuggets that Large Language Models (LLMs) can effectively digest.

Preprocessing Makes the System Work

Of course, you can't just dump raw documents into a RAG and expect magic.

Think of preprocessing like preparing ingredients before cooking a mulit-course meal. A chef doesn't just throw whole vegetables and raw meat into a pot — they carefully wash, cut and portion everything first. Thery get everything ready.

This is the mise en place phase of AI building, getting your information in order.

Raw documents rarely arrive in an ideal format. They might contain irrelevant boilerplate text, inconsistent formatting, mixed content types, complex cross-references and implicit context that humans take for granted.

Consider a technical manual that includes both standard operating procedures and emergency protocols. Without proper preprocessing, a RAG system might struggle to distinguish between routine maintenance steps and critical safety procedures — a potentially dangerous confusion.

Modern preprocessing begins with thorough document cleaning. This initial cleaning ensures all documents follow consistent formatting patterns, making them easier for the system to process and compare.

But cleaning is just the beginning.

The Art of Text Chunking

One of the most crucial — and trickiest — aspects of preprocessing is deciding how to divide documents into chunks. Split your content into pieces that are too small, and you lose crucial context. Make them too large, and your retrieval becomes inefficient and expensive.

Consider two approaches to chunking a safety procedure:

Simple character-based chunking:

Chunk 1: "The temperature should never exceed 75°C. If the pressure"
Chunk 2: "reaches 2 bar, immediately shut down the system."

Semantic chunking:

Chunk: "The temperature should never exceed 75°C. If the pressure reaches 2 bar, immediately shut down the system."

The semantic approach keeps related information together, making it more likely that the system will retrieve complete, meaningful instructions. This becomes especially important when dealing with technical documentation, where breaking apart related information could lead to dangerous misunderstandings.

Modern systems go even further by adapting their chunking strategy based on the type of content they're processing.

API documentation might keep method signatures together with their examples, while legal documents preserve complete clauses and definitions. Technical manuals maintain procedure steps together, and narrative content preserves paragraph and section coherence.

Enriching Content with Metadata

Beyond cleaning and chunking, effective preprocessing adds rich contextual metadata to each piece of information.

This isn't just about tagging content—it's about creating a web of relationships that helps the system understand how different pieces of information connect. Well-structured metadata transforms a simple text retrieval system into an intelligent knowledge navigator.

Consider how humans understand documents: we pick up on cues about document type, importance, relationships to other documents and when information needs updating.

A technical manual section about emergency shutdown procedures feels different from a routine maintenance checklist, even if both mention similar equipment.

Good metadata captures these nuances that humans intuitively understand.

Here's what metadata might look like for a tech doc chunk:

{

"text": "The temperature should never exceed 75°C...",

"metadata": {

"document_title": "Pressure vessel - safety briefs",

"section": "Running system",

"sub_section": "Enclosure checkings",

"document_type": "safety_procedure",

"equipment_type": "pressure_vessel",

"criticality": "high",

"last_updated": "2024-01-15",

"related_procedures": ["emergency_shutdown", "pressure_monitoring"],

"prerequisites": ["system_startup_check"],

"audience": ["operators", "maintenance_staff"],

"certification_required": ["pressure_vessel_operator"],

"regulatory_standards": ["ISO_17025", "ASME_BPVC"],

"key_concepts": ["temperature_control", "pressure_safety", "emergency_response"],

"verification_status": {

"last_verified": "2024-01-10",

"verified_by": "senior_engineer",

"next_review": "2024-07-10"

}

The Future: GraphRAG and Beyond

One particularly exciting development in RAG preprocessing is GraphRAG, which represents a significant evolution in how we structure and utilize information.

Unlike traditional approaches that treat documents as collections of independent chunks, GraphRAG creates a rich, interconnected knowledge structure.

It's the difference between having a stack of notecards versus a mind map — while both contain the same information, the mind map captures the relationships and hierarchies that make the information more meaningful and accessible.

Here's how a simple graph might represent related procedures:

graph TD

A [Pressure Vessell] —> B [Operating Procedures]

A —> C [Safety Protocols]

B —> D [Startup Sequencel

B —> E [Shutdown Sequence]

C —> F [Emergency Procedures]

C —> G [Safety Limits]

As we look to the future, preprocessing will become increasingly sophisticated, handling multiple types of data and automatically generating rich metadata.

We'll see systems that can understand and preserve complex relationships between different pieces of information, making our AI assistants more knowledgeable and reliable.

Building Your Preprocessing Pipeline

When building your own preprocessing pipeline, start by analyzing your document set to understand its characteristics: what types of content you're dealing with, how documents are typically structured, and what critical information must be preserved.

Monitor your preprocessing quality through metrics like chunk coherence scores and retrieval accuracy on test queries.

Build your pipeline in stages, starting with basic cleaning and working up to more sophisticated processing. Test thoroughly at each stage, and be prepared to adjust your approach based on real-world performance.

Remember: the quality of your RAG system's responses will never exceed the quality of its preprocessed knowledge base.

Toward Human Understanding

As our understanding of language structure improves and our tools become more sophisticated, we're moving toward preprocessing systems that can increasingly mirror human-like understanding of context and relationships in text.

Whether through GraphRAG or other emerging approaches, the goal remains the same: to transform raw information into knowledge that AI systems can meaningfully access and utilize.

The key takeaway? It's all about digestability, for humans as well as machines.

Preprocessing isn't just about cleaning and organizing data — it's about creating structures that enable AI systems to understand and use information more like humans do.

By investing time and effort in sophisticated preprocessing pipelines, we can build AI systems that provide more accurate, contextual and reliable responses to our questions.

Agentic Foundry: AI For Real-World Results

Learn how agentic AI boosts productivity, speeds decisions and drives growth

— while always keeping you in the loop.