RAG Pipelines in Practice: What the Tutorials Don't Tell You
RAG Pipelines in Practice: What the Tutorials Don't Tell You
Every RAG tutorial follows the same script: chunk your documents, embed them, store them in a vector database, retrieve the top-k results, and pass them to an LLM. It works in the demo. It breaks in production.
Here is what I learned building RAG pipelines for real South African business documents — and what the architecture landscape looks like in early 2026.
The Chunking Problem
Most tutorials use fixed-size chunking — 500 tokens with 50 token overlap. This works for blog posts. It fails for legal documents, tender specifications, and financial reports. These documents have structure. Headings, clauses, tables, appendices. Fixed-size chunking destroys that structure.
What worked: Semantic chunking based on document structure. Parse headings and subheadings. Keep clauses intact. Tables get their own chunks with the column headers preserved. The chunk boundaries follow the document's logic, not an arbitrary token count.
The industry is converging on this. Late-model chunking strategies in 2026 increasingly use LLMs themselves to determine optimal chunk boundaries — a technique called agentic chunking. The model reads the document, identifies semantic boundaries, and produces chunks that preserve meaning. It is slower and more expensive than regex-based splitting, but the retrieval quality improvement is substantial.
The Embedding Choice
OpenAI's text-embedding-3-small is the default recommendation. It is good enough for English documents. It struggles with South African business terminology, Afrikaans legal terms, and the code-switching that happens in local business communication.
What worked: Fine-tuning was not practical for our scale. Instead, we pre-process documents to normalise terminology before embedding. A lightweight translation layer maps local terms to their standard English equivalents. The embeddings are better because the input is cleaner.
For multilingual RAG in 2026, the emerging pattern is to embed in the original language but query with cross-lingual models. Cohere's Embed v3 and multilingual-e5-large handle this natively. The alternative — translating everything to English before embedding — loses nuance that matters in legal and commercial contexts.
The Retrieval Problem Nobody Warns You About
Top-k retrieval assumes that the most semantically similar chunks are the most relevant. This is often wrong. A user asking about payment terms needs the payment clause — but the most similar chunk might be a general overview paragraph that mentions payments in passing.
What worked: Hybrid retrieval. Combine semantic search with keyword matching (BM25). Use the document structure metadata to boost chunks from relevant sections. If the user asks about "payment terms," chunks from sections titled "Payment" or "Commercial Terms" get a relevance boost regardless of their embedding similarity.
This is now considered baseline. The 2026 RAG stack looks like: hybrid retrieval (vector plus keyword), a reranker (Cohere Rerank or a cross-encoder) to sort the merged results, and then the LLM sees only the top reranked passages. The reranking step alone can improve answer quality by twenty to thirty per cent over raw top-k retrieval.
Beyond Vanilla RAG: What Changed in 2025-2026
Agentic RAG is the most significant architectural shift. Instead of a single retrieve-then-generate pipeline, an agent decides what to retrieve, evaluates the results, and iterates. If the first retrieval is insufficient, the agent reformulates the query, searches a different index, or decomposes the question into sub-questions. LangGraph and CrewAI have made this pattern accessible. The trade-off is latency — agentic RAG adds seconds to response time, but the answer quality for complex queries is dramatically better.
GraphRAG — pioneered by Microsoft Research — structures documents as knowledge graphs instead of flat chunk stores. Entities and relationships are extracted, and retrieval traverses the graph. For our South African business documents, this is relevant because tender specifications reference multiple parties, conditions, and cross-referenced clauses. A graph captures these relationships in a way that flat chunks cannot.
Corrective RAG (CRAG) adds a self-evaluation step: after retrieval, a lightweight model scores each chunk for relevance. Irrelevant chunks are discarded before the generation step. This reduces hallucination from low-quality retrievals — one of the most common production failure modes.
The South African Context
Building AI products in South Africa means working with constraints that Silicon Valley tutorials do not address. Our documents mix English, Afrikaans, and occasionally Zulu or Sotho. Government tender documents follow formatting conventions that no open-source parser handles well. Internet latency means that every round-trip to an embedding API matters.
The local AI ecosystem is growing. South African startups and research groups are increasingly contributing to open-source LLM tooling, and awareness of RAG patterns is spreading through communities in Cape Town and Johannesburg. But the production deployments are still few. Most organisations are stuck in proof-of-concept mode.
The Lesson
RAG is not a technology problem. It is a domain understanding problem. The pipeline is only as good as your understanding of the documents it processes and the questions users will ask. The architectures are getting more sophisticated — agentic retrieval, graph-based indexing, self-correcting pipelines — but the fundamentals have not changed: understand your data, understand your users, and build the retrieval strategy around both.