Build a Minimal RAG Pipeline from Scratch (Python, FAISS, pgvector)

1) Architecture at a glance

[Docs/PDFs/Markdown] 
      │ ingest + chunk
      ▼
[Embeddings Model]  ──►  [Vector Index (FAISS/PgVector)]
                             ▲
                             │ top-k by query embedding
User Question ──embed──► [Retriever] ──► [Prompt Builder] ──► [LLM]
                                               │
                                           Answer + Sources

2) Minimal stack (why these)

Chunking: 500–800 tokens with 50–100 overlap → balances recall & context.
Embeddings: sentence-transformers (local) or OpenAI text-embedding-3-large (managed).
Index: FAISS for single-machine; switch to pgvector for multi-service.
LLM: Any function-calling or plain chat model; keep it swappable.

3) The tiniest working example (Python)

Works on a laptop. Replace the model names/keys as you like.

Install

pip install sentence-transformers faiss-cpu pypdf tiktoken openai

3.1 Ingest & chunk


# ingest.py
import os, glob, re
from pypdf import PdfReader

def read_txt(path): return open(path, "r", encoding="utf-8", errors="ignore").read()
def read_pdf(path):
r = PdfReader(path)
return "\n".join(page.extract_text() or "" for page in r.pages)

def load_corpus(folder="docs"):
texts = []
for f in glob.glob(os.path.join(folder, "**/*"), recursive=True):
if os.path.isdir(f): continue
if f.lower().endswith((".md",".txt",".markdown")):
texts.append((f, read_txt(f)))
elif f.lower().endswith(".pdf"):
texts.append((f, read_pdf(f)))
return texts

def chunk(text, size=800, overlap=120):
# rough token-ish splitter on sentences
sents = re.split(r'(?&amp;amp;amp;lt;=[.?!])\s+', text)
chunks, cur = [], []
cur_len = 0
for s in sents:
cur.append(s)
cur_len += len(s.split())
if cur_len &amp;amp;amp;gt;= size:
chunks.append(" ".join(cur))
# overlap
back = " ".join(" ".join(cur).split()[-overlap:])
cur = [back]
cur_len = len(back.split())
if cur:
chunks.append(" ".join(cur))
return chunks

3.2 Build embeddings & FAISS index

 #index.py
import faiss, pickle
from sentence_transformers import SentenceTransformer
from ingest import load_corpus, chunk

EMB_NAME = "sentence-transformers/all-MiniLM-L6-v2"  # small &amp;amp;amp;amp; fast

def build_index(folder="docs", out_dir="rag_store"):
    model = SentenceTransformer(EMB_NAME)
    records = []   # [(doc_id, chunk_text, metadata)]
    vectors = []   # list of embeddings

    for path, text in load_corpus(folder):
        for i, ch in enumerate(chunk(text)):
            emb = model.encode(ch, normalize_embeddings=True)
            vectors.append(emb)
            records.append((f"{path}#chunk{i}", ch, {"source": path, "chunk": i}))

    dim = len(vectors[0])
    index = faiss.IndexFlatIP(dim)  # cosine with normalized vectors ~ inner product
    import numpy as np
    mat = np.vstack(vectors).astype("float32")
    index.add(mat)

    os.makedirs(out_dir, exist_ok=True)
    faiss.write_index(index, f"{out_dir}/index.faiss")
    with open(f"{out_dir}/records.pkl", "wb") as f:
        pickle.dump(records, f)

if __name__ == "__main__":
    build_index()

3.3 Query → retrieve → augment → generate

# query.py
import faiss, pickle, numpy as np, os
from sentence_transformers import SentenceTransformer
from openai import OpenAI

EMB_NAME = "sentence-transformers/all-MiniLM-L6-v2"
STORE = "rag_store"
K = 5

def retrieve(question):
    model = SentenceTransformer(EMB_NAME)
    q = model.encode(question, normalize_embeddings=True).astype("float32").reshape(1, -1)
    index = faiss.read_index(f"{STORE}/index.faiss")
    with open(f"{STORE}/records.pkl","rb") as f: records = pickle.load(f)
    D, I = index.search(q, K)
    hits = [records[i] for i in I[0]]
    return hits  # [(id, text, meta), ...]

def build_prompt(question, hits):
    context = "\n\n".join(
        [f"[{i+1}] Source: {h[2]['source']}\n{h[1][:1200]}" for i,h in enumerate(hits)]
    )
    return f"""You are a precise assistant. Use ONLY the context to answer.
If the answer isn't in the context, say "I don't know" and suggest where to look.
Cite sources like [1], [2] that map to the snippets below.

Question: {question}

Context:
{context}
"""

def answer(question):
    hits = retrieve(question)
    prompt = build_prompt(question, hits)

    # Choose your LLM. Here: OpenAI for example; swap with local server if needed.
    client = OpenAI()  # requires OPENAI_API_KEY
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role":"user","content":prompt}],
        temperature=0.2
    )
    return resp.choices[0].message.content

if __name__ == "__main__":
    print(answer("What are the refund steps for premium auctions?"))&amp;amp;lt;/code&amp;amp;gt;&amp;amp;lt;/pre&amp;amp;gt;

What this gives you

Local embedding + FAISS speed.
Pluggable LLM.
Answers that only use retrieved chunks and cite sources.

4) Measuring quality

Retrieval hit-rate: % of test questions where the gold answer appears in top-k chunks.
Answer accuracy: exact-match / semantic similarity (e.g., BLEURT/BERTScore) against gold answers.
Hallucination rate: manual spot checks + “I don’t know” rate (should go up slightly when you tighten guardrails).
Latency/cost: p50/p95 query time, tokens per answer.

Create a small eval set (20–50 Q/A pairs) from your docs. Run it after any change (new chunking, new embedding model).

5) Guardrails I actually used

Strict prompt: “Use ONLY the context; else say I don’t know.”
Chunk citations: attach file & chunk IDs; show them in UI.
Post-validation: regex/JSON checks for structured answers (IDs, amounts, dates).
Source diversity: prefer hits from different files to avoid redundancy.

6) Common pitfalls (and fixes)

Over-chunking (too small): loses semantics → drop to 500–800 tokens with 80 overlap.
Wrong embeddings: domain jargon suffers → try larger models (e5-large, text-embedding-3-large) for critical domains.
PDF extraction mess: run a cleanup step (collapse hyphens, fix Unicode), or pre-convert to Markdown with a good parser.
Query drift: add a quick classifier: “Is this answerable from internal docs?” → if not, escalate or say “I don’t know.”

7) From laptop to production

Option A: Keep FAISS, wrap as a service

Tiny FastAPI server with /search and /answer.
Nightly cron to re-ingest.
Cache by normalized question → reuse the same top-k for 24h.

Option B: Move to Postgres + pgvector

Pros: transactions, backups, horizontal scaling.
Schema example:

CREATE TABLE rag_chunks (
  id bigserial PRIMARY KEY,
  source text,
  chunk_index int,
  content text,
  embedding vector(1536)  -- match your embedding dim
);
-- Vector search
SELECT id, source, chunk_index, content
FROM rag_chunks
ORDER BY embedding &amp;lt;#&amp;gt; $1  -- cosine distance with pgvector
LIMIT 5;

Option C: Orchestrate with Temporal (reliable pipelines)

Workflow: ingest → chunk → embed (batched) → upsert index → smoke test → publish.
Activity retries, idempotent upserts, metrics on every step.

8) Prompt I ship with (copy/paste)

System:
You answer only from the supplied CONTEXT. 
If the answer is missing, reply: "I don't know based on the provided documents."
Always include citations like [1], [2] mapping to sources below.
Be concise and exact.

User:
Question: {{question}}

CONTEXT SNIPPETS:
{{#each snippets}}
[{{@index+1}}] Source: {{this.source}}
{{this.text}}
{{/each}}

9) What I’d add next (nice upgrades)

Hybrid retrieval: BM25 (keyword) + vectors → better on numbers/code.
Reranking: small cross-encoder re-rank the top-50 to top-5 (big relevance win).
Multi-tenant: per-team namespaces, per-doc ACLs.
Inline quotes: highlight matched spans in each chunk.
Evals dashboard: store runs in SQLite/Parquet and chart trends.

10) Repo structure (starter)

rag/
  docs/                   # your source files
  rag_store/              # generated index + records
  ingest.py
  index.py
  query.py
  evals/
    qa.jsonl              # [{"q":"..","a":"..","ids":[...]}]
  server.py               # optional FastAPI
  requirements.txt

Final thought

RAG works best when it’s boringly deterministic around a very flexible LLM. Keep the moving parts few, measure retrieval first, and make “I don’t know” an acceptable, logged outcome. Ship tiny, improve weekly.

Tags: AI & ML Architecture Blog build RAG pipeline Coding FAISS RAG LLM pgvector RAG Python Rag Pipeline RAG tutorial Python retrieval-augmented generation Technology Wasalt

Build a Minimal RAG Pipeline from Scratch (Python, FAISS, pgvector)

This step-by-step guide shows how to build a minimal Retrieval-Augmented Generation (RAG) pipeline—ingest → chunk → embed → vector search → prompt → generate—with Python, FAISS/pgvector, and a swappable LLM. Optimized for clarity, speed, and production readiness.

Tent Park Riyadh: The Ultimate Camping Escape in the Heart of Saudi Arabia

Related Post

Discussion about this post

Latest articles

Properties for Sale

Villas for sale

Apartments for Rent

Furnished Apartments

© All rights reserved to Wasalt Real Estate 2025