{"id":3439,"date":"2025-09-26T08:32:29","date_gmt":"2025-09-26T05:32:29","guid":{"rendered":"https:\/\/blog.wasalt.sa\/en\/?p=3439"},"modified":"2025-10-02T13:00:38","modified_gmt":"2025-10-02T10:00:38","slug":"build-a-minimal-rag-pipeline-python-faiss-pgvector","status":"publish","type":"post","link":"https:\/\/blog.wasalt.sa\/en\/build-a-minimal-rag-pipeline-python-faiss-pgvector\/","title":{"rendered":"Build a Minimal RAG Pipeline from Scratch (Python, FAISS, pgvector)"},"content":{"rendered":"<h3 id=\"ember65\" class=\"ember-view reader-text-block__heading-3\">1) Architecture at a glance<\/h3>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\n&#x5B;Docs\/PDFs\/Markdown] \r\n      \u2502 ingest + chunk\r\n      \u25bc\r\n&#x5B;Embeddings Model]  \u2500\u2500\u25ba  &#x5B;Vector Index (FAISS\/PgVector)]\r\n                             \u25b2\r\n                             \u2502 top-k by query embedding\r\nUser Question \u2500\u2500embed\u2500\u2500\u25ba &#x5B;Retriever] \u2500\u2500\u25ba &#x5B;Prompt Builder] \u2500\u2500\u25ba &#x5B;LLM]\r\n                                               \u2502\r\n                                           Answer + Sources\r\n<\/pre>\n<h3 id=\"ember66\" class=\"ember-view reader-text-block__heading-3\">2) Minimal stack (why these)<\/h3>\n<ul>\n<li><strong>Chunking:<\/strong> 500\u2013800 tokens with 50\u2013100 overlap \u2192 balances recall &amp; context.<\/li>\n<li><strong>Embeddings:<\/strong> sentence-transformers (local) or OpenAI text-embedding-3-large (managed).<\/li>\n<li><strong>Index:<\/strong> <strong>FAISS<\/strong> for single-machine; switch to <strong>pgvector<\/strong> for multi-service.<\/li>\n<li><strong>LLM:<\/strong> Any function-calling or plain chat model; keep it swappable.<\/li>\n<\/ul>\n<h3 id=\"ember68\" class=\"ember-view reader-text-block__heading-3\">3) The tiniest working example (Python)<\/h3>\n<blockquote id=\"ember69\" class=\"ember-view reader-text-block__blockquote\"><p>Works on a laptop. Replace the model names\/keys as you like.<\/p><\/blockquote>\n<h3 id=\"ember70\" class=\"ember-view reader-text-block__heading-3\">Install<\/h3>\n<pre class=\"brush: bash; title: ; notranslate\" title=\"\">\r\npip install sentence-transformers faiss-cpu pypdf tiktoken openai\r\n<\/pre>\n<h3 id=\"ember71\" class=\"ember-view reader-text-block__heading-3\">3.1 Ingest &amp; chunk<\/h3>\n<pre class=\"brush: python; gutter: true; title: ; notranslate\" title=\"\">\r\n\r\n# ingest.py\r\nimport os, glob, re\r\nfrom pypdf import PdfReader\r\n\r\ndef read_txt(path): return open(path, &quot;r&quot;, encoding=&quot;utf-8&quot;, errors=&quot;ignore&quot;).read()\r\ndef read_pdf(path):\r\nr = PdfReader(path)\r\nreturn &quot;\\n&quot;.join(page.extract_text() or &quot;&quot; for page in r.pages)\r\n\r\ndef load_corpus(folder=&quot;docs&quot;):\r\ntexts = &#x5B;]\r\nfor f in glob.glob(os.path.join(folder, &quot;**\/*&quot;), recursive=True):\r\nif os.path.isdir(f): continue\r\nif f.lower().endswith((&quot;.md&quot;,&quot;.txt&quot;,&quot;.markdown&quot;)):\r\ntexts.append((f, read_txt(f)))\r\nelif f.lower().endswith(&quot;.pdf&quot;):\r\ntexts.append((f, read_pdf(f)))\r\nreturn texts\r\n\r\ndef chunk(text, size=800, overlap=120):\r\n# rough token-ish splitter on sentences\r\nsents = re.split(r&#039;(?&amp;amp;amp;amp;lt;=&#x5B;.?!])\\s+&#039;, text)\r\nchunks, cur = &#x5B;], &#x5B;]\r\ncur_len = 0\r\nfor s in sents:\r\ncur.append(s)\r\ncur_len += len(s.split())\r\nif cur_len &amp;amp;amp;amp;gt;= size:\r\nchunks.append(&quot; &quot;.join(cur))\r\n# overlap\r\nback = &quot; &quot;.join(&quot; &quot;.join(cur).split()&#x5B;-overlap:])\r\ncur = &#x5B;back]\r\ncur_len = len(back.split())\r\nif cur:\r\nchunks.append(&quot; &quot;.join(cur))\r\nreturn chunks\r\n<\/pre>\n<h3 id=\"ember72\" class=\"ember-view reader-text-block__heading-3\">3.2 Build embeddings &amp; FAISS index<\/h3>\n<pre class=\"brush: python; gutter: true; title: ; notranslate\" title=\"\">\r\n #index.py\r\nimport faiss, pickle\r\nfrom sentence_transformers import SentenceTransformer\r\nfrom ingest import load_corpus, chunk\r\n\r\nEMB_NAME = &quot;sentence-transformers\/all-MiniLM-L6-v2&quot;  # small &amp;amp;amp;amp;amp; fast\r\n\r\ndef build_index(folder=&quot;docs&quot;, out_dir=&quot;rag_store&quot;):\r\n    model = SentenceTransformer(EMB_NAME)\r\n    records = &#x5B;]   # &#x5B;(doc_id, chunk_text, metadata)]\r\n    vectors = &#x5B;]   # list of embeddings\r\n\r\n    for path, text in load_corpus(folder):\r\n        for i, ch in enumerate(chunk(text)):\r\n            emb = model.encode(ch, normalize_embeddings=True)\r\n            vectors.append(emb)\r\n            records.append((f&quot;{path}#chunk{i}&quot;, ch, {&quot;source&quot;: path, &quot;chunk&quot;: i}))\r\n\r\n    dim = len(vectors&#x5B;0])\r\n    index = faiss.IndexFlatIP(dim)  # cosine with normalized vectors ~ inner product\r\n    import numpy as np\r\n    mat = np.vstack(vectors).astype(&quot;float32&quot;)\r\n    index.add(mat)\r\n\r\n    os.makedirs(out_dir, exist_ok=True)\r\n    faiss.write_index(index, f&quot;{out_dir}\/index.faiss&quot;)\r\n    with open(f&quot;{out_dir}\/records.pkl&quot;, &quot;wb&quot;) as f:\r\n        pickle.dump(records, f)\r\n\r\nif __name__ == &quot;__main__&quot;:\r\n    build_index() <\/pre>\n<h3 id=\"ember73\" class=\"ember-view reader-text-block__heading-3\">3.3 Query \u2192 retrieve \u2192 augment \u2192 generate<\/h3>\n<pre class=\"brush: python; gutter: true; title: ; notranslate\" title=\"\"># query.py\r\nimport faiss, pickle, numpy as np, os\r\nfrom sentence_transformers import SentenceTransformer\r\nfrom openai import OpenAI\r\n\r\nEMB_NAME = &quot;sentence-transformers\/all-MiniLM-L6-v2&quot;\r\nSTORE = &quot;rag_store&quot;\r\nK = 5\r\n\r\ndef retrieve(question):\r\n    model = SentenceTransformer(EMB_NAME)\r\n    q = model.encode(question, normalize_embeddings=True).astype(&quot;float32&quot;).reshape(1, -1)\r\n    index = faiss.read_index(f&quot;{STORE}\/index.faiss&quot;)\r\n    with open(f&quot;{STORE}\/records.pkl&quot;,&quot;rb&quot;) as f: records = pickle.load(f)\r\n    D, I = index.search(q, K)\r\n    hits = &#x5B;records&#x5B;i] for i in I&#x5B;0]]\r\n    return hits  # &#x5B;(id, text, meta), ...]\r\n\r\ndef build_prompt(question, hits):\r\n    context = &quot;\\n\\n&quot;.join(\r\n        &#x5B;f&quot;&#x5B;{i+1}] Source: {h&#x5B;2]&#x5B;&#039;source&#039;]}\\n{h&#x5B;1]&#x5B;:1200]}&quot; for i,h in enumerate(hits)]\r\n    )\r\n    return f&quot;&quot;&quot;You are a precise assistant. Use ONLY the context to answer.\r\nIf the answer isn&#039;t in the context, say &quot;I don&#039;t know&quot; and suggest where to look.\r\nCite sources like &#x5B;1], &#x5B;2] that map to the snippets below.\r\n\r\nQuestion: {question}\r\n\r\nContext:\r\n{context}\r\n&quot;&quot;&quot;\r\n\r\ndef answer(question):\r\n    hits = retrieve(question)\r\n    prompt = build_prompt(question, hits)\r\n\r\n    # Choose your LLM. Here: OpenAI for example; swap with local server if needed.\r\n    client = OpenAI()  # requires OPENAI_API_KEY\r\n    resp = client.chat.completions.create(\r\n        model=&quot;gpt-4o-mini&quot;,\r\n        messages=&#x5B;{&quot;role&quot;:&quot;user&quot;,&quot;content&quot;:prompt}],\r\n        temperature=0.2\r\n    )\r\n    return resp.choices&#x5B;0].message.content\r\n\r\nif __name__ == &quot;__main__&quot;:\r\n    print(answer(&quot;What are the refund steps for premium auctions?&quot;))&amp;amp;amp;lt;\/code&amp;amp;amp;gt;&amp;amp;amp;lt;\/pre&amp;amp;amp;gt;\r\n\r\n<\/pre>\n<section>\n<p id=\"ember832\" class=\"ember-view reader-text-block__paragraph\"><strong>What this gives you<\/strong><\/p>\n<ul>\n<li>Local embedding + FAISS speed.<\/li>\n<li>Pluggable LLM.<\/li>\n<li>Answers that <strong>only<\/strong> use retrieved chunks and <strong>cite sources<\/strong>.<\/li>\n<\/ul>\n<h3 id=\"ember834\" class=\"ember-view reader-text-block__heading-3\">4) Measuring quality<\/h3>\n<ul>\n<li><strong>Retrieval hit-rate:<\/strong> % of test questions where the gold answer appears in top-k chunks.<\/li>\n<li><strong>Answer accuracy:<\/strong> exact-match \/ semantic similarity (e.g., BLEURT\/BERTScore) against gold answers.<\/li>\n<li><strong>Hallucination rate:<\/strong> manual spot checks + \u201cI don\u2019t know\u201d rate (should go <strong>up<\/strong> slightly when you tighten guardrails).<\/li>\n<li><strong>Latency\/cost:<\/strong> p50\/p95 query time, tokens per answer.<\/li>\n<\/ul>\n<p id=\"ember836\" class=\"ember-view reader-text-block__paragraph\">Create a small <strong>eval set<\/strong> (20\u201350 Q\/A pairs) from your docs. Run it after any change (new chunking, new embedding model).<\/p>\n<h3 id=\"ember837\" class=\"ember-view reader-text-block__heading-3\">5) Guardrails I actually used<\/h3>\n<ul>\n<li><strong>Strict prompt:<\/strong> \u201cUse ONLY the context; else say I don\u2019t know.\u201d<\/li>\n<li><strong>Chunk citations:<\/strong> attach file &amp; chunk IDs; show them in UI.<\/li>\n<li><strong>Post-validation:<\/strong> regex\/JSON checks for structured answers (IDs, amounts, dates).<\/li>\n<li><strong>Source diversity:<\/strong> prefer hits from <strong>different files<\/strong> to avoid redundancy.<\/li>\n<\/ul>\n<h3 id=\"ember839\" class=\"ember-view reader-text-block__heading-3\">6) Common pitfalls (and fixes)<\/h3>\n<ul>\n<li><strong>Over-chunking (too small):<\/strong> loses semantics \u2192 drop to 500\u2013800 tokens with 80 overlap.<\/li>\n<li><strong>Wrong embeddings:<\/strong> domain jargon suffers \u2192 try larger models (e5-large, text-embedding-3-large) for critical domains.<\/li>\n<li><strong>PDF extraction mess:<\/strong> run a cleanup step (collapse hyphens, fix Unicode), or pre-convert to Markdown with a good parser.<\/li>\n<li><strong>Query drift:<\/strong> add a quick classifier: \u201cIs this answerable from internal docs?\u201d \u2192 if not, escalate or say \u201cI don\u2019t know.\u201d<\/li>\n<\/ul>\n<h3 id=\"ember841\" class=\"ember-view reader-text-block__heading-3\">7) From laptop to production<\/h3>\n<p id=\"ember842\" class=\"ember-view reader-text-block__paragraph\"><strong>Option A: Keep FAISS, wrap as a service<\/strong><\/p>\n<ul>\n<li>Tiny FastAPI server with \/search and \/answer.<\/li>\n<li>Nightly cron to re-ingest.<\/li>\n<li>Cache by normalized question \u2192 reuse the same top-k for 24h.<\/li>\n<\/ul>\n<p id=\"ember844\" class=\"ember-view reader-text-block__paragraph\"><strong>Option B: Move to Postgres + pgvector<\/strong><\/p>\n<ul>\n<li>Pros: transactions, backups, horizontal scaling.<\/li>\n<li>Schema example:<\/li>\n<\/ul>\n<pre class=\"brush: sql; gutter: true; title: ; notranslate\" title=\"\">\r\nCREATE TABLE rag_chunks (\r\n  id bigserial PRIMARY KEY,\r\n  source text,\r\n  chunk_index int,\r\n  content text,\r\n  embedding vector(1536)  -- match your embedding dim\r\n);\r\n-- Vector search\r\nSELECT id, source, chunk_index, content\r\nFROM rag_chunks\r\nORDER BY embedding &amp;amp;lt;#&amp;amp;gt; $1  -- cosine distance with pgvector\r\nLIMIT 5; \r\n<\/pre>\n<p id=\"ember846\" class=\"ember-view reader-text-block__paragraph\"><strong>Option C: Orchestrate with Temporal (reliable pipelines)<\/strong><\/p>\n<ul>\n<li>Workflow: ingest \u2192 chunk \u2192 embed (batched) \u2192 upsert index \u2192 smoke test \u2192 publish.<\/li>\n<li>Activity retries, idempotent upserts, metrics on every step.<\/li>\n<\/ul>\n<h3 id=\"ember848\" class=\"ember-view reader-text-block__heading-3\">8) Prompt I ship with (copy\/paste)<\/h3>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nSystem:\r\nYou answer only from the supplied CONTEXT. \r\nIf the answer is missing, reply: &quot;I don&#039;t know based on the provided documents.&quot;\r\nAlways include citations like &#x5B;1], &#x5B;2] mapping to sources below.\r\nBe concise and exact.\r\n\r\nUser:\r\nQuestion: {{question}}\r\n\r\nCONTEXT SNIPPETS:\r\n{{#each snippets}}\r\n&#x5B;{{@index+1}}] Source: {{this.source}}\r\n{{this.text}}\r\n{{\/each}} <\/pre>\n<h3 id=\"ember849\" class=\"ember-view reader-text-block__heading-3\">9) What I\u2019d add next (nice upgrades)<\/h3>\n<ul>\n<li><strong>Hybrid retrieval:<\/strong> BM25 (keyword) + vectors \u2192 better on numbers\/code.<\/li>\n<li><strong>Reranking:<\/strong> small cross-encoder re-rank the top-50 to top-5 (big relevance win).<\/li>\n<li><strong>Multi-tenant:<\/strong> per-team namespaces, per-doc ACLs.<\/li>\n<li><strong>Inline quotes:<\/strong> highlight matched spans in each chunk.<\/li>\n<li><strong>Evals dashboard:<\/strong> store runs in SQLite\/Parquet and chart trends.<\/li>\n<\/ul>\n<h3 id=\"ember851\" class=\"ember-view reader-text-block__heading-3\">10) Repo structure (starter)<\/h3>\n<pre class=\"brush: plain; title: ; notranslate\" title=\"\">\r\nrag\/\r\n  docs\/                   # your source files\r\n  rag_store\/              # generated index + records\r\n  ingest.py\r\n  index.py\r\n  query.py\r\n  evals\/\r\n    qa.jsonl              # &#x5B;{&quot;q&quot;:&quot;..&quot;,&quot;a&quot;:&quot;..&quot;,&quot;ids&quot;:&#x5B;...]}]\r\n  server.py               # optional FastAPI\r\n  requirements.txt \r\n<\/pre>\n<h3 id=\"ember852\" class=\"ember-view reader-text-block__heading-3\">Final thought<\/h3>\n<p id=\"ember853\" class=\"ember-view reader-text-block__paragraph\">RAG works best when it\u2019s <strong>boringly deterministic<\/strong> around a very flexible LLM. Keep the moving parts few, measure retrieval first, and make \u201cI don\u2019t know\u201d an acceptable, logged outcome. Ship tiny, improve weekly.<\/p>\n<\/section>\n","protected":false},"excerpt":{"rendered":"<p>1) Architecture at a glance &#x5B;Docs\/PDFs\/Markdown] \u2502 ingest + chunk \u25bc &#x5B;Embeddings Model] \u2500\u2500\u25ba &#x5B;Vector Index (FAISS\/PgVector)] \u25b2 \u2502 top-k by query embedding User Question \u2500\u2500embed\u2500\u2500\u25ba &#x5B;Retriever] \u2500\u2500\u25ba &#x5B;Prompt Builder] \u2500\u2500\u25ba &#x5B;LLM] \u2502 Answer + Sources 2) Minimal stack (why these) Chunking: 500\u2013800 tokens with 50\u2013100 overlap \u2192 balances recall &amp; context. Embeddings: sentence-transformers (local) [&hellip;]<\/p>\n","protected":false},"author":6,"featured_media":3441,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"jnews-multi-image_gallery":[],"jnews_single_post":{"format":"standard","override":[{"template":"1","parallax":"1","fullscreen":"1","layout":"right-sidebar","sidebar":"default-sidebar","second_sidebar":"default-sidebar","sticky_sidebar":"1","share_position":"hide","share_float_style":"share-monocrhome","show_featured":"1","show_post_meta":"1","show_post_author_image":"1","show_post_date":"1","post_date_format":"default","post_date_format_custom":"Y\/m\/d","show_post_category":"1","post_reading_time_wpm":"300","post_calculate_word_method":"str_word_count","zoom_button_out_step":"2","zoom_button_in_step":"3","show_post_tag":"1","show_prev_next_post":"1","number_popup_post":"2","show_post_related":"1"}],"image_override":[{"single_post_thumbnail_size":"no-crop","single_post_gallery_size":"crop-500"}],"trending_post_position":"meta","trending_post_label":"Trending","sponsored_post_label":"Sponsored by","disable_ad":"0","subtitle":"This step-by-step guide shows how to build a minimal Retrieval-Augmented Generation (RAG) pipeline\u2014ingest \u2192 chunk \u2192 embed \u2192 vector search \u2192 prompt \u2192 generate\u2014with Python, FAISS\/pgvector, and a swappable LLM. Optimized for clarity, speed, and production readiness."},"jnews_primary_category":[],"footnotes":""},"categories":[2107,2108],"tags":[2114,2120,2119,2123,2116,2125,2122,2126,2121,2115,2124,2127,2117,2118],"class_list":["post-3439","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech-world","category-ai-ml","tag-ai-ml","tag-architecture","tag-blog","tag-build-rag-pipeline","tag-coding","tag-faiss-rag","tag-llm","tag-pgvector-rag","tag-python","tag-rag-pipeline","tag-rag-tutorial-python","tag-retrieval-augmented-generation","tag-technology","tag-wasalt"],"_links":{"self":[{"href":"https:\/\/blog.wasalt.sa\/en\/wp-json\/wp\/v2\/posts\/3439","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.wasalt.sa\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.wasalt.sa\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.wasalt.sa\/en\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.wasalt.sa\/en\/wp-json\/wp\/v2\/comments?post=3439"}],"version-history":[{"count":19,"href":"https:\/\/blog.wasalt.sa\/en\/wp-json\/wp\/v2\/posts\/3439\/revisions"}],"predecessor-version":[{"id":3460,"href":"https:\/\/blog.wasalt.sa\/en\/wp-json\/wp\/v2\/posts\/3439\/revisions\/3460"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.wasalt.sa\/en\/wp-json\/wp\/v2\/media\/3441"}],"wp:attachment":[{"href":"https:\/\/blog.wasalt.sa\/en\/wp-json\/wp\/v2\/media?parent=3439"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.wasalt.sa\/en\/wp-json\/wp\/v2\/categories?post=3439"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.wasalt.sa\/en\/wp-json\/wp\/v2\/tags?post=3439"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}