Building a Production-Ready RAG Pipeline with Supacrawler and pgvector
Retrieval-Augmented Generation (RAG) is a powerful technique that enhances Large Language Models (LLMs) by providing them with up-to-date, external knowledge. Instead of relying solely on its training data, a RAG system retrieves relevant documents from a knowledge base to help generate more accurate and context-aware answers.
Building a robust RAG pipeline starts with high-quality, structured content. This guide provides a complete, end-to-end walkthrough of how to:
- Crawl an entire website to build a knowledge base using Supacrawler's Crawl API.
- Chunk the crawled content into effective segments for retrieval.
- Embed the chunks into vectors.
- Store and Query those vectors in a PostgreSQL database using the
pgvector
extension.
The Anatomy of a RAG Pipeline
Before diving into code, it's helpful to understand the flow of data. A user's query initiates a process where the system retrieves relevant information from your database, combines it with the original query, and feeds it to an LLM to generate a final, context-enriched answer.

This guide focuses on the critical data ingestion part of this flow: getting from a live website (C
) to a populated vector database (F
). We include three storage options so you can pick your preferred Python stack:
- Supabase Vecs
- LangChain with PGVector
- LlamaIndex with PGVector
Prerequisites: If you’re using the SDKs (recommended), first see Install the SDKs.
Step 1: Set Up Your Vector Database
First, ensure the pgvector
extension is enabled in your PostgreSQL database.
- Supabase: Navigate to
Database
→Extensions
→ and enablepgvector
. See: pgvector extension. - Self‑hosted Postgres: Connect to your database and run:
create extension if not exists vector;
For production environments, it is crucial to create an index (like HNSW or IVFFlat) on your vector column for efficient similarity searches.
Step 2: Crawl a Website with the Crawl API
The foundation of any good RAG system is a comprehensive, clean knowledge base. We’ll crawl a documentation site to create ours, using URL patterns to keep the scope focused.
Create a Crawl Job
curl https://api.supacrawler.com/api/v1/crawl \-H "Authorization: Bearer YOUR_API_KEY" \-H "Content-Type: application/json" \-d '{"url": "https://docs.supacrawler.com","format": "markdown","depth": 3,"link_limit": 100,"include_patterns": ["/docs/*", "/api/*"],"render_js": true}'
When the job status
is completed
, the result will contain data.crawl_data
, a map of each URL to its clean markdown content and metadata.
Step 3: Choose Your Chunking Strategy
This is one of the most critical steps for RAG performance. Chunking is the process of breaking down large documents into smaller, semantically meaningful pieces. If chunks are too large, they can introduce noise; if they're too small, they may lack sufficient context.
Here are a few common strategies:
- RecursiveCharacterTextSplitter (Recommended Start): This method, popular in frameworks like LangChain, recursively splits text by a list of characters (e.g.,
\n\n
,\n
, - Token-Based Splitting: This approach splits text based on the token count that the LLM's embedding model uses. It's more precise but requires a tokenizer for your specific model.
- Semantic Chunking: More advanced techniques use NLP libraries or embedding models to split text based on semantic shifts in meaning, creating the most contextually relevant chunks.
For this guide, we'll use RecursiveCharacterTextSplitter
as it provides a great balance of simplicity and effectiveness.
Step 4: Embed and Store Vectors (Choose One Option)
Now, we'll process the crawled data. For each page, we'll chunk its markdown, create a vector embedding for each chunk, and store it in our database.
Option A: Supabase Vecs (Python)
import os, vecsfrom openai import OpenAIfrom langchain_text_splitters import RecursiveCharacterTextSplitter# Assumes 'result' is the completed crawl job from Step 2crawl_data = result.data.get('crawl_data', {})# 1. Initialize ClientsDB_URL = os.environ['DATABASE_URL']OPENAI_API_KEY = os.environ['OPENAI_API_KEY']vx = vecs.create_client(DB_URL)col = vx.get_or_create_collection(name='documents', dimension=1536) # `text-embedding-3-small` uses 1536 dimensionsopenai_client = OpenAI(api_key=OPENAI_API_KEY)# 2. Chunk Documentssplitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)all_chunks = []for url, page in crawl_data.items():content = page.get('markdown', '')if not content:continuechunks = splitter.split_text(content)for i, chunk_text in enumerate(chunks):all_chunks.append({"id": f"{url}#{i}","text": chunk_text,"metadata": {'url': url,'title': page.get('metadata', {}).get('title', ''),}})# 3. Embed and Upsert in Batchesrecords_to_upsert = []for chunk in all_chunks:emb = openai_client.embeddings.create(model='text-embedding-3-small', input=chunk["text"])vector = emb.data[0].embeddingrecords_to_upsert.append((chunk["id"],vector,chunk["metadata"]))if records_to_upsert:col.upsert(records=records_to_upsert)print(f'Upserted {len(records_to_upsert)} chunks')# 4. Query for Similar Chunkscol.create_index()query_text = "What are the API endpoints?"q_emb = openai_client.embeddings.create(model='text-embedding-3-small', input=query_text)q_vector = q_emb.data[0].embeddingmatches = col.query(data=q_vector, limit=3, include_metadata=True)for match in matches:print(match)
Option B: LangChain PGVector (Python)
import osfrom sqlalchemy import create_enginefrom langchain_core.documents import Documentfrom langchain_text_splitters import RecursiveCharacterTextSplitterfrom langchain_postgres import PGVectorfrom langchain_openai import OpenAIEmbeddings# Assumes 'result' is the completed crawl job from Step 2crawl_data = result.data.get('crawl_data', {})# 1. Prepare LangChain Documentsdocs = []for url, page in crawl_data.items():content = page.get('markdown', '')if content:docs.append(Document(page_content=content,metadata={'url': url,'title': page.get('metadata', {}).get('title', ''),}))# 2. Chunk Documentssplitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)chunks = splitter.split_documents(docs)# 3. Embed and StoreDATABASE_URL = os.environ['DATABASE_URL']OPENAI_API_KEY = os.environ['OPENAI_API_KEY']embeddings = OpenAIEmbeddings(model='text-embedding-3-small', openai_api_key=OPENAI_API_KEY)engine = create_engine(DATABASE_URL)store = PGVector(connection=engine, collection_name='lc_documents', embeddings=embeddings, use_jsonb=True)store.add_documents(chunks)print(f'Added {len(chunks)} chunks to the store.')# 4. Queryresults = store.similarity_search('What are the possible auth methods?', k=3)for doc in results:print(f"- {doc.metadata.get('title')}: {doc.metadata.get('url')}")
Option C: LlamaIndex + PGVector (Python)
import osfrom llama_index.embeddings.openai import OpenAIEmbeddingfrom llama_index.vector_stores.postgres import PGVectorStorefrom llama_index.core import Document, VectorStoreIndex, StorageContext# Assumes 'result' is the completed crawl job from Step 2crawl_data = result.data.get('crawl_data', {})# 1. Prepare LlamaIndex Documentsdocs = []for url, page in crawl_data.items():content = page.get('markdown', '')if content:docs.append(Document(text=content,metadata={'url': url,'title': page.get('metadata', {}).get('title', ''),}))# 2. Initialize Storage and Embedding ModelDB_URL = os.environ['DATABASE_URL']OPENAI_API_KEY = os.environ['OPENAI_API_KEY']embed_model = OpenAIEmbedding(model='text-embedding-3-small', api_key=OPENAI_API_KEY)store = PGVectorStore.from_params(database_url=DB_URL, table_name='li_documents', embed_dim=1536)ctx = StorageContext.from_defaults(vector_store=store)# 3. Build the Index (this chunks, embeds, and stores)index = VectorStoreIndex.from_documents(docs, storage_context=ctx, embed_model=embed_model)print("Index built and stored successfully.")# 4. Queryquery_engine = index.as_query_engine()response = query_engine.query('What is this page about?')print(response)
Step 5: Evaluate and Iterate
Building a RAG system is an iterative process. Once your pipeline is running, evaluate its performance by asking a set of test questions. If the answers are not accurate, consider:
- Refining the Crawl Scope: Are your
include_patterns
too broad or too narrow? - Adjusting Chunking Strategy: Experiment with different chunk sizes and overlaps.
- Improving Retrieval: You may need to add more specific metadata to your chunks to help the retriever find the best possible context.
Conclusion: A Powerful Foundation for AI
By combining Supacrawler's powerful crawling capabilities with the efficiency of pgvector
, you can build a robust and scalable data ingestion pipeline for any RAG application. This process—Crawl, Chunk, Embed, Store—provides the foundation for creating intelligent AI agents that can reason about and answer questions on any web-based knowledge base.