Back to Blog

Integrations: Build a simple RAG System with Supacrawler and Supabase pgvector using OpenAI embeddings

Retrieval-Augmented Generation (RAG) has revolutionized how we build AI applications by combining the power of large language models with real-time access to external knowledge bases. This comprehensive guide shows you how to build a production-ready RAG system using Supacrawler for web data extraction, Supabase pgvector for vector storage, and OpenAI embeddings for high-quality semantic search.

By the end of this tutorial, you'll have a complete RAG pipeline that can crawl any website, convert content into searchable vectors, and provide intelligent question-answering capabilities.

If you'd like to try it yourself you can check the Supabase Vectors notebook.

Table of Contents

Understanding the RAG Architecture

Our RAG system follows a four-stage pipeline that transforms web content into intelligent, searchable knowledge:

StageComponentPurposeTechnology
Data ExtractionSupacrawlerCrawl and extract clean content from websitesSupacrawler Crawl API
Content ProcessingText ChunkingSplit content into meaningful, searchable segmentsPython text splitters
Embedding GenerationOpenAI APIConvert text chunks into high-dimensional vectorsOpenAI text-embedding-3-small
Vector StorageSupabase pgvectorStore and search vectors with metadataPostgreSQL + pgvector extension

This architecture provides several key advantages:

  • Scalable Data Ingestion: Supacrawler handles JavaScript rendering, rate limiting, and large-scale crawling
  • High-Quality Embeddings: OpenAI's embeddings provide superior semantic understanding
  • Production-Ready Storage: Supabase offers managed PostgreSQL with built-in vector operations
  • Real-Time Updates: Easy to refresh content and maintain current knowledge

Setting Up Your Environment

First, install the required dependencies and set up your development environment:

# Install core dependencies
pip install supacrawler vecs openai python-dotenv
# Install text processing libraries
pip install beautifulsoup4 markdownify

Create a .env file with your API credentials:

# .env
SUPACRAWLER_API_KEY=your_supacrawler_api_key
OPENAI_API_KEY=your_openai_api_key
SUPABASE_URL=your_supabase_project_url
SUPABASE_KEY=your_supabase_anon_key
DATABASE_URL=postgresql://postgres:[password]@db.[project].supabase.co:5432/postgres

Configuring Supabase pgvector

Before building our RAG system, we need to enable the pgvector extension in Supabase:

  1. Enable pgvector Extension:

    • Go to your Supabase dashboard
    • Navigate to DatabaseExtensions
    • Search for and enable pgvector
  2. Verify Installation:

import os
import vecs
from dotenv import load_dotenv
load_dotenv()
# Connect to Supabase
DB_URL = os.getenv('DATABASE_URL')
vx = vecs.create_client(DB_URL)
# Test connection
print("✅ Connected to Supabase successfully!")
  1. Create Vector Collection:
# Create collection for our RAG system
# OpenAI text-embedding-3-small uses 1536 dimensions
collection = vx.get_or_create_collection(
name="knowledge_base",
dimension=1536
)
print(f"📦 Collection created: {collection.name}")

Web Crawling with Supacrawler

Supacrawler excels at extracting clean, structured content from websites. Let's crawl a documentation site to build our knowledge base:

import os
from supacrawler import SupacrawlerClient
from dotenv import load_dotenv
load_dotenv()
def crawl_documentation(base_url: str, patterns: list = None) -> dict:
"""
Crawl a documentation site and return clean content
Args:
base_url: The starting URL to crawl
patterns: URL patterns to include (e.g., ['/docs/*', '/api/*'])
Returns:
Dictionary mapping URLs to their content and metadata
"""
client = SupacrawlerClient(api_key=os.getenv('SUPACRAWLER_API_KEY'))
# Configure crawl parameters for documentation
crawl_config = {
'url': base_url,
'format': 'markdown', # Get clean markdown content
'depth': 3, # Crawl up to 3 levels deep
'link_limit': 200, # Limit total pages crawled
'render_js': True, # Handle JavaScript-rendered content
'include_patterns': patterns or ['/docs/*', '/api/*', '/guides/*'],
'exclude_patterns': ['/blog/*', '/changelog/*'], # Skip non-documentation
'timeout': 30000, # 30 second timeout per page
}
print(f"🚀 Starting crawl of {base_url}")
# Create and wait for crawl job
job = client.create_crawl_job(**crawl_config)
result = client.wait_for_crawl(job.job_id)
if result.status == 'completed':
crawl_data = result.data.get('crawl_data', {})
print(f"✅ Crawl completed! Found {len(crawl_data)} pages")
return crawl_data
else:
raise Exception(f"Crawl failed with status: {result.status}")
# Example: Crawl Supabase documentation
crawled_content = crawl_documentation(
'https://supabase.com/docs',
patterns=['/docs/*']
)
# Display crawl results
for url, page_data in list(crawled_content.items())[:3]:
content = page_data.get('markdown', '')
title = page_data.get('metadata', {}).get('title', 'No title')
print(f"\n📄 {title}")
print(f"🔗 {url}")
print(f"📝 Content length: {len(content)} characters")
print(f"📋 Preview: {content[:200]}...")

Content Processing and Chunking

Effective chunking is crucial for RAG performance. We need to split large documents into meaningful, searchable segments:

import re
from typing import List, Dict
class DocumentChunker:
"""
Intelligent document chunker optimized for RAG systems
"""
def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
def clean_content(self, content: str) -> str:
"""Clean and normalize content"""
# Remove excessive whitespace
content = re.sub(r'\n\s*\n\s*\n', '\n\n', content)
# Remove markdown artifacts that don't add value
content = re.sub(r'\[([^\]]+)\]\([^)]+\)', r'\1', content) # Convert links to text
content = re.sub(r'```[\s\S]*?```', '', content) # Remove code blocks
content = re.sub(r'`([^`]+)`', r'\1', content) # Remove inline code formatting
return content.strip()
def split_by_headers(self, content: str) -> List[str]:
"""Split content by markdown headers for semantic boundaries"""
# Split on headers (H1-H6)
sections = re.split(r'\n(?=#{1,6}\s)', content)
return [section.strip() for section in sections if section.strip()]
def chunk_text(self, text: str, max_size: int = None) -> List[str]:
"""Split text into chunks with overlap"""
max_size = max_size or self.chunk_size
if len(text) <= max_size:
return [text]
chunks = []
start = 0
while start < len(text):
# Find end position
end = start + max_size
if end >= len(text):
chunks.append(text[start:])
break
# Try to end at sentence boundary
sentence_end = text.rfind('.', start, end)
if sentence_end > start + max_size // 2:
end = sentence_end + 1
else:
# Try to end at word boundary
word_end = text.rfind(' ', start, end)
if word_end > start + max_size // 2:
end = word_end
chunks.append(text[start:end])
start = end - self.chunk_overlap
return chunks
def process_crawled_content(self, crawled_data: Dict) -> List[Dict]:
"""
Process crawled content into chunks with metadata
Returns:
List of chunk dictionaries with text, metadata, and unique IDs
"""
all_chunks = []
for url, page_data in crawled_data.items():
content = page_data.get('markdown', '')
metadata = page_data.get('metadata', {})
if not content:
continue
# Clean content
cleaned_content = self.clean_content(content)
# Split by headers first for semantic boundaries
sections = self.split_by_headers(cleaned_content)
chunk_index = 0
for section in sections:
# Further chunk large sections
section_chunks = self.chunk_text(section)
for chunk_text in section_chunks:
if len(chunk_text.strip()) < 100: # Skip very small chunks
continue
chunk = {
'id': f"{url}#{chunk_index}",
'text': chunk_text.strip(),
'metadata': {
'url': url,
'title': metadata.get('title', ''),
'description': metadata.get('description', ''),
'chunk_index': chunk_index,
'source': 'supacrawler'
}
}
all_chunks.append(chunk)
chunk_index += 1
print(f"📊 Created {len(all_chunks)} chunks from {len(crawled_data)} pages")
return all_chunks
# Process our crawled content
chunker = DocumentChunker(chunk_size=800, chunk_overlap=150)
chunks = chunker.process_crawled_content(crawled_content)
# Display chunking results
print(f"\n📈 Chunking Statistics:")
print(f"Total chunks: {len(chunks)}")
print(f"Average chunk size: {sum(len(c['text']) for c in chunks) // len(chunks)} characters")
# Show sample chunks
print("\n📋 Sample chunks:")
for i, chunk in enumerate(chunks[:3]):
print(f"\nChunk {i+1}:")
print(f"ID: {chunk['id']}")
print(f"Title: {chunk['metadata']['title']}")
print(f"Text: {chunk['text'][:200]}...")

Generating OpenAI Embeddings

OpenAI's embeddings provide superior semantic understanding. Let's generate embeddings for our chunks:

import openai
from typing import List
import time
class OpenAIEmbedder:
"""
Generate embeddings using OpenAI's text-embedding-3-small model
"""
def __init__(self, api_key: str, model: str = "text-embedding-3-small"):
self.client = openai.OpenAI(api_key=api_key)
self.model = model
self.dimension = 1536 # text-embedding-3-small dimension
def embed_text(self, text: str) -> List[float]:
"""Generate embedding for a single text"""
try:
response = self.client.embeddings.create(
model=self.model,
input=text
)
return response.data[0].embedding
except Exception as e:
print(f"❌ Error generating embedding: {e}")
return None
def embed_batch(self, texts: List[str], batch_size: int = 50) -> List[List[float]]:
"""
Generate embeddings for multiple texts with batching and rate limiting
"""
embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
try:
print(f"🔄 Processing batch {i//batch_size + 1}/{(len(texts)-1)//batch_size + 1}")
response = self.client.embeddings.create(
model=self.model,
input=batch
)
batch_embeddings = [item.embedding for item in response.data]
embeddings.extend(batch_embeddings)
# Rate limiting: small delay between batches
time.sleep(0.1)
except Exception as e:
print(f"❌ Error in batch {i//batch_size + 1}: {e}")
# Add None for failed embeddings
embeddings.extend([None] * len(batch))
return embeddings
def embed_chunks(self, chunks: List[Dict]) -> List[Dict]:
"""
Add embeddings to chunk data
"""
print(f"🧠 Generating embeddings for {len(chunks)} chunks...")
# Extract texts for embedding
texts = [chunk['text'] for chunk in chunks]
# Generate embeddings
embeddings = self.embed_batch(texts)
# Add embeddings to chunks
embedded_chunks = []
for chunk, embedding in zip(chunks, embeddings):
if embedding is not None:
chunk['embedding'] = embedding
embedded_chunks.append(chunk)
else:
print(f"⚠️ Skipping chunk {chunk['id']} due to embedding failure")
print(f"✅ Successfully embedded {len(embedded_chunks)} chunks")
return embedded_chunks
# Generate embeddings
embedder = OpenAIEmbedder(api_key=os.getenv('OPENAI_API_KEY'))
embedded_chunks = embedder.embed_chunks(chunks)
print(f"\n📊 Embedding Statistics:")
print(f"Successful embeddings: {len(embedded_chunks)}")
print(f"Embedding dimension: {len(embedded_chunks[0]['embedding']) if embedded_chunks else 'N/A'}")

Vector Storage in Supabase

Now let's store our embedded chunks in Supabase pgvector for efficient similarity search:

import vecs
from typing import List, Tuple, Any
class SupabaseVectorStore:
"""
Manage vector storage and retrieval in Supabase pgvector
"""
def __init__(self, db_url: str, collection_name: str = "knowledge_base"):
self.client = vecs.create_client(db_url)
self.collection_name = collection_name
self.collection = None
def create_collection(self, dimension: int = 1536):
"""Create or get vector collection"""
self.collection = self.client.get_or_create_collection(
name=self.collection_name,
dimension=dimension
)
print(f"📦 Collection '{self.collection_name}' ready")
def upsert_chunks(self, embedded_chunks: List[Dict], batch_size: int = 100):
"""
Store embedded chunks in the vector database
"""
if not self.collection:
raise ValueError("Collection not created. Call create_collection() first.")
print(f"💾 Storing {len(embedded_chunks)} chunks in Supabase...")
# Prepare records for upsert
records = []
for chunk in embedded_chunks:
record = (
chunk['id'], # Unique ID
chunk['embedding'], # Vector embedding
{ # Metadata
'text': chunk['text'],
'url': chunk['metadata']['url'],
'title': chunk['metadata']['title'],
'description': chunk['metadata']['description'],
'chunk_index': chunk['metadata']['chunk_index'],
'source': chunk['metadata']['source']
}
)
records.append(record)
# Upsert in batches
for i in range(0, len(records), batch_size):
batch = records[i:i + batch_size]
try:
self.collection.upsert(records=batch)
print(f"✅ Stored batch {i//batch_size + 1}/{(len(records)-1)//batch_size + 1}")
except Exception as e:
print(f"❌ Error storing batch {i//batch_size + 1}: {e}")
print(f"🎉 Successfully stored {len(embedded_chunks)} chunks!")
def create_index(self):
"""Create HNSW index for fast similarity search"""
if not self.collection:
raise ValueError("Collection not created.")
print("🔍 Creating HNSW index for fast search...")
self.collection.create_index()
print("✅ Index created successfully!")
def similarity_search(self, query_embedding: List[float], limit: int = 5) -> List[Dict]:
"""
Search for similar chunks using vector similarity
"""
if not self.collection:
raise ValueError("Collection not created.")
results = self.collection.query(
data=query_embedding,
limit=limit,
include_metadata=True
)
# Format results
formatted_results = []
for result in results:
formatted_results.append({
'id': result[0],
'similarity': result[1],
'metadata': result[2]
})
return formatted_results
# Store vectors in Supabase
vector_store = SupabaseVectorStore(
db_url=os.getenv('DATABASE_URL'),
collection_name="supacrawler_rag_demo"
)
# Create collection and store chunks
vector_store.create_collection(dimension=1536)
vector_store.upsert_chunks(embedded_chunks)
vector_store.create_index()
print("\n🎯 Vector storage complete! Ready for queries.")

Building the Query Engine

Now let's create a query engine that can answer questions using our knowledge base:

class RAGQueryEngine:
"""
Complete RAG query engine with embedding search and response generation
"""
def __init__(self, vector_store: SupabaseVectorStore, embedder: OpenAIEmbedder):
self.vector_store = vector_store
self.embedder = embedder
self.openai_client = embedder.client
def search_knowledge_base(self, query: str, top_k: int = 5) -> List[Dict]:
"""
Search knowledge base for relevant chunks
"""
print(f"🔍 Searching for: '{query}'")
# Generate query embedding
query_embedding = self.embedder.embed_text(query)
if not query_embedding:
raise ValueError("Failed to generate query embedding")
# Search vector database
results = self.vector_store.similarity_search(query_embedding, limit=top_k)
print(f"📋 Found {len(results)} relevant chunks")
return results
def generate_response(self, query: str, context_chunks: List[Dict], model: str = "gpt-3.5-turbo") -> str:
"""
Generate response using retrieved context
"""
# Build context from retrieved chunks
context_texts = []
for chunk in context_chunks:
metadata = chunk['metadata']
text = metadata['text']
url = metadata['url']
title = metadata['title']
context_texts.append(f"Source: {title} ({url})\n{text}")
context = "\n\n---\n\n".join(context_texts)
# Create prompt
prompt = f"""You are a helpful assistant that answers questions based on the provided context.
Use only the information from the context to answer the question. If the context doesn't contain
enough information to answer the question, say so clearly.
Context:
{context}
Question: {query}
Answer:"""
try:
response = self.openai_client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant that answers questions based on provided context."},
{"role": "user", "content": prompt}
],
max_tokens=500,
temperature=0.1
)
return response.choices[0].message.content
except Exception as e:
return f"Error generating response: {e}"
def ask(self, question: str, top_k: int = 5) -> Dict:
"""
Complete RAG pipeline: search + generate
"""
print(f"\n💬 Question: {question}")
# Search knowledge base
relevant_chunks = self.search_knowledge_base(question, top_k=top_k)
if not relevant_chunks:
return {
'question': question,
'answer': "I couldn't find relevant information to answer your question.",
'sources': []
}
# Generate response
answer = self.generate_response(question, relevant_chunks)
# Extract source URLs
sources = list(set([
chunk['metadata']['url']
for chunk in relevant_chunks
]))
result = {
'question': question,
'answer': answer,
'sources': sources,
'relevant_chunks': len(relevant_chunks)
}
print(f"✅ Answer generated using {len(relevant_chunks)} chunks from {len(sources)} sources")
return result
# Create query engine
query_engine = RAGQueryEngine(vector_store, embedder)
# Test queries
test_questions = [
"How do I set up pgvector in Supabase?",
"What are the main features of Supabase?",
"How do I query vectors in Supabase?",
]
print("\n🧪 Testing RAG System:")
print("=" * 50)
for question in test_questions:
result = query_engine.ask(question)
print(f"\n❓ {result['question']}")
print(f"💡 {result['answer']}")
print(f"📚 Sources: {', '.join(result['sources'])}")
print("-" * 50)

Complete RAG Implementation

Here's the complete implementation that ties everything together:

import os
from dotenv import load_dotenv
from supacrawler import SupacrawlerClient
import vecs
import openai
from typing import List, Dict, Any
class SupacrawlerRAGSystem:
"""
Complete RAG system using Supacrawler, OpenAI, and Supabase
"""
def __init__(self):
load_dotenv()
# Initialize clients
self.supacrawler_client = SupacrawlerClient(
api_key=os.getenv('SUPACRAWLER_API_KEY')
)
self.openai_client = openai.OpenAI(
api_key=os.getenv('OPENAI_API_KEY')
)
self.vector_client = vecs.create_client(os.getenv('DATABASE_URL'))
# Initialize components
self.chunker = DocumentChunker()
self.collection = None
print("🚀 RAG System initialized!")
def build_knowledge_base(self, url: str, collection_name: str = "rag_knowledge"):
"""
Complete pipeline: crawl → chunk → embed → store
"""
print(f"🏗️ Building knowledge base from {url}")
# Step 1: Crawl website
crawled_data = self._crawl_website(url)
# Step 2: Process and chunk content
chunks = self.chunker.process_crawled_content(crawled_data)
# Step 3: Generate embeddings
embedded_chunks = self._embed_chunks(chunks)
# Step 4: Store in vector database
self._store_vectors(embedded_chunks, collection_name)
print("✅ Knowledge base built successfully!")
return len(embedded_chunks)
def _crawl_website(self, url: str) -> Dict:
"""Crawl website and return content"""
job = self.supacrawler_client.create_crawl_job(
url=url,
format='markdown',
depth=3,
link_limit=100,
render_js=True,
include_patterns=['/docs/*', '/api/*', '/guides/*']
)
result = self.supacrawler_client.wait_for_crawl(job.job_id)
if result.status == 'completed':
return result.data.get('crawl_data', {})
else:
raise Exception(f"Crawl failed: {result.status}")
def _embed_chunks(self, chunks: List[Dict]) -> List[Dict]:
"""Generate OpenAI embeddings for chunks"""
embedded_chunks = []
for chunk in chunks:
try:
response = self.openai_client.embeddings.create(
model="text-embedding-3-small",
input=chunk['text']
)
chunk['embedding'] = response.data[0].embedding
embedded_chunks.append(chunk)
except Exception as e:
print(f"⚠️ Failed to embed chunk {chunk['id']}: {e}")
return embedded_chunks
def _store_vectors(self, embedded_chunks: List[Dict], collection_name: str):
"""Store vectors in Supabase"""
self.collection = self.vector_client.get_or_create_collection(
name=collection_name,
dimension=1536
)
# Prepare records
records = [
(chunk['id'], chunk['embedding'], chunk['metadata'])
for chunk in embedded_chunks
]
# Upsert and create index
self.collection.upsert(records=records)
self.collection.create_index()
def ask(self, question: str, top_k: int = 5) -> Dict:
"""Ask a question and get an answer"""
if not self.collection:
raise ValueError("No knowledge base loaded. Call build_knowledge_base() first.")
# Generate query embedding
query_response = self.openai_client.embeddings.create(
model="text-embedding-3-small",
input=question
)
query_embedding = query_response.data[0].embedding
# Search vectors
results = self.collection.query(
data=query_embedding,
limit=top_k,
include_metadata=True
)
# Build context
context_parts = []
sources = []
for result in results:
metadata = result[2] # Metadata is third element
context_parts.append(f"Source: {metadata['title']}\n{metadata['text']}")
sources.append(metadata['url'])
context = "\n\n---\n\n".join(context_parts)
# Generate answer
response = self.openai_client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "Answer questions based only on the provided context."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
],
max_tokens=500,
temperature=0.1
)
return {
'question': question,
'answer': response.choices[0].message.content,
'sources': list(set(sources))
}
# Usage example
if __name__ == "__main__":
# Initialize RAG system
rag = SupacrawlerRAGSystem()
# Build knowledge base
num_chunks = rag.build_knowledge_base("https://supabase.com/docs")
print(f"📊 Knowledge base contains {num_chunks} chunks")
# Ask questions
questions = [
"How do I enable pgvector in Supabase?",
"What authentication methods does Supabase support?",
"How do I create a vector search in Supabase?"
]
for question in questions:
result = rag.ask(question)
print(f"\n❓ {result['question']}")
print(f"💡 {result['answer']}")
print(f"📚 Sources: {result['sources']}")

Performance Optimization

To optimize your RAG system for production use:

Chunking Optimization

# Experiment with different chunk sizes
chunk_configs = [
{'size': 500, 'overlap': 100},
{'size': 1000, 'overlap': 200},
{'size': 1500, 'overlap': 300}
]
for config in chunk_configs:
chunker = DocumentChunker(
chunk_size=config['size'],
chunk_overlap=config['overlap']
)
# Test and measure retrieval performance

Embedding Batch Processing

# Process embeddings in batches for better throughput
def batch_embed(texts: List[str], batch_size: int = 100):
embeddings = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=batch
)
embeddings.extend([item.embedding for item in response.data])
return embeddings

Database Indexing

-- Optimize vector search performance
CREATE INDEX CONCURRENTLY ON your_collection
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);

Production Deployment

For production deployment, consider these enhancements:

Environment Configuration

# config.py
import os
from pydantic import BaseSettings
class Settings(BaseSettings):
supacrawler_api_key: str
openai_api_key: str
database_url: str
supabase_url: str
supabase_key: str
# Performance settings
embedding_batch_size: int = 50
vector_search_limit: int = 10
chunk_size: int = 1000
chunk_overlap: int = 200
class Config:
env_file = ".env"
settings = Settings()

Error Handling and Monitoring

import logging
from functools import wraps
def with_retry(max_retries: int = 3):
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except Exception as e:
if attempt == max_retries - 1:
logging.error(f"Function {func.__name__} failed after {max_retries} attempts: {e}")
raise
logging.warning(f"Attempt {attempt + 1} failed: {e}, retrying...")
return wrapper
return decorator
@with_retry(max_retries=3)
def embed_with_retry(text: str):
return openai_client.embeddings.create(
model="text-embedding-3-small",
input=text
)

Scale Beyond Local Development with Supacrawler

While this tutorial demonstrates building a RAG system locally, production deployments require handling scale, reliability, and performance optimization:

  • Large-Scale Crawling: Managing thousands of pages with rate limiting and error handling
  • Content Updates: Keeping knowledge bases current with incremental updates
  • Vector Management: Optimizing storage and search performance across millions of vectors
  • Cost Optimization: Balancing embedding quality with API costs

Supacrawler's Crawl API handles these production challenges automatically:

import { SupacrawlerClient } from '@supacrawler/js'
const client = new SupacrawlerClient({ apiKey: process.env.SUPACRAWLER_API_KEY })
// Production-scale RAG data pipeline
async function buildProductionRAG() {
const job = await client.createCrawlJob({
url: 'https://docs.company.com',
format: 'markdown',
depth: 5, // Deep crawling
link_limit: 10000, // Large scale
render_js: true, // Full JavaScript support
// Production optimizations
include_patterns: ['/docs/*', '/api/*', '/guides/*'],
exclude_patterns: ['/blog/*', '/changelog/*'],
timeout: 30000,
concurrent_limit: 10, // Parallel processing
// Content quality
remove_selectors: ['.sidebar', '.nav', '.footer'],
wait_for: '.main-content',
block_ads: true
})
const result = await client.waitForCrawl(job.job_id)
return result.data.crawl_data
}

Key Production Advantages:

  • Automatic Scale Management: Handle 10,000+ pages without infrastructure complexity
  • Content Quality: Clean, structured content optimized for RAG systems
  • JavaScript Rendering: Full SPA and dynamic content support
  • Rate Limiting: Built-in respect for robots.txt and site limits
  • Error Recovery: Automatic retries and failure handling
  • Incremental Updates: Efficient re-crawling for content freshness

Getting Started:

Conclusion

You've built a complete RAG system that combines the best of modern AI technologies:

  • Supacrawler for high-quality web data extraction
  • OpenAI embeddings for superior semantic understanding
  • Supabase pgvector for scalable vector storage and search

This foundation provides everything needed for intelligent question-answering systems, chatbots, and knowledge management applications. The modular design makes it easy to swap components, optimize performance, and scale to production requirements.

Whether you're building internal documentation search, customer support automation, or research assistants, this RAG architecture provides a robust, production-ready foundation for AI-powered applications.

By Supacrawler Team
Published on September 7, 2025