Integrations: Build a simple RAG System with Supacrawler and Supabase pgvector using OpenAI embeddings

Retrieval-Augmented Generation (RAG) has revolutionized how we build AI applications by combining the power of large language models with real-time access to external knowledge bases. This comprehensive guide shows you how to build a production-ready RAG system using Supacrawler for web data extraction, Supabase pgvector for vector storage, and OpenAI embeddings for high-quality semantic search.

By the end of this tutorial, you'll have a complete RAG pipeline that can crawl any website, convert content into searchable vectors, and provide intelligent question-answering capabilities.

If you'd like to try it yourself you can check the Supabase Vectors notebook.

Understanding the RAG Architecture
Setting Up Your Environment
Configuring Supabase pgvector
Web Crawling with Supacrawler
Content Processing and Chunking
Generating OpenAI Embeddings
Vector Storage in Supabase
Building the Query Engine
Complete RAG Implementation
Performance Optimization
Production Deployment

Understanding the RAG Architecture

Our RAG system follows a four-stage pipeline that transforms web content into intelligent, searchable knowledge:

Stage	Component	Purpose	Technology
Data Extraction	Supacrawler	Crawl and extract clean content from websites	Supacrawler Crawl API
Content Processing	Text Chunking	Split content into meaningful, searchable segments	Python text splitters
Embedding Generation	OpenAI API	Convert text chunks into high-dimensional vectors	OpenAI text-embedding-3-small
Vector Storage	Supabase pgvector	Store and search vectors with metadata	PostgreSQL + pgvector extension

This architecture provides several key advantages:

Scalable Data Ingestion: Supacrawler handles JavaScript rendering, rate limiting, and large-scale crawling
High-Quality Embeddings: OpenAI's embeddings provide superior semantic understanding
Production-Ready Storage: Supabase offers managed PostgreSQL with built-in vector operations
Real-Time Updates: Easy to refresh content and maintain current knowledge

Setting Up Your Environment

First, install the required dependencies and set up your development environment:

# Install core dependencies
pip install supacrawler vecs openai python-dotenv

# Install text processing libraries
pip install beautifulsoup4 markdownify

Create a .env file with your API credentials:

# .env
SUPACRAWLER_API_KEY=your_supacrawler_api_key
OPENAI_API_KEY=your_openai_api_key
SUPABASE_URL=your_supabase_project_url
SUPABASE_KEY=your_supabase_anon_key
DATABASE_URL=postgresql://postgres:[password]@db.[project].supabase.co:5432/postgres

Configuring Supabase pgvector

Before building our RAG system, we need to enable the pgvector extension in Supabase:

Enable pgvector Extension:
- Go to your Supabase dashboard
- Navigate to Database → Extensions
- Search for and enable pgvector
Verify Installation:

import os
import vecs
from dotenv import load_dotenv

load_dotenv()

# Connect to Supabase
DB_URL = os.getenv('DATABASE_URL')
vx = vecs.create_client(DB_URL)

# Test connection
print("✅ Connected to Supabase successfully!")

Create Vector Collection:

# Create collection for our RAG system
# OpenAI text-embedding-3-small uses 1536 dimensions
collection = vx.get_or_create_collection(
    name="knowledge_base", 
    dimension=1536
)
print(f"📦 Collection created: {collection.name}")

Web Crawling with Supacrawler

Supacrawler excels at extracting clean, structured content from websites. Let's crawl a documentation site to build our knowledge base:

import os
from supacrawler import SupacrawlerClient
from dotenv import load_dotenv

load_dotenv()

def crawl_documentation(base_url: str, patterns: list = None) -> dict:
    """
    Crawl a documentation site and return clean content
    
    Args:
        base_url: The starting URL to crawl
        patterns: URL patterns to include (e.g., ['/docs/*', '/api/*'])
    
    Returns:
        Dictionary mapping URLs to their content and metadata
    """
    client = SupacrawlerClient(api_key=os.getenv('SUPACRAWLER_API_KEY'))
    
    # Configure crawl parameters for documentation
    crawl_config = {
        'url': base_url,
        'format': 'markdown',  # Get clean markdown content
        'depth': 3,           # Crawl up to 3 levels deep
        'link_limit': 200,    # Limit total pages crawled
        'render_js': True,    # Handle JavaScript-rendered content
        'include_patterns': patterns or ['/docs/*', '/api/*', '/guides/*'],
        'exclude_patterns': ['/blog/*', '/changelog/*'],  # Skip non-documentation
        'timeout': 30000,     # 30 second timeout per page
    }
    
    print(f"🚀 Starting crawl of {base_url}")
    
    # Create and wait for crawl job
    job = client.create_crawl_job(**crawl_config)
    result = client.wait_for_crawl(job.job_id)
    
    if result.status == 'completed':
        crawl_data = result.data.get('crawl_data', {})
        print(f"✅ Crawl completed! Found {len(crawl_data)} pages")
        return crawl_data
    else:
        raise Exception(f"Crawl failed with status: {result.status}")

# Example: Crawl Supabase documentation
crawled_content = crawl_documentation(
    'https://supabase.com/docs',
    patterns=['/docs/*']
)

# Display crawl results
for url, page_data in list(crawled_content.items())[:3]:
    content = page_data.get('markdown', '')
    title = page_data.get('metadata', {}).get('title', 'No title')
    print(f"\n📄 {title}")
    print(f"🔗 {url}")
    print(f"📝 Content length: {len(content)} characters")
    print(f"📋 Preview: {content[:200]}...")

Content Processing and Chunking

Effective chunking is crucial for RAG performance. We need to split large documents into meaningful, searchable segments:

import re
from typing import List, Dict

class DocumentChunker:
    """
    Intelligent document chunker optimized for RAG systems
    """
    
    def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
    
    def clean_content(self, content: str) -> str:
        """Clean and normalize content"""
        # Remove excessive whitespace
        content = re.sub(r'\n\s*\n\s*\n', '\n\n', content)
        
        # Remove markdown artifacts that don't add value
        content = re.sub(r'\[([^\]]+)\]\([^)]+\)', r'\1', content)  # Convert links to text
        content = re.sub(r'```[\s\S]*?```', '', content)  # Remove code blocks
        content = re.sub(r'`([^`]+)`', r'\1', content)  # Remove inline code formatting
        
        return content.strip()
    
    def split_by_headers(self, content: str) -> List[str]:
        """Split content by markdown headers for semantic boundaries"""
        # Split on headers (H1-H6)
        sections = re.split(r'\n(?=#{1,6}\s)', content)
        return [section.strip() for section in sections if section.strip()]
    
    def chunk_text(self, text: str, max_size: int = None) -> List[str]:
        """Split text into chunks with overlap"""
        max_size = max_size or self.chunk_size
        
        if len(text) <= max_size:
            return [text]
        
        chunks = []
        start = 0
        
        while start < len(text):
            # Find end position
            end = start + max_size
            
            if end >= len(text):
                chunks.append(text[start:])
                break
            
            # Try to end at sentence boundary
            sentence_end = text.rfind('.', start, end)
            if sentence_end > start + max_size // 2:
                end = sentence_end + 1
            else:
                # Try to end at word boundary
                word_end = text.rfind(' ', start, end)
                if word_end > start + max_size // 2:
                    end = word_end
            
            chunks.append(text[start:end])
            start = end - self.chunk_overlap
        
        return chunks
    
    def process_crawled_content(self, crawled_data: Dict) -> List[Dict]:
        """
        Process crawled content into chunks with metadata
        
        Returns:
            List of chunk dictionaries with text, metadata, and unique IDs
        """
        all_chunks = []
        
        for url, page_data in crawled_data.items():
            content = page_data.get('markdown', '')
            metadata = page_data.get('metadata', {})
            
            if not content:
                continue
            
            # Clean content
            cleaned_content = self.clean_content(content)
            
            # Split by headers first for semantic boundaries
            sections = self.split_by_headers(cleaned_content)
            
            chunk_index = 0
            for section in sections:
                # Further chunk large sections
                section_chunks = self.chunk_text(section)
                
                for chunk_text in section_chunks:
                    if len(chunk_text.strip()) < 100:  # Skip very small chunks
                        continue
                    
                    chunk = {
                        'id': f"{url}#{chunk_index}",
                        'text': chunk_text.strip(),
                        'metadata': {
                            'url': url,
                            'title': metadata.get('title', ''),
                            'description': metadata.get('description', ''),
                            'chunk_index': chunk_index,
                            'source': 'supacrawler'
                        }
                    }
                    
                    all_chunks.append(chunk)
                    chunk_index += 1
        
        print(f"📊 Created {len(all_chunks)} chunks from {len(crawled_data)} pages")
        return all_chunks

# Process our crawled content
chunker = DocumentChunker(chunk_size=800, chunk_overlap=150)
chunks = chunker.process_crawled_content(crawled_content)

# Display chunking results
print(f"\n📈 Chunking Statistics:")
print(f"Total chunks: {len(chunks)}")
print(f"Average chunk size: {sum(len(c['text']) for c in chunks) // len(chunks)} characters")

# Show sample chunks
print("\n📋 Sample chunks:")
for i, chunk in enumerate(chunks[:3]):
    print(f"\nChunk {i+1}:")
    print(f"ID: {chunk['id']}")
    print(f"Title: {chunk['metadata']['title']}")
    print(f"Text: {chunk['text'][:200]}...")

Generating OpenAI Embeddings

OpenAI's embeddings provide superior semantic understanding. Let's generate embeddings for our chunks:

import openai
from typing import List
import time

class OpenAIEmbedder:
    """
    Generate embeddings using OpenAI's text-embedding-3-small model
    """
    
    def __init__(self, api_key: str, model: str = "text-embedding-3-small"):
        self.client = openai.OpenAI(api_key=api_key)
        self.model = model
        self.dimension = 1536  # text-embedding-3-small dimension
    
    def embed_text(self, text: str) -> List[float]:
        """Generate embedding for a single text"""
        try:
            response = self.client.embeddings.create(
                model=self.model,
                input=text
            )
            return response.data[0].embedding
        except Exception as e:
            print(f"❌ Error generating embedding: {e}")
            return None
    
    def embed_batch(self, texts: List[str], batch_size: int = 50) -> List[List[float]]:
        """
        Generate embeddings for multiple texts with batching and rate limiting
        """
        embeddings = []
        
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            
            try:
                print(f"🔄 Processing batch {i//batch_size + 1}/{(len(texts)-1)//batch_size + 1}")
                
                response = self.client.embeddings.create(
                    model=self.model,
                    input=batch
                )
                
                batch_embeddings = [item.embedding for item in response.data]
                embeddings.extend(batch_embeddings)
                
                # Rate limiting: small delay between batches
                time.sleep(0.1)
                
            except Exception as e:
                print(f"❌ Error in batch {i//batch_size + 1}: {e}")
                # Add None for failed embeddings
                embeddings.extend([None] * len(batch))
        
        return embeddings
    
    def embed_chunks(self, chunks: List[Dict]) -> List[Dict]:
        """
        Add embeddings to chunk data
        """
        print(f"🧠 Generating embeddings for {len(chunks)} chunks...")
        
        # Extract texts for embedding
        texts = [chunk['text'] for chunk in chunks]
        
        # Generate embeddings
        embeddings = self.embed_batch(texts)
        
        # Add embeddings to chunks
        embedded_chunks = []
        for chunk, embedding in zip(chunks, embeddings):
            if embedding is not None:
                chunk['embedding'] = embedding
                embedded_chunks.append(chunk)
            else:
                print(f"⚠️  Skipping chunk {chunk['id']} due to embedding failure")
        
        print(f"✅ Successfully embedded {len(embedded_chunks)} chunks")
        return embedded_chunks

# Generate embeddings
embedder = OpenAIEmbedder(api_key=os.getenv('OPENAI_API_KEY'))
embedded_chunks = embedder.embed_chunks(chunks)

print(f"\n📊 Embedding Statistics:")
print(f"Successful embeddings: {len(embedded_chunks)}")
print(f"Embedding dimension: {len(embedded_chunks[0]['embedding']) if embedded_chunks else 'N/A'}")

Vector Storage in Supabase

Now let's store our embedded chunks in Supabase pgvector for efficient similarity search:

import vecs
from typing import List, Tuple, Any

class SupabaseVectorStore:
    """
    Manage vector storage and retrieval in Supabase pgvector
    """
    
    def __init__(self, db_url: str, collection_name: str = "knowledge_base"):
        self.client = vecs.create_client(db_url)
        self.collection_name = collection_name
        self.collection = None
    
    def create_collection(self, dimension: int = 1536):
        """Create or get vector collection"""
        self.collection = self.client.get_or_create_collection(
            name=self.collection_name,
            dimension=dimension
        )
        print(f"📦 Collection '{self.collection_name}' ready")
    
    def upsert_chunks(self, embedded_chunks: List[Dict], batch_size: int = 100):
        """
        Store embedded chunks in the vector database
        """
        if not self.collection:
            raise ValueError("Collection not created. Call create_collection() first.")
        
        print(f"💾 Storing {len(embedded_chunks)} chunks in Supabase...")
        
        # Prepare records for upsert
        records = []
        for chunk in embedded_chunks:
            record = (
                chunk['id'],                    # Unique ID
                chunk['embedding'],             # Vector embedding
                {                              # Metadata
                    'text': chunk['text'],
                    'url': chunk['metadata']['url'],
                    'title': chunk['metadata']['title'],
                    'description': chunk['metadata']['description'],
                    'chunk_index': chunk['metadata']['chunk_index'],
                    'source': chunk['metadata']['source']
                }
            )
            records.append(record)
        
        # Upsert in batches
        for i in range(0, len(records), batch_size):
            batch = records[i:i + batch_size]
            try:
                self.collection.upsert(records=batch)
                print(f"✅ Stored batch {i//batch_size + 1}/{(len(records)-1)//batch_size + 1}")
            except Exception as e:
                print(f"❌ Error storing batch {i//batch_size + 1}: {e}")
        
        print(f"🎉 Successfully stored {len(embedded_chunks)} chunks!")
    
    def create_index(self):
        """Create HNSW index for fast similarity search"""
        if not self.collection:
            raise ValueError("Collection not created.")
        
        print("🔍 Creating HNSW index for fast search...")
        self.collection.create_index()
        print("✅ Index created successfully!")
    
    def similarity_search(self, query_embedding: List[float], limit: int = 5) -> List[Dict]:
        """
        Search for similar chunks using vector similarity
        """
        if not self.collection:
            raise ValueError("Collection not created.")
        
        results = self.collection.query(
            data=query_embedding,
            limit=limit,
            include_metadata=True
        )
        
        # Format results
        formatted_results = []
        for result in results:
            formatted_results.append({
                'id': result[0],
                'similarity': result[1],
                'metadata': result[2]
            })
        
        return formatted_results

# Store vectors in Supabase
vector_store = SupabaseVectorStore(
    db_url=os.getenv('DATABASE_URL'),
    collection_name="supacrawler_rag_demo"
)

# Create collection and store chunks
vector_store.create_collection(dimension=1536)
vector_store.upsert_chunks(embedded_chunks)
vector_store.create_index()

print("\n🎯 Vector storage complete! Ready for queries.")

Building the Query Engine

Now let's create a query engine that can answer questions using our knowledge base:

class RAGQueryEngine:
    """
    Complete RAG query engine with embedding search and response generation
    """
    
    def __init__(self, vector_store: SupabaseVectorStore, embedder: OpenAIEmbedder):
        self.vector_store = vector_store
        self.embedder = embedder
        self.openai_client = embedder.client
    
    def search_knowledge_base(self, query: str, top_k: int = 5) -> List[Dict]:
        """
        Search knowledge base for relevant chunks
        """
        print(f"🔍 Searching for: '{query}'")
        
        # Generate query embedding
        query_embedding = self.embedder.embed_text(query)
        if not query_embedding:
            raise ValueError("Failed to generate query embedding")
        
        # Search vector database
        results = self.vector_store.similarity_search(query_embedding, limit=top_k)
        
        print(f"📋 Found {len(results)} relevant chunks")
        return results
    
    def generate_response(self, query: str, context_chunks: List[Dict], model: str = "gpt-3.5-turbo") -> str:
        """
        Generate response using retrieved context
        """
        # Build context from retrieved chunks
        context_texts = []
        for chunk in context_chunks:
            metadata = chunk['metadata']
            text = metadata['text']
            url = metadata['url']
            title = metadata['title']
            
            context_texts.append(f"Source: {title} ({url})\n{text}")
        
        context = "\n\n---\n\n".join(context_texts)
        
        # Create prompt
        prompt = f"""You are a helpful assistant that answers questions based on the provided context. 
Use only the information from the context to answer the question. If the context doesn't contain 
enough information to answer the question, say so clearly.

Context:
{context}

Question: {query}

Answer:"""
        
        try:
            response = self.openai_client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": "You are a helpful assistant that answers questions based on provided context."},
                    {"role": "user", "content": prompt}
                ],
                max_tokens=500,
                temperature=0.1
            )
            
            return response.choices[0].message.content
        
        except Exception as e:
            return f"Error generating response: {e}"
    
    def ask(self, question: str, top_k: int = 5) -> Dict:
        """
        Complete RAG pipeline: search + generate
        """
        print(f"\n💬 Question: {question}")
        
        # Search knowledge base
        relevant_chunks = self.search_knowledge_base(question, top_k=top_k)
        
        if not relevant_chunks:
            return {
                'question': question,
                'answer': "I couldn't find relevant information to answer your question.",
                'sources': []
            }
        
        # Generate response
        answer = self.generate_response(question, relevant_chunks)
        
        # Extract source URLs
        sources = list(set([
            chunk['metadata']['url'] 
            for chunk in relevant_chunks
        ]))
        
        result = {
            'question': question,
            'answer': answer,
            'sources': sources,
            'relevant_chunks': len(relevant_chunks)
        }
        
        print(f"✅ Answer generated using {len(relevant_chunks)} chunks from {len(sources)} sources")
        return result

# Create query engine
query_engine = RAGQueryEngine(vector_store, embedder)

# Test queries
test_questions = [
    "How do I set up pgvector in Supabase?",
    "What are the main features of Supabase?",
    "How do I query vectors in Supabase?",
]

print("\n🧪 Testing RAG System:")
print("=" * 50)

for question in test_questions:
    result = query_engine.ask(question)
    
    print(f"\n❓ {result['question']}")
    print(f"💡 {result['answer']}")
    print(f"📚 Sources: {', '.join(result['sources'])}")
    print("-" * 50)

Complete RAG Implementation

Here's the complete implementation that ties everything together:

import os
from dotenv import load_dotenv
from supacrawler import SupacrawlerClient
import vecs
import openai
from typing import List, Dict, Any

class SupacrawlerRAGSystem:
    """
    Complete RAG system using Supacrawler, OpenAI, and Supabase
    """
    
    def __init__(self):
        load_dotenv()
        
        # Initialize clients
        self.supacrawler_client = SupacrawlerClient(
            api_key=os.getenv('SUPACRAWLER_API_KEY')
        )
        self.openai_client = openai.OpenAI(
            api_key=os.getenv('OPENAI_API_KEY')
        )
        self.vector_client = vecs.create_client(os.getenv('DATABASE_URL'))
        
        # Initialize components
        self.chunker = DocumentChunker()
        self.collection = None
        
        print("🚀 RAG System initialized!")
    
    def build_knowledge_base(self, url: str, collection_name: str = "rag_knowledge"):
        """
        Complete pipeline: crawl → chunk → embed → store
        """
        print(f"🏗️  Building knowledge base from {url}")
        
        # Step 1: Crawl website
        crawled_data = self._crawl_website(url)
        
        # Step 2: Process and chunk content
        chunks = self.chunker.process_crawled_content(crawled_data)
        
        # Step 3: Generate embeddings
        embedded_chunks = self._embed_chunks(chunks)
        
        # Step 4: Store in vector database
        self._store_vectors(embedded_chunks, collection_name)
        
        print("✅ Knowledge base built successfully!")
        return len(embedded_chunks)
    
    def _crawl_website(self, url: str) -> Dict:
        """Crawl website and return content"""
        job = self.supacrawler_client.create_crawl_job(
            url=url,
            format='markdown',
            depth=3,
            link_limit=100,
            render_js=True,
            include_patterns=['/docs/*', '/api/*', '/guides/*']
        )
        
        result = self.supacrawler_client.wait_for_crawl(job.job_id)
        
        if result.status == 'completed':
            return result.data.get('crawl_data', {})
        else:
            raise Exception(f"Crawl failed: {result.status}")
    
    def _embed_chunks(self, chunks: List[Dict]) -> List[Dict]:
        """Generate OpenAI embeddings for chunks"""
        embedded_chunks = []
        
        for chunk in chunks:
            try:
                response = self.openai_client.embeddings.create(
                    model="text-embedding-3-small",
                    input=chunk['text']
                )
                
                chunk['embedding'] = response.data[0].embedding
                embedded_chunks.append(chunk)
                
            except Exception as e:
                print(f"⚠️  Failed to embed chunk {chunk['id']}: {e}")
        
        return embedded_chunks
    
    def _store_vectors(self, embedded_chunks: List[Dict], collection_name: str):
        """Store vectors in Supabase"""
        self.collection = self.vector_client.get_or_create_collection(
            name=collection_name,
            dimension=1536
        )
        
        # Prepare records
        records = [
            (chunk['id'], chunk['embedding'], chunk['metadata'])
            for chunk in embedded_chunks
        ]
        
        # Upsert and create index
        self.collection.upsert(records=records)
        self.collection.create_index()
    
    def ask(self, question: str, top_k: int = 5) -> Dict:
        """Ask a question and get an answer"""
        if not self.collection:
            raise ValueError("No knowledge base loaded. Call build_knowledge_base() first.")
        
        # Generate query embedding
        query_response = self.openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=question
        )
        query_embedding = query_response.data[0].embedding
        
        # Search vectors
        results = self.collection.query(
            data=query_embedding,
            limit=top_k,
            include_metadata=True
        )
        
        # Build context
        context_parts = []
        sources = []
        
        for result in results:
            metadata = result[2]  # Metadata is third element
            context_parts.append(f"Source: {metadata['title']}\n{metadata['text']}")
            sources.append(metadata['url'])
        
        context = "\n\n---\n\n".join(context_parts)
        
        # Generate answer
        response = self.openai_client.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "Answer questions based only on the provided context."},
                {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
            ],
            max_tokens=500,
            temperature=0.1
        )
        
        return {
            'question': question,
            'answer': response.choices[0].message.content,
            'sources': list(set(sources))
        }

# Usage example
if __name__ == "__main__":
    # Initialize RAG system
    rag = SupacrawlerRAGSystem()
    
    # Build knowledge base
    num_chunks = rag.build_knowledge_base("https://supabase.com/docs")
    print(f"📊 Knowledge base contains {num_chunks} chunks")
    
    # Ask questions
    questions = [
        "How do I enable pgvector in Supabase?",
        "What authentication methods does Supabase support?",
        "How do I create a vector search in Supabase?"
    ]
    
    for question in questions:
        result = rag.ask(question)
        print(f"\n❓ {result['question']}")
        print(f"💡 {result['answer']}")
        print(f"📚 Sources: {result['sources']}")

Performance Optimization

To optimize your RAG system for production use:

Chunking Optimization

# Experiment with different chunk sizes
chunk_configs = [
    {'size': 500, 'overlap': 100},
    {'size': 1000, 'overlap': 200},
    {'size': 1500, 'overlap': 300}
]

for config in chunk_configs:
    chunker = DocumentChunker(
        chunk_size=config['size'], 
        chunk_overlap=config['overlap']
    )
    # Test and measure retrieval performance

Embedding Batch Processing

# Process embeddings in batches for better throughput
def batch_embed(texts: List[str], batch_size: int = 100):
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=batch
        )
        embeddings.extend([item.embedding for item in response.data])
    return embeddings

Database Indexing

-- Optimize vector search performance
CREATE INDEX CONCURRENTLY ON your_collection 
USING hnsw (embedding vector_cosine_ops) 
WITH (m = 16, ef_construction = 64);

Production Deployment

For production deployment, consider these enhancements:

Environment Configuration

# config.py
import os
from pydantic import BaseSettings

class Settings(BaseSettings):
    supacrawler_api_key: str
    openai_api_key: str
    database_url: str
    supabase_url: str
    supabase_key: str
    
    # Performance settings
    embedding_batch_size: int = 50
    vector_search_limit: int = 10
    chunk_size: int = 1000
    chunk_overlap: int = 200
    
    class Config:
        env_file = ".env"

settings = Settings()

Error Handling and Monitoring

import logging
from functools import wraps

def with_retry(max_retries: int = 3):
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_retries - 1:
                        logging.error(f"Function {func.__name__} failed after {max_retries} attempts: {e}")
                        raise
                    logging.warning(f"Attempt {attempt + 1} failed: {e}, retrying...")
        return wrapper
    return decorator

@with_retry(max_retries=3)
def embed_with_retry(text: str):
    return openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )

Scale Beyond Local Development with Supacrawler

While this tutorial demonstrates building a RAG system locally, production deployments require handling scale, reliability, and performance optimization:

Large-Scale Crawling: Managing thousands of pages with rate limiting and error handling
Content Updates: Keeping knowledge bases current with incremental updates
Vector Management: Optimizing storage and search performance across millions of vectors
Cost Optimization: Balancing embedding quality with API costs

Supacrawler's Crawl API handles these production challenges automatically:

import { SupacrawlerClient } from '@supacrawler/js'

const client = new SupacrawlerClient({ apiKey: process.env.SUPACRAWLER_API_KEY })

// Production-scale RAG data pipeline
async function buildProductionRAG() {
  const job = await client.createCrawlJob({
    url: 'https://docs.company.com',
    format: 'markdown',
    depth: 5,                    // Deep crawling
    link_limit: 10000,          // Large scale
    render_js: true,            // Full JavaScript support
    
    // Production optimizations
    include_patterns: ['/docs/*', '/api/*', '/guides/*'],
    exclude_patterns: ['/blog/*', '/changelog/*'],
    timeout: 30000,
    concurrent_limit: 10,       // Parallel processing
    
    // Content quality
    remove_selectors: ['.sidebar', '.nav', '.footer'],
    wait_for: '.main-content',
    block_ads: true
  })
  
  const result = await client.waitForCrawl(job.job_id)
  return result.data.crawl_data
}

Key Production Advantages:

✅ Automatic Scale Management: Handle 10,000+ pages without infrastructure complexity
✅ Content Quality: Clean, structured content optimized for RAG systems
✅ JavaScript Rendering: Full SPA and dynamic content support
✅ Rate Limiting: Built-in respect for robots.txt and site limits
✅ Error Recovery: Automatic retries and failure handling
✅ Incremental Updates: Efficient re-crawling for content freshness

Getting Started:

📖 Crawl API Documentation for RAG-optimized crawling
🔧 GitHub Repository for self-hosting
🆓 Start with 1,000 free crawl operations

Conclusion

You've built a complete RAG system that combines the best of modern AI technologies:

Supacrawler for high-quality web data extraction
OpenAI embeddings for superior semantic understanding
Supabase pgvector for scalable vector storage and search

This foundation provides everything needed for intelligent question-answering systems, chatbots, and knowledge management applications. The modular design makes it easy to swap components, optimize performance, and scale to production requirements.

Whether you're building internal documentation search, customer support automation, or research assistants, this RAG architecture provides a robust, production-ready foundation for AI-powered applications.