A Developer's Guide to the Supacrawler Crawl API

Building a production-grade web crawler is a massive engineering task involving infrastructure management, JavaScript rendering, and anti-bot navigation. The Supacrawler Crawl API handles this complexity for you.

By the end of this guide, you will understand how to use the Crawl API to solve common, high-value problems, with complete code recipes for:

Building a RAG knowledge base from a documentation site.
Performing a site-wide SEO audit and exporting the results to a CSV.

Core Concept: Asynchronous Crawl Jobs

The Crawl API is asynchronous. You submit a URL and a set of rules (like depth and URL patterns), which creates a job. This job runs in the background, and you can poll for its status or use our SDKs' built-in waitForCrawl helpers.

graph TD
    A["1. API Request (URL + Rules)"] --> B{"Cloud Crawler Job"};
    B --> C["2. Map & Process Pages"];
    C --> D["3. Extract Structured Data"];
    D --> E["4. Return Complete JSON"];
    
    style B fill:#222,stroke:#333,stroke-width:2px;
    style E fill:#0c0,stroke:#333,stroke-width:2px;

Key Parameters for Precise Crawling

You control the crawler's behavior with a few key parameters:

url: The starting point for the crawl.
depth: How many links deep to follow from the url. A depth of 1 gets the start page and any pages it links to directly.
link_limit: A hard cap on the total number of pages to crawl. This is a crucial safety mechanism.
include_patterns: An array of regex strings. The crawler will only follow and process URLs matching these patterns. Use this for precise targeting.
render_js: Set to true for Single-Page Applications (SPAs) or any site that loads content dynamically with JavaScript.

Recipe 1: Building a Knowledge Base for a RAG System

Goal: Crawl a documentation site and extract clean Markdown, ready for chunking and embedding.

This configuration targets a docs site, crawls up to 4 links deep with a max of 500 pages, focuses only on relevant content paths, and enables JS rendering.

Crawl a Documentation Site for RAG

from supacrawler import SupacrawlerClient
import json

client = SupacrawlerClient(api_key='YOUR_API_KEY')

# Create a comprehensive crawl of the documentation
job = client.create_crawl_job(
    url='https://docs.example-framework.com',
    format='markdown',
    depth=4,
    link_limit=500,
    include_patterns=['/docs/.*', '/api/.*', '/tutorials/.*'],
    render_js=True
)

print(f"Crawl job started with ID: {job.job_id}")
result = client.wait_for_crawl(job.job_id)

# Process the results into a clean list of documents
knowledge_base = []
if result.status == 'completed':
    crawl_data = result.data.get('crawl_data', {})
    for url, page_data in crawl_data.items():
        if page_data.get('status_code') == 200 and page_data.get('markdown'):
            knowledge_base.append({
                'url': url,
                'title': page_data.get('metadata', {}).get('title', ''),
                'content': page_data.get('markdown', '')
            })

    # Save the knowledge base to a JSON file
    with open('knowledge_base.json', 'w') as f:
        json.dump(knowledge_base, f, indent=2)

    print(f"Knowledge base created with {len(knowledge_base)} documents.")
else:
    print(f"Crawl failed with status: {result.status}")

Recipe 2: Performing a Site-Wide SEO Audit

Goal: Crawl a website, extract the URL, title, meta description, and H1 tag for every page, and save it to a CSV.

This script uses include_html: true to get the raw HTML needed for H1 extraction.

Perform a Site-Wide SEO Audit

from supacrawler import SupacrawlerClient
import csv
import re

client = SupacrawlerClient(api_key='YOUR_API_KEY')

def run_seo_audit(start_url):
    job = client.create_crawl_job(
        url=start_url,
        depth=5,
        link_limit=1000,
        include_html=True  # We need HTML to extract H1 tags
    )

    print(f"SEO Audit crawl started: {job.job_id}")
    result = client.wait_for_crawl(job.job_id)

    if result.status != 'completed':
        print(f"Crawl failed: {result.status}")
        return

    seo_data = []
    crawl_data = result.data.get('crawl_data', {})

    for url, page in crawl_data.items():
        if page.get('status_code') == 200 and page.get('metadata'):
            # Basic H1 extraction.
            h1_match = re.search(r'<h1[^>]*>([\s\S]*?)<\/h1>', page.get('html', ''), re.IGNORECASE)
            seo_data.append({
                'url': url,
                'title': page.get('metadata', {}).get('title', ''),
                'meta_description': page.get('metadata', {}).get('meta_description', ''),
                'h1': h1_match.group(1).strip() if h1_match else ''
            })

    # Convert to CSV and save
    with open('seo_audit.csv', 'w', newline='', encoding='utf-8') as f:
        if not seo_data:
            print("No data to write to CSV.")
            return
            
        writer = csv.DictWriter(f, fieldnames=seo_data[0].keys())
        writer.writeheader()
        writer.writerows(seo_data)

    print(f"SEO Audit complete! {len(seo_data)} pages saved to seo_audit.csv")

run_seo_audit('https://example.com')

Production Best Practices

Start Small: Test your configuration with a small depth (e.g., 2) and link_limit (e.g., 20) before launching a large crawl.
Use Specific Patterns: The more specific your include_patterns, the faster and more efficient your crawl will be.
Handle Failures: The job status can be failed. The statistics object in a completed job will show failed_pages. Build retry logic for URLs that return non-200 status codes.
Save Job IDs: Store the job_id so you can retrieve results later without re-crawling.

Next Steps

This guide covers the core functionality of the Crawl API. To explore all available parameters and options, check out the full API documentation.

A Developer's Guide to the Supacrawler Crawl API

Core Concept: Asynchronous Crawl Jobs

Key Parameters for Precise Crawling

Recipe 1: Building a Knowledge Base for a RAG System

Crawl a Documentation Site for RAG

Recipe 2: Performing a Site-Wide SEO Audit

Perform a Site-Wide SEO Audit

Production Best Practices

Next Steps

Product

Company

Blog

Support