Back to Blog

A Developer's Guide to the Supacrawler Crawl API

Building a production-grade web crawler is a massive engineering task involving infrastructure management, JavaScript rendering, and anti-bot navigation. The Supacrawler Crawl API handles this complexity for you.

By the end of this guide, you will understand how to use the Crawl API to solve common, high-value problems, with complete code recipes for:

  1. Building a RAG knowledge base from a documentation site.
  2. Performing a site-wide SEO audit and exporting the results to a CSV.

Core Concept: Asynchronous Crawl Jobs

The Crawl API is asynchronous. You submit a URL and a set of rules (like depth and URL patterns), which creates a job. This job runs in the background, and you can poll for its status or use our SDKs' built-in waitForCrawl helpers.

graph TD
A["1. API Request (URL + Rules)"] --> B{"Cloud Crawler Job"};
B --> C["2. Map & Process Pages"];
C --> D["3. Extract Structured Data"];
D --> E["4. Return Complete JSON"];
style B fill:#222,stroke:#333,stroke-width:2px;
style E fill:#0c0,stroke:#333,stroke-width:2px;

Key Parameters for Precise Crawling

You control the crawler's behavior with a few key parameters:

  • url: The starting point for the crawl.
  • depth: How many links deep to follow from the url. A depth of 1 gets the start page and any pages it links to directly.
  • link_limit: A hard cap on the total number of pages to crawl. This is a crucial safety mechanism.
  • include_patterns: An array of regex strings. The crawler will only follow and process URLs matching these patterns. Use this for precise targeting.
  • render_js: Set to true for Single-Page Applications (SPAs) or any site that loads content dynamically with JavaScript.

Recipe 1: Building a Knowledge Base for a RAG System

Goal: Crawl a documentation site and extract clean Markdown, ready for chunking and embedding.

This configuration targets a docs site, crawls up to 4 links deep with a max of 500 pages, focuses only on relevant content paths, and enables JS rendering.

Crawl a Documentation Site for RAG

from supacrawler import SupacrawlerClient
import json
client = SupacrawlerClient(api_key='YOUR_API_KEY')
# Create a comprehensive crawl of the documentation
job = client.create_crawl_job(
url='https://docs.example-framework.com',
format='markdown',
depth=4,
link_limit=500,
include_patterns=['/docs/.*', '/api/.*', '/tutorials/.*'],
render_js=True
)
print(f"Crawl job started with ID: {job.job_id}")
result = client.wait_for_crawl(job.job_id)
# Process the results into a clean list of documents
knowledge_base = []
if result.status == 'completed':
crawl_data = result.data.get('crawl_data', {})
for url, page_data in crawl_data.items():
if page_data.get('status_code') == 200 and page_data.get('markdown'):
knowledge_base.append({
'url': url,
'title': page_data.get('metadata', {}).get('title', ''),
'content': page_data.get('markdown', '')
})
# Save the knowledge base to a JSON file
with open('knowledge_base.json', 'w') as f:
json.dump(knowledge_base, f, indent=2)
print(f"Knowledge base created with {len(knowledge_base)} documents.")
else:
print(f"Crawl failed with status: {result.status}")

Recipe 2: Performing a Site-Wide SEO Audit

Goal: Crawl a website, extract the URL, title, meta description, and H1 tag for every page, and save it to a CSV.

This script uses include_html: true to get the raw HTML needed for H1 extraction.

Perform a Site-Wide SEO Audit

from supacrawler import SupacrawlerClient
import csv
import re
client = SupacrawlerClient(api_key='YOUR_API_KEY')
def run_seo_audit(start_url):
job = client.create_crawl_job(
url=start_url,
depth=5,
link_limit=1000,
include_html=True # We need HTML to extract H1 tags
)
print(f"SEO Audit crawl started: {job.job_id}")
result = client.wait_for_crawl(job.job_id)
if result.status != 'completed':
print(f"Crawl failed: {result.status}")
return
seo_data = []
crawl_data = result.data.get('crawl_data', {})
for url, page in crawl_data.items():
if page.get('status_code') == 200 and page.get('metadata'):
# Basic H1 extraction.
h1_match = re.search(r'<h1[^>]*>([\s\S]*?)<\/h1>', page.get('html', ''), re.IGNORECASE)
seo_data.append({
'url': url,
'title': page.get('metadata', {}).get('title', ''),
'meta_description': page.get('metadata', {}).get('meta_description', ''),
'h1': h1_match.group(1).strip() if h1_match else ''
})
# Convert to CSV and save
with open('seo_audit.csv', 'w', newline='', encoding='utf-8') as f:
if not seo_data:
print("No data to write to CSV.")
return
writer = csv.DictWriter(f, fieldnames=seo_data[0].keys())
writer.writeheader()
writer.writerows(seo_data)
print(f"SEO Audit complete! {len(seo_data)} pages saved to seo_audit.csv")
run_seo_audit('https://example.com')

Production Best Practices

  • Start Small: Test your configuration with a small depth (e.g., 2) and link_limit (e.g., 20) before launching a large crawl.
  • Use Specific Patterns: The more specific your include_patterns, the faster and more efficient your crawl will be.
  • Handle Failures: The job status can be failed. The statistics object in a completed job will show failed_pages. Build retry logic for URLs that return non-200 status codes.
  • Save Job IDs: Store the job_id so you can retrieve results later without re-crawling.

Next Steps

This guide covers the core functionality of the Crawl API. To explore all available parameters and options, check out the full API documentation.

By Supacrawler Team
Published on August 28, 2025