A Developer's Guide to the Supacrawler Crawl API
Building a production-grade web crawler is a massive engineering task involving infrastructure management, JavaScript rendering, and anti-bot navigation. The Supacrawler Crawl API handles this complexity for you.
By the end of this guide, you will understand how to use the Crawl API to solve common, high-value problems, with complete code recipes for:
- Building a RAG knowledge base from a documentation site.
- Performing a site-wide SEO audit and exporting the results to a CSV.
Core Concept: Asynchronous Crawl Jobs
The Crawl API is asynchronous. You submit a URL and a set of rules (like depth and URL patterns), which creates a job
. This job runs in the background, and you can poll for its status or use our SDKs' built-in waitForCrawl
helpers.
graph TDA["1. API Request (URL + Rules)"] --> B{"Cloud Crawler Job"};B --> C["2. Map & Process Pages"];C --> D["3. Extract Structured Data"];D --> E["4. Return Complete JSON"];style B fill:#222,stroke:#333,stroke-width:2px;style E fill:#0c0,stroke:#333,stroke-width:2px;
Key Parameters for Precise Crawling
You control the crawler's behavior with a few key parameters:
url
: The starting point for the crawl.depth
: How many links deep to follow from theurl
. Adepth
of 1 gets the start page and any pages it links to directly.link_limit
: A hard cap on the total number of pages to crawl. This is a crucial safety mechanism.include_patterns
: An array of regex strings. The crawler will only follow and process URLs matching these patterns. Use this for precise targeting.render_js
: Set totrue
for Single-Page Applications (SPAs) or any site that loads content dynamically with JavaScript.
Recipe 1: Building a Knowledge Base for a RAG System
Goal: Crawl a documentation site and extract clean Markdown, ready for chunking and embedding.
This configuration targets a docs site, crawls up to 4 links deep with a max of 500 pages, focuses only on relevant content paths, and enables JS rendering.
Crawl a Documentation Site for RAG
from supacrawler import SupacrawlerClientimport jsonclient = SupacrawlerClient(api_key='YOUR_API_KEY')# Create a comprehensive crawl of the documentationjob = client.create_crawl_job(url='https://docs.example-framework.com',format='markdown',depth=4,link_limit=500,include_patterns=['/docs/.*', '/api/.*', '/tutorials/.*'],render_js=True)print(f"Crawl job started with ID: {job.job_id}")result = client.wait_for_crawl(job.job_id)# Process the results into a clean list of documentsknowledge_base = []if result.status == 'completed':crawl_data = result.data.get('crawl_data', {})for url, page_data in crawl_data.items():if page_data.get('status_code') == 200 and page_data.get('markdown'):knowledge_base.append({'url': url,'title': page_data.get('metadata', {}).get('title', ''),'content': page_data.get('markdown', '')})# Save the knowledge base to a JSON filewith open('knowledge_base.json', 'w') as f:json.dump(knowledge_base, f, indent=2)print(f"Knowledge base created with {len(knowledge_base)} documents.")else:print(f"Crawl failed with status: {result.status}")
Recipe 2: Performing a Site-Wide SEO Audit
Goal: Crawl a website, extract the URL, title, meta description, and H1 tag for every page, and save it to a CSV.
This script uses include_html: true
to get the raw HTML needed for H1 extraction.
Perform a Site-Wide SEO Audit
from supacrawler import SupacrawlerClientimport csvimport reclient = SupacrawlerClient(api_key='YOUR_API_KEY')def run_seo_audit(start_url):job = client.create_crawl_job(url=start_url,depth=5,link_limit=1000,include_html=True # We need HTML to extract H1 tags)print(f"SEO Audit crawl started: {job.job_id}")result = client.wait_for_crawl(job.job_id)if result.status != 'completed':print(f"Crawl failed: {result.status}")returnseo_data = []crawl_data = result.data.get('crawl_data', {})for url, page in crawl_data.items():if page.get('status_code') == 200 and page.get('metadata'):# Basic H1 extraction.h1_match = re.search(r'<h1[^>]*>([\s\S]*?)<\/h1>', page.get('html', ''), re.IGNORECASE)seo_data.append({'url': url,'title': page.get('metadata', {}).get('title', ''),'meta_description': page.get('metadata', {}).get('meta_description', ''),'h1': h1_match.group(1).strip() if h1_match else ''})# Convert to CSV and savewith open('seo_audit.csv', 'w', newline='', encoding='utf-8') as f:if not seo_data:print("No data to write to CSV.")returnwriter = csv.DictWriter(f, fieldnames=seo_data[0].keys())writer.writeheader()writer.writerows(seo_data)print(f"SEO Audit complete! {len(seo_data)} pages saved to seo_audit.csv")run_seo_audit('https://example.com')
Production Best Practices
- Start Small: Test your configuration with a small
depth
(e.g., 2) andlink_limit
(e.g., 20) before launching a large crawl. - Use Specific Patterns: The more specific your
include_patterns
, the faster and more efficient your crawl will be. - Handle Failures: The job
status
can befailed
. Thestatistics
object in acompleted
job will showfailed_pages
. Build retry logic for URLs that return non-200 status codes. - Save Job IDs: Store the
job_id
so you can retrieve results later without re-crawling.
Next Steps
This guide covers the core functionality of the Crawl API. To explore all available parameters and options, check out the full API documentation.