Advanced Crawling Techniques: Beyond Basic Web Scraping
Advanced Crawling Techniques: Beyond Basic Web Scraping
Web scraping a single page is straightforward. But what happens when you need to extract content from hundreds or thousands of pages across an entire website? That's where advanced crawling techniques come in.
In this comprehensive guide, we'll explore how to use Supacrawler's Crawl API to go beyond basic web scraping and implement sophisticated crawling strategies for deep site mapping, content discovery, and building comprehensive knowledge bases.
Understanding the Crawl API Architecture
Before diving into advanced techniques, it's important to understand how Supacrawler's Crawl API works under the hood. The API uses an asynchronous job-based architecture that handles the complex process of discovering, mapping, and scraping content from multiple pages in a coordinated way.
When you create a crawl job, Supacrawler:
- Maps the website structure - Discovers links from the starting URL based on your depth and pattern configurations
- Processes pages in parallel - Uses a worker pool to efficiently scrape content from discovered URLs
- Enforces limits - Respects your link limits and depth parameters to prevent runaway crawls
- Provides structured results - Returns a comprehensive dataset with content and metadata for each page
This architecture allows you to extract content from hundreds of pages with a single API call, while maintaining control over the scope and focus of your crawl.
Deep Site Mapping and Content Discovery
Technique #1: Strategic Depth Control
The depth
parameter is one of your most powerful controls when crawling a website. It determines how many "hops" away from the starting URL the crawler will travel.
const job = await client.createCrawlJob({url: 'https://example.com/docs',depth: 3, // Crawl the starting page, pages linked from it, and pages linked from thoselink_limit: 200});
For different scenarios, consider these depth strategies:
- Depth 1: Extract just the starting page (useful for testing)
- Depth 2: Get the starting page and direct links (good for specific sections)
- Depth 3-4: Comprehensive crawl of a website section (ideal for documentation sites)
- Depth 5+: Deep exploration of complex sites (use with caution and appropriate link limits)
Technique #2: Pattern-Based Crawling
Rather than crawling an entire domain, use the include_patterns
parameter to focus on specific sections or content types. This is particularly valuable for large sites where you only need certain areas:
job = client.create_crawl_job(url='https://example.com',depth=4,link_limit=500,include_patterns=['/blog/*', # Crawl all blog posts'/docs/api/*', # Crawl API documentation'/guides/*.html' # Crawl HTML guides],render_js=True)
You can combine multiple patterns to create sophisticated crawling rules that precisely target the content you need while ignoring everything else.
Technique #3: Subdomain Navigation Control
Many websites spread their content across multiple subdomains. The include_subdomains
parameter lets you control whether the crawler should follow links to subdomains of the starting URL:
curl https://api.supacrawler.com/api/v1/crawl \-H "Authorization: Bearer YOUR_API_KEY" \-H "Content-Type: application/json" \-d '{"url": "https://main.example.com","depth": 3,"link_limit": 300,"include_subdomains": true,"include_patterns": ["/docs/*", "/blog/*"]}'
This is particularly useful for sites that use different subdomains for different content types (e.g., blog.example.com, docs.example.com).
Building Comprehensive Knowledge Bases
Technique #4: Structured Content Extraction with Format Control
When building knowledge bases or RAG systems, the format of extracted content is crucial. Supacrawler's format
parameter gives you control over how content is processed:
# Extract clean markdown (ideal for RAG systems)job = client.create_crawl_job(url='https://example.com/docs',format='markdown',depth=3,link_limit=200)# Extract both markdown and HTML for maximum flexibilityjob = client.create_crawl_job(url='https://example.com/docs',format='markdown',include_html=True,depth=3,link_limit=200)
The markdown
format is particularly valuable for knowledge bases as it:
- Preserves the semantic structure of content
- Removes unnecessary styling and scripts
- Maintains headings, lists, and other important formatting
- Works well with most vector embedding systems
Technique #5: JavaScript Rendering for Dynamic Content
Many modern websites load their content dynamically using JavaScript. For these sites, enable the render_js
parameter to ensure you capture the fully rendered content:
const job = await client.createCrawlJob({url: 'https://spa-example.com',format: 'markdown',depth: 2,link_limit: 50,render_js: true // Enable JavaScript rendering});
This technique is essential for single-page applications (SPAs), documentation sites built with frameworks like Docusaurus or VuePress, and other JavaScript-heavy websites.
Technique #6: Streaming Processing for Large Sites
For very large sites, Supacrawler uses an efficient streaming approach that processes pages as they're discovered rather than waiting for the entire site map to be built first. This provides several benefits:
- Faster time to first results - You get content from the first pages while others are still being processed
- Better memory efficiency - The system doesn't need to hold the entire site map in memory
- More precise limit control - The crawler can stop exactly at your specified link limit
This streaming architecture is automatically used for all crawl jobs, making it possible to efficiently process large sites without overwhelming your systems.
Real-World Applications
Building a Documentation Knowledge Base
Let's look at a complete example of building a knowledge base from a documentation site:
from supacrawler import SupacrawlerClientimport jsonclient = SupacrawlerClient(api_key='YOUR_API_KEY')# Create a comprehensive crawl of a documentation sitejob = client.create_crawl_job(url='https://framework-docs.example.com',format='markdown',depth=4,link_limit=500,include_patterns=['/docs/*', '/api/*', '/tutorials/*'],render_js=True,include_subdomains=False)# Wait for the job to completeresult = client.wait_for_crawl(job.job_id, interval_seconds=5, timeout_seconds=600)# Process the results into a knowledge base formatknowledge_base = []for url, page_data in result.data.get('crawl_data', {}).items():knowledge_base.append({'url': url,'title': (page_data.get('metadata') or {}).get('title', ''),'content': page_data.get('markdown', ''),'status_code': (page_data.get('metadata') or {}).get('status_code')})# Save the knowledge base to a JSON filewith open('documentation_knowledge_base.json', 'w') as f:json.dump(knowledge_base, f, indent=2)print(f"Knowledge base created with {len(knowledge_base)} documents")
This script creates a complete knowledge base from a documentation site, ready to be imported into a vector database or RAG system.
Competitive Analysis Dataset
Another powerful application is creating a comprehensive dataset of competitor content:
import { SupacrawlerClient } from '@supacrawler/js';import fs from 'fs';const client = new SupacrawlerClient({ apiKey: process.env.SUPACRAWLER_API_KEY });async function createCompetitorDataset(competitorUrl, outputFilename) {// Create a focused crawl of competitor contentconst job = await client.createCrawlJob({url: competitorUrl,format: 'markdown',depth: 3,link_limit: 200,include_patterns: ['/features/*', '/solutions/*', '/case-studies/*'],render_js: true});console.log(`Crawl job created: ${job.job_id}`);// Wait for completionconst result = await client.waitForCrawl(job.job_id, {intervalMs: 5000,timeoutMs: 600000});// Extract statisticsconst stats = result.data?.statistics || {};console.log(`Crawl completed: ${stats.successful_pages || 0} pages successful, ${stats.failed_pages || 0} failed`);// Save the complete datasetfs.writeFileSync(outputFilename, JSON.stringify(result, null, 2));console.log(`Dataset saved to ${outputFilename}`);return result;}// Create datasets for multiple competitorsasync function analyzeCompetitors() {await createCompetitorDataset('https://competitor1.com', 'competitor1_dataset.json');await createCompetitorDataset('https://competitor2.com', 'competitor2_dataset.json');await createCompetitorDataset('https://competitor3.com', 'competitor3_dataset.json');}analyzeCompetitors().catch(console.error);
This script creates comprehensive datasets of competitor content, which can be used for feature comparison, messaging alignment, or gap analysis.
Advanced Configuration Techniques
Optimizing Worker Pools
Supacrawler automatically adjusts its worker pool based on your crawl configuration. For standard crawls, it uses up to 10 parallel workers to process pages efficiently. However, when JavaScript rendering is enabled, it reduces to 2 workers to prevent overloading browser instances.
This automatic optimization ensures efficient processing while maintaining system stability. For most use cases, you don't need to worry about this detail - the system handles it for you.
Handling Timeouts and Large Crawls
For very large crawls, Supacrawler implements a safety timeout of 2 minutes to prevent runaway processes. If you need to crawl extremely large sites, consider breaking your crawl into multiple jobs focused on different sections of the site.
For example, instead of crawling an entire documentation site at once, you might create separate crawl jobs for each major section:
# Crawl API documentationapi_job = client.create_crawl_job(url='https://example.com/docs/api',depth=3,link_limit=200,include_patterns=['/docs/api/*'])# Crawl tutorials separatelytutorials_job = client.create_crawl_job(url='https://example.com/docs/tutorials',depth=3,link_limit=200,include_patterns=['/docs/tutorials/*'])# Crawl guides separatelyguides_job = client.create_crawl_job(url='https://example.com/docs/guides',depth=3,link_limit=200,include_patterns=['/docs/guides/*'])
This approach not only helps manage timeouts but also gives you more granular control over what content is included in each dataset.
Best Practices for Production Crawls
Based on our experience running millions of crawl jobs, here are some best practices to ensure reliable and efficient crawling:
- Start small and scale up - Begin with a small
depth
andlink_limit
to test your configuration before launching larger crawls - Use specific patterns - The more specific your
include_patterns
, the more focused and efficient your crawl will be - Enable JavaScript rendering selectively - Only use
render_js: true
when you know the target site requires it, as it's more resource-intensive - Implement exponential backoff for polling - When checking job status, start with short intervals but increase them if the job is taking longer
- Save job IDs for important crawls - Store job IDs in your system so you can retrieve results later without re-crawling
- Respect robots.txt - Supacrawler automatically respects robots.txt directives, but be mindful of crawl frequency on production sites
Conclusion: Beyond Basic Scraping
The techniques covered in this guide demonstrate how Supacrawler's Crawl API goes far beyond basic web scraping. By leveraging these advanced crawling capabilities, you can:
- Map entire websites and discover hidden content
- Build comprehensive knowledge bases for RAG systems
- Create structured datasets for competitive analysis
- Extract content from dynamic, JavaScript-heavy sites
- Process hundreds of pages with a single API call
Whether you're building an AI agent that needs web data, creating a knowledge base for your organization, or analyzing competitor content, these advanced crawling techniques provide the foundation for sophisticated data extraction at scale.
Ready to start crawling? Sign up for a Supacrawler account and try these techniques with our generous free tier, or check out our documentation for more details on the Crawl API.