Back to Blog

Advanced Crawling Techniques: Beyond Basic Web Scraping

Advanced Crawling Techniques: Beyond Basic Web Scraping

Web scraping a single page is straightforward. But what happens when you need to extract content from hundreds or thousands of pages across an entire website? That's where advanced crawling techniques come in.

In this comprehensive guide, we'll explore how to use Supacrawler's Crawl API to go beyond basic web scraping and implement sophisticated crawling strategies for deep site mapping, content discovery, and building comprehensive knowledge bases.

Understanding the Crawl API Architecture

Before diving into advanced techniques, it's important to understand how Supacrawler's Crawl API works under the hood. The API uses an asynchronous job-based architecture that handles the complex process of discovering, mapping, and scraping content from multiple pages in a coordinated way.

When you create a crawl job, Supacrawler:

  1. Maps the website structure - Discovers links from the starting URL based on your depth and pattern configurations
  2. Processes pages in parallel - Uses a worker pool to efficiently scrape content from discovered URLs
  3. Enforces limits - Respects your link limits and depth parameters to prevent runaway crawls
  4. Provides structured results - Returns a comprehensive dataset with content and metadata for each page

This architecture allows you to extract content from hundreds of pages with a single API call, while maintaining control over the scope and focus of your crawl.

Deep Site Mapping and Content Discovery

Technique #1: Strategic Depth Control

The depth parameter is one of your most powerful controls when crawling a website. It determines how many "hops" away from the starting URL the crawler will travel.

const job = await client.createCrawlJob({
url: 'https://example.com/docs',
depth: 3, // Crawl the starting page, pages linked from it, and pages linked from those
link_limit: 200
});

For different scenarios, consider these depth strategies:

  • Depth 1: Extract just the starting page (useful for testing)
  • Depth 2: Get the starting page and direct links (good for specific sections)
  • Depth 3-4: Comprehensive crawl of a website section (ideal for documentation sites)
  • Depth 5+: Deep exploration of complex sites (use with caution and appropriate link limits)

Technique #2: Pattern-Based Crawling

Rather than crawling an entire domain, use the include_patterns parameter to focus on specific sections or content types. This is particularly valuable for large sites where you only need certain areas:

job = client.create_crawl_job(
url='https://example.com',
depth=4,
link_limit=500,
include_patterns=[
'/blog/*', # Crawl all blog posts
'/docs/api/*', # Crawl API documentation
'/guides/*.html' # Crawl HTML guides
],
render_js=True
)

You can combine multiple patterns to create sophisticated crawling rules that precisely target the content you need while ignoring everything else.

Technique #3: Subdomain Navigation Control

Many websites spread their content across multiple subdomains. The include_subdomains parameter lets you control whether the crawler should follow links to subdomains of the starting URL:

curl https://api.supacrawler.com/api/v1/crawl \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://main.example.com",
"depth": 3,
"link_limit": 300,
"include_subdomains": true,
"include_patterns": ["/docs/*", "/blog/*"]
}'

This is particularly useful for sites that use different subdomains for different content types (e.g., blog.example.com, docs.example.com).

Building Comprehensive Knowledge Bases

Technique #4: Structured Content Extraction with Format Control

When building knowledge bases or RAG systems, the format of extracted content is crucial. Supacrawler's format parameter gives you control over how content is processed:

# Extract clean markdown (ideal for RAG systems)
job = client.create_crawl_job(
url='https://example.com/docs',
format='markdown',
depth=3,
link_limit=200
)
# Extract both markdown and HTML for maximum flexibility
job = client.create_crawl_job(
url='https://example.com/docs',
format='markdown',
include_html=True,
depth=3,
link_limit=200
)

The markdown format is particularly valuable for knowledge bases as it:

  • Preserves the semantic structure of content
  • Removes unnecessary styling and scripts
  • Maintains headings, lists, and other important formatting
  • Works well with most vector embedding systems

Technique #5: JavaScript Rendering for Dynamic Content

Many modern websites load their content dynamically using JavaScript. For these sites, enable the render_js parameter to ensure you capture the fully rendered content:

const job = await client.createCrawlJob({
url: 'https://spa-example.com',
format: 'markdown',
depth: 2,
link_limit: 50,
render_js: true // Enable JavaScript rendering
});

This technique is essential for single-page applications (SPAs), documentation sites built with frameworks like Docusaurus or VuePress, and other JavaScript-heavy websites.

Technique #6: Streaming Processing for Large Sites

For very large sites, Supacrawler uses an efficient streaming approach that processes pages as they're discovered rather than waiting for the entire site map to be built first. This provides several benefits:

  1. Faster time to first results - You get content from the first pages while others are still being processed
  2. Better memory efficiency - The system doesn't need to hold the entire site map in memory
  3. More precise limit control - The crawler can stop exactly at your specified link limit

This streaming architecture is automatically used for all crawl jobs, making it possible to efficiently process large sites without overwhelming your systems.

Real-World Applications

Building a Documentation Knowledge Base

Let's look at a complete example of building a knowledge base from a documentation site:

from supacrawler import SupacrawlerClient
import json
client = SupacrawlerClient(api_key='YOUR_API_KEY')
# Create a comprehensive crawl of a documentation site
job = client.create_crawl_job(
url='https://framework-docs.example.com',
format='markdown',
depth=4,
link_limit=500,
include_patterns=['/docs/*', '/api/*', '/tutorials/*'],
render_js=True,
include_subdomains=False
)
# Wait for the job to complete
result = client.wait_for_crawl(job.job_id, interval_seconds=5, timeout_seconds=600)
# Process the results into a knowledge base format
knowledge_base = []
for url, page_data in result.data.get('crawl_data', {}).items():
knowledge_base.append({
'url': url,
'title': (page_data.get('metadata') or {}).get('title', ''),
'content': page_data.get('markdown', ''),
'status_code': (page_data.get('metadata') or {}).get('status_code')
})
# Save the knowledge base to a JSON file
with open('documentation_knowledge_base.json', 'w') as f:
json.dump(knowledge_base, f, indent=2)
print(f"Knowledge base created with {len(knowledge_base)} documents")

This script creates a complete knowledge base from a documentation site, ready to be imported into a vector database or RAG system.

Competitive Analysis Dataset

Another powerful application is creating a comprehensive dataset of competitor content:

import { SupacrawlerClient } from '@supacrawler/js';
import fs from 'fs';
const client = new SupacrawlerClient({ apiKey: process.env.SUPACRAWLER_API_KEY });
async function createCompetitorDataset(competitorUrl, outputFilename) {
// Create a focused crawl of competitor content
const job = await client.createCrawlJob({
url: competitorUrl,
format: 'markdown',
depth: 3,
link_limit: 200,
include_patterns: ['/features/*', '/solutions/*', '/case-studies/*'],
render_js: true
});
console.log(`Crawl job created: ${job.job_id}`);
// Wait for completion
const result = await client.waitForCrawl(job.job_id, {
intervalMs: 5000,
timeoutMs: 600000
});
// Extract statistics
const stats = result.data?.statistics || {};
console.log(`Crawl completed: ${stats.successful_pages || 0} pages successful, ${stats.failed_pages || 0} failed`);
// Save the complete dataset
fs.writeFileSync(outputFilename, JSON.stringify(result, null, 2));
console.log(`Dataset saved to ${outputFilename}`);
return result;
}
// Create datasets for multiple competitors
async function analyzeCompetitors() {
await createCompetitorDataset('https://competitor1.com', 'competitor1_dataset.json');
await createCompetitorDataset('https://competitor2.com', 'competitor2_dataset.json');
await createCompetitorDataset('https://competitor3.com', 'competitor3_dataset.json');
}
analyzeCompetitors().catch(console.error);

This script creates comprehensive datasets of competitor content, which can be used for feature comparison, messaging alignment, or gap analysis.

Advanced Configuration Techniques

Optimizing Worker Pools

Supacrawler automatically adjusts its worker pool based on your crawl configuration. For standard crawls, it uses up to 10 parallel workers to process pages efficiently. However, when JavaScript rendering is enabled, it reduces to 2 workers to prevent overloading browser instances.

This automatic optimization ensures efficient processing while maintaining system stability. For most use cases, you don't need to worry about this detail - the system handles it for you.

Handling Timeouts and Large Crawls

For very large crawls, Supacrawler implements a safety timeout of 2 minutes to prevent runaway processes. If you need to crawl extremely large sites, consider breaking your crawl into multiple jobs focused on different sections of the site.

For example, instead of crawling an entire documentation site at once, you might create separate crawl jobs for each major section:

# Crawl API documentation
api_job = client.create_crawl_job(
url='https://example.com/docs/api',
depth=3,
link_limit=200,
include_patterns=['/docs/api/*']
)
# Crawl tutorials separately
tutorials_job = client.create_crawl_job(
url='https://example.com/docs/tutorials',
depth=3,
link_limit=200,
include_patterns=['/docs/tutorials/*']
)
# Crawl guides separately
guides_job = client.create_crawl_job(
url='https://example.com/docs/guides',
depth=3,
link_limit=200,
include_patterns=['/docs/guides/*']
)

This approach not only helps manage timeouts but also gives you more granular control over what content is included in each dataset.

Best Practices for Production Crawls

Based on our experience running millions of crawl jobs, here are some best practices to ensure reliable and efficient crawling:

  1. Start small and scale up - Begin with a small depth and link_limit to test your configuration before launching larger crawls
  2. Use specific patterns - The more specific your include_patterns, the more focused and efficient your crawl will be
  3. Enable JavaScript rendering selectively - Only use render_js: true when you know the target site requires it, as it's more resource-intensive
  4. Implement exponential backoff for polling - When checking job status, start with short intervals but increase them if the job is taking longer
  5. Save job IDs for important crawls - Store job IDs in your system so you can retrieve results later without re-crawling
  6. Respect robots.txt - Supacrawler automatically respects robots.txt directives, but be mindful of crawl frequency on production sites

Conclusion: Beyond Basic Scraping

The techniques covered in this guide demonstrate how Supacrawler's Crawl API goes far beyond basic web scraping. By leveraging these advanced crawling capabilities, you can:

  • Map entire websites and discover hidden content
  • Build comprehensive knowledge bases for RAG systems
  • Create structured datasets for competitive analysis
  • Extract content from dynamic, JavaScript-heavy sites
  • Process hundreds of pages with a single API call

Whether you're building an AI agent that needs web data, creating a knowledge base for your organization, or analyzing competitor content, these advanced crawling techniques provide the foundation for sophisticated data extraction at scale.

Ready to start crawling? Sign up for a Supacrawler account and try these techniques with our generous free tier, or check out our documentation for more details on the Crawl API.

By Supacrawler Team
Published on August 27, 2025