Back to Blog

Web Scraping Best Practices: How to Avoid Getting Blocked in 2025

Getting blocked while web scraping is one of the most frustrating experiences for developers. You've written the perfect scraper, tested it locally, and then... 403 Forbidden. Your IP is banned. Your hours of work are useless.

But here's the thing: most blocking happens because developers unknowingly violate basic web scraping etiquette. Websites don't want to block legitimate users - they want to block malicious bots and resource-heavy scrapers that slow down their servers.

This guide will teach you how to scrape websites responsibly, effectively, and without getting blocked. You'll learn the techniques that separate professional scrapers from amateur scripts that get shut down immediately.

Why Websites Block Scrapers

Before diving into solutions, let's understand why websites implement blocking mechanisms:

1. Server Resource Protection

Websites need to serve real users. A scraper making 100 requests per second can slow down the entire site for legitimate visitors.

2. Data Protection

Some data is valuable intellectual property. Companies invest in creating content and want to control how it's accessed.

3. Business Model Protection

Ad-supported sites lose revenue when scrapers bypass ads. E-commerce sites don't want competitors easily copying their pricing.

4. Legal Compliance

Some blocking is required by law or terms of service agreements.

5. Bandwidth Costs

Every request costs money in server resources and bandwidth. Excessive scraping can significantly impact hosting costs.

Understanding these motivations helps us scrape more ethically and effectively.

The Most Common Blocking Techniques

Websites use several methods to detect and block scrapers:

Common blocking mechanisms

# These are examples of what NOT to do - behaviors that get you blocked
import requests
import time
# ❌ BAD: This will get you blocked quickly
def bad_scraping_example():
"""
This demonstrates common mistakes that lead to blocking
"""
# Mistake 1: No delays between requests
urls = [f"https://example.com/page{i}" for i in range(100)]
for url in urls:
response = requests.get(url) # Sending requests as fast as possible
print(response.status_code)
# Mistake 2: Obviously bot-like user agent
headers = {'User-Agent': 'Python-requests/2.28.0'} # Screams "I'm a bot!"
# Mistake 3: Same exact timing pattern
for url in urls:
response = requests.get(url, headers=headers)
time.sleep(1) # Exactly 1 second every time - robotic behavior
# Mistake 4: Ignoring errors and retrying immediately
for url in urls:
try:
response = requests.get(url)
if response.status_code == 429: # Rate limited
# Immediately try again - bad!
response = requests.get(url)
except:
pass
# Don't run this! It's an example of what gets blocked

Detection Methods Websites Use:

  1. Rate Analysis: Too many requests too quickly
  2. User Agent Detection: Non-browser user agents
  3. Behavior Patterns: Robotic, predictable behavior
  4. IP Reputation: Known proxy/datacenter IPs
  5. JavaScript Challenges: Pages that require JS execution
  6. CAPTCHA Systems: Human verification challenges
  7. Honeypot Traps: Hidden links that only bots would follow

Best Practice #1: Respect Rate Limits

The foundation of ethical scraping is controlling your request rate.

Implementing proper rate limiting

import time
import random
from datetime import datetime, timedelta
import logging
# Set up logging to track your scraping behavior
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class RespectfulScraper:
def __init__(self, min_delay=1, max_delay=3, requests_per_minute=20):
self.min_delay = min_delay
self.max_delay = max_delay
self.requests_per_minute = requests_per_minute
self.request_times = []
def smart_delay(self):
"""
Implement intelligent delays between requests
"""
# Clean old requests (older than 1 minute)
now = datetime.now()
self.request_times = [
t for t in self.request_times
if now - t < timedelta(minutes=1)
]
# Check if we're at our rate limit
if len(self.request_times) >= self.requests_per_minute:
# Wait until the oldest request is over a minute old
oldest_request = min(self.request_times)
wait_time = 60 - (now - oldest_request).seconds
logger.info(f"Rate limit reached. Waiting {wait_time} seconds.")
time.sleep(wait_time)
# Add random delay to avoid predictable patterns
delay = random.uniform(self.min_delay, self.max_delay)
logger.info(f"Waiting {delay:.2f} seconds before next request")
time.sleep(delay)
# Record this request time
self.request_times.append(datetime.now())
def make_request(self, url, **kwargs):
"""
Make a request with proper rate limiting
"""
self.smart_delay()
# Add some randomization to request timing
if random.random() < 0.1: # 10% chance
extra_delay = random.uniform(2, 5)
logger.info(f"Random extra delay: {extra_delay:.2f} seconds")
time.sleep(extra_delay)
import requests
return requests.get(url, **kwargs)
# Example usage
scraper = RespectfulScraper(min_delay=1, max_delay=3, requests_per_minute=15)
urls = [
"https://httpbin.org/delay/1",
"https://httpbin.org/json",
"https://httpbin.org/user-agent"
]
for url in urls:
response = scraper.make_request(url)
logger.info(f"Status: {response.status_code} for {url}")

Key Rate Limiting Principles:

  1. Start Conservative: Begin with 1 request per 2-3 seconds
  2. Add Randomness: Vary delays to avoid robotic patterns
  3. Respect HTTP Status Codes:
    • 429 Too Many Requests: Back off exponentially
    • 503 Service Unavailable: Server overloaded, wait longer
  4. Monitor Your Impact: Track response times - if they're increasing, slow down

Best Practice #2: Use Realistic User Agents

Your user agent string is like your ID card to websites. Make it realistic.

Proper user agent management

import random
import requests
class UserAgentManager:
def __init__(self):
# Real user agents from popular browsers
self.user_agents = [
# Chrome on Windows
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
# Chrome on macOS
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
# Firefox on Windows
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
# Firefox on macOS
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0',
# Safari on macOS
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15',
# Edge on Windows
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0'
]
self.current_agent = random.choice(self.user_agents)
def get_headers(self):
"""
Get realistic browser headers
"""
return {
'User-Agent': self.current_agent,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'DNT': '1',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Cache-Control': 'max-age=0'
}
def rotate_agent(self):
"""
Switch to a different user agent
"""
old_agent = self.current_agent
self.current_agent = random.choice(self.user_agents)
# Make sure we actually changed
while self.current_agent == old_agent and len(self.user_agents) > 1:
self.current_agent = random.choice(self.user_agents)
return self.current_agent
# Example usage
ua_manager = UserAgentManager()
def make_realistic_request(url):
"""
Make a request that looks like it's from a real browser
"""
headers = ua_manager.get_headers()
# Occasionally rotate user agent (like switching browsers)
if random.random() < 0.05: # 5% chance
ua_manager.rotate_agent()
headers = ua_manager.get_headers()
print(f"Rotated to new user agent")
return requests.get(url, headers=headers)
# Test it
response = make_realistic_request("https://httpbin.org/headers")
print(response.json())

User Agent Best Practices:

  1. Use Real Browser Strings: Copy from actual browsers, not generic ones
  2. Rotate Occasionally: Don't use the same agent for thousands of requests
  3. Match Headers: Include realistic Accept, Accept-Language headers
  4. Stay Current: Update user agents periodically as browsers update

Best Practice #3: Handle Errors Gracefully

How you handle errors can mean the difference between getting blocked and staying undetected.

Proper error handling

import requests
import time
import random
from functools import wraps
def retry_with_backoff(max_retries=3, backoff_base=2, max_backoff=60):
"""
Decorator for implementing exponential backoff on failed requests
"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
for attempt in range(max_retries + 1):
try:
response = func(*args, **kwargs)
# Handle different HTTP status codes
if response.status_code == 200:
return response
elif response.status_code == 429: # Rate limited
# Check if server provided Retry-After header
retry_after = response.headers.get('Retry-After')
if retry_after:
wait_time = int(retry_after)
else:
# Exponential backoff
wait_time = min(backoff_base ** attempt, max_backoff)
print(f"Rate limited. Waiting {wait_time} seconds.")
time.sleep(wait_time)
if attempt == max_retries:
print("Max retries reached for rate limiting")
return response
elif response.status_code in [502, 503, 504]: # Server errors
wait_time = min(backoff_base ** attempt, max_backoff)
print(f"Server error {response.status_code}. Waiting {wait_time} seconds.")
time.sleep(wait_time)
if attempt == max_retries:
return response
elif response.status_code == 403: # Forbidden
print("403 Forbidden - you might be blocked")
# Don't retry immediately for 403s
if attempt < max_retries:
wait_time = 30 + random.uniform(10, 20) # Wait longer
print(f"Waiting {wait_time:.1f} seconds before retry")
time.sleep(wait_time)
else:
return response
else:
# For other status codes, return immediately
return response
except requests.exceptions.ConnectionError:
wait_time = min(backoff_base ** attempt, max_backoff)
print(f"Connection error. Waiting {wait_time} seconds.")
time.sleep(wait_time)
if attempt == max_retries:
raise
except requests.exceptions.Timeout:
wait_time = min(backoff_base ** attempt, max_backoff)
print(f"Timeout. Waiting {wait_time} seconds.")
time.sleep(wait_time)
if attempt == max_retries:
raise
return None
return wrapper
return decorator
class SmartScraper:
def __init__(self):
self.session = requests.Session()
self.consecutive_errors = 0
self.max_consecutive_errors = 5
@retry_with_backoff(max_retries=3)
def get_page(self, url, **kwargs):
"""
Make a request with intelligent error handling
"""
response = self.session.get(url, timeout=30, **kwargs)
# Track error patterns
if response.status_code >= 400:
self.consecutive_errors += 1
if self.consecutive_errors >= self.max_consecutive_errors:
print(f"Too many consecutive errors ({self.consecutive_errors})")
print("Taking a longer break...")
time.sleep(300) # 5 minute break
self.consecutive_errors = 0
else:
self.consecutive_errors = 0
return response
def scrape_with_circuit_breaker(self, urls):
"""
Scrape multiple URLs with circuit breaker pattern
"""
results = []
for i, url in enumerate(urls):
print(f"Scraping {i+1}/{len(urls)}: {url}")
try:
response = self.get_page(url)
if response and response.status_code == 200:
results.append({
'url': url,
'status': 'success',
'content_length': len(response.text)
})
else:
results.append({
'url': url,
'status': 'failed',
'status_code': response.status_code if response else 'No response'
})
except Exception as e:
print(f"Error scraping {url}: {e}")
results.append({
'url': url,
'status': 'error',
'error': str(e)
})
# Progressive delays - slow down if we're having issues
if self.consecutive_errors > 0:
extra_delay = self.consecutive_errors * 2
print(f"Adding extra delay: {extra_delay} seconds")
time.sleep(extra_delay)
return results
# Example usage
scraper = SmartScraper()
test_urls = [
"https://httpbin.org/status/200", # Success
"https://httpbin.org/status/429", # Rate limited
"https://httpbin.org/status/503", # Server error
]
results = scraper.scrape_with_circuit_breaker(test_urls)
for result in results:
print(f"URL: {result['url']}, Status: {result['status']}")

Error Handling Principles:

  1. Exponential Backoff: Wait longer after each failure
  2. Respect Retry-After Headers: When servers tell you when to retry, listen
  3. Circuit Breaker Pattern: Stop trying if you're getting too many errors
  4. Different Strategies for Different Errors: 429 vs 503 vs 403 need different approaches

Best Practice #4: Respect robots.txt

The robots.txt file is a website's polite way of saying "please don't scrape these areas."

Respecting robots.txt

import urllib.robotparser
import requests
from urllib.parse import urljoin, urlparse
class RobotsRespectfulScraper:
def __init__(self, user_agent="*"):
self.user_agent = user_agent
self.robots_cache = {}
def can_fetch(self, url):
"""
Check if we're allowed to scrape this URL according to robots.txt
"""
parsed_url = urlparse(url)
base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"
# Check if we've already fetched robots.txt for this domain
if base_url not in self.robots_cache:
self.robots_cache[base_url] = self._fetch_robots_txt(base_url)
robots_parser = self.robots_cache[base_url]
if robots_parser is None:
# If we can't fetch robots.txt, assume scraping is allowed
return True
return robots_parser.can_fetch(self.user_agent, url)
def _fetch_robots_txt(self, base_url):
"""
Fetch and parse robots.txt for a domain
"""
robots_url = urljoin(base_url, "/robots.txt")
try:
print(f"Fetching robots.txt from {robots_url}")
robots_parser = urllib.robotparser.RobotFileParser()
robots_parser.set_url(robots_url)
robots_parser.read()
return robots_parser
except Exception as e:
print(f"Could not fetch robots.txt from {robots_url}: {e}")
return None
def get_crawl_delay(self, url):
"""
Get the recommended crawl delay for this domain
"""
parsed_url = urlparse(url)
base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"
if base_url not in self.robots_cache:
self.robots_cache[base_url] = self._fetch_robots_txt(base_url)
robots_parser = self.robots_cache[base_url]
if robots_parser is None:
return 1 # Default 1 second delay
crawl_delay = robots_parser.crawl_delay(self.user_agent)
return crawl_delay if crawl_delay is not None else 1
def scrape_url(self, url):
"""
Scrape a URL only if allowed by robots.txt
"""
if not self.can_fetch(url):
print(f"❌ Robots.txt disallows scraping {url}")
return None
print(f"✅ Robots.txt allows scraping {url}")
# Get recommended delay
delay = self.get_crawl_delay(url)
print(f"Recommended crawl delay: {delay} seconds")
# Make the request
import time
time.sleep(delay)
try:
response = requests.get(url, timeout=30)
return response
except Exception as e:
print(f"Error scraping {url}: {e}")
return None
# Example usage
scraper = RobotsRespectfulScraper(user_agent="*")
test_urls = [
"https://example.com/",
"https://example.com/page1",
"https://httpbin.org/robots.txt", # This will show you the robots.txt content
]
for url in test_urls:
print(f"\n--- Checking {url} ---")
response = scraper.scrape_url(url)
if response:
print(f"Successfully scraped: {response.status_code}")
else:
print("Could not scrape (blocked by robots.txt or error)")

robots.txt Best Practices:

  1. Always Check First: Before scraping a new domain
  2. Respect Crawl-Delay: If specified, use it as your minimum delay
  3. Cache Results: Don't fetch robots.txt for every request
  4. Handle Missing Files: Assume allowed if robots.txt doesn't exist

Best Practice #5: Use Sessions and Connection Pooling

Reusing connections makes your scraping more efficient and less detectable.

Efficient session management

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import time
class EfficientScraper:
def __init__(self):
self.session = requests.Session()
self._setup_session()
def _setup_session(self):
"""
Configure session with retry strategy and connection pooling
"""
# Configure retry strategy
retry_strategy = Retry(
total=3,
backoff_factor=1,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["HEAD", "GET", "OPTIONS"]
)
# Set up HTTP adapter with retry strategy
adapter = HTTPAdapter(
max_retries=retry_strategy,
pool_connections=10, # Number of connection pools
pool_maxsize=20, # Max connections per pool
pool_block=False
)
# Mount adapter for both HTTP and HTTPS
self.session.mount("http://", adapter)
self.session.mount("https://", adapter)
# Set default headers
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0'
})
# Set timeout
self.session.timeout = 30
def scrape_multiple_pages(self, urls, delay=2):
"""
Efficiently scrape multiple pages using session
"""
results = []
for i, url in enumerate(urls):
print(f"Scraping {i+1}/{len(urls)}: {url}")
try:
start_time = time.time()
response = self.session.get(url)
request_time = time.time() - start_time
result = {
'url': url,
'status_code': response.status_code,
'response_time': round(request_time, 2),
'content_length': len(response.content),
'content_type': response.headers.get('content-type', 'unknown')
}
# Check if we're being throttled (slow responses might indicate rate limiting)
if request_time > 10:
print(f"⚠️ Slow response ({request_time:.1f}s) - might be rate limited")
delay *= 1.5 # Increase delay for subsequent requests
results.append(result)
except requests.exceptions.RequestException as e:
print(f"Error scraping {url}: {e}")
results.append({
'url': url,
'error': str(e)
})
# Respectful delay
if i < len(urls) - 1: # Don't wait after the last URL
time.sleep(delay)
return results
def close(self):
"""
Clean up session resources
"""
self.session.close()
# Example usage with connection reuse benefits
scraper = EfficientScraper()
# These requests will reuse the same TCP connection
same_domain_urls = [
"https://httpbin.org/json",
"https://httpbin.org/headers",
"https://httpbin.org/user-agent",
"https://httpbin.org/ip"
]
print("Scraping same domain (connection reuse):")
results = scraper.scrape_multiple_pages(same_domain_urls, delay=1)
for result in results:
if 'error' not in result:
print(f" {result['url']}: {result['status_code']} ({result['response_time']}s)")
scraper.close()

Session Management Benefits:

  1. Connection Reuse: Faster subsequent requests to same domain
  2. Cookie Persistence: Maintains session state across requests
  3. Connection Pooling: Better resource utilization
  4. Automatic Retries: Built-in handling of temporary failures

Best Practice #6: Modern Solution - Supacrawler API

While all these techniques are important to understand, modern web scraping APIs like Supacrawler handle these complexities automatically.

Supacrawler: Best practices built-in

from supacrawler import SupacrawlerClient
import os
import time
# Supacrawler automatically handles:
# ✅ Rate limiting
# ✅ User agent rotation
# ✅ JavaScript rendering
# ✅ Error handling with retries
# ✅ IP rotation
# ✅ CAPTCHA solving
# ✅ Connection pooling
client = SupacrawlerClient(api_key=os.environ.get('SUPACRAWLER_API_KEY'))
def scrape_with_built_in_best_practices():
"""
Supacrawler handles all the best practices automatically
"""
# Example 1: Simple scraping with automatic best practices
response = client.scrape(
url="https://example.com",
render_js=True, # Handles JavaScript automatically
# Rate limiting, user agents, retries all handled automatically
)
if response.success:
print("✅ Scraped successfully")
print(f"Title: {response.metadata.title}")
print(f"Content length: {len(response.markdown)}")
# Example 2: Bulk scraping with automatic optimization
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3"
]
results = []
for url in urls:
response = client.scrape(url=url)
results.append({
'url': url,
'success': response.success,
'title': response.metadata.title if response.metadata else None
})
# No need to add delays - Supacrawler handles rate limiting
return results
def advanced_scraping_with_supacrawler():
"""
Advanced features that would be complex to implement manually
"""
# Scrape JavaScript-heavy site with custom wait conditions
response = client.scrape(
url="https://spa-example.com",
render_js=True,
wait_for_selector=".content-loaded", # Wait for specific element
scroll_to_bottom=True, # Handle infinite scroll
block_resources=["image", "stylesheet"] # Optimize for speed
)
# Extract structured data with selectors
response = client.scrape(
url="https://news-site.com",
selectors={
"articles": {
"selector": ".article",
"multiple": True,
"fields": {
"title": "h2",
"summary": ".summary",
"author": ".author",
"publish_date": ".date"
}
}
}
)
return response.data
def compare_traditional_vs_modern():
"""
Compare traditional scraping complexity vs Supacrawler simplicity
"""
print("Traditional approach would require:")
print(" ❌ 50+ lines of rate limiting code")
print(" ❌ User agent management")
print(" ❌ Proxy rotation setup")
print(" ❌ JavaScript rendering with Selenium/Playwright")
print(" ❌ Error handling and retries")
print(" ❌ CAPTCHA solving")
print(" ❌ Connection pooling")
print(" ❌ Ongoing maintenance as sites change")
print("\nSupacrawler approach:")
print(" ✅ 3 lines of code")
print(" ✅ All best practices built-in")
print(" ✅ Zero maintenance")
print(" ✅ Better success rates")
# Example usage
if __name__ == "__main__":
print("=== Supacrawler Best Practices Demo ===")
try:
results = scrape_with_built_in_best_practices()
print(f"Scraped {len(results)} URLs successfully")
advanced_data = advanced_scraping_with_supacrawler()
print("Advanced scraping completed")
compare_traditional_vs_modern()
except Exception as e:
print(f"Error: {e}")
print("Make sure to set SUPACRAWLER_API_KEY environment variable")

Advanced Anti-Detection Techniques

For sites with sophisticated blocking mechanisms, here are advanced techniques:

1. Browser Fingerprint Randomization

Advanced fingerprint management

import random
import requests
class AdvancedScraper:
def __init__(self):
self.session = requests.Session()
# Randomize TLS fingerprint
self.session.headers.update(self._get_random_headers())
def _get_random_headers(self):
"""Generate realistic, randomized headers"""
# Random screen resolutions (common ones)
resolutions = [
"1920x1080", "1366x768", "1536x864", "1440x900",
"1280x720", "1600x900", "2560x1440"
]
# Random timezone offsets
timezones = ["-480", "-420", "-360", "-300", "-240", "-180", "0", "60", "120"]
# Random language preferences
languages = [
"en-US,en;q=0.9",
"en-GB,en;q=0.9",
"en-US,en;q=0.9,es;q=0.8",
"en-US,en;q=0.9,fr;q=0.8"
]
headers = {
'User-Agent': self._get_random_user_agent(),
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': random.choice(languages),
'Accept-Encoding': 'gzip, deflate, br',
'DNT': random.choice(['1', '0']),
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': random.choice(['none', 'same-origin', 'cross-site']),
'Cache-Control': random.choice(['no-cache', 'max-age=0']),
}
return headers
def _get_random_user_agent(self):
"""Get a realistic, current user agent"""
# Up-to-date user agents (as of 2025)
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0'
]
return random.choice(user_agents)

2. Behavioral Mimicking

Human-like behavior patterns

import random
import time
import math
class HumanLikeScraper:
def __init__(self):
self.page_view_times = []
self.last_request_time = None
def human_like_delay(self):
"""
Implement human-like delays between page views
"""
# Humans don't visit pages at exact intervals
# They have patterns: quick browsing, detailed reading, pauses
behavior_type = random.choices(
['quick_browse', 'normal_read', 'detailed_study', 'distracted'],
weights=[30, 50, 15, 5]
)[0]
if behavior_type == 'quick_browse':
delay = random.uniform(2, 8)
elif behavior_type == 'normal_read':
delay = random.uniform(10, 30)
elif behavior_type == 'detailed_study':
delay = random.uniform(45, 120)
else: # distracted
delay = random.uniform(180, 600) # 3-10 minutes
# Add some randomness with normal distribution
delay += random.gauss(0, delay * 0.1)
delay = max(1, delay) # Ensure minimum 1 second
print(f"Human-like delay: {delay:.1f} seconds ({behavior_type})")
time.sleep(delay)
def simulate_mouse_movement(self):
"""
Simulate realistic mouse movement patterns
(This would be used with Selenium, shown for concept)
"""
# Humans don't move mouse in straight lines
# They have curves, pauses, micro-movements
movements = []
current_x, current_y = 0, 0
target_x, target_y = random.randint(100, 800), random.randint(100, 600)
# Create curved path with noise
steps = random.randint(10, 30)
for i in range(steps):
progress = i / steps
# Bezier curve-like movement
x = current_x + (target_x - current_x) * progress
y = current_y + (target_y - current_y) * progress
# Add human-like noise
x += random.gauss(0, 5)
y += random.gauss(0, 5)
# Vary speed (humans slow down near target)
speed_factor = 1 - (progress * 0.5)
delay = random.uniform(0.01, 0.05) / speed_factor
movements.append((x, y, delay))
return movements
def realistic_reading_pattern(self, content_length):
"""
Calculate realistic reading time based on content
"""
# Average reading speed: 200-300 words per minute
words_per_minute = random.uniform(200, 300)
# Estimate word count (rough: content_length / 5)
estimated_words = content_length / 5
# Calculate reading time in seconds
reading_time = (estimated_words / words_per_minute) * 60
# Add scanning time (people don't read everything)
scanning_factor = random.uniform(0.3, 0.8)
actual_time = reading_time * scanning_factor
# Add think time
think_time = random.uniform(2, 10)
total_time = actual_time + think_time
return max(5, total_time) # Minimum 5 seconds
# Example usage
scraper = HumanLikeScraper()
def scrape_like_human(urls):
"""
Scrape URLs with human-like behavior
"""
for i, url in enumerate(urls):
print(f"\nVisiting page {i+1}: {url}")
# Make request (would use your preferred method)
response = requests.get(url)
if response.status_code == 200:
# Simulate reading the content
content_length = len(response.text)
reading_time = scraper.realistic_reading_pattern(content_length)
print(f"Simulating reading for {reading_time:.1f} seconds")
time.sleep(reading_time)
# Human-like delay before next page
if i < len(urls) - 1:
scraper.human_like_delay()

Legal and Ethical Considerations

Before implementing any scraping strategy, consider the legal and ethical implications:

Good Practices:

  1. Read Terms of Service: Understand what's allowed
  2. Check robots.txt: Respect website preferences
  3. Use Official APIs First: Always prefer APIs when available
  4. Minimize Server Load: Don't overwhelm servers
  5. Respect Copyright: Don't republish copyrighted content
  6. Add Value: Use scraped data to create something useful

Avoid These:

  1. Ignoring robots.txt: Clearly stated preferences
  2. Overwhelming Servers: Excessive request rates
  3. Scraping Personal Data: Privacy violations
  4. Commercial Redistribution: Without permission
  5. Circumventing Paywalls: Violates business models
  6. Aggressive Automation: Impacts legitimate users

Monitoring Your Scraping Health

Keep track of how your scraping is performing:

Scraping health monitoring

import time
import json
from datetime import datetime
from collections import defaultdict
class ScrapingMonitor:
def __init__(self):
self.stats = {
'requests_made': 0,
'successful_requests': 0,
'failed_requests': 0,
'blocked_requests': 0,
'average_response_time': 0,
'response_times': [],
'status_codes': defaultdict(int),
'start_time': datetime.now()
}
def record_request(self, url, status_code, response_time, success=True):
"""Record metrics for a request"""
self.stats['requests_made'] += 1
self.stats['response_times'].append(response_time)
self.stats['status_codes'][status_code] += 1
if success and status_code == 200:
self.stats['successful_requests'] += 1
elif status_code in [403, 429]:
self.stats['blocked_requests'] += 1
else:
self.stats['failed_requests'] += 1
# Update average response time
self.stats['average_response_time'] = sum(self.stats['response_times']) / len(self.stats['response_times'])
def get_health_report(self):
"""Generate a health report"""
runtime = datetime.now() - self.stats['start_time']
success_rate = (self.stats['successful_requests'] / self.stats['requests_made']) * 100 if self.stats['requests_made'] > 0 else 0
blocked_rate = (self.stats['blocked_requests'] / self.stats['requests_made']) * 100 if self.stats['requests_made'] > 0 else 0
report = {
'summary': {
'runtime_minutes': round(runtime.total_seconds() / 60, 2),
'total_requests': self.stats['requests_made'],
'success_rate': round(success_rate, 2),
'blocked_rate': round(blocked_rate, 2),
'average_response_time': round(self.stats['average_response_time'], 2)
},
'status_codes': dict(self.stats['status_codes']),
'health_indicators': self._get_health_indicators()
}
return report
def _get_health_indicators(self):
"""Get health indicators and warnings"""
indicators = []
success_rate = (self.stats['successful_requests'] / self.stats['requests_made']) * 100 if self.stats['requests_made'] > 0 else 0
blocked_rate = (self.stats['blocked_requests'] / self.stats['requests_made']) * 100 if self.stats['requests_made'] > 0 else 0
avg_response_time = self.stats['average_response_time']
if success_rate < 80:
indicators.append("⚠️ Low success rate - consider slowing down")
if blocked_rate > 10:
indicators.append("🚫 High block rate - you're being detected")
if avg_response_time > 5:
indicators.append("🐌 Slow responses - server may be throttling you")
if success_rate > 95 and blocked_rate < 1:
indicators.append("✅ Healthy scraping - good practices being followed")
return indicators
# Example usage with monitoring
monitor = ScrapingMonitor()
def monitored_scrape(urls):
"""Scrape with health monitoring"""
for url in urls:
start_time = time.time()
try:
response = requests.get(url, timeout=30)
response_time = time.time() - start_time
monitor.record_request(
url=url,
status_code=response.status_code,
response_time=response_time,
success=(response.status_code == 200)
)
print(f"Scraped {url}: {response.status_code} ({response_time:.2f}s)")
except Exception as e:
response_time = time.time() - start_time
monitor.record_request(
url=url,
status_code=0, # Network error
response_time=response_time,
success=False
)
print(f"Error scraping {url}: {e}")
time.sleep(2) # Respectful delay
# Print health report
report = monitor.get_health_report()
print("\n" + "="*50)
print("SCRAPING HEALTH REPORT")
print("="*50)
print(json.dumps(report, indent=2))
# Test monitoring
test_urls = [
"https://httpbin.org/status/200",
"https://httpbin.org/delay/1",
"https://httpbin.org/status/429",
]
monitored_scrape(test_urls)

When to Use Supacrawler vs DIY

Here's a decision framework:

ScenarioRecommendationReason
Learning web scrapingDIY with PythonUnderstand fundamentals
Simple static sitesDIY or SupacrawlerBoth work well
JavaScript-heavy sitesSupacrawlerComplex to handle manually
High-volume scrapingSupacrawlerBuilt-in optimization
Production systemsSupacrawlerReliability and maintenance
Complex anti-bot sitesSupacrawlerAdvanced countermeasures
Budget constraintsDIY first, then SupacrawlerStart free, scale up
Time constraintsSupacrawlerFaster development

Conclusion: Building Sustainable Scrapers

The key to successful web scraping isn't just avoiding detection—it's building sustainable, respectful systems that can run reliably over time.

Remember these principles:

  1. Start Respectfully: Begin with conservative rates and realistic behavior
  2. Monitor Continuously: Track your success rates and response times
  3. Adapt Quickly: When you see signs of blocking, adjust immediately
  4. Choose the Right Tool: Match complexity to your needs
  5. Stay Updated: Anti-bot measures evolve, so should your techniques

For most production use cases, Supacrawler eliminates these concerns entirely:

  • Built-in best practices: Rate limiting, user agent rotation, error handling
  • Automatic adaptation: Adjusts to site-specific anti-bot measures
  • No maintenance: Updates automatically as sites change
  • Higher success rates: Professional-grade infrastructure
  • Focus on value: Spend time using data, not fighting blocks

Whether you choose DIY or Supacrawler, the goal is the same: extract valuable data while being a good citizen of the web.

Ready to start scraping responsibly?

Happy (and respectful) scraping! 🤖✨

By Supacrawler Team
Published on June 5, 2025