Web Scraping Best Practices: How to Avoid Getting Blocked in 2025

Getting blocked while web scraping is one of the most frustrating experiences for developers. You've written the perfect scraper, tested it locally, and then... 403 Forbidden. Your IP is banned. Your hours of work are useless.

But here's the thing: most blocking happens because developers unknowingly violate basic web scraping etiquette. Websites don't want to block legitimate users - they want to block malicious bots and resource-heavy scrapers that slow down their servers.

This guide will teach you how to scrape websites responsibly, effectively, and without getting blocked. You'll learn the techniques that separate professional scrapers from amateur scripts that get shut down immediately.

Why Websites Block Scrapers

Before diving into solutions, let's understand why websites implement blocking mechanisms:

1. Server Resource Protection

Websites need to serve real users. A scraper making 100 requests per second can slow down the entire site for legitimate visitors.

2. Data Protection

Some data is valuable intellectual property. Companies invest in creating content and want to control how it's accessed.

3. Business Model Protection

Ad-supported sites lose revenue when scrapers bypass ads. E-commerce sites don't want competitors easily copying their pricing.

4. Legal Compliance

Some blocking is required by law or terms of service agreements.

5. Bandwidth Costs

Every request costs money in server resources and bandwidth. Excessive scraping can significantly impact hosting costs.

Understanding these motivations helps us scrape more ethically and effectively.

The Most Common Blocking Techniques

Websites use several methods to detect and block scrapers:

Common blocking mechanisms

# These are examples of what NOT to do - behaviors that get you blocked

import requests
import time

# ❌ BAD: This will get you blocked quickly
def bad_scraping_example():
    """
    This demonstrates common mistakes that lead to blocking
    """
    
    # Mistake 1: No delays between requests
    urls = [f"https://example.com/page{i}" for i in range(100)]
    for url in urls:
        response = requests.get(url)  # Sending requests as fast as possible
        print(response.status_code)
    
    # Mistake 2: Obviously bot-like user agent
    headers = {'User-Agent': 'Python-requests/2.28.0'}  # Screams "I'm a bot!"
    
    # Mistake 3: Same exact timing pattern
    for url in urls:
        response = requests.get(url, headers=headers)
        time.sleep(1)  # Exactly 1 second every time - robotic behavior
    
    # Mistake 4: Ignoring errors and retrying immediately
    for url in urls:
        try:
            response = requests.get(url)
            if response.status_code == 429:  # Rate limited
                # Immediately try again - bad!
                response = requests.get(url)
        except:
            pass

# Don't run this! It's an example of what gets blocked

Detection Methods Websites Use:

Rate Analysis: Too many requests too quickly
User Agent Detection: Non-browser user agents
Behavior Patterns: Robotic, predictable behavior
IP Reputation: Known proxy/datacenter IPs
JavaScript Challenges: Pages that require JS execution
CAPTCHA Systems: Human verification challenges
Honeypot Traps: Hidden links that only bots would follow

Best Practice #1: Respect Rate Limits

The foundation of ethical scraping is controlling your request rate.

Implementing proper rate limiting

import time
import random
from datetime import datetime, timedelta
import logging

# Set up logging to track your scraping behavior
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class RespectfulScraper:
    def __init__(self, min_delay=1, max_delay=3, requests_per_minute=20):
        self.min_delay = min_delay
        self.max_delay = max_delay
        self.requests_per_minute = requests_per_minute
        self.request_times = []
        
    def smart_delay(self):
        """
        Implement intelligent delays between requests
        """
        # Clean old requests (older than 1 minute)
        now = datetime.now()
        self.request_times = [
            t for t in self.request_times 
            if now - t < timedelta(minutes=1)
        ]
        
        # Check if we're at our rate limit
        if len(self.request_times) >= self.requests_per_minute:
            # Wait until the oldest request is over a minute old
            oldest_request = min(self.request_times)
            wait_time = 60 - (now - oldest_request).seconds
            logger.info(f"Rate limit reached. Waiting {wait_time} seconds.")
            time.sleep(wait_time)
        
        # Add random delay to avoid predictable patterns
        delay = random.uniform(self.min_delay, self.max_delay)
        logger.info(f"Waiting {delay:.2f} seconds before next request")
        time.sleep(delay)
        
        # Record this request time
        self.request_times.append(datetime.now())
    
    def make_request(self, url, **kwargs):
        """
        Make a request with proper rate limiting
        """
        self.smart_delay()
        
        # Add some randomization to request timing
        if random.random() < 0.1:  # 10% chance
            extra_delay = random.uniform(2, 5)
            logger.info(f"Random extra delay: {extra_delay:.2f} seconds")
            time.sleep(extra_delay)
        
        import requests
        return requests.get(url, **kwargs)

# Example usage
scraper = RespectfulScraper(min_delay=1, max_delay=3, requests_per_minute=15)

urls = [
    "https://httpbin.org/delay/1",
    "https://httpbin.org/json",
    "https://httpbin.org/user-agent"
]

for url in urls:
    response = scraper.make_request(url)
    logger.info(f"Status: {response.status_code} for {url}")

Key Rate Limiting Principles:

Start Conservative: Begin with 1 request per 2-3 seconds
Add Randomness: Vary delays to avoid robotic patterns
Respect HTTP Status Codes:
- 429 Too Many Requests: Back off exponentially
- 503 Service Unavailable: Server overloaded, wait longer
Monitor Your Impact: Track response times - if they're increasing, slow down

Best Practice #2: Use Realistic User Agents

Your user agent string is like your ID card to websites. Make it realistic.

Proper user agent management

import random
import requests

class UserAgentManager:
    def __init__(self):
        # Real user agents from popular browsers
        self.user_agents = [
            # Chrome on Windows
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            
            # Chrome on macOS
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            
            # Firefox on Windows
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
            
            # Firefox on macOS
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0',
            
            # Safari on macOS
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15',
            
            # Edge on Windows
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0'
        ]
        
        self.current_agent = random.choice(self.user_agents)
    
    def get_headers(self):
        """
        Get realistic browser headers
        """
        return {
            'User-Agent': self.current_agent,
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate, br',
            'DNT': '1',
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
            'Sec-Fetch-Dest': 'document',
            'Sec-Fetch-Mode': 'navigate',
            'Sec-Fetch-Site': 'none',
            'Cache-Control': 'max-age=0'
        }
    
    def rotate_agent(self):
        """
        Switch to a different user agent
        """
        old_agent = self.current_agent
        self.current_agent = random.choice(self.user_agents)
        
        # Make sure we actually changed
        while self.current_agent == old_agent and len(self.user_agents) > 1:
            self.current_agent = random.choice(self.user_agents)
        
        return self.current_agent

# Example usage
ua_manager = UserAgentManager()

def make_realistic_request(url):
    """
    Make a request that looks like it's from a real browser
    """
    headers = ua_manager.get_headers()
    
    # Occasionally rotate user agent (like switching browsers)
    if random.random() < 0.05:  # 5% chance
        ua_manager.rotate_agent()
        headers = ua_manager.get_headers()
        print(f"Rotated to new user agent")
    
    return requests.get(url, headers=headers)

# Test it
response = make_realistic_request("https://httpbin.org/headers")
print(response.json())

User Agent Best Practices:

Use Real Browser Strings: Copy from actual browsers, not generic ones
Rotate Occasionally: Don't use the same agent for thousands of requests
Match Headers: Include realistic Accept, Accept-Language headers
Stay Current: Update user agents periodically as browsers update

Best Practice #3: Handle Errors Gracefully

How you handle errors can mean the difference between getting blocked and staying undetected.

Proper error handling

import requests
import time
import random
from functools import wraps

def retry_with_backoff(max_retries=3, backoff_base=2, max_backoff=60):
    """
    Decorator for implementing exponential backoff on failed requests
    """
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries + 1):
                try:
                    response = func(*args, **kwargs)
                    
                    # Handle different HTTP status codes
                    if response.status_code == 200:
                        return response
                    
                    elif response.status_code == 429:  # Rate limited
                        # Check if server provided Retry-After header
                        retry_after = response.headers.get('Retry-After')
                        if retry_after:
                            wait_time = int(retry_after)
                        else:
                            # Exponential backoff
                            wait_time = min(backoff_base ** attempt, max_backoff)
                        
                        print(f"Rate limited. Waiting {wait_time} seconds.")
                        time.sleep(wait_time)
                        
                        if attempt == max_retries:
                            print("Max retries reached for rate limiting")
                            return response
                    
                    elif response.status_code in [502, 503, 504]:  # Server errors
                        wait_time = min(backoff_base ** attempt, max_backoff)
                        print(f"Server error {response.status_code}. Waiting {wait_time} seconds.")
                        time.sleep(wait_time)
                        
                        if attempt == max_retries:
                            return response
                    
                    elif response.status_code == 403:  # Forbidden
                        print("403 Forbidden - you might be blocked")
                        # Don't retry immediately for 403s
                        if attempt < max_retries:
                            wait_time = 30 + random.uniform(10, 20)  # Wait longer
                            print(f"Waiting {wait_time:.1f} seconds before retry")
                            time.sleep(wait_time)
                        else:
                            return response
                    
                    else:
                        # For other status codes, return immediately
                        return response
                
                except requests.exceptions.ConnectionError:
                    wait_time = min(backoff_base ** attempt, max_backoff)
                    print(f"Connection error. Waiting {wait_time} seconds.")
                    time.sleep(wait_time)
                    
                    if attempt == max_retries:
                        raise
                
                except requests.exceptions.Timeout:
                    wait_time = min(backoff_base ** attempt, max_backoff)
                    print(f"Timeout. Waiting {wait_time} seconds.")
                    time.sleep(wait_time)
                    
                    if attempt == max_retries:
                        raise
            
            return None
        
        return wrapper
    return decorator

class SmartScraper:
    def __init__(self):
        self.session = requests.Session()
        self.consecutive_errors = 0
        self.max_consecutive_errors = 5
    
    @retry_with_backoff(max_retries=3)
    def get_page(self, url, **kwargs):
        """
        Make a request with intelligent error handling
        """
        response = self.session.get(url, timeout=30, **kwargs)
        
        # Track error patterns
        if response.status_code >= 400:
            self.consecutive_errors += 1
            
            if self.consecutive_errors >= self.max_consecutive_errors:
                print(f"Too many consecutive errors ({self.consecutive_errors})")
                print("Taking a longer break...")
                time.sleep(300)  # 5 minute break
                self.consecutive_errors = 0
        else:
            self.consecutive_errors = 0
        
        return response
    
    def scrape_with_circuit_breaker(self, urls):
        """
        Scrape multiple URLs with circuit breaker pattern
        """
        results = []
        
        for i, url in enumerate(urls):
            print(f"Scraping {i+1}/{len(urls)}: {url}")
            
            try:
                response = self.get_page(url)
                
                if response and response.status_code == 200:
                    results.append({
                        'url': url,
                        'status': 'success',
                        'content_length': len(response.text)
                    })
                else:
                    results.append({
                        'url': url,
                        'status': 'failed',
                        'status_code': response.status_code if response else 'No response'
                    })
            
            except Exception as e:
                print(f"Error scraping {url}: {e}")
                results.append({
                    'url': url,
                    'status': 'error',
                    'error': str(e)
                })
            
            # Progressive delays - slow down if we're having issues
            if self.consecutive_errors > 0:
                extra_delay = self.consecutive_errors * 2
                print(f"Adding extra delay: {extra_delay} seconds")
                time.sleep(extra_delay)
        
        return results

# Example usage
scraper = SmartScraper()
test_urls = [
    "https://httpbin.org/status/200",  # Success
    "https://httpbin.org/status/429",  # Rate limited
    "https://httpbin.org/status/503",  # Server error
]

results = scraper.scrape_with_circuit_breaker(test_urls)
for result in results:
    print(f"URL: {result['url']}, Status: {result['status']}")

Error Handling Principles:

Exponential Backoff: Wait longer after each failure
Respect Retry-After Headers: When servers tell you when to retry, listen
Circuit Breaker Pattern: Stop trying if you're getting too many errors
Different Strategies for Different Errors: 429 vs 503 vs 403 need different approaches

Best Practice #4: Respect robots.txt

The robots.txt file is a website's polite way of saying "please don't scrape these areas."

Respecting robots.txt

import urllib.robotparser
import requests
from urllib.parse import urljoin, urlparse

class RobotsRespectfulScraper:
    def __init__(self, user_agent="*"):
        self.user_agent = user_agent
        self.robots_cache = {}
    
    def can_fetch(self, url):
        """
        Check if we're allowed to scrape this URL according to robots.txt
        """
        parsed_url = urlparse(url)
        base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"
        
        # Check if we've already fetched robots.txt for this domain
        if base_url not in self.robots_cache:
            self.robots_cache[base_url] = self._fetch_robots_txt(base_url)
        
        robots_parser = self.robots_cache[base_url]
        
        if robots_parser is None:
            # If we can't fetch robots.txt, assume scraping is allowed
            return True
        
        return robots_parser.can_fetch(self.user_agent, url)
    
    def _fetch_robots_txt(self, base_url):
        """
        Fetch and parse robots.txt for a domain
        """
        robots_url = urljoin(base_url, "/robots.txt")
        
        try:
            print(f"Fetching robots.txt from {robots_url}")
            
            robots_parser = urllib.robotparser.RobotFileParser()
            robots_parser.set_url(robots_url)
            robots_parser.read()
            
            return robots_parser
            
        except Exception as e:
            print(f"Could not fetch robots.txt from {robots_url}: {e}")
            return None
    
    def get_crawl_delay(self, url):
        """
        Get the recommended crawl delay for this domain
        """
        parsed_url = urlparse(url)
        base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"
        
        if base_url not in self.robots_cache:
            self.robots_cache[base_url] = self._fetch_robots_txt(base_url)
        
        robots_parser = self.robots_cache[base_url]
        
        if robots_parser is None:
            return 1  # Default 1 second delay
        
        crawl_delay = robots_parser.crawl_delay(self.user_agent)
        return crawl_delay if crawl_delay is not None else 1
    
    def scrape_url(self, url):
        """
        Scrape a URL only if allowed by robots.txt
        """
        if not self.can_fetch(url):
            print(f"❌ Robots.txt disallows scraping {url}")
            return None
        
        print(f"✅ Robots.txt allows scraping {url}")
        
        # Get recommended delay
        delay = self.get_crawl_delay(url)
        print(f"Recommended crawl delay: {delay} seconds")
        
        # Make the request
        import time
        time.sleep(delay)
        
        try:
            response = requests.get(url, timeout=30)
            return response
        except Exception as e:
            print(f"Error scraping {url}: {e}")
            return None

# Example usage
scraper = RobotsRespectfulScraper(user_agent="*")

test_urls = [
    "https://example.com/",
    "https://example.com/page1",
    "https://httpbin.org/robots.txt",  # This will show you the robots.txt content
]

for url in test_urls:
    print(f"\n--- Checking {url} ---")
    response = scraper.scrape_url(url)
    if response:
        print(f"Successfully scraped: {response.status_code}")
    else:
        print("Could not scrape (blocked by robots.txt or error)")

robots.txt Best Practices:

Always Check First: Before scraping a new domain
Respect Crawl-Delay: If specified, use it as your minimum delay
Cache Results: Don't fetch robots.txt for every request
Handle Missing Files: Assume allowed if robots.txt doesn't exist

Best Practice #5: Use Sessions and Connection Pooling

Reusing connections makes your scraping more efficient and less detectable.

Efficient session management

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
import time

class EfficientScraper:
    def __init__(self):
        self.session = requests.Session()
        self._setup_session()
    
    def _setup_session(self):
        """
        Configure session with retry strategy and connection pooling
        """
        # Configure retry strategy
        retry_strategy = Retry(
            total=3,
            backoff_factor=1,
            status_forcelist=[429, 500, 502, 503, 504],
            allowed_methods=["HEAD", "GET", "OPTIONS"]
        )
        
        # Set up HTTP adapter with retry strategy
        adapter = HTTPAdapter(
            max_retries=retry_strategy,
            pool_connections=10,  # Number of connection pools
            pool_maxsize=20,      # Max connections per pool
            pool_block=False
        )
        
        # Mount adapter for both HTTP and HTTPS
        self.session.mount("http://", adapter)
        self.session.mount("https://", adapter)
        
        # Set default headers
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.5',
            'Accept-Encoding': 'gzip, deflate',
            'Connection': 'keep-alive',
            'Cache-Control': 'max-age=0'
        })
        
        # Set timeout
        self.session.timeout = 30
    
    def scrape_multiple_pages(self, urls, delay=2):
        """
        Efficiently scrape multiple pages using session
        """
        results = []
        
        for i, url in enumerate(urls):
            print(f"Scraping {i+1}/{len(urls)}: {url}")
            
            try:
                start_time = time.time()
                response = self.session.get(url)
                request_time = time.time() - start_time
                
                result = {
                    'url': url,
                    'status_code': response.status_code,
                    'response_time': round(request_time, 2),
                    'content_length': len(response.content),
                    'content_type': response.headers.get('content-type', 'unknown')
                }
                
                # Check if we're being throttled (slow responses might indicate rate limiting)
                if request_time > 10:
                    print(f"⚠️  Slow response ({request_time:.1f}s) - might be rate limited")
                    delay *= 1.5  # Increase delay for subsequent requests
                
                results.append(result)
                
            except requests.exceptions.RequestException as e:
                print(f"Error scraping {url}: {e}")
                results.append({
                    'url': url,
                    'error': str(e)
                })
            
            # Respectful delay
            if i < len(urls) - 1:  # Don't wait after the last URL
                time.sleep(delay)
        
        return results
    
    def close(self):
        """
        Clean up session resources
        """
        self.session.close()

# Example usage with connection reuse benefits
scraper = EfficientScraper()

# These requests will reuse the same TCP connection
same_domain_urls = [
    "https://httpbin.org/json",
    "https://httpbin.org/headers",
    "https://httpbin.org/user-agent",
    "https://httpbin.org/ip"
]

print("Scraping same domain (connection reuse):")
results = scraper.scrape_multiple_pages(same_domain_urls, delay=1)

for result in results:
    if 'error' not in result:
        print(f"  {result['url']}: {result['status_code']} ({result['response_time']}s)")

scraper.close()

Session Management Benefits:

Connection Reuse: Faster subsequent requests to same domain
Cookie Persistence: Maintains session state across requests
Connection Pooling: Better resource utilization
Automatic Retries: Built-in handling of temporary failures

Best Practice #6: Modern Solution - Supacrawler API

While all these techniques are important to understand, modern web scraping APIs like Supacrawler handle these complexities automatically.

Supacrawler: Best practices built-in

from supacrawler import SupacrawlerClient
import os
import time

# Supacrawler automatically handles:
# ✅ Rate limiting
# ✅ User agent rotation
# ✅ JavaScript rendering
# ✅ Error handling with retries
# ✅ IP rotation
# ✅ CAPTCHA solving
# ✅ Connection pooling

client = SupacrawlerClient(api_key=os.environ.get('SUPACRAWLER_API_KEY'))

def scrape_with_built_in_best_practices():
    """
    Supacrawler handles all the best practices automatically
    """
    
    # Example 1: Simple scraping with automatic best practices
    response = client.scrape(
        url="https://example.com",
        render_js=True,  # Handles JavaScript automatically
        # Rate limiting, user agents, retries all handled automatically
    )
    
    if response.success:
        print("✅ Scraped successfully")
        print(f"Title: {response.metadata.title}")
        print(f"Content length: {len(response.markdown)}")
    
    # Example 2: Bulk scraping with automatic optimization
    urls = [
        "https://example.com/page1",
        "https://example.com/page2", 
        "https://example.com/page3"
    ]
    
    results = []
    for url in urls:
        response = client.scrape(url=url)
        results.append({
            'url': url,
            'success': response.success,
            'title': response.metadata.title if response.metadata else None
        })
        # No need to add delays - Supacrawler handles rate limiting
    
    return results

def advanced_scraping_with_supacrawler():
    """
    Advanced features that would be complex to implement manually
    """
    
    # Scrape JavaScript-heavy site with custom wait conditions
    response = client.scrape(
        url="https://spa-example.com",
        render_js=True,
        wait_for_selector=".content-loaded",  # Wait for specific element
        scroll_to_bottom=True,  # Handle infinite scroll
        block_resources=["image", "stylesheet"]  # Optimize for speed
    )
    
    # Extract structured data with selectors
    response = client.scrape(
        url="https://news-site.com",
        selectors={
            "articles": {
                "selector": ".article",
                "multiple": True,
                "fields": {
                    "title": "h2",
                    "summary": ".summary",
                    "author": ".author",
                    "publish_date": ".date"
                }
            }
        }
    )
    
    return response.data

def compare_traditional_vs_modern():
    """
    Compare traditional scraping complexity vs Supacrawler simplicity
    """
    
    print("Traditional approach would require:")
    print("  ❌ 50+ lines of rate limiting code")
    print("  ❌ User agent management")
    print("  ❌ Proxy rotation setup")
    print("  ❌ JavaScript rendering with Selenium/Playwright")
    print("  ❌ Error handling and retries")
    print("  ❌ CAPTCHA solving")
    print("  ❌ Connection pooling")
    print("  ❌ Ongoing maintenance as sites change")
    
    print("\nSupacrawler approach:")
    print("  ✅ 3 lines of code")
    print("  ✅ All best practices built-in")
    print("  ✅ Zero maintenance")
    print("  ✅ Better success rates")

# Example usage
if __name__ == "__main__":
    print("=== Supacrawler Best Practices Demo ===")
    
    try:
        results = scrape_with_built_in_best_practices()
        print(f"Scraped {len(results)} URLs successfully")
        
        advanced_data = advanced_scraping_with_supacrawler()
        print("Advanced scraping completed")
        
        compare_traditional_vs_modern()
        
    except Exception as e:
        print(f"Error: {e}")
        print("Make sure to set SUPACRAWLER_API_KEY environment variable")

Advanced Anti-Detection Techniques

For sites with sophisticated blocking mechanisms, here are advanced techniques:

1. Browser Fingerprint Randomization

Advanced fingerprint management

import random
import requests

class AdvancedScraper:
    def __init__(self):
        self.session = requests.Session()
        
        # Randomize TLS fingerprint
        self.session.headers.update(self._get_random_headers())
    
    def _get_random_headers(self):
        """Generate realistic, randomized headers"""
        
        # Random screen resolutions (common ones)
        resolutions = [
            "1920x1080", "1366x768", "1536x864", "1440x900", 
            "1280x720", "1600x900", "2560x1440"
        ]
        
        # Random timezone offsets
        timezones = ["-480", "-420", "-360", "-300", "-240", "-180", "0", "60", "120"]
        
        # Random language preferences
        languages = [
            "en-US,en;q=0.9",
            "en-GB,en;q=0.9",
            "en-US,en;q=0.9,es;q=0.8",
            "en-US,en;q=0.9,fr;q=0.8"
        ]
        
        headers = {
            'User-Agent': self._get_random_user_agent(),
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Language': random.choice(languages),
            'Accept-Encoding': 'gzip, deflate, br',
            'DNT': random.choice(['1', '0']),
            'Connection': 'keep-alive',
            'Upgrade-Insecure-Requests': '1',
            'Sec-Fetch-Dest': 'document',
            'Sec-Fetch-Mode': 'navigate',
            'Sec-Fetch-Site': random.choice(['none', 'same-origin', 'cross-site']),
            'Cache-Control': random.choice(['no-cache', 'max-age=0']),
        }
        
        return headers
    
    def _get_random_user_agent(self):
        """Get a realistic, current user agent"""
        
        # Up-to-date user agents (as of 2025)
        user_agents = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0'
        ]
        
        return random.choice(user_agents)

2. Behavioral Mimicking

Human-like behavior patterns

import random
import time
import math

class HumanLikeScraper:
    def __init__(self):
        self.page_view_times = []
        self.last_request_time = None
    
    def human_like_delay(self):
        """
        Implement human-like delays between page views
        """
        # Humans don't visit pages at exact intervals
        # They have patterns: quick browsing, detailed reading, pauses
        
        behavior_type = random.choices(
            ['quick_browse', 'normal_read', 'detailed_study', 'distracted'],
            weights=[30, 50, 15, 5]
        )[0]
        
        if behavior_type == 'quick_browse':
            delay = random.uniform(2, 8)
        elif behavior_type == 'normal_read':
            delay = random.uniform(10, 30)
        elif behavior_type == 'detailed_study':
            delay = random.uniform(45, 120)
        else:  # distracted
            delay = random.uniform(180, 600)  # 3-10 minutes
        
        # Add some randomness with normal distribution
        delay += random.gauss(0, delay * 0.1)
        delay = max(1, delay)  # Ensure minimum 1 second
        
        print(f"Human-like delay: {delay:.1f} seconds ({behavior_type})")
        time.sleep(delay)
    
    def simulate_mouse_movement(self):
        """
        Simulate realistic mouse movement patterns
        (This would be used with Selenium, shown for concept)
        """
        # Humans don't move mouse in straight lines
        # They have curves, pauses, micro-movements
        
        movements = []
        current_x, current_y = 0, 0
        target_x, target_y = random.randint(100, 800), random.randint(100, 600)
        
        # Create curved path with noise
        steps = random.randint(10, 30)
        for i in range(steps):
            progress = i / steps
            
            # Bezier curve-like movement
            x = current_x + (target_x - current_x) * progress
            y = current_y + (target_y - current_y) * progress
            
            # Add human-like noise
            x += random.gauss(0, 5)
            y += random.gauss(0, 5)
            
            # Vary speed (humans slow down near target)
            speed_factor = 1 - (progress * 0.5)
            delay = random.uniform(0.01, 0.05) / speed_factor
            
            movements.append((x, y, delay))
        
        return movements
    
    def realistic_reading_pattern(self, content_length):
        """
        Calculate realistic reading time based on content
        """
        # Average reading speed: 200-300 words per minute
        words_per_minute = random.uniform(200, 300)
        
        # Estimate word count (rough: content_length / 5)
        estimated_words = content_length / 5
        
        # Calculate reading time in seconds
        reading_time = (estimated_words / words_per_minute) * 60
        
        # Add scanning time (people don't read everything)
        scanning_factor = random.uniform(0.3, 0.8)
        actual_time = reading_time * scanning_factor
        
        # Add think time
        think_time = random.uniform(2, 10)
        
        total_time = actual_time + think_time
        return max(5, total_time)  # Minimum 5 seconds

# Example usage
scraper = HumanLikeScraper()

def scrape_like_human(urls):
    """
    Scrape URLs with human-like behavior
    """
    for i, url in enumerate(urls):
        print(f"\nVisiting page {i+1}: {url}")
        
        # Make request (would use your preferred method)
        response = requests.get(url)
        
        if response.status_code == 200:
            # Simulate reading the content
            content_length = len(response.text)
            reading_time = scraper.realistic_reading_pattern(content_length)
            
            print(f"Simulating reading for {reading_time:.1f} seconds")
            time.sleep(reading_time)
        
        # Human-like delay before next page
        if i < len(urls) - 1:
            scraper.human_like_delay()

Legal and Ethical Considerations

Before implementing any scraping strategy, consider the legal and ethical implications:

✅ Good Practices:

Read Terms of Service: Understand what's allowed
Check robots.txt: Respect website preferences
Use Official APIs First: Always prefer APIs when available
Minimize Server Load: Don't overwhelm servers
Respect Copyright: Don't republish copyrighted content
Add Value: Use scraped data to create something useful

❌ Avoid These:

Ignoring robots.txt: Clearly stated preferences
Overwhelming Servers: Excessive request rates
Scraping Personal Data: Privacy violations
Commercial Redistribution: Without permission
Circumventing Paywalls: Violates business models
Aggressive Automation: Impacts legitimate users

Monitoring Your Scraping Health

Keep track of how your scraping is performing:

Scraping health monitoring

import time
import json
from datetime import datetime
from collections import defaultdict

class ScrapingMonitor:
    def __init__(self):
        self.stats = {
            'requests_made': 0,
            'successful_requests': 0,
            'failed_requests': 0,
            'blocked_requests': 0,
            'average_response_time': 0,
            'response_times': [],
            'status_codes': defaultdict(int),
            'start_time': datetime.now()
        }
    
    def record_request(self, url, status_code, response_time, success=True):
        """Record metrics for a request"""
        self.stats['requests_made'] += 1
        self.stats['response_times'].append(response_time)
        self.stats['status_codes'][status_code] += 1
        
        if success and status_code == 200:
            self.stats['successful_requests'] += 1
        elif status_code in [403, 429]:
            self.stats['blocked_requests'] += 1
        else:
            self.stats['failed_requests'] += 1
        
        # Update average response time
        self.stats['average_response_time'] = sum(self.stats['response_times']) / len(self.stats['response_times'])
    
    def get_health_report(self):
        """Generate a health report"""
        runtime = datetime.now() - self.stats['start_time']
        success_rate = (self.stats['successful_requests'] / self.stats['requests_made']) * 100 if self.stats['requests_made'] > 0 else 0
        blocked_rate = (self.stats['blocked_requests'] / self.stats['requests_made']) * 100 if self.stats['requests_made'] > 0 else 0
        
        report = {
            'summary': {
                'runtime_minutes': round(runtime.total_seconds() / 60, 2),
                'total_requests': self.stats['requests_made'],
                'success_rate': round(success_rate, 2),
                'blocked_rate': round(blocked_rate, 2),
                'average_response_time': round(self.stats['average_response_time'], 2)
            },
            'status_codes': dict(self.stats['status_codes']),
            'health_indicators': self._get_health_indicators()
        }
        
        return report
    
    def _get_health_indicators(self):
        """Get health indicators and warnings"""
        indicators = []
        
        success_rate = (self.stats['successful_requests'] / self.stats['requests_made']) * 100 if self.stats['requests_made'] > 0 else 0
        blocked_rate = (self.stats['blocked_requests'] / self.stats['requests_made']) * 100 if self.stats['requests_made'] > 0 else 0
        avg_response_time = self.stats['average_response_time']
        
        if success_rate < 80:
            indicators.append("⚠️ Low success rate - consider slowing down")
        
        if blocked_rate > 10:
            indicators.append("🚫 High block rate - you're being detected")
        
        if avg_response_time > 5:
            indicators.append("🐌 Slow responses - server may be throttling you")
        
        if success_rate > 95 and blocked_rate < 1:
            indicators.append("✅ Healthy scraping - good practices being followed")
        
        return indicators

# Example usage with monitoring
monitor = ScrapingMonitor()

def monitored_scrape(urls):
    """Scrape with health monitoring"""
    
    for url in urls:
        start_time = time.time()
        
        try:
            response = requests.get(url, timeout=30)
            response_time = time.time() - start_time
            
            monitor.record_request(
                url=url,
                status_code=response.status_code,
                response_time=response_time,
                success=(response.status_code == 200)
            )
            
            print(f"Scraped {url}: {response.status_code} ({response_time:.2f}s)")
            
        except Exception as e:
            response_time = time.time() - start_time
            monitor.record_request(
                url=url,
                status_code=0,  # Network error
                response_time=response_time,
                success=False
            )
            print(f"Error scraping {url}: {e}")
        
        time.sleep(2)  # Respectful delay
    
    # Print health report
    report = monitor.get_health_report()
    print("\n" + "="*50)
    print("SCRAPING HEALTH REPORT")
    print("="*50)
    print(json.dumps(report, indent=2))

# Test monitoring
test_urls = [
    "https://httpbin.org/status/200",
    "https://httpbin.org/delay/1",
    "https://httpbin.org/status/429",
]

monitored_scrape(test_urls)

When to Use Supacrawler vs DIY

Here's a decision framework:

Scenario	Recommendation	Reason
Learning web scraping	DIY with Python	Understand fundamentals
Simple static sites	DIY or Supacrawler	Both work well
JavaScript-heavy sites	Supacrawler	Complex to handle manually
High-volume scraping	Supacrawler	Built-in optimization
Production systems	Supacrawler	Reliability and maintenance
Complex anti-bot sites	Supacrawler	Advanced countermeasures
Budget constraints	DIY first, then Supacrawler	Start free, scale up
Time constraints	Supacrawler	Faster development

Conclusion: Building Sustainable Scrapers

The key to successful web scraping isn't just avoiding detection—it's building sustainable, respectful systems that can run reliably over time.

Remember these principles:

Start Respectfully: Begin with conservative rates and realistic behavior
Monitor Continuously: Track your success rates and response times
Adapt Quickly: When you see signs of blocking, adjust immediately
Choose the Right Tool: Match complexity to your needs
Stay Updated: Anti-bot measures evolve, so should your techniques

For most production use cases, Supacrawler eliminates these concerns entirely:

✅ Built-in best practices: Rate limiting, user agent rotation, error handling
✅ Automatic adaptation: Adjusts to site-specific anti-bot measures
✅ No maintenance: Updates automatically as sites change
✅ Higher success rates: Professional-grade infrastructure
✅ Focus on value: Spend time using data, not fighting blocks

Whether you choose DIY or Supacrawler, the goal is the same: extract valuable data while being a good citizen of the web.

Ready to start scraping responsibly?

For learning: Try the Python examples above
For production: Get started with Supacrawler - 1,000 free requests to test
For complex sites: Check our advanced documentation

Happy (and respectful) scraping! 🤖✨