Web Scraping Best Practices: How to Avoid Getting Blocked in 2025
Getting blocked while web scraping is one of the most frustrating experiences for developers. You've written the perfect scraper, tested it locally, and then... 403 Forbidden. Your IP is banned. Your hours of work are useless.
But here's the thing: most blocking happens because developers unknowingly violate basic web scraping etiquette. Websites don't want to block legitimate users - they want to block malicious bots and resource-heavy scrapers that slow down their servers.
This guide will teach you how to scrape websites responsibly, effectively, and without getting blocked. You'll learn the techniques that separate professional scrapers from amateur scripts that get shut down immediately.
Why Websites Block Scrapers
Before diving into solutions, let's understand why websites implement blocking mechanisms:
1. Server Resource Protection
Websites need to serve real users. A scraper making 100 requests per second can slow down the entire site for legitimate visitors.
2. Data Protection
Some data is valuable intellectual property. Companies invest in creating content and want to control how it's accessed.
3. Business Model Protection
Ad-supported sites lose revenue when scrapers bypass ads. E-commerce sites don't want competitors easily copying their pricing.
4. Legal Compliance
Some blocking is required by law or terms of service agreements.
5. Bandwidth Costs
Every request costs money in server resources and bandwidth. Excessive scraping can significantly impact hosting costs.
Understanding these motivations helps us scrape more ethically and effectively.
The Most Common Blocking Techniques
Websites use several methods to detect and block scrapers:
Common blocking mechanisms
# These are examples of what NOT to do - behaviors that get you blockedimport requestsimport time# ❌ BAD: This will get you blocked quicklydef bad_scraping_example():"""This demonstrates common mistakes that lead to blocking"""# Mistake 1: No delays between requestsurls = [f"https://example.com/page{i}" for i in range(100)]for url in urls:response = requests.get(url) # Sending requests as fast as possibleprint(response.status_code)# Mistake 2: Obviously bot-like user agentheaders = {'User-Agent': 'Python-requests/2.28.0'} # Screams "I'm a bot!"# Mistake 3: Same exact timing patternfor url in urls:response = requests.get(url, headers=headers)time.sleep(1) # Exactly 1 second every time - robotic behavior# Mistake 4: Ignoring errors and retrying immediatelyfor url in urls:try:response = requests.get(url)if response.status_code == 429: # Rate limited# Immediately try again - bad!response = requests.get(url)except:pass# Don't run this! It's an example of what gets blocked
Detection Methods Websites Use:
- Rate Analysis: Too many requests too quickly
- User Agent Detection: Non-browser user agents
- Behavior Patterns: Robotic, predictable behavior
- IP Reputation: Known proxy/datacenter IPs
- JavaScript Challenges: Pages that require JS execution
- CAPTCHA Systems: Human verification challenges
- Honeypot Traps: Hidden links that only bots would follow
Best Practice #1: Respect Rate Limits
The foundation of ethical scraping is controlling your request rate.
Implementing proper rate limiting
import timeimport randomfrom datetime import datetime, timedeltaimport logging# Set up logging to track your scraping behaviorlogging.basicConfig(level=logging.INFO)logger = logging.getLogger(__name__)class RespectfulScraper:def __init__(self, min_delay=1, max_delay=3, requests_per_minute=20):self.min_delay = min_delayself.max_delay = max_delayself.requests_per_minute = requests_per_minuteself.request_times = []def smart_delay(self):"""Implement intelligent delays between requests"""# Clean old requests (older than 1 minute)now = datetime.now()self.request_times = [t for t in self.request_timesif now - t < timedelta(minutes=1)]# Check if we're at our rate limitif len(self.request_times) >= self.requests_per_minute:# Wait until the oldest request is over a minute oldoldest_request = min(self.request_times)wait_time = 60 - (now - oldest_request).secondslogger.info(f"Rate limit reached. Waiting {wait_time} seconds.")time.sleep(wait_time)# Add random delay to avoid predictable patternsdelay = random.uniform(self.min_delay, self.max_delay)logger.info(f"Waiting {delay:.2f} seconds before next request")time.sleep(delay)# Record this request timeself.request_times.append(datetime.now())def make_request(self, url, **kwargs):"""Make a request with proper rate limiting"""self.smart_delay()# Add some randomization to request timingif random.random() < 0.1: # 10% chanceextra_delay = random.uniform(2, 5)logger.info(f"Random extra delay: {extra_delay:.2f} seconds")time.sleep(extra_delay)import requestsreturn requests.get(url, **kwargs)# Example usagescraper = RespectfulScraper(min_delay=1, max_delay=3, requests_per_minute=15)urls = ["https://httpbin.org/delay/1","https://httpbin.org/json","https://httpbin.org/user-agent"]for url in urls:response = scraper.make_request(url)logger.info(f"Status: {response.status_code} for {url}")
Key Rate Limiting Principles:
- Start Conservative: Begin with 1 request per 2-3 seconds
- Add Randomness: Vary delays to avoid robotic patterns
- Respect HTTP Status Codes:
429 Too Many Requests
: Back off exponentially503 Service Unavailable
: Server overloaded, wait longer
- Monitor Your Impact: Track response times - if they're increasing, slow down
Best Practice #2: Use Realistic User Agents
Your user agent string is like your ID card to websites. Make it realistic.
Proper user agent management
import randomimport requestsclass UserAgentManager:def __init__(self):# Real user agents from popular browsersself.user_agents = [# Chrome on Windows'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',# Chrome on macOS'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',# Firefox on Windows'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',# Firefox on macOS'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0',# Safari on macOS'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15',# Edge on Windows'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0']self.current_agent = random.choice(self.user_agents)def get_headers(self):"""Get realistic browser headers"""return {'User-Agent': self.current_agent,'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Language': 'en-US,en;q=0.5','Accept-Encoding': 'gzip, deflate, br','DNT': '1','Connection': 'keep-alive','Upgrade-Insecure-Requests': '1','Sec-Fetch-Dest': 'document','Sec-Fetch-Mode': 'navigate','Sec-Fetch-Site': 'none','Cache-Control': 'max-age=0'}def rotate_agent(self):"""Switch to a different user agent"""old_agent = self.current_agentself.current_agent = random.choice(self.user_agents)# Make sure we actually changedwhile self.current_agent == old_agent and len(self.user_agents) > 1:self.current_agent = random.choice(self.user_agents)return self.current_agent# Example usageua_manager = UserAgentManager()def make_realistic_request(url):"""Make a request that looks like it's from a real browser"""headers = ua_manager.get_headers()# Occasionally rotate user agent (like switching browsers)if random.random() < 0.05: # 5% chanceua_manager.rotate_agent()headers = ua_manager.get_headers()print(f"Rotated to new user agent")return requests.get(url, headers=headers)# Test itresponse = make_realistic_request("https://httpbin.org/headers")print(response.json())
User Agent Best Practices:
- Use Real Browser Strings: Copy from actual browsers, not generic ones
- Rotate Occasionally: Don't use the same agent for thousands of requests
- Match Headers: Include realistic Accept, Accept-Language headers
- Stay Current: Update user agents periodically as browsers update
Best Practice #3: Handle Errors Gracefully
How you handle errors can mean the difference between getting blocked and staying undetected.
Proper error handling
import requestsimport timeimport randomfrom functools import wrapsdef retry_with_backoff(max_retries=3, backoff_base=2, max_backoff=60):"""Decorator for implementing exponential backoff on failed requests"""def decorator(func):@wraps(func)def wrapper(*args, **kwargs):for attempt in range(max_retries + 1):try:response = func(*args, **kwargs)# Handle different HTTP status codesif response.status_code == 200:return responseelif response.status_code == 429: # Rate limited# Check if server provided Retry-After headerretry_after = response.headers.get('Retry-After')if retry_after:wait_time = int(retry_after)else:# Exponential backoffwait_time = min(backoff_base ** attempt, max_backoff)print(f"Rate limited. Waiting {wait_time} seconds.")time.sleep(wait_time)if attempt == max_retries:print("Max retries reached for rate limiting")return responseelif response.status_code in [502, 503, 504]: # Server errorswait_time = min(backoff_base ** attempt, max_backoff)print(f"Server error {response.status_code}. Waiting {wait_time} seconds.")time.sleep(wait_time)if attempt == max_retries:return responseelif response.status_code == 403: # Forbiddenprint("403 Forbidden - you might be blocked")# Don't retry immediately for 403sif attempt < max_retries:wait_time = 30 + random.uniform(10, 20) # Wait longerprint(f"Waiting {wait_time:.1f} seconds before retry")time.sleep(wait_time)else:return responseelse:# For other status codes, return immediatelyreturn responseexcept requests.exceptions.ConnectionError:wait_time = min(backoff_base ** attempt, max_backoff)print(f"Connection error. Waiting {wait_time} seconds.")time.sleep(wait_time)if attempt == max_retries:raiseexcept requests.exceptions.Timeout:wait_time = min(backoff_base ** attempt, max_backoff)print(f"Timeout. Waiting {wait_time} seconds.")time.sleep(wait_time)if attempt == max_retries:raisereturn Nonereturn wrapperreturn decoratorclass SmartScraper:def __init__(self):self.session = requests.Session()self.consecutive_errors = 0self.max_consecutive_errors = 5@retry_with_backoff(max_retries=3)def get_page(self, url, **kwargs):"""Make a request with intelligent error handling"""response = self.session.get(url, timeout=30, **kwargs)# Track error patternsif response.status_code >= 400:self.consecutive_errors += 1if self.consecutive_errors >= self.max_consecutive_errors:print(f"Too many consecutive errors ({self.consecutive_errors})")print("Taking a longer break...")time.sleep(300) # 5 minute breakself.consecutive_errors = 0else:self.consecutive_errors = 0return responsedef scrape_with_circuit_breaker(self, urls):"""Scrape multiple URLs with circuit breaker pattern"""results = []for i, url in enumerate(urls):print(f"Scraping {i+1}/{len(urls)}: {url}")try:response = self.get_page(url)if response and response.status_code == 200:results.append({'url': url,'status': 'success','content_length': len(response.text)})else:results.append({'url': url,'status': 'failed','status_code': response.status_code if response else 'No response'})except Exception as e:print(f"Error scraping {url}: {e}")results.append({'url': url,'status': 'error','error': str(e)})# Progressive delays - slow down if we're having issuesif self.consecutive_errors > 0:extra_delay = self.consecutive_errors * 2print(f"Adding extra delay: {extra_delay} seconds")time.sleep(extra_delay)return results# Example usagescraper = SmartScraper()test_urls = ["https://httpbin.org/status/200", # Success"https://httpbin.org/status/429", # Rate limited"https://httpbin.org/status/503", # Server error]results = scraper.scrape_with_circuit_breaker(test_urls)for result in results:print(f"URL: {result['url']}, Status: {result['status']}")
Error Handling Principles:
- Exponential Backoff: Wait longer after each failure
- Respect Retry-After Headers: When servers tell you when to retry, listen
- Circuit Breaker Pattern: Stop trying if you're getting too many errors
- Different Strategies for Different Errors: 429 vs 503 vs 403 need different approaches
Best Practice #4: Respect robots.txt
The robots.txt file is a website's polite way of saying "please don't scrape these areas."
Respecting robots.txt
import urllib.robotparserimport requestsfrom urllib.parse import urljoin, urlparseclass RobotsRespectfulScraper:def __init__(self, user_agent="*"):self.user_agent = user_agentself.robots_cache = {}def can_fetch(self, url):"""Check if we're allowed to scrape this URL according to robots.txt"""parsed_url = urlparse(url)base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"# Check if we've already fetched robots.txt for this domainif base_url not in self.robots_cache:self.robots_cache[base_url] = self._fetch_robots_txt(base_url)robots_parser = self.robots_cache[base_url]if robots_parser is None:# If we can't fetch robots.txt, assume scraping is allowedreturn Truereturn robots_parser.can_fetch(self.user_agent, url)def _fetch_robots_txt(self, base_url):"""Fetch and parse robots.txt for a domain"""robots_url = urljoin(base_url, "/robots.txt")try:print(f"Fetching robots.txt from {robots_url}")robots_parser = urllib.robotparser.RobotFileParser()robots_parser.set_url(robots_url)robots_parser.read()return robots_parserexcept Exception as e:print(f"Could not fetch robots.txt from {robots_url}: {e}")return Nonedef get_crawl_delay(self, url):"""Get the recommended crawl delay for this domain"""parsed_url = urlparse(url)base_url = f"{parsed_url.scheme}://{parsed_url.netloc}"if base_url not in self.robots_cache:self.robots_cache[base_url] = self._fetch_robots_txt(base_url)robots_parser = self.robots_cache[base_url]if robots_parser is None:return 1 # Default 1 second delaycrawl_delay = robots_parser.crawl_delay(self.user_agent)return crawl_delay if crawl_delay is not None else 1def scrape_url(self, url):"""Scrape a URL only if allowed by robots.txt"""if not self.can_fetch(url):print(f"❌ Robots.txt disallows scraping {url}")return Noneprint(f"✅ Robots.txt allows scraping {url}")# Get recommended delaydelay = self.get_crawl_delay(url)print(f"Recommended crawl delay: {delay} seconds")# Make the requestimport timetime.sleep(delay)try:response = requests.get(url, timeout=30)return responseexcept Exception as e:print(f"Error scraping {url}: {e}")return None# Example usagescraper = RobotsRespectfulScraper(user_agent="*")test_urls = ["https://example.com/","https://example.com/page1","https://httpbin.org/robots.txt", # This will show you the robots.txt content]for url in test_urls:print(f"\n--- Checking {url} ---")response = scraper.scrape_url(url)if response:print(f"Successfully scraped: {response.status_code}")else:print("Could not scrape (blocked by robots.txt or error)")
robots.txt Best Practices:
- Always Check First: Before scraping a new domain
- Respect Crawl-Delay: If specified, use it as your minimum delay
- Cache Results: Don't fetch robots.txt for every request
- Handle Missing Files: Assume allowed if robots.txt doesn't exist
Best Practice #5: Use Sessions and Connection Pooling
Reusing connections makes your scraping more efficient and less detectable.
Efficient session management
import requestsfrom requests.adapters import HTTPAdapterfrom urllib3.util.retry import Retryimport timeclass EfficientScraper:def __init__(self):self.session = requests.Session()self._setup_session()def _setup_session(self):"""Configure session with retry strategy and connection pooling"""# Configure retry strategyretry_strategy = Retry(total=3,backoff_factor=1,status_forcelist=[429, 500, 502, 503, 504],allowed_methods=["HEAD", "GET", "OPTIONS"])# Set up HTTP adapter with retry strategyadapter = HTTPAdapter(max_retries=retry_strategy,pool_connections=10, # Number of connection poolspool_maxsize=20, # Max connections per poolpool_block=False)# Mount adapter for both HTTP and HTTPSself.session.mount("http://", adapter)self.session.mount("https://", adapter)# Set default headersself.session.headers.update({'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Language': 'en-US,en;q=0.5','Accept-Encoding': 'gzip, deflate','Connection': 'keep-alive','Cache-Control': 'max-age=0'})# Set timeoutself.session.timeout = 30def scrape_multiple_pages(self, urls, delay=2):"""Efficiently scrape multiple pages using session"""results = []for i, url in enumerate(urls):print(f"Scraping {i+1}/{len(urls)}: {url}")try:start_time = time.time()response = self.session.get(url)request_time = time.time() - start_timeresult = {'url': url,'status_code': response.status_code,'response_time': round(request_time, 2),'content_length': len(response.content),'content_type': response.headers.get('content-type', 'unknown')}# Check if we're being throttled (slow responses might indicate rate limiting)if request_time > 10:print(f"⚠️ Slow response ({request_time:.1f}s) - might be rate limited")delay *= 1.5 # Increase delay for subsequent requestsresults.append(result)except requests.exceptions.RequestException as e:print(f"Error scraping {url}: {e}")results.append({'url': url,'error': str(e)})# Respectful delayif i < len(urls) - 1: # Don't wait after the last URLtime.sleep(delay)return resultsdef close(self):"""Clean up session resources"""self.session.close()# Example usage with connection reuse benefitsscraper = EfficientScraper()# These requests will reuse the same TCP connectionsame_domain_urls = ["https://httpbin.org/json","https://httpbin.org/headers","https://httpbin.org/user-agent","https://httpbin.org/ip"]print("Scraping same domain (connection reuse):")results = scraper.scrape_multiple_pages(same_domain_urls, delay=1)for result in results:if 'error' not in result:print(f" {result['url']}: {result['status_code']} ({result['response_time']}s)")scraper.close()
Session Management Benefits:
- Connection Reuse: Faster subsequent requests to same domain
- Cookie Persistence: Maintains session state across requests
- Connection Pooling: Better resource utilization
- Automatic Retries: Built-in handling of temporary failures
Best Practice #6: Modern Solution - Supacrawler API
While all these techniques are important to understand, modern web scraping APIs like Supacrawler handle these complexities automatically.
Supacrawler: Best practices built-in
from supacrawler import SupacrawlerClientimport osimport time# Supacrawler automatically handles:# ✅ Rate limiting# ✅ User agent rotation# ✅ JavaScript rendering# ✅ Error handling with retries# ✅ IP rotation# ✅ CAPTCHA solving# ✅ Connection poolingclient = SupacrawlerClient(api_key=os.environ.get('SUPACRAWLER_API_KEY'))def scrape_with_built_in_best_practices():"""Supacrawler handles all the best practices automatically"""# Example 1: Simple scraping with automatic best practicesresponse = client.scrape(url="https://example.com",render_js=True, # Handles JavaScript automatically# Rate limiting, user agents, retries all handled automatically)if response.success:print("✅ Scraped successfully")print(f"Title: {response.metadata.title}")print(f"Content length: {len(response.markdown)}")# Example 2: Bulk scraping with automatic optimizationurls = ["https://example.com/page1","https://example.com/page2","https://example.com/page3"]results = []for url in urls:response = client.scrape(url=url)results.append({'url': url,'success': response.success,'title': response.metadata.title if response.metadata else None})# No need to add delays - Supacrawler handles rate limitingreturn resultsdef advanced_scraping_with_supacrawler():"""Advanced features that would be complex to implement manually"""# Scrape JavaScript-heavy site with custom wait conditionsresponse = client.scrape(url="https://spa-example.com",render_js=True,wait_for_selector=".content-loaded", # Wait for specific elementscroll_to_bottom=True, # Handle infinite scrollblock_resources=["image", "stylesheet"] # Optimize for speed)# Extract structured data with selectorsresponse = client.scrape(url="https://news-site.com",selectors={"articles": {"selector": ".article","multiple": True,"fields": {"title": "h2","summary": ".summary","author": ".author","publish_date": ".date"}}})return response.datadef compare_traditional_vs_modern():"""Compare traditional scraping complexity vs Supacrawler simplicity"""print("Traditional approach would require:")print(" ❌ 50+ lines of rate limiting code")print(" ❌ User agent management")print(" ❌ Proxy rotation setup")print(" ❌ JavaScript rendering with Selenium/Playwright")print(" ❌ Error handling and retries")print(" ❌ CAPTCHA solving")print(" ❌ Connection pooling")print(" ❌ Ongoing maintenance as sites change")print("\nSupacrawler approach:")print(" ✅ 3 lines of code")print(" ✅ All best practices built-in")print(" ✅ Zero maintenance")print(" ✅ Better success rates")# Example usageif __name__ == "__main__":print("=== Supacrawler Best Practices Demo ===")try:results = scrape_with_built_in_best_practices()print(f"Scraped {len(results)} URLs successfully")advanced_data = advanced_scraping_with_supacrawler()print("Advanced scraping completed")compare_traditional_vs_modern()except Exception as e:print(f"Error: {e}")print("Make sure to set SUPACRAWLER_API_KEY environment variable")
Advanced Anti-Detection Techniques
For sites with sophisticated blocking mechanisms, here are advanced techniques:
1. Browser Fingerprint Randomization
Advanced fingerprint management
import randomimport requestsclass AdvancedScraper:def __init__(self):self.session = requests.Session()# Randomize TLS fingerprintself.session.headers.update(self._get_random_headers())def _get_random_headers(self):"""Generate realistic, randomized headers"""# Random screen resolutions (common ones)resolutions = ["1920x1080", "1366x768", "1536x864", "1440x900","1280x720", "1600x900", "2560x1440"]# Random timezone offsetstimezones = ["-480", "-420", "-360", "-300", "-240", "-180", "0", "60", "120"]# Random language preferenceslanguages = ["en-US,en;q=0.9","en-GB,en;q=0.9","en-US,en;q=0.9,es;q=0.8","en-US,en;q=0.9,fr;q=0.8"]headers = {'User-Agent': self._get_random_user_agent(),'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8','Accept-Language': random.choice(languages),'Accept-Encoding': 'gzip, deflate, br','DNT': random.choice(['1', '0']),'Connection': 'keep-alive','Upgrade-Insecure-Requests': '1','Sec-Fetch-Dest': 'document','Sec-Fetch-Mode': 'navigate','Sec-Fetch-Site': random.choice(['none', 'same-origin', 'cross-site']),'Cache-Control': random.choice(['no-cache', 'max-age=0']),}return headersdef _get_random_user_agent(self):"""Get a realistic, current user agent"""# Up-to-date user agents (as of 2025)user_agents = ['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36','Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0','Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0']return random.choice(user_agents)
2. Behavioral Mimicking
Human-like behavior patterns
import randomimport timeimport mathclass HumanLikeScraper:def __init__(self):self.page_view_times = []self.last_request_time = Nonedef human_like_delay(self):"""Implement human-like delays between page views"""# Humans don't visit pages at exact intervals# They have patterns: quick browsing, detailed reading, pausesbehavior_type = random.choices(['quick_browse', 'normal_read', 'detailed_study', 'distracted'],weights=[30, 50, 15, 5])[0]if behavior_type == 'quick_browse':delay = random.uniform(2, 8)elif behavior_type == 'normal_read':delay = random.uniform(10, 30)elif behavior_type == 'detailed_study':delay = random.uniform(45, 120)else: # distracteddelay = random.uniform(180, 600) # 3-10 minutes# Add some randomness with normal distributiondelay += random.gauss(0, delay * 0.1)delay = max(1, delay) # Ensure minimum 1 secondprint(f"Human-like delay: {delay:.1f} seconds ({behavior_type})")time.sleep(delay)def simulate_mouse_movement(self):"""Simulate realistic mouse movement patterns(This would be used with Selenium, shown for concept)"""# Humans don't move mouse in straight lines# They have curves, pauses, micro-movementsmovements = []current_x, current_y = 0, 0target_x, target_y = random.randint(100, 800), random.randint(100, 600)# Create curved path with noisesteps = random.randint(10, 30)for i in range(steps):progress = i / steps# Bezier curve-like movementx = current_x + (target_x - current_x) * progressy = current_y + (target_y - current_y) * progress# Add human-like noisex += random.gauss(0, 5)y += random.gauss(0, 5)# Vary speed (humans slow down near target)speed_factor = 1 - (progress * 0.5)delay = random.uniform(0.01, 0.05) / speed_factormovements.append((x, y, delay))return movementsdef realistic_reading_pattern(self, content_length):"""Calculate realistic reading time based on content"""# Average reading speed: 200-300 words per minutewords_per_minute = random.uniform(200, 300)# Estimate word count (rough: content_length / 5)estimated_words = content_length / 5# Calculate reading time in secondsreading_time = (estimated_words / words_per_minute) * 60# Add scanning time (people don't read everything)scanning_factor = random.uniform(0.3, 0.8)actual_time = reading_time * scanning_factor# Add think timethink_time = random.uniform(2, 10)total_time = actual_time + think_timereturn max(5, total_time) # Minimum 5 seconds# Example usagescraper = HumanLikeScraper()def scrape_like_human(urls):"""Scrape URLs with human-like behavior"""for i, url in enumerate(urls):print(f"\nVisiting page {i+1}: {url}")# Make request (would use your preferred method)response = requests.get(url)if response.status_code == 200:# Simulate reading the contentcontent_length = len(response.text)reading_time = scraper.realistic_reading_pattern(content_length)print(f"Simulating reading for {reading_time:.1f} seconds")time.sleep(reading_time)# Human-like delay before next pageif i < len(urls) - 1:scraper.human_like_delay()
Legal and Ethical Considerations
Before implementing any scraping strategy, consider the legal and ethical implications:
✅ Good Practices:
- Read Terms of Service: Understand what's allowed
- Check robots.txt: Respect website preferences
- Use Official APIs First: Always prefer APIs when available
- Minimize Server Load: Don't overwhelm servers
- Respect Copyright: Don't republish copyrighted content
- Add Value: Use scraped data to create something useful
❌ Avoid These:
- Ignoring robots.txt: Clearly stated preferences
- Overwhelming Servers: Excessive request rates
- Scraping Personal Data: Privacy violations
- Commercial Redistribution: Without permission
- Circumventing Paywalls: Violates business models
- Aggressive Automation: Impacts legitimate users
Monitoring Your Scraping Health
Keep track of how your scraping is performing:
Scraping health monitoring
import timeimport jsonfrom datetime import datetimefrom collections import defaultdictclass ScrapingMonitor:def __init__(self):self.stats = {'requests_made': 0,'successful_requests': 0,'failed_requests': 0,'blocked_requests': 0,'average_response_time': 0,'response_times': [],'status_codes': defaultdict(int),'start_time': datetime.now()}def record_request(self, url, status_code, response_time, success=True):"""Record metrics for a request"""self.stats['requests_made'] += 1self.stats['response_times'].append(response_time)self.stats['status_codes'][status_code] += 1if success and status_code == 200:self.stats['successful_requests'] += 1elif status_code in [403, 429]:self.stats['blocked_requests'] += 1else:self.stats['failed_requests'] += 1# Update average response timeself.stats['average_response_time'] = sum(self.stats['response_times']) / len(self.stats['response_times'])def get_health_report(self):"""Generate a health report"""runtime = datetime.now() - self.stats['start_time']success_rate = (self.stats['successful_requests'] / self.stats['requests_made']) * 100 if self.stats['requests_made'] > 0 else 0blocked_rate = (self.stats['blocked_requests'] / self.stats['requests_made']) * 100 if self.stats['requests_made'] > 0 else 0report = {'summary': {'runtime_minutes': round(runtime.total_seconds() / 60, 2),'total_requests': self.stats['requests_made'],'success_rate': round(success_rate, 2),'blocked_rate': round(blocked_rate, 2),'average_response_time': round(self.stats['average_response_time'], 2)},'status_codes': dict(self.stats['status_codes']),'health_indicators': self._get_health_indicators()}return reportdef _get_health_indicators(self):"""Get health indicators and warnings"""indicators = []success_rate = (self.stats['successful_requests'] / self.stats['requests_made']) * 100 if self.stats['requests_made'] > 0 else 0blocked_rate = (self.stats['blocked_requests'] / self.stats['requests_made']) * 100 if self.stats['requests_made'] > 0 else 0avg_response_time = self.stats['average_response_time']if success_rate < 80:indicators.append("⚠️ Low success rate - consider slowing down")if blocked_rate > 10:indicators.append("🚫 High block rate - you're being detected")if avg_response_time > 5:indicators.append("🐌 Slow responses - server may be throttling you")if success_rate > 95 and blocked_rate < 1:indicators.append("✅ Healthy scraping - good practices being followed")return indicators# Example usage with monitoringmonitor = ScrapingMonitor()def monitored_scrape(urls):"""Scrape with health monitoring"""for url in urls:start_time = time.time()try:response = requests.get(url, timeout=30)response_time = time.time() - start_timemonitor.record_request(url=url,status_code=response.status_code,response_time=response_time,success=(response.status_code == 200))print(f"Scraped {url}: {response.status_code} ({response_time:.2f}s)")except Exception as e:response_time = time.time() - start_timemonitor.record_request(url=url,status_code=0, # Network errorresponse_time=response_time,success=False)print(f"Error scraping {url}: {e}")time.sleep(2) # Respectful delay# Print health reportreport = monitor.get_health_report()print("\n" + "="*50)print("SCRAPING HEALTH REPORT")print("="*50)print(json.dumps(report, indent=2))# Test monitoringtest_urls = ["https://httpbin.org/status/200","https://httpbin.org/delay/1","https://httpbin.org/status/429",]monitored_scrape(test_urls)
When to Use Supacrawler vs DIY
Here's a decision framework:
Scenario | Recommendation | Reason |
---|---|---|
Learning web scraping | DIY with Python | Understand fundamentals |
Simple static sites | DIY or Supacrawler | Both work well |
JavaScript-heavy sites | Supacrawler | Complex to handle manually |
High-volume scraping | Supacrawler | Built-in optimization |
Production systems | Supacrawler | Reliability and maintenance |
Complex anti-bot sites | Supacrawler | Advanced countermeasures |
Budget constraints | DIY first, then Supacrawler | Start free, scale up |
Time constraints | Supacrawler | Faster development |
Conclusion: Building Sustainable Scrapers
The key to successful web scraping isn't just avoiding detection—it's building sustainable, respectful systems that can run reliably over time.
Remember these principles:
- Start Respectfully: Begin with conservative rates and realistic behavior
- Monitor Continuously: Track your success rates and response times
- Adapt Quickly: When you see signs of blocking, adjust immediately
- Choose the Right Tool: Match complexity to your needs
- Stay Updated: Anti-bot measures evolve, so should your techniques
For most production use cases, Supacrawler eliminates these concerns entirely:
- ✅ Built-in best practices: Rate limiting, user agent rotation, error handling
- ✅ Automatic adaptation: Adjusts to site-specific anti-bot measures
- ✅ No maintenance: Updates automatically as sites change
- ✅ Higher success rates: Professional-grade infrastructure
- ✅ Focus on value: Spend time using data, not fighting blocks
Whether you choose DIY or Supacrawler, the goal is the same: extract valuable data while being a good citizen of the web.
Ready to start scraping responsibly?
- For learning: Try the Python examples above
- For production: Get started with Supacrawler - 1,000 free requests to test
- For complex sites: Check our advanced documentation
Happy (and respectful) scraping! 🤖✨