Back to Blog

How to Fix 403 Forbidden Errors in Web Scraping: Complete Troubleshooting Guide 2025

Nothing kills the momentum of a web scraping project quite like seeing HTTP 403 Forbidden in your logs. You've written the perfect scraper, tested it thoroughly, and then... access denied. Your carefully crafted bot is being rejected at the door.

If you're reading this, chances are you're staring at error messages right now, wondering why your scraper worked yesterday but fails today, or why it works on some sites but not others.

Here's the reality: 403 Forbidden errors are the web's way of saying "I know you're a bot, and I don't want you here." But the good news? Most of these blocks are preventable and fixable with the right techniques.

This comprehensive guide will teach you everything you need to know about diagnosing, understanding, and fixing 403 Forbidden errors in web scraping. We'll cover everything from quick fixes to advanced evasion techniques, with real code examples that actually work.

Understanding 403 Forbidden Errors

Before jumping into solutions, let's understand what's actually happening when you get a 403 error.

What 403 Forbidden Really Means

A 403 status code specifically means: "The server understood your request, but refuses to authorize it." This is different from other common errors:

  • 401 Unauthorized: You need to authenticate (provide credentials)
  • 404 Not Found: The resource doesn't exist
  • 403 Forbidden: You're not allowed to access this resource, period

Understanding different HTTP error codes

import requests
def demonstrate_different_errors():
"""
Examples of different HTTP errors you might encounter
"""
# 403 Forbidden - Most common in web scraping
try:
response = requests.get("https://example-site.com/protected-content")
if response.status_code == 403:
print("403 Forbidden: Server detected automation/bot behavior")
print("Possible causes:")
print("- User agent detected as bot")
print("- IP address blocked")
print("- Rate limiting triggered")
print("- Missing required headers")
except Exception as e:
print(f"Request failed: {e}")
# 401 Unauthorized - Authentication required
try:
response = requests.get("https://api.example.com/private-data")
if response.status_code == 401:
print("401 Unauthorized: Need valid credentials")
print("Solution: Add API key, login, or auth headers")
except Exception as e:
print(f"Request failed: {e}")
# 429 Too Many Requests - Rate limiting
try:
response = requests.get("https://api.example.com/data")
if response.status_code == 429:
print("429 Too Many Requests: Hitting rate limits")
print("Solution: Slow down request rate")
# Check for Retry-After header
retry_after = response.headers.get('Retry-After')
if retry_after:
print(f"Server says wait {retry_after} seconds")
except Exception as e:
print(f"Request failed: {e}")
# Example of what triggers 403 errors
def bad_scraping_example():
"""
This code will likely trigger 403 errors - DON'T do this!
"""
headers = {
'User-Agent': 'Python-requests/2.28.0' # Screams "I'm a bot!"
}
urls = [f"https://example-store.com/products?page={i}" for i in range(100)]
# Rapid-fire requests with obvious bot signature
for url in urls:
response = requests.get(url, headers=headers)
print(f"Status: {response.status_code}")
# No delays, same user agent, predictable pattern = BLOCKED
if __name__ == "__main__":
demonstrate_different_errors()

Common Triggers for 403 Errors

Websites use various signals to detect and block automated traffic:

1. User Agent Detection

Default library user agents are dead giveaways:

  • Python-requests/2.28.0
  • Go-http-client/1.1
  • curl/7.68.0

2. Request Pattern Analysis

  • Too many requests too quickly
  • Perfectly timed intervals (robotic behavior)
  • Accessing pages in unnatural order

3. Missing Browser Headers

Real browsers send dozens of headers. Missing key ones raises red flags:

  • Accept-Language
  • Accept-Encoding
  • Cache-Control
  • Sec-Fetch-* headers

4. IP Reputation

  • Known datacenter/VPS IP ranges
  • Previously flagged IPs
  • Geographic restrictions

5. JavaScript Challenges

  • Missing JavaScript execution
  • Failed browser fingerprint checks
  • CAPTCHA systems

Diagnostic Approach: Finding the Root Cause

Before applying fixes, you need to understand why you're being blocked. Here's a systematic diagnostic approach:

Systematic 403 error diagnosis

import requests
import time
from urllib.parse import urlparse
class ForbiddenErrorDiagnostic:
def __init__(self, url):
self.url = url
self.domain = urlparse(url).netloc
self.results = {}
def run_full_diagnosis(self):
"""Run comprehensive diagnosis to identify blocking causes"""
print(f"Diagnosing 403 errors for: {self.url}")
print("=" * 50)
# Test 1: Basic request (establish baseline)
self.test_basic_request()
# Test 2: User agent impact
self.test_user_agent_impact()
# Test 3: Headers impact
self.test_headers_impact()
# Test 4: Rate limiting sensitivity
self.test_rate_limiting()
# Test 5: JavaScript requirement
self.test_javascript_requirement()
# Test 6: Geographic restrictions
self.test_geographic_blocking()
# Generate diagnosis report
self.generate_report()
def test_basic_request(self):
"""Test with minimal request to establish baseline"""
print("Test 1: Basic Request")
try:
response = requests.get(self.url, timeout=10)
status = response.status_code
self.results['basic_request'] = {
'status_code': status,
'success': status == 200,
'headers_received': len(response.headers),
'content_length': len(response.content)
}
print(f" Status Code: {status}")
print(f" Content Length: {len(response.content)} bytes")
if status == 403:
print(" ❌ Blocked on basic request - likely user agent or IP issue")
elif status == 200:
print(" ✅ Basic request successful")
else:
print(f" ⚠️ Unexpected status: {status}")
except Exception as e:
print(f" ❌ Request failed: {e}")
self.results['basic_request'] = {'error': str(e)}
print()
def test_user_agent_impact(self):
"""Test different user agents to see if that's the blocking factor"""
print("Test 2: User Agent Impact")
user_agents = {
'python_requests': 'Python-requests/2.28.0',
'chrome': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'firefox': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
'safari': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15'
}
self.results['user_agent_test'] = {}
for name, ua in user_agents.items():
try:
headers = {'User-Agent': ua}
response = requests.get(self.url, headers=headers, timeout=10)
self.results['user_agent_test'][name] = {
'status_code': response.status_code,
'success': response.status_code == 200
}
print(f" {name}: {response.status_code}")
time.sleep(1) # Be respectful between tests
except Exception as e:
print(f" {name}: Error - {e}")
self.results['user_agent_test'][name] = {'error': str(e)}
print()
def test_headers_impact(self):
"""Test with realistic browser headers"""
print("Test 3: Headers Impact")
# Minimal headers
minimal_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}
# Full browser headers
full_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Cache-Control': 'no-cache',
'Pragma': 'no-cache',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'Connection': 'keep-alive'
}
tests = {
'minimal_headers': minimal_headers,
'full_headers': full_headers
}
self.results['headers_test'] = {}
for test_name, headers in tests.items():
try:
response = requests.get(self.url, headers=headers, timeout=10)
self.results['headers_test'][test_name] = {
'status_code': response.status_code,
'success': response.status_code == 200
}
print(f" {test_name}: {response.status_code}")
time.sleep(2) # Longer pause between tests
except Exception as e:
print(f" {test_name}: Error - {e}")
self.results['headers_test'][test_name] = {'error': str(e)}
print()
def test_rate_limiting(self):
"""Test if rate limiting is causing 403s"""
print("Test 4: Rate Limiting Sensitivity")
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}
# Test rapid requests
print(" Testing rapid requests...")
rapid_results = []
for i in range(5):
try:
response = requests.get(self.url, headers=headers, timeout=5)
rapid_results.append(response.status_code)
print(f" Request {i+1}: {response.status_code}")
except Exception as e:
rapid_results.append(f"Error: {e}")
print(f" Request {i+1}: Error - {e}")
# Test with delays
print(" Testing with 3-second delays...")
delayed_results = []
for i in range(3):
try:
response = requests.get(self.url, headers=headers, timeout=5)
delayed_results.append(response.status_code)
print(f" Delayed request {i+1}: {response.status_code}")
if i < 2: # Don't sleep after last request
time.sleep(3)
except Exception as e:
delayed_results.append(f"Error: {e}")
print(f" Delayed request {i+1}: Error - {e}")
self.results['rate_limiting_test'] = {
'rapid_requests': rapid_results,
'delayed_requests': delayed_results
}
print()
def test_javascript_requirement(self):
"""Check if the site requires JavaScript execution"""
print("Test 5: JavaScript Requirement")
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
}
response = requests.get(self.url, headers=headers, timeout=10)
content = response.text.lower()
# Look for signs that JavaScript is required
js_indicators = [
'javascript is required',
'please enable javascript',
'javascript disabled',
'noscript',
'document.getelementbyid',
'window.location',
'cloudflare',
'captcha',
'checking your browser'
]
js_found = any(indicator in content for indicator in js_indicators)
self.results['javascript_test'] = {
'status_code': response.status_code,
'requires_javascript': js_found,
'content_length': len(content),
'indicators_found': [ind for ind in js_indicators if ind in content]
}
print(f" Status Code: {response.status_code}")
print(f" Content Length: {len(content)} bytes")
print(f" JavaScript Required: {'Yes' if js_found else 'Probably No'}")
if js_found:
print(f" Indicators found: {[ind for ind in js_indicators if ind in content]}")
except Exception as e:
print(f" Error: {e}")
self.results['javascript_test'] = {'error': str(e)}
print()
def test_geographic_blocking(self):
"""Basic test for geographic restrictions"""
print("Test 6: Geographic/IP Blocking")
# This is a simplified test - in practice you'd test from different IPs
try:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept-Language': 'en-GB,en;q=0.9', # UK language preference
}
response = requests.get(self.url, headers=headers, timeout=10)
# Check for geographic blocking indicators
content = response.text.lower()
geo_indicators = [
'not available in your region',
'not available in your country',
'geographic restriction',
'geo-blocked',
'vpn detected',
'proxy detected'
]
geo_blocked = any(indicator in content for indicator in geo_indicators)
self.results['geographic_test'] = {
'status_code': response.status_code,
'geo_blocked': geo_blocked,
'indicators_found': [ind for ind in geo_indicators if ind in content]
}
print(f" Status Code: {response.status_code}")
print(f" Geographic Blocking: {'Possible' if geo_blocked else 'Not detected'}")
except Exception as e:
print(f" Error: {e}")
self.results['geographic_test'] = {'error': str(e)}
print()
def generate_report(self):
"""Generate a diagnostic report with recommendations"""
print("DIAGNOSTIC REPORT")
print("=" * 50)
# Analyze results and provide recommendations
recommendations = []
# Check user agent impact
if 'user_agent_test' in self.results:
ua_results = self.results['user_agent_test']
python_blocked = ua_results.get('python_requests', {}).get('status_code') == 403
browser_works = any(result.get('success', False) for result in ua_results.values())
if python_blocked and browser_works:
recommendations.append("🔧 Use realistic browser User-Agent headers")
# Check headers impact
if 'headers_test' in self.results:
headers_results = self.results['headers_test']
minimal_blocked = headers_results.get('minimal_headers', {}).get('status_code') == 403
full_works = headers_results.get('full_headers', {}).get('success', False)
if minimal_blocked and full_works:
recommendations.append("🔧 Add complete browser header set")
# Check rate limiting
if 'rate_limiting_test' in self.results:
rate_results = self.results['rate_limiting_test']
rapid_blocked = any(status == 403 for status in rate_results.get('rapid_requests', []) if isinstance(status, int))
delayed_works = any(status == 200 for status in rate_results.get('delayed_requests', []) if isinstance(status, int))
if rapid_blocked and delayed_works:
recommendations.append("🔧 Implement proper rate limiting (2-5 seconds between requests)")
# Check JavaScript requirement
if 'javascript_test' in self.results:
js_results = self.results['javascript_test']
if js_results.get('requires_javascript', False):
recommendations.append("🔧 Use headless browser (Selenium/Playwright) or Supacrawler for JavaScript rendering")
# Print recommendations
if recommendations:
print("RECOMMENDED FIXES:")
for rec in recommendations:
print(f" {rec}")
else:
print("❌ Unable to determine specific cause. Try advanced techniques:")
print(" - Use residential proxies")
print(" - Implement browser fingerprint randomization")
print(" - Consider Supacrawler for automatic handling")
print("\n" + "=" * 50)
# Example usage
if __name__ == "__main__":
# Replace with the URL you're having trouble with
diagnostic = ForbiddenErrorDiagnostic("https://example-site.com")
diagnostic.run_full_diagnosis()

Solution 1: Fix User Agent and Headers

The most common cause of 403 errors is using default library user agents and missing essential browser headers.

Fixing user agent and headers

import requests
import random
from datetime import datetime
class BrowserHeaderManager:
def __init__(self):
# Real browser user agents (updated for 2025)
self.user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.1 Safari/605.1.15'
]
self.current_ua = random.choice(self.user_agents)
def get_realistic_headers(self, referer=None):
"""Generate realistic browser headers"""
headers = {
'User-Agent': self.current_ua,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
'Accept-Language': random.choice([
'en-US,en;q=0.9',
'en-GB,en;q=0.9',
'en-US,en;q=0.9,es;q=0.8',
'en-US,en;q=0.9,fr;q=0.8'
]),
'Accept-Encoding': 'gzip, deflate, br',
'Cache-Control': random.choice(['no-cache', 'max-age=0']),
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}
# Add Chrome-specific headers if Chrome user agent
if 'Chrome' in self.current_ua:
headers.update({
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none' if not referer else 'cross-site',
'Sec-Fetch-User': '?1',
'sec-ch-ua': '"Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"'
})
# Add Firefox-specific headers
elif 'Firefox' in self.current_ua:
headers.update({
'DNT': '1',
'Pragma': 'no-cache'
})
# Add referer if provided
if referer:
headers['Referer'] = referer
return headers
def rotate_user_agent(self):
"""Switch to a different user agent"""
old_ua = self.current_ua
while self.current_ua == old_ua:
self.current_ua = random.choice(self.user_agents)
return self.current_ua
def fixed_scraper_example():
"""Example of scraper with proper headers to avoid 403 errors"""
header_manager = BrowserHeaderManager()
session = requests.Session()
def scrape_url(url, referer=None):
"""Scrape URL with realistic browser headers"""
headers = header_manager.get_realistic_headers(referer)
try:
print(f"Scraping: {url}")
print(f"User-Agent: {headers['User-Agent'][:50]}...")
response = session.get(url, headers=headers, timeout=30)
print(f"Status Code: {response.status_code}")
if response.status_code == 200:
print(f"✅ Success! Content length: {len(response.content)} bytes")
return response
elif response.status_code == 403:
print("❌ Still getting 403 - need advanced techniques")
return None
else:
print(f"⚠️ Unexpected status: {response.status_code}")
return response
except Exception as e:
print(f"❌ Request failed: {e}")
return None
# Example usage
urls = [
"https://httpbin.org/headers", # Shows what headers you're sending
"https://httpbin.org/user-agent", # Shows your user agent
]
for url in urls:
result = scrape_url(url)
if result:
print("Response preview:")
print(result.text[:200] + "...\n")
# Rotate user agent occasionally
if random.random() < 0.3: # 30% chance
header_manager.rotate_user_agent()
print("🔄 Rotated User-Agent\n")
def headers_before_and_after():
"""Demonstrate the difference between bad and good headers"""
print("❌ BAD HEADERS (will likely get 403):")
bad_headers = {
'User-Agent': 'Python-requests/2.28.0'
}
print("Headers sent:")
for key, value in bad_headers.items():
print(f" {key}: {value}")
print("\n✅ GOOD HEADERS (more likely to work):")
header_manager = BrowserHeaderManager()
good_headers = header_manager.get_realistic_headers()
print("Headers sent:")
for key, value in good_headers.items():
print(f" {key}: {value}")
print(f"\nHeader count - Bad: {len(bad_headers)}, Good: {len(good_headers)}")
if __name__ == "__main__":
print("=== Headers Solution Demo ===")
headers_before_and_after()
print("\n" + "="*50 + "\n")
fixed_scraper_example()

Solution 2: Implement Proper Rate Limiting

Many 403 errors are triggered by making requests too quickly. Here's how to implement intelligent rate limiting:

Advanced rate limiting solution

import time
import random
from datetime import datetime, timedelta
from collections import deque
import threading
class AdaptiveRateLimiter:
def __init__(self, initial_delay=2, max_delay=30, success_threshold=0.8):
self.initial_delay = initial_delay
self.current_delay = initial_delay
self.max_delay = max_delay
self.success_threshold = success_threshold
# Track recent request outcomes
self.recent_requests = deque(maxlen=10)
self.last_request_time = None
# Thread safety
self.lock = threading.Lock()
def wait_if_needed(self):
"""Implement intelligent waiting between requests"""
with self.lock:
now = datetime.now()
if self.last_request_time:
elapsed = (now - self.last_request_time).total_seconds()
if elapsed < self.current_delay:
sleep_time = self.current_delay - elapsed
print(f"Rate limiting: waiting {sleep_time:.1f} seconds")
time.sleep(sleep_time)
# Add some randomness to avoid predictable patterns
jitter = random.uniform(0.1, 0.5)
time.sleep(jitter)
self.last_request_time = datetime.now()
def record_result(self, success, status_code=None):
"""Record the outcome of a request to adapt rate limiting"""
with self.lock:
self.recent_requests.append({
'success': success,
'status_code': status_code,
'timestamp': datetime.now()
})
# Analyze recent success rate
if len(self.recent_requests) >= 5:
success_rate = sum(1 for r in self.recent_requests if r['success']) / len(self.recent_requests)
if success_rate < self.success_threshold:
# Too many failures, slow down
self.current_delay = min(self.current_delay * 1.5, self.max_delay)
print(f"🐌 Success rate low ({success_rate:.1%}), slowing down to {self.current_delay:.1f}s")
elif success_rate > 0.9 and self.current_delay > self.initial_delay:
# High success rate, can speed up slightly
self.current_delay = max(self.current_delay * 0.9, self.initial_delay)
print(f"⚡ Success rate high ({success_rate:.1%}), speeding up to {self.current_delay:.1f}s")
def get_current_delay(self):
"""Get the current delay setting"""
return self.current_delay
class RespectfulScraper:
def __init__(self, base_delay=2):
self.rate_limiter = AdaptiveRateLimiter(initial_delay=base_delay)
self.session = requests.Session()
self.consecutive_errors = 0
self.max_consecutive_errors = 3
# Set up realistic headers
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive'
})
def scrape_url(self, url):
"""Scrape a URL with adaptive rate limiting"""
# Wait according to current rate limit
self.rate_limiter.wait_if_needed()
try:
response = self.session.get(url, timeout=30)
# Check for specific error conditions
if response.status_code == 200:
self.rate_limiter.record_result(True, response.status_code)
self.consecutive_errors = 0
return response
elif response.status_code == 403:
print(f"❌ 403 Forbidden for {url}")
self.rate_limiter.record_result(False, 403)
self.consecutive_errors += 1
# If we're getting consistent 403s, take a longer break
if self.consecutive_errors >= self.max_consecutive_errors:
print(f"⏸️ Too many consecutive 403s, taking 60 second break")
time.sleep(60)
self.consecutive_errors = 0
return None
elif response.status_code == 429: # Rate limited
print(f"⏳ Rate limited (429), backing off")
# Check for Retry-After header
retry_after = response.headers.get('Retry-After')
if retry_after:
wait_time = int(retry_after)
print(f"Server requested {wait_time} second wait")
time.sleep(wait_time)
else:
# Exponential backoff
wait_time = min(60, self.rate_limiter.current_delay * 3)
time.sleep(wait_time)
self.rate_limiter.record_result(False, 429)
return None
else:
print(f"⚠️ Unexpected status {response.status_code} for {url}")
self.rate_limiter.record_result(False, response.status_code)
return response
except Exception as e:
print(f"❌ Error scraping {url}: {e}")
self.rate_limiter.record_result(False)
return None
def scrape_multiple_urls(self, urls):
"""Scrape multiple URLs with intelligent rate limiting"""
results = []
print(f"Starting to scrape {len(urls)} URLs...")
print(f"Initial delay: {self.rate_limiter.current_delay} seconds")
for i, url in enumerate(urls):
print(f"\nProgress: {i+1}/{len(urls)} - {url}")
result = self.scrape_url(url)
if result:
results.append({
'url': url,
'status_code': result.status_code,
'content_length': len(result.content),
'success': True
})
print(f"✅ Success: {result.status_code} ({len(result.content)} bytes)")
else:
results.append({
'url': url,
'success': False
})
print(f"❌ Failed")
# Show current rate limiting status
current_delay = self.rate_limiter.get_current_delay()
print(f"Current delay: {current_delay:.1f}s")
return results
def demonstrate_rate_limiting():
"""Demonstrate adaptive rate limiting in action"""
scraper = RespectfulScraper(base_delay=1)
# Test URLs (replace with your target URLs)
test_urls = [
"https://httpbin.org/delay/1",
"https://httpbin.org/status/200",
"https://httpbin.org/json",
"https://httpbin.org/headers",
"https://httpbin.org/user-agent"
]
results = scraper.scrape_multiple_urls(test_urls)
# Print summary
successful = sum(1 for r in results if r['success'])
print(f"\n📊 SUMMARY:")
print(f"Total URLs: {len(results)}")
print(f"Successful: {successful}")
print(f"Success Rate: {successful/len(results)*100:.1f}%")
if __name__ == "__main__":
demonstrate_rate_limiting()

Solution 3: Use Proxies and IP Rotation

If your IP address is blocked, you'll need to route requests through different IPs:

Proxy rotation solution

import requests
import random
import time
from itertools import cycle
class ProxyRotator:
def __init__(self, proxy_list=None):
# Example proxy list - replace with your actual proxies
self.proxy_list = proxy_list or [
# Format: protocol://username:password@host:port
# or just protocol://host:port for public proxies
"http://proxy1.example.com:8080",
"http://proxy2.example.com:8080",
"http://proxy3.example.com:8080"
]
self.proxy_cycle = cycle(self.proxy_list)
self.current_proxy = None
self.failed_proxies = set()
# Track proxy performance
self.proxy_stats = {proxy: {'success': 0, 'failed': 0} for proxy in self.proxy_list}
def get_next_proxy(self):
"""Get the next working proxy in rotation"""
attempts = 0
max_attempts = len(self.proxy_list) * 2
while attempts < max_attempts:
proxy = next(self.proxy_cycle)
if proxy not in self.failed_proxies:
self.current_proxy = proxy
return proxy
attempts += 1
# If all proxies are marked as failed, reset and try again
print("⚠️ All proxies marked as failed, resetting...")
self.failed_proxies.clear()
self.current_proxy = next(self.proxy_cycle)
return self.current_proxy
def mark_proxy_failed(self, proxy):
"""Mark a proxy as failed"""
self.failed_proxies.add(proxy)
self.proxy_stats[proxy]['failed'] += 1
print(f"❌ Marking proxy as failed: {proxy}")
def mark_proxy_success(self, proxy):
"""Mark a proxy as working"""
self.proxy_stats[proxy]['success'] += 1
# Remove from failed list if it was there
self.failed_proxies.discard(proxy)
def get_proxy_config(self, proxy):
"""Convert proxy string to requests-compatible dict"""
return {
'http': proxy,
'https': proxy
}
def get_stats(self):
"""Get proxy performance statistics"""
return self.proxy_stats
class ProxiedScraper:
def __init__(self, proxy_list=None):
self.proxy_rotator = ProxyRotator(proxy_list)
self.session = requests.Session()
# Set realistic headers
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive'
})
def scrape_with_proxy_rotation(self, url, max_retries=3):
"""Scrape URL with automatic proxy rotation on failures"""
for attempt in range(max_retries):
proxy = self.proxy_rotator.get_next_proxy()
proxy_config = self.proxy_rotator.get_proxy_config(proxy)
print(f"Attempt {attempt + 1}: Using proxy {proxy}")
try:
response = self.session.get(
url,
proxies=proxy_config,
timeout=30
)
if response.status_code == 200:
self.proxy_rotator.mark_proxy_success(proxy)
print(f"✅ Success with proxy {proxy}")
return response
elif response.status_code == 403:
print(f"❌ 403 Forbidden with proxy {proxy}")
self.proxy_rotator.mark_proxy_failed(proxy)
# Wait before trying next proxy
time.sleep(2)
continue
else:
print(f"⚠️ Status {response.status_code} with proxy {proxy}")
# Don't mark as failed for non-403 errors
return response
except requests.exceptions.ProxyError:
print(f"❌ Proxy error with {proxy}")
self.proxy_rotator.mark_proxy_failed(proxy)
continue
except requests.exceptions.Timeout:
print(f"⏱️ Timeout with proxy {proxy}")
# Don't mark as failed for timeouts, might be temporary
continue
except Exception as e:
print(f"❌ Error with proxy {proxy}: {e}")
continue
print(f"❌ Failed to scrape {url} after {max_retries} attempts")
return None
def test_proxies(self):
"""Test all proxies to see which ones work"""
test_url = "https://httpbin.org/ip" # Returns your IP address
print("Testing proxy list...")
working_proxies = []
for proxy in self.proxy_rotator.proxy_list:
try:
proxy_config = self.proxy_rotator.get_proxy_config(proxy)
response = self.session.get(
test_url,
proxies=proxy_config,
timeout=10
)
if response.status_code == 200:
ip_info = response.json()
print(f"✅ {proxy} -> IP: {ip_info.get('origin', 'Unknown')}")
working_proxies.append(proxy)
self.proxy_rotator.mark_proxy_success(proxy)
else:
print(f"❌ {proxy} -> Status: {response.status_code}")
self.proxy_rotator.mark_proxy_failed(proxy)
except Exception as e:
print(f"❌ {proxy} -> Error: {e}")
self.proxy_rotator.mark_proxy_failed(proxy)
print(f"\nWorking proxies: {len(working_proxies)}/{len(self.proxy_rotator.proxy_list)}")
return working_proxies
def residential_proxy_example():
"""Example using residential proxies (more effective against blocking)"""
# Example residential proxy services (you need to sign up and get credentials)
residential_proxies = [
# These are example formats - replace with real credentials
"http://username:[email protected]:8000",
"http://username:[email protected]:8001"
]
print("Residential Proxy Example")
print("Note: Replace example proxies with real residential proxy credentials")
# Residential proxies are more expensive but much more effective:
# - Appear as real home internet connections
# - Harder for websites to detect and block
# - Higher success rates against anti-bot systems
# Popular residential proxy providers:
providers = [
"Bright Data (formerly Luminati)",
"Oxylabs",
"Smartproxy",
"ProxyMesh",
"Geonode"
]
print("Popular residential proxy providers:")
for provider in providers:
print(f" - {provider}")
def free_proxy_warning():
"""Warning about free proxies"""
print("⚠️ WARNING: Free Proxies")
print("Free proxies are generally NOT recommended for production scraping:")
print(" - Often unreliable and slow")
print(" - Shared by many users (higher chance of IP bans)")
print(" - Potential security risks")
print(" - Limited geographic locations")
print()
print("For serious scraping projects, invest in:")
print(" - Residential proxies for best success rates")
print(" - Datacenter proxies for speed and cost balance")
print(" - Or use Supacrawler with built-in proxy rotation")
if __name__ == "__main__":
print("=== Proxy Solution Demo ===")
# Show warnings about free proxies
free_proxy_warning()
print("\n" + "="*50 + "\n")
# Residential proxy example
residential_proxy_example()
print("\n" + "="*50 + "\n")
# If you have actual proxies, uncomment this:
# scraper = ProxiedScraper(your_proxy_list)
# scraper.test_proxies()
# result = scraper.scrape_with_proxy_rotation("https://example.com")

Solution 4: Handle JavaScript and Browser Fingerprinting

Some 403 errors occur because the site requires JavaScript execution or detects non-browser fingerprints:

JavaScript and fingerprinting solutions

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import random
class StealthBrowser:
def __init__(self, headless=True):
self.headless = headless
self.driver = None
self.setup_driver()
def setup_driver(self):
"""Set up Chrome driver with stealth configurations"""
chrome_options = Options()
if self.headless:
chrome_options.add_argument('--headless')
# Basic stealth arguments
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument('--window-size=1920,1080')
# Advanced anti-detection measures
chrome_options.add_argument('--disable-blink-features=AutomationControlled')
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_experimental_option('useAutomationExtension', False)
# Disable automation indicators
chrome_options.add_argument('--disable-extensions')
chrome_options.add_argument('--disable-plugins-discovery')
chrome_options.add_argument('--disable-default-apps')
# Random user agent
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
]
chrome_options.add_argument(f'--user-agent={random.choice(user_agents)}')
self.driver = webdriver.Chrome(options=chrome_options)
# Execute stealth scripts to hide automation traces
self.driver.execute_script("""
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined,
});
""")
self.driver.execute_script("""
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3, 4, 5],
});
""")
self.driver.execute_script("""
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en'],
});
""")
def scrape_javascript_site(self, url):
"""Scrape a site that requires JavaScript execution"""
print(f"Loading JavaScript site: {url}")
try:
self.driver.get(url)
# Wait for page to load
WebDriverWait(self.driver, 20).until(
EC.presence_of_element_located((By.TAG_NAME, "body"))
)
# Additional wait for dynamic content
time.sleep(3)
# Check if we got blocked
page_source = self.driver.page_source.lower()
block_indicators = [
'403 forbidden',
'access denied',
'blocked',
'captcha',
'cloudflare',
'checking your browser'
]
if any(indicator in page_source for indicator in block_indicators):
print("❌ Site appears to be blocking us")
return None
# Try to extract content
try:
# Wait for specific content to load
content_elements = WebDriverWait(self.driver, 10).until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, "article, .article, .post, .content"))
)
print(f"✅ Found {len(content_elements)} content elements")
# Extract data
results = []
for element in content_elements[:5]: # First 5 elements
try:
text = element.text.strip()
if len(text) > 50: # Only substantial content
results.append({
'text': text[:200] + '...',
'html': element.get_attribute('outerHTML')[:100] + '...'
})
except:
continue
return results
except:
# Fallback: just return page title and basic info
title = self.driver.title
url = self.driver.current_url
return [{
'title': title,
'url': url,
'page_source_length': len(self.driver.page_source)
}]
except Exception as e:
print(f"❌ Error scraping {url}: {e}")
return None
def handle_cloudflare_challenge(self, url):
"""Handle Cloudflare protection"""
print(f"Attempting to bypass Cloudflare protection: {url}")
self.driver.get(url)
# Wait and check for Cloudflare challenge
time.sleep(5)
page_source = self.driver.page_source.lower()
if 'cloudflare' in page_source or 'checking your browser' in page_source:
print("Cloudflare challenge detected, waiting...")
# Wait up to 30 seconds for challenge to complete
for i in range(30):
time.sleep(1)
current_source = self.driver.page_source.lower()
if 'cloudflare' not in current_source and 'checking your browser' not in current_source:
print(f"✅ Cloudflare challenge completed after {i+1} seconds")
return True
print("❌ Cloudflare challenge not completed")
return False
print("✅ No Cloudflare challenge detected")
return True
def close(self):
"""Clean up driver"""
if self.driver:
self.driver.quit()
def selenium_stealth_example():
"""Example using selenium-stealth for better detection avoidance"""
print("Enhanced Stealth Example")
print("For production use, consider selenium-stealth package:")
print("pip install selenium-stealth")
print()
example_code = '''
from selenium import webdriver
from selenium_stealth import stealth
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
stealth(driver,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
driver.get("https://bot-detection-site.com")
'''
print("Example code:")
print(example_code)
def undetected_chrome_example():
"""Example using undetected-chromedriver"""
print("Undetected Chrome Example")
print("For even better stealth, use undetected-chromedriver:")
print("pip install undetected-chromedriver")
print()
example_code = '''
import undetected_chromedriver as uc
driver = uc.Chrome(headless=True)
driver.get("https://nowsecure.nl") # Bot detection test site
print(driver.page_source)
driver.quit()
'''
print("Example code:")
print(example_code)
if __name__ == "__main__":
print("=== JavaScript & Stealth Solutions ===")
# Show enhanced stealth options
selenium_stealth_example()
print("\n" + "="*40 + "\n")
undetected_chrome_example()
print("\n" + "="*40 + "\n")
# Basic stealth browser example
browser = StealthBrowser(headless=True)
try:
# Test on a site that checks for automation
result = browser.scrape_javascript_site("https://httpbin.org/headers")
if result:
print("✅ Successfully scraped with stealth browser")
for item in result[:2]:
print(f" Content preview: {item}")
else:
print("❌ Stealth browser was detected")
finally:
browser.close()

Solution 5: The Modern Approach - Supacrawler

While the previous solutions work, they require significant setup and maintenance. Supacrawler handles all 403 error prevention automatically:

Supacrawler: Automatic 403 error handling

from supacrawler import SupacrawlerClient
import os
# Supacrawler automatically handles all common causes of 403 errors
client = SupacrawlerClient(api_key=os.environ.get('SUPACRAWLER_API_KEY'))
def simple_403_fix():
"""
Supacrawler automatically prevents 403 errors with built-in features
"""
print("Supacrawler - Automatic 403 Error Prevention")
# All of these are handled automatically:
# ✅ Realistic browser headers
# ✅ User agent rotation
# ✅ IP rotation and proxy management
# ✅ Rate limiting and request spacing
# ✅ JavaScript execution
# ✅ Browser fingerprint randomization
# ✅ Captcha solving
# ✅ Cloudflare bypass
response = client.scrape(
url="https://difficult-to-scrape-site.com",
render_js=True, # Handles JavaScript requirements
# All anti-bot measures handled automatically
)
if response.success:
print("✅ Successfully scraped without 403 errors")
print(f"Title: {response.metadata.title}")
print(f"Content length: {len(response.markdown)}")
return response.data
else:
print(f"❌ Error: {response.error}")
return None
def scrape_multiple_sites_no_403():
"""
Scrape multiple sites without worrying about 403 errors
"""
difficult_sites = [
"https://site-with-cloudflare.com",
"https://site-with-rate-limiting.com",
"https://javascript-heavy-spa.com",
"https://site-with-captcha.com"
]
results = []
for site in difficult_sites:
print(f"Scraping: {site}")
response = client.scrape(
url=site,
render_js=True,
timeout=30,
# Supacrawler automatically:
# - Rotates IPs
# - Uses realistic headers
# - Handles rate limiting
# - Solves CAPTCHAs
# - Bypasses JavaScript challenges
)
if response.success:
results.append({
'url': site,
'title': response.metadata.title,
'success': True
})
print(f" ✅ Success: {response.metadata.title}")
else:
results.append({
'url': site,
'error': response.error,
'success': False
})
print(f" ❌ Error: {response.error}")
return results
def compare_solutions():
"""
Compare DIY solutions vs Supacrawler for 403 error handling
"""
print("403 Error Solutions Comparison")
print("=" * 50)
print("DIY Approach:")
print("❌ Manage realistic headers manually")
print("❌ Set up proxy rotation infrastructure")
print("❌ Implement rate limiting logic")
print("❌ Handle JavaScript with Selenium/Playwright")
print("❌ Deal with CAPTCHA solving services")
print("❌ Monitor and update user agents")
print("❌ Handle different blocking techniques per site")
print("❌ Maintain infrastructure as sites change")
print("📊 Result: 100+ lines of code, ongoing maintenance")
print("\nSupacrawler Approach:")
print("✅ Realistic headers automatic")
print("✅ IP rotation built-in")
print("✅ Smart rate limiting included")
print("✅ JavaScript rendering automatic")
print("✅ CAPTCHA solving included")
print("✅ User agent rotation built-in")
print("✅ Adapts to new blocking techniques")
print("✅ Zero maintenance required")
print("📊 Result: 3 lines of code, no maintenance")
def success_rate_comparison():
"""
Real-world success rate comparison
"""
print("\nSuccess Rate Comparison (Real-World Data)")
print("=" * 50)
print("Basic requests library:")
print(" Success rate: ~20% (blocks most modern sites)")
print("Requests + proper headers:")
print(" Success rate: ~40% (some improvement)")
print("Selenium + stealth:")
print(" Success rate: ~60% (good for basic sites)")
print("Proxies + rotation + stealth:")
print(" Success rate: ~75% (complex setup required)")
print("Supacrawler:")
print(" Success rate: ~95% (professional infrastructure)")
def cost_analysis():
"""
Cost analysis of different approaches
"""
print("\nCost Analysis (Monthly)")
print("=" * 30)
print("DIY Approach:")
print(" Developer time: 40 hours @ $100/hr = $4,000")
print(" Proxy services: $200-500/month")
print(" Server costs: $100-300/month")
print(" Maintenance: 10 hours/month @ $100/hr = $1,000")
print(" Total: $5,300-5,800/month")
print("\nSupacrawler:")
print(" API costs: $49-299/month (depending on volume)")
print(" Developer time: 2 hours setup = $200 one-time")
print(" Maintenance: $0")
print(" Total: $49-299/month")
print("\n💰 Savings: $5,000-5,500/month with Supacrawler")
if __name__ == "__main__":
print("=== Supacrawler 403 Error Solution ===")
try:
# Simple demonstration
simple_403_fix()
# Multiple sites example
print("\n" + "="*50)
results = scrape_multiple_sites_no_403()
successful = sum(1 for r in results if r['success'])
print(f"\nResults: {successful}/{len(results)} sites scraped successfully")
except Exception as e:
print(f"Error: {e}")
print("Make sure to set SUPACRAWLER_API_KEY environment variable")
print("\n" + "="*50)
compare_solutions()
print("\n" + "="*50)
success_rate_comparison()
print("\n" + "="*50)
cost_analysis()

Advanced Troubleshooting Techniques

For particularly stubborn 403 errors, here are advanced techniques:

Technique 1: Session Persistence and Cookie Management

Advanced session management

import requests
from http.cookiejar import LWPCookieJar
import os
class PersistentScraper:
def __init__(self, cookie_file='scraper_cookies.txt'):
self.session = requests.Session()
self.cookie_file = cookie_file
# Load existing cookies if available
self.load_cookies()
# Set realistic headers
self.session.headers.update({
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive'
})
def load_cookies(self):
"""Load cookies from file"""
if os.path.exists(self.cookie_file):
try:
self.session.cookies = LWPCookieJar(self.cookie_file)
self.session.cookies.load(ignore_discard=True, ignore_expires=True)
print(f"✅ Loaded {len(self.session.cookies)} cookies")
except Exception as e:
print(f"⚠️ Could not load cookies: {e}")
def save_cookies(self):
"""Save cookies to file"""
try:
if hasattr(self.session.cookies, 'save'):
self.session.cookies.save(ignore_discard=True, ignore_expires=True)
print(f"✅ Saved {len(self.session.cookies)} cookies")
except Exception as e:
print(f"⚠️ Could not save cookies: {e}")
def establish_session(self, base_url):
"""Establish a session by visiting the homepage first"""
print(f"Establishing session with {base_url}")
try:
# Visit homepage to get initial cookies
response = self.session.get(base_url)
if response.status_code == 200:
print(f"✅ Session established, got {len(response.cookies)} cookies")
self.save_cookies()
return True
else:
print(f"❌ Could not establish session: {response.status_code}")
return False
except Exception as e:
print(f"❌ Error establishing session: {e}")
return False
def scrape_with_session(self, url):
"""Scrape URL using established session"""
try:
response = self.session.get(url)
if response.status_code == 200:
print(f"✅ Successfully scraped {url}")
self.save_cookies() # Update cookies
return response
elif response.status_code == 403:
print(f"❌ 403 error even with session cookies")
return None
else:
print(f"⚠️ Unexpected status: {response.status_code}")
return response
except Exception as e:
print(f"❌ Error scraping {url}: {e}")
return None

Technique 2: Request Timing and Pattern Randomization

Advanced timing randomization

import time
import random
import numpy as np
from datetime import datetime, timedelta
class HumanLikeTiming:
def __init__(self):
self.last_request_time = None
self.request_history = []
self.session_start = datetime.now()
def human_delay(self):
"""Generate human-like delays between requests"""
# Humans don't browse at constant intervals
# They have different patterns throughout the day
current_time = datetime.now()
# Time of day affects browsing patterns
hour = current_time.hour
if 9 <= hour <= 17: # Work hours - shorter attention spans
base_delay = random.uniform(2, 8)
elif 19 <= hour <= 23: # Evening - longer reading
base_delay = random.uniform(5, 15)
else: # Late night/early morning - slower browsing
base_delay = random.uniform(10, 30)
# Add reading time variability (normal distribution)
reading_time = max(1, np.random.normal(base_delay, base_delay * 0.3))
# Occasional long pauses (like getting distracted)
if random.random() < 0.1: # 10% chance
distraction_time = random.uniform(60, 300) # 1-5 minutes
reading_time += distraction_time
print(f"😴 Simulating distraction: {distraction_time:.1f} second pause")
# Occasional quick browsing (like skipping content)
elif random.random() < 0.2: # 20% chance
reading_time *= 0.3
print(f"⚡ Quick browsing: {reading_time:.1f} second delay")
return reading_time
def wait_like_human(self):
"""Wait with human-like timing patterns"""
delay = self.human_delay()
print(f"⏱️ Human-like delay: {delay:.1f} seconds")
time.sleep(delay)
# Record timing for pattern analysis
self.request_history.append({
'timestamp': datetime.now(),
'delay': delay
})
self.last_request_time = datetime.now()
def get_timing_stats(self):
"""Get statistics about request timing patterns"""
if len(self.request_history) < 2:
return {}
delays = [r['delay'] for r in self.request_history]
return {
'total_requests': len(self.request_history),
'average_delay': np.mean(delays),
'delay_std': np.std(delays),
'min_delay': min(delays),
'max_delay': max(delays),
'session_duration': (datetime.now() - self.session_start).total_seconds()
}
def advanced_pattern_randomization():
"""Advanced techniques for randomizing request patterns"""
print("Advanced Pattern Randomization Techniques:")
print("=" * 50)
techniques = [
{
'name': 'Browsing Session Simulation',
'description': 'Simulate real browsing sessions with natural start/end times',
'implementation': '''
# Start session at realistic time
session_start = random.choice([9, 10, 11, 14, 15, 19, 20, 21])
# Browse for realistic duration
session_duration = random.uniform(10, 60) # 10-60 minutes
# Take breaks between sessions
break_duration = random.uniform(30, 180) # 30 minutes to 3 hours
'''
},
{
'name': 'Page Navigation Patterns',
'description': 'Follow realistic navigation patterns like real users',
'implementation': '''
# Start from homepage
homepage_response = scrape(base_url)
# Navigate through category pages
category_response = scrape(base_url + '/category')
# Visit specific pages from categories
product_response = scrape(product_url_from_category)
# Occasionally go back
if random.random() < 0.3:
back_response = scrape(previous_url)
'''
},
{
'name': 'Mouse Movement Simulation',
'description': 'Simulate mouse movements and scrolling (with Selenium)',
'implementation': '''
# Simulate scrolling
driver.execute_script("window.scrollTo(0, document.body.scrollHeight/2);")
time.sleep(random.uniform(1, 3))
# Simulate reading pause
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Random mouse movements
action = ActionChains(driver)
action.move_by_offset(random.randint(-100, 100), random.randint(-100, 100))
action.perform()
'''
}
]
for technique in techniques:
print(f"\n{technique['name']}:")
print(f" {technique['description']}")
print(f" Implementation:")
for line in technique['implementation'].strip().split('\n'):
if line.strip():
print(f" {line}")
if __name__ == "__main__":
# Demonstrate human-like timing
timing = HumanLikeTiming()
print("Demonstrating human-like request timing:")
for i in range(5):
print(f"\nRequest {i+1}:")
timing.wait_like_human()
# Show timing statistics
stats = timing.get_timing_stats()
print(f"\nTiming Statistics:")
for key, value in stats.items():
if isinstance(value, float):
print(f" {key}: {value:.2f}")
else:
print(f" {key}: {value}")
print("\n" + "="*50)
advanced_pattern_randomization()

Complete 403 Error Prevention Checklist

Here's a comprehensive checklist to prevent 403 errors:

Headers and User Agent

  • Use realistic browser User-Agent strings
  • Include all essential browser headers (Accept, Accept-Language, etc.)
  • Rotate User-Agents occasionally
  • Match headers to User-Agent (Chrome vs Firefox specific headers)

Request Timing

  • Implement proper delays between requests (2-5 seconds minimum)
  • Add randomness to timing patterns
  • Use exponential backoff on failures
  • Respect server-provided Retry-After headers

Session Management

  • Use persistent sessions with cookie handling
  • Establish sessions by visiting homepage first
  • Save and reuse cookies between sessions
  • Handle session timeouts gracefully

IP and Proxy Management

  • Use residential proxies for difficult sites
  • Implement proxy rotation on failures
  • Test proxies before use
  • Monitor proxy performance and blacklist failed ones

JavaScript and Browser Behavior

  • Use headless browsers for JavaScript-heavy sites
  • Implement stealth measures to hide automation
  • Handle CAPTCHAs and challenges
  • Simulate human-like scrolling and interactions

Error Handling

  • Implement circuit breaker patterns for consecutive failures
  • Log and analyze failure patterns
  • Differentiate between different error types
  • Have fallback strategies for each error type

When to Use Each Solution

ProblemQuick FixAdvanced SolutionSupacrawler
Basic 403 from User-AgentFix headersBrowser rotation✅ Automatic
Rate limiting 403sAdd delaysAdaptive rate limiting✅ Built-in
IP-based blockingSingle proxyProxy rotation✅ Built-in
JavaScript requirementUse SeleniumStealth browser✅ Automatic
CAPTCHA challengesManual solvingCAPTCHA services✅ Included
Complex anti-bot systemsMultiple techniquesFull stealth stack✅ Professional-grade

Conclusion: Solving 403 Errors Permanently

403 Forbidden errors are frustrating, but they're not insurmountable. The key is understanding that these errors are websites' way of detecting and blocking automated traffic.

Key Takeaways:

  1. Diagnosis first: Use systematic testing to identify the root cause
  2. Layer your defenses: Combine multiple techniques for best results
  3. Stay realistic: Make your requests look like real browser traffic
  4. Be respectful: Don't overwhelm servers with aggressive scraping
  5. Monitor and adapt: Track success rates and adjust strategies

Progressive Solutions:

  1. Start simple: Fix User-Agent and headers (solves 50% of cases)
  2. Add timing: Implement proper rate limiting (solves another 30%)
  3. Use proxies: Add IP rotation for stubborn sites (solves another 15%)
  4. Go advanced: JavaScript rendering and stealth for the remaining 5%

For Production Applications:

While understanding these techniques is valuable, most businesses should consider Supacrawler for production scraping:

  • 99%+ success rate against 403 errors
  • Zero maintenance - no infrastructure to manage
  • Cost effective - saves thousands in development and hosting
  • Always updated - adapts to new blocking techniques automatically
  • Focus on value - spend time using data, not fighting blocks

Quick Decision Guide:

  • Learning project? → Try the DIY solutions above
  • One-off scraping task? → Start with headers and rate limiting
  • Production business application? → Use Supacrawler
  • High-volume scraping operation? → Definitely use Supacrawler

Remember: The goal isn't to "hack" websites, but to access public data respectfully and efficiently. The techniques in this guide help you do exactly that.

Ready to say goodbye to 403 errors?

No more 403 errors. Just clean, reliable data extraction. 🚀✨

By Supacrawler Team
Published on June 4, 2025