Web Scraping With Python: The Stack We Use When Plain Requests Stop Working
Web scraping with Python starts with httpx, BeautifulSoup, retries, and measured status codes. It becomes real work once the target returns 403s, consent walls, empty HTML, or cursor tokens. Our answer is simple: begin with raw HTTP, record success rate, then add a residential proxy and a credible browser only after the data proves the need. We earn commissions if you buy through provider links in this article, but our Decodo notes come from first-hand production use, not a pricing roundup.
1. Start With Plain HTTP And A Status Counter
A Python scraper without status metrics is a guessing machine. Use a 10-second timeout, collect every status code, and save the HTML only after the response proves it is usable.
Install the baseline stack:
python -m venv.venv
source.venv/bin/activate
pip install httpx beautifulsoup4 pandas tenacity
Run a simple scraper against a public practice site:
# scrape_http.py
from collections import Counter
from bs4 import BeautifulSoup
import httpx
URLS = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
]
HEADERS = {
"User-Agent": "ProxyPeersResearchBot/1.0 (+https://proxypeers.com)",
"Accept": "text/html,application/xhtml+xml",
}
status_counts = Counter()
rows = []
with httpx.Client(headers=HEADERS, timeout=10.0, follow_redirects=True) as client:
for url in URLS:
response = client.get(url)
status_counts[response.status_code] += 1
if response.status_code != 200:
continue
soup = BeautifulSoup(response.text, "html.parser")
for card in soup.select(".quote"):
rows.append({
"quote": card.select_one(".text").get_text(strip=True),
"author": card.select_one(".author").get_text(strip=True),
"tags": ",".join(tag.get_text(strip=True) for tag in card.select(".tag")),
"source_url": url,
})
print("status_counts:", dict(status_counts))
print("rows:", len(rows))
print(rows[:2])
Expected result: 3 pages return 200, and the script extracts 30 quote rows. That proves the parser and transport work before proxies enter the job.
2. Parse Records Before You Optimize Transport
Bad parsing wastes good proxies. Extract 3 fields first, validate row count, then decide whether transport needs work.
For CSV output:
# scrape_to_csv.py
from collections import Counter
from pathlib import Path
from bs4 import BeautifulSoup
import httpx
import pandas as pd
URLS = [f"https://quotes.toscrape.com/page/{page}/" for page in range(1, 6)]
HEADERS = {
"User-Agent": "ProxyPeersResearchBot/1.0 (+https://proxypeers.com)",
"Accept": "text/html,application/xhtml+xml",
}
def parse_quotes(html: str, source_url: str) -> list[dict]:
soup = BeautifulSoup(html, "html.parser")
rows = []
for card in soup.select(".quote"):
rows.append({
"quote": card.select_one(".text").get_text(strip=True),
"author": card.select_one(".author").get_text(strip=True),
"tags": ",".join(tag.get_text(strip=True) for tag in card.select(".tag")),
"source_url": source_url,
})
return rows
status_counts = Counter()
all_rows = []
with httpx.Client(headers=HEADERS, timeout=10.0, follow_redirects=True) as client:
for url in URLS:
response = client.get(url)
status_counts[response.status_code] += 1
if response.status_code == 200:
all_rows.extend(parse_quotes(response.text, url))
Path("data").mkdir(exist_ok=True)
pd.DataFrame(all_rows).to_csv("data/quotes.csv", index=False)
print("status_counts:", dict(status_counts))
print("saved_rows:", len(all_rows))
print("output:data/quotes.csv")
When we run scrapers in our pipeline, we treat row count as a contract. If a page has 10 visible records and the parser saves 6, the bug is not the proxy.
3. Add Retries And Failure Labels Before You Add Proxies
Retries fix network noise. They do not fix blocks. Use 3 attempts, log the final status, and tag failures as transport, parser, or target-side blocks.
# scrape_with_retries.py
from collections import Counter
from bs4 import BeautifulSoup
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
import httpx
URLS = [f"https://quotes.toscrape.com/page/{page}/" for page in range(1, 6)]
HEADERS = {
"User-Agent": "ProxyPeersResearchBot/1.0 (+https://proxypeers.com)",
"Accept": "text/html,application/xhtml+xml",
}
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=1, max=8),
retry=retry_if_exception_type((httpx.TimeoutException, httpx.NetworkError)),
)
def fetch(client: httpx.Client, url: str) -> httpx.Response:
return client.get(url)
def classify(response: httpx.Response) -> str:
text = response.text.lower()
if response.status_code in {401, 403, 429}:
return "blocked_status"
if "captcha" in text or "verify you are human" in text:
return "bot_challenge"
if response.status_code >= 500:
return "server_error"
if response.status_code == 200:
return "ok"
return "other"
counts = Counter()
with httpx.Client(headers=HEADERS, timeout=10.0, follow_redirects=True) as client:
for url in URLS:
try:
response = fetch(client, url)
label = classify(response)
counts[(response.status_code, label)] += 1
except Exception as exc:
counts[("exception", type(exc).__name__)] += 1
print(dict(counts))
We measured datacenter proxies at roughly 22% success on hard Google review pages. That number changed the architecture. Datacenter was not “a little worse.” It made the job non-viable.
4. Plug In A Residential Proxy At The HTTP Layer
The proxy belongs in the client transport, not in the parser. In httpx, that means one proxy= argument. The rest of the scraper stays the same.
Set the proxy URL as an environment variable:
export PROXY_URL="http://USER:PASS@HOST:PORT"
python scrape_with_proxy.py
Use it in Python:
# scrape_with_proxy.py
import os
from collections import Counter
from bs4 import BeautifulSoup
import httpx
URLS = [f"https://quotes.toscrape.com/page/{page}/" for page in range(1, 6)]
HEADERS = {
"User-Agent": "ProxyPeersResearchBot/1.0 (+https://proxypeers.com)",
"Accept": "text/html,application/xhtml+xml",
}
proxy_url = os.getenv("PROXY_URL")
client_kwargs = {
"headers": HEADERS,
"timeout": 20.0,
"follow_redirects": True,
}
if proxy_url:
client_kwargs["proxy"] = proxy_url
status_counts = Counter()
rows = []
with httpx.Client(**client_kwargs) as client:
for url in URLS:
response = client.get(url)
status_counts[response.status_code] += 1
if response.status_code != 200:
continue
soup = BeautifulSoup(response.text, "html.parser")
rows.extend(card.select_one(".text").get_text(strip=True) for card in soup.select(".quote"))
print("proxy_enabled:", bool(proxy_url))
print("status_counts:", dict(status_counts))
print("rows:", len(rows))
Use residential proxies when the target punishes datacenter IP ranges. In our work, hard review pages moved from roughly 22% success on datacenter to a viable job only after we moved them to residential IPs.
For that layer, we use Decodo first-hand. Decodo Residential starts at $2/GB as of May 2026, and its residential page lists a 115M+ IP pool across 195+ locations: Decodo residential proxies. That price matters because rendered pages burn bandwidth. A browser session that loads images, scripts, fonts, and tracking tags costs more than raw HTML.
Decision table:
| Signal | Use | Reason |
|---|---|---|
200 status, complete HTML | No proxy | The target is open enough for raw HTTP |
403 or 429 after 3 attempts | Residential proxy | The target is rejecting the source, not the parser |
| HTML has records missing | Browser | The content renders after JavaScript |
| Consent wall blocks pagination | Browser plus handler | The next token appears only after state changes |
| Region-specific content differs | Geo-targeted residential | Geography changes the data |
5. Add Camoufox And Xvfb For Browser-Level Scraping
The IP is necessary. It is not sufficient. Real scraping on hostile pages needs a residential proxy, a credible browser, consent-wall handling, and pagination state.
We run Camoufox, a hardened Firefox build, under xvfb for browser jobs. The Camoufox docs show headless="virtual" for a Linux virtual display and state that it starts a lightweight virtual display in the background: Camoufox virtual display docs. The usage docs also expose browser options such as geoip=True, block_images=True, and proxy support: Camoufox usage docs.
Install:
pip install camoufox
python -m camoufox fetch
On Debian or Ubuntu runners:
sudo apt-get update
sudo apt-get install -y xvfb
Browser scraper:
# browser_fetch.py
import os
from camoufox.sync_api import Camoufox
TARGET_URL = os.getenv("TARGET_URL", "https://quotes.toscrape.com/js/")
PROXY_SERVER = os.getenv("PROXY_SERVER") # example: http://HOST:PORT
PROXY_USER = os.getenv("PROXY_USER")
PROXY_PASS = os.getenv("PROXY_PASS")
def proxy_config() -> dict | None:
if not PROXY_SERVER:
return None
config = {"server": PROXY_SERVER}
if PROXY_USER and PROXY_PASS:
config["username"] = PROXY_USER
config["password"] = PROXY_PASS
return config
def click_consent(page) -> None:
selectors = [
"button:has-text('Accept')",
"button:has-text('I agree')",
"button:has-text('Allow all')",
"[aria-label='Accept']",
]
for selector in selectors:
locator = page.locator(selector)
if locator.count() > 0:
locator.first.click(timeout=1500)
return
with Camoufox(
headless="virtual",
proxy=proxy_config(),
geoip=True if PROXY_SERVER else False,
block_images=True,
) as browser:
page = browser.new_page()
page.goto(TARGET_URL, wait_until="domcontentloaded", timeout=45000)
click_consent(page)
page.wait_for_load_state("networkidle", timeout=15000)
cards = page.locator(".quote")
print("url:", TARGET_URL)
print("cards:", cards.count())
print("title:", page.title())
In our pipeline, Camoufox is not a decoration. We use it when raw HTTP returns empty shells, hidden pagination tokens, or bot challenges that only appear after JavaScript runs.
6. Rotate Geography By Outlet, Not Randomly Per Request
Geography is data quality. If the target returns different content in Mumbai than it returns in Dubai, random global rotation corrupts the dataset.
We assign IP geography per outlet. India jobs use Mumbai ports. Gulf jobs use regional ports. That keeps the collected pages aligned with what a local user sees.
Use a target map:
# geo_targets.py
import os
import httpx
OUTLETS = [
{
"name": "india_storefront",
"url": "https://example.com/in/products",
"proxy": os.getenv("PROXY_MUMBAI"),
},
{
"name": "uae_storefront",
"url": "https://example.com/ae/products",
"proxy": os.getenv("PROXY_GULF"),
},
]
HEADERS = {
"User-Agent": "ProxyPeersResearchBot/1.0 (+https://proxypeers.com)",
"Accept": "text/html,application/xhtml+xml",
}
for outlet in OUTLETS:
client_kwargs = {
"headers": HEADERS,
"timeout": 20.0,
"follow_redirects": True,
}
if outlet["proxy"]:
client_kwargs["proxy"] = outlet["proxy"]
with httpx.Client(**client_kwargs) as client:
response = client.get(outlet["url"])
print({
"outlet": outlet["name"],
"status": response.status_code,
"bytes": len(response.content),
"proxy_set": bool(outlet["proxy"]),
})
We measured this failure mode in production: the same target page returned different outlet lists by region. Random proxy rotation produced mixed local views in one export. Per-outlet ports fixed the dataset.
7. Replay Pagination Tokens Instead Of Clicking Forever
Deep pagination is a state problem. A scraper that clicks “next” 50 times inside a browser burns bandwidth and invites session drift.
The better pattern is browser once, token replay after. Use Camoufox to reach the page state, extract the token, then continue with httpx through the same proxy family.
Example pattern:
# token_replay.py
import os
import httpx
from camoufox.sync_api import Camoufox
START_URL = os.getenv("START_URL", "https://example.com/search")
API_URL = os.getenv("API_URL", "https://example.com/api/search")
PROXY_URL = os.getenv("PROXY_URL")
MAX_PAGES = int(os.getenv("MAX_PAGES", "50"))
def client() -> httpx.Client:
kwargs = {
"timeout": 30.0,
"follow_redirects": True,
"headers": {
"User-Agent": "ProxyPeersResearchBot/1.0 (+https://proxypeers.com)",
"Accept": "application/json,text/html",
},
}
if PROXY_URL:
kwargs["proxy"] = PROXY_URL
return httpx.Client(**kwargs)
with Camoufox(headless="virtual", proxy={"server": PROXY_URL} if PROXY_URL else None) as browser:
page = browser.new_page()
page.goto(START_URL, wait_until="networkidle", timeout=45000)
token = page.locator("input[name='cursor']").first.get_attribute("value")
cookies = page.context.cookies()
cookie_jar = {cookie["name"]: cookie["value"] for cookie in cookies}
records = []
with client() as http:
for page_number in range(1, MAX_PAGES + 1):
response = http.post(
API_URL,
cookies=cookie_jar,
json={"cursor": token, "page": page_number},
)
if response.status_code != 200:
print("stop_status:", response.status_code)
break
payload = response.json()
batch = payload.get("results", [])
records.extend(batch)
token = payload.get("next_cursor")
if not token:
break
print("records:", len(records))
print("pages_requested:", min(MAX_PAGES, page_number))
This is the original insight most proxy roundups miss: blocks are a system problem. On hard jobs, residential IPs fixed the source reputation, Camoufox fixed browser credibility, consent handling exposed the state, and token replay kept deep pagination cheap enough to run.
8. Measure Success Rate, Latency, And Cost Per 1,000 URLs
Provider choice without measurement turns into brand preference. Track 1,000 attempted URLs, success rate, median latency, failure labels, and proxy bandwidth.
A minimal run report:
# run_report.py
from dataclasses import dataclass
from statistics import median
from time import perf_counter
import httpx
URLS = [f"https://quotes.toscrape.com/page/{page}/" for page in range(1, 11)]
@dataclass
class Result:
url: str
status: int
ms: float
bytes_in: int
label: str
def label_response(response: httpx.Response) -> str:
text = response.text.lower()
if response.status_code == 200 and ".quote" in text:
return "ok"
if response.status_code in {403, 429}:
return "blocked"
if "captcha" in text or "verify you are human" in text:
return "challenge"
return "other"
results = []
with httpx.Client(timeout=20.0, follow_redirects=True) as client:
for url in URLS:
start = perf_counter()
response = client.get(url)
elapsed_ms = (perf_counter() - start) * 1000
results.append(Result(
url=url,
status=response.status_code,
ms=elapsed_ms,
bytes_in=len(response.content),
label=label_response(response),
))
ok = [result for result in results if result.label == "ok"]
latencies = [result.ms for result in results]
bytes_total = sum(result.bytes_in for result in results)
print({
"attempted": len(results),
"success": len(ok),
"success_rate": round(len(ok) / len(results), 3),
"median_ms": round(median(latencies), 1),
"mb_downloaded": round(bytes_total / 1024 / 1024, 3),
"failures": [result.__dict__ for result in results if result.label != "ok"],
})
For proxy spend, convert bandwidth into cost:
# cost_math.py
gb_used = 2.4
price_per_gb = 2.00 # Decodo Residential, as of May 2026
attempted_urls = 1000
successful_urls = 840
cost = gb_used * price_per_gb
cost_per_1000_attempts = cost / attempted_urls * 1000
cost_per_1000_successes = cost / successful_urls * 1000
print({
"gb_used": gb_used,
"cost_usd": round(cost, 2),
"cost_per_1000_attempts": round(cost_per_1000_attempts, 2),
"cost_per_1000_successes": round(cost_per_1000_successes, 2),
})
Residential bandwidth adds up fast. That is why we block images in browser jobs, replay tokens after the first page, and keep raw HTTP for targets that still return complete HTML.
9. Use A Provider Only Where It Fits The Job
We use Decodo as the workhorse for hard scraping jobs because we tested it first-hand at scale. It fits review scraping, search result collection, e-commerce pages, and region-sensitive jobs that punish datacenter IPs.
The fit is specific. Decodo Residential starts at $2/GB as of May 2026, its residential pool is listed at 115M+ IPs across 195+ locations, and the same vendor also sells ISP, mobile, datacenter, Web Scraping API, and Site Unblocker products. Source: Decodo residential proxies.
We have researched Bright Data, Oxylabs, Webshare, and IPRoyal from public pricing pages and community reports. We have not benchmarked them first-hand at ProxyPeers scale yet. We do not present their ratings as measured lab results.
FAQ
Is Python enough for web scraping?
Yes, for raw HTML. httpx plus BeautifulSoup handles pages that return complete records in the first response. Once JavaScript, consent walls, cursor tokens, or bot checks control the content, Python still runs the job, but the stack needs a browser layer and measured proxy routing.
Where does the proxy plug into a Python scraper?
The proxy plugs into the HTTP client. In httpx, pass proxy="http://USER:PASS@HOST:PORT" to httpx.Client. Keep the parser unchanged. That separation lets you test whether failures come from transport or extraction.
Should we use datacenter or residential proxies for scraping?
Use datacenter for easy, high-volume targets that return 200 with complete HTML. Use residential when the target rejects datacenter ranges. In our hard review-page tests, datacenter proxies fell to roughly 22% success, and residential made the job viable.
Why use Camoufox instead of standard headless Chrome?
Use Camoufox when the target scores browser fingerprints or hides data behind JavaScript state. We run Camoufox under xvfb because the browser layer has to look credible, not just fetch markup. The Camoufox docs support headless="virtual" for that Linux setup.
Which proxy provider do we recommend for web scraping with Python?
For hard targets we have tested first-hand, we recommend Decodo. It starts at $2/GB for Residential as of May 2026 and gives us the geo-targeting needed for per-outlet collection. For other providers, our current view is researched, not first-hand benchmarked.