How To Scrape Google
Scrape Google with a browser collector, not a naked HTTP loop. In our pipeline, datacenter proxies fell to roughly 22% success on hard Google review-page jobs, so the working pattern is residential IP + credible browser + geo-correct rotation + consent-wall handling + retry logs. We use Decodo residential for this class of job because we tested it at scale; Decodo lists residential proxies from $2/GB and Site Unblocker from $0.95/1K requests as of May 2026 on its pricing page. Some links are affiliate links; ProxyPeers earns a commission on qualifying purchases at no extra cost to you.
1. Define The Google Job Before Sending 1 Request
Do not “scrape Google” as one task. Split it into a surface, a country, a result depth, and a retry budget.
For SERPs, we usually collect the first 10 organic results, the query, the gl country, the hl language, the timestamp, and the proxy exit region. That is enough for rank tracking and SEO monitoring without turning one keyword into 100 wasteful browser sessions.
For review pages and local results, the IP location changes the data. We measured this in our own jobs: Mumbai ports for India and regional Gulf ports returned pages that matched local user views. A generic US exit produced the wrong page state.
Use this run contract:
| Field | Example |
|---|---|
| Query | best crm software |
| Country | IN |
| Language | en |
| Depth | 10 results |
| Retry cap | 2 retries |
2. Install Camoufox And Xvfb In 4 Commands
Google detects more than IPs. The browser layer matters. We run Camoufox, a hardened Firefox build with Playwright compatibility, under xvfb so Linux workers get a real display path instead of brittle headless behavior. Camoufox documents the Python API and proxy support in its usage docs and the xvfb install path in its installation docs.
python -m venv.venv
source.venv/bin/activate
pip install -U "camoufox[geoip]" playwright beautifulsoup4
python -m camoufox fetch
On Debian or Ubuntu workers:
sudo apt-get update
sudo apt-get install -y xvfb libgtk-3-0 libx11-xcb1 libasound2
The geoip extra matters with proxies. It lets Camoufox align browser geography with the proxy exit instead of sending a Prague timezone through a Mumbai IP.
3. Run A 10-Result Google SERP Collector
This script opens Google, handles common consent buttons, extracts organic links, and stops at 10 results. It does not pretend CAPTCHA pages are success. It exits hard when Google returns “unusual traffic,” because retrying that same fingerprint wastes bandwidth.
Save this as google_serp_camoufox.py.
import argparse
import json
import os
from datetime import datetime, timezone
from urllib.parse import parse_qs, urlencode, urlparse
from camoufox.sync_api import Camoufox
from playwright.sync_api import TimeoutError as PlaywrightTimeoutError
def proxy_config():
server = os.getenv("PROXY_SERVER")
if not server:
return None
cfg = {"server": server}
username = os.getenv("PROXY_USERNAME")
password = os.getenv("PROXY_PASSWORD")
if username and password:
cfg["username"] = username
cfg["password"] = password
return cfg
def google_url(query, country, language, count):
params = {
"q": query,
"gl": country,
"hl": language,
"num": str(count),
"pws": "0",
}
return "https://www.google.com/search?" + urlencode(params)
def accept_consent(page):
selectors = [
"button:has-text('Accept all')",
"button:has-text('I agree')",
"button:has-text('Accept')",
"form[action*='consent'] button",
]
for selector in selectors:
try:
button = page.locator(selector).first
if button.is_visible(timeout=1500):
button.click(timeout=3000)
page.wait_for_load_state("domcontentloaded", timeout=10000)
return True
except PlaywrightTimeoutError:
continue
return False
def normalize_google_href(href):
if not href:
return None
parsed = urlparse(href)
if parsed.netloc.endswith("google.com") and parsed.path == "/url":
target = parse_qs(parsed.query).get("q", [None])[0]
return target
if parsed.scheme in ("http", "https") and not parsed.netloc.endswith("google.com"):
return href
return None
def extract_results(page, limit):
anchors = page.locator("a").evaluate_all(
"""els => els.map(a => ({
text: (a.innerText || '').trim(),
href: a.href
}))"""
)
results = []
seen = set()
for anchor in anchors:
url = normalize_google_href(anchor["href"])
title = " ".join(anchor["text"].split())
if not url or not title:
continue
if url in seen:
continue
if len(title) < 8:
continue
seen.add(url)
results.append({"rank": len(results) + 1, "title": title, "url": url})
if len(results) >= limit:
break
return results
def main():
parser = argparse.ArgumentParser()
parser.add_argument("query")
parser.add_argument("--country", default="US")
parser.add_argument("--language", default="en")
parser.add_argument("--count", type=int, default=10)
args = parser.parse_args()
url = google_url(args.query, args.country, args.language, args.count)
with Camoufox(
headless=False,
geoip=True,
locale=f"{args.language}-{args.country}",
block_images=True,
proxy=proxy_config(),
window=(1366, 768),
) as browser:
page = browser.new_page()
page.goto(url, wait_until="domcontentloaded", timeout=45000)
accept_consent(page)
html = page.content().lower()
if "unusual traffic" in html or "/sorry/" in page.url:
raise RuntimeError("Google returned an anti-automation block page")
results = extract_results(page, args.count)
payload = {
"query": args.query,
"country": args.country,
"language": args.language,
"count": len(results),
"collected_at": datetime.now(timezone.utc).isoformat(),
"results": results,
}
print(json.dumps(payload, indent=2))
if __name__ == "__main__":
main()
Run it through xvfb:
xvfb-run -a python google_serp_camoufox.py "how to scrape google" --country US --language en --count 10
4. Plug In A Residential Proxy At 1 Line
The proxy plugs into the browser launch, not into the parser. That distinction matters because Google scores the whole session: IP, TLS path, browser fingerprint, cookies, consent flow, timing, and pagination.
Set the proxy as environment variables:
export PROXY_SERVER="http://gate.example-proxy.com:7000"
export PROXY_USERNAME="YOUR_USERNAME"
export PROXY_PASSWORD="YOUR_PASSWORD"
xvfb-run -a python google_serp_camoufox.py "site:proxypeers.com proxy" --country US --language en --count 10
Use curl only to test credentials. A 200 response from curl proves the proxy works. It does not prove the setup survives Google.
curl -x "http://YOUR_USERNAME:[email protected]:7000" \
"https://www.google.com/search?q=proxy+testing&gl=US&hl=en&num=10"
When we ran hard Google review-page jobs through datacenter proxies, success landed near 22%. Moving those jobs to residential made them viable because the IP reputation matched a real user path. The implication is simple: use datacenter only for easy targets and health checks. Use residential for Google surfaces that fight back.
5. Make Geography A Data-Quality Control, Not 1 Checkbox
Google results are local data. gl=IN with a US proxy is not the same as gl=IN with an India exit. The request parameter asks for India. The proxy exit proves the session belongs there.
For India jobs, we use Mumbai exits. For Gulf jobs, we use regional ports. In our pipeline, that changed the collected page state enough to affect downstream data, especially local packs, reviews, and availability-like surfaces.
Run the same query through 2 countries and store both outputs:
export PROXY_SERVER="http://YOUR_MUMBAI_PORT"
xvfb-run -a python google_serp_camoufox.py "iphone repair near me" --country IN --language en --count 10 > serp-in.json
export PROXY_SERVER="http://YOUR_US_PORT"
xvfb-run -a python google_serp_camoufox.py "iphone repair near me" --country US --language en --count 10 > serp-us.json
Compare URLs and titles:
python - <<'PY'
import json
for name in ["serp-in.json", "serp-us.json"]:
data = json.load(open(name))
print("\n" + name)
for r in data["results"][:5]:
print(r["rank"], r["title"][:80], r["url"])
PY
If the top 5 results are identical across countries for a local query, inspect the proxy exit. The common failure is not parsing. It is bad geography.
6. Scale With A 5-Column Run Log
Do not scale Google scraping until every request writes a run log. We use 5 fields at minimum: query, country, proxy exit, status, and block reason.
Create a CSV log wrapper:
import csv
import subprocess
import sys
from datetime import datetime, timezone
queries = [
("how to scrape google", "US"),
("best residential proxies", "US"),
("iphone repair near me", "IN"),
]
with open("google-runs.csv", "w", newline="") as f:
writer = csv.DictWriter(
f,
fieldnames=["time", "query", "country", "status", "detail"],
)
writer.writeheader()
for query, country in queries:
cmd = [
"xvfb-run",
"-a",
"python",
"google_serp_camoufox.py",
query,
"--country",
country,
"--language",
"en",
"--count",
"10",
]
try:
result = subprocess.run(cmd, capture_output=True, text=True, timeout=90)
status = "ok" if result.returncode == 0 else "fail"
detail = result.stdout[:300] if status == "ok" else result.stderr[:300]
except subprocess.TimeoutExpired:
status = "timeout"
detail = "90 second timeout"
writer.writerow({
"time": datetime.now(timezone.utc).isoformat(),
"query": query,
"country": country,
"status": status,
"detail": detail.replace("\n", " "),
})
Scale by sessions, not raw requests. A practical first production cap is 1 browser session per proxy port, 10 results per query, and 2 retries per blocked query. More retries against the same block pattern burn residential GB without adding data.
7. Provider Choice: 1 Tested Pick
For Google scraping, our tested pick is Decodo. We have used it first-hand at scale. We have not first-hand benchmarked Bright Data, Oxylabs, Webshare, or IPRoyal for this Google workflow, so we treat those as researched options from public pricing pages, not measured ProxyPeers results.
Decodo fits this job because it combines residential proxies, ISP, mobile, datacenter, and scraping APIs in one account. The residential pool is listed as 115M+ IPs across 195+ locations in the supplied provider data, and the public residential price starts at $2/GB as of May 2026 on Decodo’s pricing page.
Webshare is the budget researched option. Its pricing page is here, and the supplied May 2026 data puts rotating residential at $1.40/GB and datacenter at $0.018/IP. We would use that for easy targets and cost tests, not as our first pick for hard Google surfaces.
Bright Data and Oxylabs are researched enterprise options. Bright Data lists a 400M+ residential network on its residential pricing page. Oxylabs lists 175M+ residential IPs on its residential pricing page. Both deserve a paid benchmark before we call them better for our Google pipeline.
IPRoyal is the researched mid-market option. Its pricing page lists residential proxies from $1.75/GB as of May 2026. The non-expiring traffic model is useful for bursty workloads, but we have not run it through our hard Google review-page jobs yet.
FAQ: 5 Questions
Is scraping Google legal?
Scrape only public data you have a right to collect, and get legal review for commercial use. A 10-result SERP rank tracker has a different risk profile than collecting personal data from profiles or reviews.
Can I scrape Google with Python requests?
You can fetch a page with requests, but it breaks fast on real Google surfaces. Our measured datacenter success on hard Google review-page jobs was about 22%, and naked HTTP gives Google even fewer browser signals to trust.
Why use Camoufox instead of normal Playwright?
Normal Playwright solves automation. Camoufox targets browser fingerprint consistency. In our pipeline, the working unit is not “proxy.” It is residential proxy plus Camoufox under xvfb plus consent handling plus logged retries.
Which proxy type should I use for Google?
Use residential for hard Google targets. Use datacenter for cheap tests and easy pages. Decodo residential starts at $2/GB as of May 2026, and that cost is justified when failed datacenter retries destroy throughput.
How many results should I scrape per query?
Start with 10. Add depth only after block rate, latency, and duplicate rate are logged. Deep pagination changes the anti-bot profile, so treat page 2 and beyond as a separate job with its own retry budget.