Proxies for Web Scraping
Web scraping works best when the proxy setup matches the target site's sensitivity, request volume, and session behavior. The wrong proxy type leads to early CAPTCHAs, rate limits, or unstable results long before the scraper logic itself fails.
Guardrail: keep the exact endpoint from your portal and change only the username controls when you need rotation, sticky sessions, or country targeting. Public docs use placeholders like <ROTATING_HTTP_ENDPOINT> because live host and port values are account-specific.What web scraping needs from a proxy
- Route rotation so repeated requests do not all come from one IP.
- Sticky session controls for pagination, carts, or multi-request flows that must stay consistent.
- Residential or mobile identity when strict targets challenge datacenter traffic.
- Predictable auth and retry behavior that works cleanly with common Python and Node.js clients.
Which NinjaProxy product fits and why
| Proxy setup | Best for | Why |
|---|---|---|
| Rotating residential | General-purpose scraping on stricter sites | Rotating residential traffic is the best default when a target blocks obvious datacenter requests. |
| Sticky residential session | Pagination, search flows, and scraping that must preserve one route briefly | Username session controls keep one identity stable while the scraper completes a bounded sequence. |
| Datacenter | Fast, lower-cost scraping on lenient targets | Datacenter routes are useful when the site is tolerant and throughput matters more than stealth. |
Working code example
This Python example uses a rotating gateway endpoint and pins a residential route long enough to finish one scraping batch without guessing hostnames, ports, or undocumented parameters.
import requests
TARGET_URL = "https://example.com/category/widgets"
PROXY = "http://<USERNAME>--session-scrape-batch-01--duration-120--provider-res:<API_KEY>@<ROTATING_HTTP_ENDPOINT>"
session = requests.Session()
session.headers.update({
"User-Agent": "Mozilla/5.0 (compatible; NinjaProxyDocsExample/1.0)",
})
response = session.get(
TARGET_URL,
proxies={"http": PROXY, "https": PROXY},
timeout=20,
)
response.raise_for_status()
print(response.text[:500])Common failure modes
| Failure | Likely cause | Fix |
|---|---|---|
| CAPTCHAs or block pages | Datacenter traffic is too easy for the target to detect. | Move the workflow to residential traffic and reduce burstiness per route. |
| Repeated 429 or 403 responses | Too many requests are landing from one session or IP. | Rotate sessions more often, lower concurrency, and stagger retries. |
| Same IP for every request | You are using an assigned/static endpoint or reusing one session token. | Use a rotating gateway and remove or vary the --session-... token when you need fresh routes. |
| 407 authentication errors | Malformed credentials or missing URL encoding. | Re-copy the username and API key, and percent-encode reserved characters if needed. |
Related docs
- Python integration for `requests`, `httpx`, `aiohttp`, Scrapy, and Playwright examples.
- Authentication for username + API key, whitelist mode, and rotating username controls.
- Troubleshooting for 407s, timeouts, sticky-session mistakes, and block-response triage.
- Rotating proxies for session controls, provider selection, and route overrides.
- Rate Limits for concurrency and retry guidance.
