Protocol-aware scraping that holds up in the wild // GrowthScribe

Modern sites are overwhelmingly encrypted, with well over 90% of page loads using HTTPS. Most requests are served over HTTP/2, and HTTP/3 already accounts for more than a quarter of traffic in many datasets. That reality changes how scrapers must behave. Clients need to negotiate ALPN cleanly, present a credible TLS fingerprint, and handle multiplexing. If your stack still forces HTTP/1.1 everywhere, you invite head-of-line blocking, odd server heuristics, and avoidable throttling.

IPv6 also matters. Global adoption hovers around 40% of user requests, and many edge providers apply different rate limits and reputational scoring across IPv4 and IPv6. If you never test v6 paths, you are blind to a substantial slice of the network.

Compression is another quiet breaker. Brotli now serves the majority of compressed HTTPS responses, especially for text assets. Fetchers that only advertise gzip will work, but they won’t look like a real browser and they waste bandwidth.

Bot defense is no longer edge-case

Automated traffic is not a rounding error. Recent industry reporting places bad bots at roughly 32% of all traffic. That number explains why ordinary scraping patterns draw scrutiny faster than they used to. Expect rate limits, JavaScript computation gates, and behavior analysis that flags offbeat protocol use or timing.

Plan for two parallel streams. One handles straightforward HTML and API endpoints with clean concurrency and caching. The other uses full browser automation for pages that hinge on client-side rendering or challenge flows. Switch only when metrics say you need it, because headless browsers are expensive and noisy if misconfigured.

Page weight and JavaScript shape your budget

The median web page is around 2 MB, and JavaScript alone often lands near the half-megabyte mark. That payload drives up connection time, increases memory pressure in headless contexts, and magnifies the cost of retries. Treat bytes as part of the reliability budget. If you do not block analytics, video, and third-party widgets during scripted runs, you pay for work that does not move the extraction forward.

Prioritize network-level filtering. For many sites, you can cut bytes transferred by more than a third simply by refusing media and tracking domains. Pair that with resource hints like preconnect and early data when your client supports them.

Operational metrics that actually predict success

You cannot manage what you do not measure. The following counters and distributions help isolate problems quickly and correlate them with infrastructure changes:

Resolved protocol mix: share of HTTP/2 and HTTP/3 by domain, plus fallback counts
TLS handshake failure rate and cipher/extension coverage
Status code distribution over time with 403 and 429 split out
Median and p90 TTFB by route, with retries counted separately
Challenge incidence per 1,000 requests and solve rate when using headless
IP pool health: ASN diversity, IPv4 vs IPv6 share, and per-IP request spacing
Brotli vs gzip response share and compression savings
Cache hit ratio for immutable assets fetched during scripted runs

One practical step saves hours of guessing: validate proxy liveness, protocol support, and latency before a job starts. A simple way to do that is a proxy checker that records HTTP/2 negotiation, TLS parameters, IPv6 reachability, and response timing. Run it continuously and quarantine outliers.

Closing the loop with controlled experiments

Treat your scraper like a networked application, not a script. When block rates rise, change one variable at a time and rerun the same route set. Switch only the transport from HTTP/1.1 to HTTP/2 and re-measure. Toggle Brotli advertising. Swap a v4-only pool for dual-stack. Keep short, fixed test corpora so that improvements are attributable to the change you made, not to drift in site content.

The web you are harvesting is encrypted, compressed, multiplexed, and guarded by systems that assume a third of traffic is hostile. Align your clients with the way the modern stack actually works and measure the pieces that matter. That is how you keep extraction reliable without an arms race you cannot win.

Protocol-aware scraping that holds up in the wild

Bot defense is no longer edge-case

Page weight and JavaScript shape your budget

Operational metrics that actually predict success

Closing the loop with controlled experiments

Sofía Morales

Have a challenge in mind?

Bot defense is no longer edge-case

Page weight and JavaScript shape your budget

Operational metrics that actually predict success

Closing the loop with controlled experiments

Sofía Morales

Related Posts

Have a challenge in mind?