The modern way to scrape cloudflare sites easily
Chapters8
Explains why having a browser for scraping is important when sites use anti-bot measures and how a browser can obtain the clearance cookie.
A practical look at using a browser-based approach to bypass Cloudflare anti-bot tests with the PiDoll library for smoother, cookie-backed scraping.
Summary
John Watson Rooney walks through a modern strategy for scraping Cloudflare-protected sites by leveraging a browser to obtain the CF clearance cookie. He explains how the CF clearance cookie is generated after a JavaScript challenge, and how you can reuse that cookie in HTTP requests to avoid repetitive browser work. The video highlights PiDoll’s browser context HTTP requests as a streamlined way to capture cookies, manage sessions, and feed them into your HTTP client automatically. Rooney also notes complementary tools like Playwright, Puppeteer (and related options) for stealthy scraping, plus useful insights on network event monitoring, XHR discovery, and fingerprinting avoidance. He demonstrates a simple setup where Chromium is loaded, the CF test is passed, and subsequent HTTP requests return page data rather than a “Just a moment” page. The takeaway is a faster, cost-efficient scraping workflow that scales by persisting cookies and session details across requests. Rooney invites feedback on the PiDoll approach and asks viewers to share their experiences and improvements.
Key Takeaways
- Using a browser to obtain the CF clearance cookie avoids repeated JavaScript challenges and turnstiles on subsequent requests.
- PiDoll’s browser context HTTP requests automate cookie extraction and formatting for your HTTP client, removing manual steps.
- Rooney demonstrates configuring Chromium binary paths and CDP connections to bypass traditional web-driver constraints.
- Monitoring network events and XHR responses helps identify the right API endpoints and payloads for scraping.
- The approach emphasizes consistent browser fingerprinting and IP reputation handling to stay under anti-bot radars.
- A practical flow: load the page, pass the Cloudflare test, then reuse the same cookies for multiple HTTP requests.
Who Is This For?
This is essential viewing for Python developers and data engineers who scrape Cloudflare-protected sites and want a scalable, cost-effective workflow that minimizes browser overhead.
Notable Quotes
"Having a browser to scrape data these days is extremely important when we're looking at sites that are protected by any kind of antibbot vendor."
—Rooney sets up the problem: anti-bot protections require browser-based solutions.
"The specific information... I like the most in this so far is this browser context HTTP requests which does everything that I just talked about all for you."
—Introduction of PiDoll’s core feature for handling cookies and sessions.
"This code is going to load up the browser, go to the page, pass all of the test, the CloudFare test, and then it's going to just request the same page and give me the text back."
—Demo flow: pass CF test, then reuse the session for actual data retrieval.
"Disable notifications and then more importantly disable these blink features of automation controlled. There's a lot of giveaways and telltale signs when you're controlling a browser automatically."
—Tip about reducing automation fingerprints to avoid detection.
"If you've watched any of my last videos, you'll know that a lot of the things I do is going through the network tab, finding the XHR requests, finding which ones have the right JSON, and then manipulating those going forward."
—Rooney’s usual workflow for identifying API endpoints to scrape.
Questions This Video Answers
- How can I bypass Cloudflare’s anti-bot protection safely with browser automation?
- What is Cloudflare CF clearance and how does it help scraping at scale?
- What is PiDoll and how does its browser context HTTP requests feature work?
- Which browser automation tools are best for avoiding anti-bot detection when scraping?
- How do I reuse Cloudflare clearance cookies across multiple HTTP requests in Python?
CloudflareCF clearancePiDollPlaywrightPuppeteerBrowser automationCDPnetwork monitoringXHRweb scraping
Full Transcript
Having a browser to scrape data these days is extremely important when we're looking at sites that are protected by any kind of antibbot vendor specifically because of the usage of a JavaScript test which is executed on your browser when you make your initial request. Upon passing this test, you get some kind of clearance cookie. In this instance, it's CF clearance. And then you can actually access the site and click around as as normal without accessing or without being shown that 10 turn style over and over again. So this site that I'm looking at here, if I come to inspect and go to application, we can see that we have the cookies here, specifically this CF clearance.
So if I clear and get rid of all of these and then refresh the page, we're going to get this turn style that we are very familiar with. This is the JavaScript test running. It's checking our IP, our fingerprint, our browser, and it's encrypting it and then sending it off to an endpoint which is going to give us that cookie back. So, what we want to be able to do is utilize a browser to load up this initial page here, take the cookies, which is our clearance, and then utilize all of that within the HTTP requests afterwards.
The reason why we want to do that is because, you know, when we start to scrape at scale, we don't want to have to mess around with lots of browser usage and we want to save costs and we want to be more efficient and quicker making these requests. And when you start to get further into it, things like browser profiles and sessions become much more important and managing that therefore within becomes a bit easier. So what can we do then? Well, we can use some of the uh more popular browser automation libraries within Python um specifically like playright or if you want to be more stealthy things like node driver and camera fox to get around the antivar protection get those cookies that we can then utilize.
But what I want to show you today is a relatively new one called pi doll that I've just started checking out. I haven't got really far into this yet, but so far it looks pretty cool. And the specific information, the specific thing that I like the most in this so far is this browser context HTTP requests which does everything that I just talked about all for you. So you don't need to worry about, you know, multiple uh getting the the right cookies, finding them, extracting them, making sure the formats are correct to load into your specific HTTP client.
It's handled all automatically for you and you know talks about it down here. Now, I put together a working example of this, which I'll show you in just a second. But there was a couple of other things that I thought was pretty interesting, which is um kind of standard on the other ones, but possibly not as easy to do. But here we have, you know, monitoring network events, which, you know, we could start to build up a tool where you could just give it a URL and get a lot of information very, very quickly about the site.
So, you can save time when you're figuring out how you want to scrape it. If you've watched any of my last videos, you'll know that a lot of the things I do is going through the network tab, finding the XHR requests, finding which ones have the right JSON, and then manipulating those going forward. I could see the use case for this of how you can build something up that does that for you very, very quickly and easily. Another thing to note is the Cloud Cloudflare turnstyle interaction. There's a lot of warnings on this page about this, and I haven't tried this yet.
I haven't had this issue so far. It does just move the mouse for you and click, but again, nice built-in thing to have, save you having to do it yourself. And it talks about uh browser fingerprinting and uh behavioral patterns and IP reputation, which is very, very, very important. That's sort of what I alluded to at the beginning. So, let's go ahead and look at the code that I've got here. Very simple. Pulled it mostly from the example with a couple of uh notable inclusions. For me, I needed to specify where my Chromium browser or Chrome browser binary was because of the way it's all installed on my system.
It didn't automatically find it. Although, it's worth noting that this user bin Chromium browser is exactly this browser. It's running from the same binary. So, there's that's kind of cool, which is the way that this all connects through CDP rather than a web driver or any other method, which is the more modern way to do so. disable notifications and then more importantly disable these blink uh features of automation controlled. There's a lot of giveaways and a lot of telltale signs when you're controlling a browser automatically either through the web driver or through the CDP connection.
Um and this helps you just remove some of those more obvious ones. Then this code is going to load up the browser, go to the page, pass all of the test, the CloudFare test, and then it's going to just request the same page and give me the text back. So if we close this up and we do UV run on the main should see it open up here and hopefully this time we get through. We do page is loaded and again here is all the information from a subsequent HTTP request and we can see that we got actual page data here rather than another page saying hey just a moment you know we're used to seeing.
So what could we actually do with this then? Well, if we start to think in the ways of sessions and IPs and everything like that, we could quite straightforwardly, you know, go ahead and make one request and then make maybe 10, 15, 20 requests depending on the site utilizing the HTTP client and the same cookies and the same kind of session. This is going to make your life much much easier. And whilst this wasn't a new concept, this particular library pi doll I thought handed handled it very neatly and it was really cool to see it being addressed as something that would be useful and wanted by us.
So that's what I'm going to show you in this one. Let me know if you've tried this library, what you think of it. Are there obvious things that I'm missing? I'm really cool. Really interested to hear your feedback. And yeah, thanks thanks a lot for watching and I'll I'll see you again soon.
More from John Watson Rooney
Get daily recaps from
John Watson Rooney
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.









