Scraping these sites are easy (once you know how)
Chapters10
Use browser inspect tools to monitor network activity, find the vehicle data API and confirm a JSON response that contains the car information.
Master the basics of scraping a protected API by using network sniffing, CSRF/referrer handling, and lightweight Python tooling to grab car data efficiently.
Summary
John Watson Rooney walks through a practical guide to scraping a site with an API-driven car listing. He starts in the browser, using Inspect/Network to spot a vehicle API with a list endpoint and a quotes endpoint, then explains why the CSRF token and a proper referral matter for requests. By experimenting with page and size parameters, he shows how to verify that the API returns the expected vehicle data. He demonstrates how removing certain headers affects responses and why cookies alone aren’t sufficient without a valid referral. The core insight is that you can programmatically regenerate the necessary CSRF token and cookie via curl -CFi (pretending to be Firefox) to access any page or size, while keeping requests lightweight and proxied if desired. Rooney finally distills the approach into a compact Python loop (roughly 26 lines) that iterates through pages until all 13,000 cars are fetched, noting that most sites pose an 80/20 split in scraping difficulty. This makes the technique accessible for daily data pulls without heavy browser automation.
Key Takeaways
- Identify the vehicle API endpoint (the 'list' response) in the network tab to target the data you need.
- CSRF tokens are required in headers, not just cookies, and you must include the correct referral URL to succeed.
- Changing query parameters like page, size, and payment type helps map the API's capabilities and pagination.
- Using curl -CFi to impersonate Firefox can generate the necessary tokens and cookies without a full browser session.
- A compact Python loop (about 26 lines) can iterate over all pages (e.g., 261 requests for 13,000 cars) with proper token handling.
- In practice, a simple proxy and a refresh of the cookies/CSRF per request minimizes complexity while preserving access.
- The site’s protection hinges on a token/referral combo, which Rooney notes is often weaker than expected, enabling straightforward scraping.
Who Is This For?
Essential viewing for developers experimenting with API-based scraping, especially those dealing with CSRF protections and simple pagination. Great for Python scraps teams and anyone curious about how lightweight token handling unlocks protected data.
Notable Quotes
"there is a vehicle API which is exactly what we were expecting to see."
—Rooney identifies the API endpoint in the network tab.
"the most interesting an X CSRF token as a specific header."
—Highlighting that CSRF must be sent in headers, not just cookies.
"it's expecting a referral."
—Explains why the Referer header is needed for token validation.
"just using curl cfi impersonating Firefox to get access does actually generate that data for us."
—Shows a browser-free way to obtain tokens/cookies.
"this the protection on this API is almost non-existent."
—Rooney notes the API’s relatively easy defenses in this example.
Questions This Video Answers
- how do you bypass CSRF protection for a legitimate test scrape
- what headers are required to access a CSRF-protected API
- how can curl impersonate a browser to retrieve CSRF tokens
- how to paginate through a JSON API with page and size parameters
- what are practical tips for scraping car listing APIs without a full browser
Web ScrapingCSRF TokenHTTP HeadersReferrer HeaderAPI PaginationXHRBrowser DevToolsPython Requestscurl -CFiBrowser impersonation
Full Transcript
So, in this video, I want to show you how I scrape the data from this page. It's another one where we need to utilize the network request, but there's a couple of interesting things along the way that might be beneficial to you and a few things that I found interesting on the way that the site worked. So, as always, the first thing we need to do is open up our inspect element and head over to network. And I'm just going to hit search on these vehicles. I have a XHR filter and I also have persist logs on, which means, you know, these will stay whenever you change the page.
So we can see right away there is a vehicle API which is exactly what we were expecting to see. There's a quotes and a list. The list being the one that we are going to be interested in because that's got the car information. If you did want the quotes details um that should be fairly straightforward to get as well. It looks like it's organized by the vehicle regge and in the request there is a a JSON response that seems to has that have that information in. Um I haven't interrogated this much. I was only really interested here.
So if we have a look at this request, we can see under the request headers down here that uh there is this one which is the most interesting an X CSRF token as a specific header. Um and also within the cookie and this information is the same. Uh and this is kind of an interesting thing for me and we'll show you as we go forward why that actually matters. But the first thing I the next thing that I always like to do, you know, is just is flick through a few through p flick through a few pages and just make sure you know that because that information is there that I want that, you know, I expect the more requests when I do this just to make sure we kind of get a few more bits of information going.
And we can see right away here we are page two and page three with that size here. Interestingly, the first one has this payment type. So, if that was interesting to you, it's possible that we would be able to have a find out what other parameters that this API would take. Given the fact that it's page and size and payment type, it's probably fairly easy to figure out any other ones that may work, too. Uh, so we just check the extra ones. We can see our response looks good. Different set of vehicles and exactly what we were expecting to find.
So from here, what I'll always do is I'll take this request out and I'll put it into Bruno or, you know, whatever API client that you use. Uh, and we start to mess around with it. Now, the first thing that I always do in the parameters is set them as low as possible and run it and just make sure that that actually works. And we can see then that when we start to change these params, we are we are getting back what we were expecting. And I'm just double checking here to make sure that the information about the vehicle that I would be after is in here.
And it is. So from here we want to go over and have a look at the headers. Now you can see right away I've removed all of these ones which are just not relevant. I also removed the user agent and that didn't have any uh effect on the results or the outcome. And in fact, if you remove these as well and just leave the cookie, you might think that that's that's it. That works perfect. We got the same response back, the same what we were after. We just need this cookie. But in fact, if you were to change this to page two and let's say add in a different size, we get this valid CRF token was not provided.
And that's interesting because you would have thought, well, here's our CRF from the other request and it's in here in the cookie, but we need it in here, so let's add it back in. But we still get it. And that's because when we are going through these pages, it's expecting a referral. So we need to add this referral in here, this URL, and then that will let us change it. So you can see I've got 23 and the pageionation here. Now, initially after this, I thought, okay, I'm going to need to use a browser, I think, to be able to generate these tokens, and I want to know how long they would last for.
And the easiest way to check that is to come back to your browser session and go over to storage and find the cookies and the headers. And we can see here's this one here. And if we look over here, it actually tells us that this is going to last for a year. Um, so you know, once a year I'll have to spin up a browser and generate a new one per perhaps. Maybe. So that's not such a bad thing for me. So that looks pretty good. Pretty good here. So, we just kind of need to build something that's going to get this token, find the referral, and then generate the cookie and then send it over.
So, the next thing that I would do is uh export this to some kind of code. Uh I would generally use Python for this requests. And we can have a look here. And this is this is exactly everything that's in here. Now, I've done this already. Um and this is what I've come up with. So, pretty straightforward. Let me just tidy this up from where I must have deleted a load of stuff while I was messing around. So, we can see here that I am going to use a proxy. Um although, you know, I guess with the pages and the different sizes that we're going to be working with, for example, here if I look um and we make let's change this up to 50.
So, this is only 261 pages, right? So that assuming that we want to do this every day, that's only 261 requests per day from us. Even though I'm going to use an proxy IP, I've got my URL here. And then I'm going to use curl cfi to generate a session that impersonates Firefox. Now, I think this is where a lot of people might stumble and they might think that you do need to use a browser to generate that cookie and the uh CSRF header token. However, in this instance, just using curl cfi impersonating Firefox to get access does actually generate that data for us.
So, we can make a request here. We can then get the cookies from the session. We can print them out. We can see them. And then we can update the header with that new cookie which would have happened automatically in your browser. Then we can add in our referer. And that's going to then allow us access to any page and any size of information that we actually after. So if we come out of this and we run it, we're going to be able to see what comes back. So we can see here, I think I left this on on one.
So, we should be able to see uh one item. So, we should be still see everything. Yeah. So, here are the uh here's the the the headers that were generated. Uh here's our token and our referral that we added in and then the results that come back for us here with the car information. We would want to, you know, just go through every single one of these. And this is just the standard kind of JSON response and then keep going until there's no more pages. And the way that I coded this is that this is going to actually generate a new um set of X C [snorts] CSRF and everything like that and cookies for every time we go and make these requests.
So this the the protection on this API is almost non-existent. Um they're just relying on this here which has to be generated. You'd be surprised how often it's like this. I think generally speaking, it's kind of like an 8020 split. So, you know, 80% would be pretty straightforward of general sites that you want to you want to scrape from. And and this is e-commerce. Uh but this is kind of like my area of expertise. So 80% is going to be very easy and 20% is actually going to be very very hard. So it's worth finding out where you stand with the site so you can actually build your code out and then like temper your expectations as well.
So, in this instance, to get all of these 13,000 uh car information, you know, I've got 26 lines of code here. All we need is a simple loop, and we should be golden. 261 requests, nice and easy.
More from John Watson Rooney
Get daily recaps from
John Watson Rooney
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.









