How I use Scrapy And Proxies to scrape Geo restricted sites
Chapters7
Demonstrates inspecting the page source and network data to find a JSON schema that contains vehicle details.
John shows a lean, script-based Scrapy approach using US proxies and JSON-LD extraction to scrape Geo-restricted car data quickly and efficiently.
Summary
John Watson Rooney demonstrates a practical, script-first Scrapy workflow to scrape US-only data from geo-restricted car sites. He starts by inspecting the page source and uncovering a JSON-LD payload that includes a vehicle URL and a rich set of vehicle details. Rather than a full Scrapy project, he builds a one-off crawler using Scrapy’s crawler process and a custom settings dictionary, leveraging Scrapy Impersonate to mimic a Firefox TLS fingerprint and the impersonate download handlers for robust downloading. He pulls proxies from his environment (US IPs) and uses the JSON LD extractor library to pull data from the script tag with the next data ID. The spider constructs starting URLs, fetches the embedded JSON, extracts the vehicle URLs, and then visits those pages to harvest the detailed vehicle data. The result is exported as JSON Lines for easy post-processing. He notes the process yielded 63 requests, 63 responses, and 60 scraped items in about 17.5 seconds, highlighting asynchronous efficiency. John hints at expanding this concept for more data in a future video and teases a larger Scrapy project next. Overall, this is a compact, real-world example of combining JSON-LD parsing with a lightweight Scrapy script to handle geo-restricted scraping tasks.
Key Takeaways
- Using JSON-LD within the page source is a reliable starting point for extracting structured data from dynamic pages.
- Scrapy can be run as a one-off script with a crawler process, avoiding the overhead of a full Scrapy project.
- Scrapy Impersonate with Firefox can improve reliability when facing TLS fingerprinting and anti-bot measures.
- Pulling US proxies from the local environment ensures Geo-restricted sites are accessed from compliant IPs.
- JSON Lines export makes downstream data processing simpler than a single large JSON document.
- Leveraging the JSON-LD extractor library simplifies pulling structured data from embedded JSON (JSON-LD) blocks.
Who Is This For?
This is essential viewing for developers who want a fast, script-based Scrapy workflow to handle geo-restricted sites, especially those who prefer lightweight scripts over full Scrapy projects.
Notable Quotes
"There’s a load of information here, the most important one being the vehicle URL, the URL to the page that the rest of that information is on."
—Shows how the initial JSON data provides the core URL needed to access detailed data.
"I’m going to open it up here… there’s this load of JSON data tucked away under this script ID tag next data."
—Describes how the JSON data is found in the page source for extraction.
"We know that this is going to have a load of information in, probably more than we actually need."
—Notes that the embedded JSON contains rich data, not all of which is required for the scrape.
Questions This Video Answers
- How can I scrape geo-restricted sites with Scrapy using proxies?
- What is JSON-LD and how do you extract data from JSON-LD blocks in web pages?
- Can Scrapy be run as a standalone script without a full project, and how?
- What are the advantages of exporting Scrapy data as JSON Lines?
- Which proxies and TLS fingerprint methods work best for scraping car listing sites?
ScrapyScrapy ImpersonateJSON-LDJSON-LD extractorProxy rotationGeo-restricted scrapingPython scriptingWeb scraping best practices
Full Transcript
There was a few interesting things about scraping this site that I wanted to share with you and how I went about it. A slightly different method to my usual. Now the first thing was that this is a US site so we needed to use US IPs only. And the second was although that there's a lot of information available for each of these products obviously because they are cars there's lots of data. It was actually very easy to extract and I want to show you how I did that. This is kind of a combination of a few different methods that we're working with scrapey but inside a script using the crawler process rather than a project.
So let's have a look at the site initially. So I've pulled up the trucks page. Now the first thing I normally do is go to inspect network tab and see what's going on there or I check out view source and that's where I'm going to go first now. And there's a reason for that and that's because I'm pretty confident that I was going to find this when I looked and it is indeed correct. It is there. We have this schema for all of the information for these trucks. Well, for the for the first part. So, if I copy this one out, for example, and we put this into a JSON parser, we can see this is of course valid JSON.
And there's a lot of information here. The most important one being the vehicle URL, the URL to the page that the rest of that information is on. So, let's go and have a look at that URL. I'm going to open it up here. And because I found a load of information in that view source, that's where I'm going to go straight away here. Now, obviously there's no schema on this page, but what there is is all the way down here, there's this load of JSON data here tucked away under this ID, the script ID tag here of next data.
And this is relatively common and is one of the places I always look first when I'm trying to decide how I'm going to scrape a site. So, I always check network tab and then I check here. So, we know that this is going to have a load of information in, probably more than we actually need. But if we go ahead and copy this all out and put this in our JSON parser, we'll see again valid JSON. And down here somewhere, if we look closely, we'll find that there is a vehicle details and it has a load of information like load like all these features beyond what you could even even want to know for this vehicle, I suppose.
But it's all there and that's because you know it's all on the site. It's got to be there. It's got to be given that it's got to be put there somehow. And this is how it's done in this instance. So now we know how the site behaves. Let's go ahead and create a new project. So what we're going to do is we're going to be using, as I said, scrapey, but we're not going to be using a scrapey project. We're going to use the crawler process and make it a one-off script. Now this is very similar to writing a standard spider, but there's a couple of differences.
And the first one being that we need to have add in any of the settings that we want to directly to our spider class. So I'm going to have this custom settings dictionary here. The first one is user agent is none. And that's because I'm using scrapey impersonate to give me a browser-l like TLS fingerprint which means I don't want to be sending a different user agent with the requests and you know confusing it all. I want to use the one that comes with the curl cfi the impersonate request. The next is the download handlers and this is telling scrapey the spider that we want to use the scrapey impersonate handlers to do the downloading which means we can have that meta dictionary the meta information added to each request.
And the third one is this twisted reactor. Make sure we're using the async io one. I think this is default these days, but you know, I'm going to leave this in because that's what the scrapey impersonate docs said to do. So now we've got that set up. We just need to write the rest of the spider as we would do normally. And this is where scrapey becomes really powerful because we have access to a load of methods that already written that do exactly what we would want them to do because this is a web scraping library.
So we have start requests. What I chose to do here was to create a simple range to create the URLs to start with. This meant I could control how many pages we went through to start with when I was looking at pulling the data. And uh if we wanted to, you know, expand it to every single page, we could also do that just by looking at the pageionation. So I'm going to do one to four. We're going to yield our scrapey request out. And then this is going to have a few bits of extra information in.
This is the URL obviously that's going to hold that page number. Don't filter is equal to true. I don't know if this is totally necessary, but I'm having it in there. But this is what we really need to look at. This is the meta dictionary that gets added along which tells the request a little bit more information about what it needs to do. The first is the impersonate part which is Firefox 135 in this instance. If you're using an impersonate library like this, I would recommend using Firefox more. I seem to be having a few more issues with Chrome of it being recognized and blocked a bit.
I think Firefox get through a little bit better, probably due to popularity. Either way, then we have our proxy. And obviously I pull all my proxies from my current environment and so I'm using os.get environment and US proxy because you know that's stored on my machine as us IPs which is what we want. Then we need to do some parsing. So we need to actually pull the information from the schema inside the script tags on this page which is going to involve a different library called JSON LD extractor which is from extract which I've got up here.
This is another one by the same people that wrote scraping. It just gives you a really easy way to pull the information from the schema where it's the JSON LD plus JSON tag rather. It just makes your life much easier. So I just recommend using it. And we just pass it the text. And then we have all of those schemas stored in this data variable. So I want to go through and pull out the car URL for each one of those, which is where I'm just using some list comprehension in Python against the dictionary, the URL here in the data.
Once I've got that, I can loop through those URLs and create a request for each one. And again, this is, you know, fairly standard. This is just scrapey 101, but we're going to give it the URL. And then we're going to basically do the same thing. We want the same meta here as we had before. Now, obviously, this is repeating ourselves. But if this was within a scrapy project that was going to be something that was, you know, probably going to be reused more often or expanded upon, we would do this slightly differently, you know, because we want to reuse it.
But in this case, this is just a simple script. We're only be writing 50 60 lines of code to pull as much data as we can. Now, with this one, we do want to have a different call back because we want to call back not to the pass function, but to a new pass function so we can get the car data out. Now, I'm going to write that down here. You know, all I'm doing is asking the HTML parser to find the script tag with the ID of next data and then get me the text for it.
And that had all the vehicle details in. Then I'm just going to yield out as a new dictionary. This is the information that we're going to yield out. And you'll see that once we actually save that. But that leads us on to the next thing. We need to be able to, you know, run this scrapey spider. And because we don't have a project, we can't just do scrapey crawl. That's where, you know, we're running as a script. So, we want to use the crawler process to do so. And this is just going to have a few settings in it.
And mainly about the feed exporting. And we want to basically tell it that we want to export to this file, which is JSON lines, which I think is generally better to use than JSON itself. It's easier to work with because everything is separate. And that's it. So, once I've got that set up, all we need to do is just do process.crawl crawl and process start for our spider and then this will run as a script. So we don't need to worry about a full project. So that's clear. So now if we run this we're going to get our full scrapey here.
It looks like any other scrapey project that's going on. We can see the data coming by. But we are just running it from that main PI script. It gives you extra flexibility when you want to you know just write something small and quick to get some data rather than having a whole project. And as long as you include the settings that you need and whatever else it works just fine. So let's have a look here. So we had 63 requests, 63 responses in 17 1/2 seconds. Again, asynchronous given to us essentially for free. Brilliant. And we had 60 items scraped.
Uh so let's have a look at those. And we can see here there's loads and loads and loads of data. So this has all of this device stuff which I didn't filter out, but we do have uh vehicle details here. So, if I just look at this part, this portion, we would probably want to chop this out really, but we can have all of this information and there's like even like the VIN numbers and everything. So, everything that is on that page that you could possibly want to see is down here and it tells you everything about it.
Okay, I definitely need to get a truck with a V8 Hemi 6.4 L. I guess that would probably bankrupt us in the UK or in Europe, let alone not fit on our roads. Cool nonetheless. So, you get the idea. Now, what I want to do next in an upcoming video is take this concept, but actually expand it and make it run better and more often. We're going to look at scraping more data. Um, if you want to see me do something like that now without waiting, which you would just subscribe for by the way, you want to watch this video here next where I use Scrapey for a bigger
More from John Watson Rooney
Get daily recaps from
John Watson Rooney
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.



