How I found the easiest way to scrape this site
Chapters8
Open the browser inspector and network tab to look for anti bot measures, enable persistent logs, and reload to observe protections.
A hands-on walkthrough of bypassing basic anti-bot signals and using an Algolia-backed API to scrape product data efficiently.
Summary
John Watson Rooney walks through a practical scraping workflow for an e-commerce site. He starts by inspecting network traffic to identify anti-bot measures, cookies, and potential blockers. He discovers that Cloudflare cookies reappear after deletion but that the data source for products is loaded via a JavaScript request, not directly from the HTML. The key breakthrough comes from spotting an offsite request that carries the product IDs (PIDs) and enables direct API access. By extracting the request body, Rooney shows how to use Bruno to test parameters and then switch to a coded approach that hits an Algolia-backed API endpoint with a search-only API key. He demonstrates adjusting the hits per page and page number to pull a full category (96 results per page, 13 pages total) and confirms the data fields—name, price, ID, and image—for all 96 items in each batch. Finally, he highlights the scalability takeaway: many sites rely on similar backend indices, and with careful testing and rotation of IPs, data extraction can be done quickly and at low cost. Rooney stresses that understanding the backend often reveals a fast path to scraping when protections are lightweight or misconfigured.
Key Takeaways
- Inspecting network traffic and cookies is essential to identify anti-bot measures and where data actually comes from on an e-commerce site.
- The product data is loaded via an offsite JSON/API call rather than being embedded in the HTML, indicated by a placeholder div and a PID (product identifier).
- An exposed API key in the page’s requests can reveal a working search API; Rooney demonstrates using the Algolia-based endpoint with a 'search only' API key.
- Using Bruno to model requests, Rooney shows that 96 hits per page and 13 pages yield 1226 total products for a category, with fields like name, price, and ID available in the response.
- Scaling the approach involves testing across categories, adjusting the API request parameters, and rotating IPs to minimize blocking while fetching large datasets.
Who Is This For?
Essential viewing for developers exploring web scraping of e-commerce sites, especially those who want to understand how backend indices and lightweight protections can be leveraged for fast data extraction.
Notable Quotes
"There should be some information coming back here. So, we don't need to use a browser because that will be our last resort."
—Identifies that direct browser scraping is a last resort once JSON data loads via API.
"The P here is kind of important because that is like the skew or the UPC. So it's the product identifier for that specific site."
—Explains the role of the PID in constructing API requests.
"This is actually more common than you might think, and it brings up this kind of conversation of protection against scraping..."
—Rooney notes that backend protections vary and expose opportunities when understood.
"If we change the hits per page to 96 and iterate across 13 pages, we can pull all 1,200+ products from a category."
—Demonstrates the concrete data pull plan and expected total items.
"The API key is exposed in a search-only API key, which could let third parties scrape everything from your indices."
—Highlights a security find about potentially exposed API keys.
Questions This Video Answers
- How can I locate the data source behind an e-commerce page's product listings?
- What is an Algolia search-only API key and why does it matter for scraping?
- How do I determine if product data is loaded via JavaScript rather than embedded HTML?
- What are practical steps to scale scraping across multiple categories with minimal blocking?
- What role do IP rotation and request parameters play in scraping large product catalogs?
Web ScrapingAlgolia APICloudflare CookiesBruno (tool)API endpointsProduct IDs (PIDs)Offsite data loadingJSON API responsesIP rotationE-commerce data extraction
Full Transcript
Okay, so I want to show you my method for scraping this ecom site. There was a few cool things going on, something that I learned as well. So, let me just show you my process. The first thing I'll always do is open up inspect and hit network because I want to know what uh antibot protection or anything's going on here. Uh I'll do persist logs as well. And we'll hit reload. Um there's a lot of things kind of came up that I was seeing here. Uh this was some kind of like CDN performance thing, so kind of irrelevant.
Um, but I didn't see any Cloudflare offsite or anything like that for challenges. Some like location stuff, etc., etc., like you would expect. And then some other things that, uh, got blocked. Um, not a whole lot going on really, uh, which was interesting for me. Um, because, you know, when I went to storage and had a look at the cookies, we can see we have the standard kind of CloudFare CF clearance. And if I delete all of these and all of them and refresh the page, we'll see them all come back again. Um, so I had to keep that in the back of my mind.
Although, as you'll see, not totally relevant. The next thing I like to do is find out where the data is actually coming from. And to do that on an ecom site, I usually go for a category. Um, there's generally more available. There's lots of different ones so you can kind of understand what's working behind the scenes and what's actually going on. uh and it gives you a little bit more of a depth into the hierarchy of the site uh as opposed to just trying to find everything or narrowing down straight on to one single product.
So I'm under this category here. We got see 1200 and something items. Um what I'll do now is I would always have a look at the source because I want to know what that's looking like should I need to request this page directly through uh requests or scrapey or arnet or something like that. So I'm going to search for this product and we can see it's bumper fender. So if we come over here and search, we don't have anything. Um, one thing you'll see when we go to inspect um is that it it shows us that um there is somewhere up here uh data P.
So that suggests to me um that this is being requested by JavaScript offsite somewhere. And in fact, if we come over here and search for PID, we can see that we have a placeholder uh with a div class of product. So requesting this page directly won't work. Um but there should be some information coming back here. So we don't need to use a browser because that will be our last resort. So the P here is kind of important because that is like the skew or the UPC. So it's the product identifier for that uh for that specific site.
It's quite universal. So, it's always a good thing to search for and as we can see when we look at this uh they're everywhere in here and that's how we will know and I'll be able to identify the different products. So, from here I'll go to the network tab again and I'm going to refresh and load up this uh page here with the with these on. Um and I want to really look through here and kind of understand now there's a couple of requests that jump out directly to me straight away. We can see them here.
Um this was interesting. I don't know why it had placeholder amongst the actual products and I don't know why there's two of them that are seemingly almost exactly the same. Um it was interesting. I'm not entirely sure. Uh but it is what it is. And in here we can see here's here's where the placeholders were. We can see we do have product information that's come back and it'll be this that is being loaded into those uh divs when the page gets rendered in a browser. So we know that this is here. So, first I went down the route of copying this out into uh Bruno.
And here I can see here you can see um I actually changed the uh PIDs. Um we can see that you know there was all of these different ones. I just set it to one to see if it would be responsive to me. Uh and you can see that it is. We get one product back which is a really good sign which means you know it's working how kind of like I would expect it to. There's no body. And we can see here that I've already removed all the editors including the cookie and it still works.
So despite being, you know, here's some CloudFare cookies in here. Despite these being here, they don't seem to affect this endpoint on the API. Um, which was interesting. So this was a really good sign for me. It kind of like to me it meant, hey, look, I can find the product IDs and I can load them in here and I can just get the data back this way. This was really interesting to me because not only were there CloudFo cookies, but we didn't actually need them to access the API that the page was using to call to fill out the product information.
Now, this is actually more common than you might think, and it brings up this kind of conversation of, you know, the protection against the protection against scraping and bots on some sites is very, very strong, but on others it's much less. And I think this is kind of shows when you start to scale out and when you look at things at a bigger picture that a lot of sites like specifically this one once you understand what's going on, you'll be able to access the data very quickly and easily at very low cost. But unfortunately, I didn't find a really good way to actually find those product IDs.
I went through and I I had a look at uh inside the the page source again to see if I could find any, but they weren't they weren't in here. So uh again back to inspect and and this is where I realized um that I actually missed something. So uh let's just start going through different pages and you can see we get two requests every time. Let's do it again. Go to page three. Two requests. Now, I actually initially looked overlooked this one because I'm looking at the domain and it's something offsite, which to me suggests, you know, I thought maybe it was um like uh advertising or some something else like, you know, some kind of analytics or something.
If we click on it here and we go to the response, we have three things. Interestingly enough, these top two seem to be the same and they have the products in, including the PIDs, but it also has all of the product information in as well. Um, what I did here is I I saw that there was a form data request. So, I just copied this out and pasted it in here. And we can see that we do have, you know, three uh bits of JSON that are being sent under the request here, which is what shows here.
And that determines what comes back in the response. Uh you might have noticed this before and I didn't realize this right away, but under the post request, we actually have this here, the API key. Now, I wasn't sure whether this was supposed to be here or not. So, I did a bit of googling and I found this on their website and it says here's a search only API key um which lets your application search through its indices to find and it says it's safe to use in your production front end code. And I was like, okay, maybe that means somebody has accidentally exposed it.
Uh but no, you can see down here when using this your search only API key is exposed and third parties could use it to uh scrape everything from your indices. Um, so I would oblige. Uh, this was interesting. So that kind of like led me away from the website itself and just onto this and I can look at the actual response and everything. So I took this request and put it into um, Bruno itself. We can see I've removed the agent params because it was long and horrible. But we need these other two. And the body here.
All I did was mess around with the parameters and remove the ones that I didn't need. Uh, and we can see here hits per page. I changed this to 96 instead of the default. And here is the page number. Uh, this also has the category. So, you know, if you wanted to do a different category, it would just be a case of finding this request there and we'd be able to change this up. Um, I looked at the headers as well and again I don't need them because I have the API key. So it knows it's an authenticated request.
If I just change this to the preview, we can see we have 96 hits as they're called here with all the information, you know, price, price, data, all this stuff here. Uh, and there's images and the name, etc. Um, which was uh which was good to see. And I think one of these is probably our Oh, I thought it might have been our bumper fender, but it's not. Doesn't matter. Uh, but we can see, you know, 1226, which was the number that we were expecting over here. 1226. So, to me, this was exactly what we were after.
All I did from here was just change it into uh some code just to get a proof of concept working. I sent this through a residential proxy just because and gone through hole here and I've changed the page number and I put it into an F string over here somewhere. Yeah. And it was just a case of running this. I left the headers in here, but as we saw that they weren't actually necessary. And I'm going to do is just run through all of these and print the name, the ID, and the price just to check that it works.
And this is what I got back. And we can see, you know, name, price, ID, etc., etc. for all of the 96 products. So all we would need to do is check out the actual um results returned and we can see number of pages is 13. So 13 * 96 uh 1,200 and something total requests 13 [snorts] requests to the website to get all of that product information back. That was pretty much it for this one. It was quite an interesting find for me understanding that the the Algolia backend thing with the API key which I'm pretty sure is probably more common than you think.
But um from here to scale this up I would just need to keep testing to understand you know finding all from different categories. how many requests are we going to need to make to get the actual data that we want given the fact that we can up that number in the API request that we're making and then just, you know, maybe rotate through some IPs to see where it's at and then try and find its failure points. But as I said earlier, sites like this, once you get to this point and understand what you're looking at, they can be scraped pretty quickly and pretty easily at very minimal cost.
More from John Watson Rooney
Get daily recaps from
John Watson Rooney
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.



