“Requests + BS4” but for 2025

John Watson Rooney| 00:08:53|Feb 18, 2026
Chapters11
Explains why requests/bs4 are no longer sufficient and introduces a move to newer tools and techniques.

Skip old-school requests + BeautifulSoup for modern scraping; switch to async tools, TLS impersonation, and proxies to reduce blocks with 2025-ready Python.

Summary

John Watson Rooney makes a strong case for moving beyond requests and Beautiful Soup for web scraping in 2025. He explains that while requests is fine for APIs, it quickly gets blocked on web pages, and modern scraping benefits from async IO, proxy impersonation, and more capable libraries. He walks through a practical example: scraping a shop page, extracting a JSON-LD payload, and discovering product URLs that lead to richer data pages. The video emphasizes building a reusable client with proxies and a TLS fingerprint impersonation strategy (using Arnet) to improve access and reduce 403s. Rooney demonstrates async fetching of multiple URLs with a rate limiter based on semaphores, then parsing JSON data from both listing and product pages. He highlights how JSON-LD (LD+JSON) often holds the data you want and how to use Select for HTML-to-JSON extraction. The takeaway is that modern Python tooling—async + robust HTTP clients + fingerprinting—can outperform the traditional Requests + Beautiful Soup combo in real-world web scraping. If you want a scalable, future-proof approach, this video outlines the workflow and the rationale behind each choice. Rooney teases a deeper, more data-heavy example at the end, inviting viewers to watch another video for a full, real-world scrape.

Key Takeaways

  • Using a reusable HTTP client with built-in proxies and TLS fingerprint impersonation (via Arnet) dramatically improves access to sites behind anti-bot protections.
  • Async IO with uv (uvicorn-style runner) enables concurrent fetching of dozens of URLs, cutting total scraping time to about a minute for the example.
  • JSON-LD (LD+JSON) on product listing pages often contains the data you need; parsing this payload is central to collecting product URLs efficiently.
  • A simple rate limiter based on semaphores helps you throttle requests to avoid blocks without sacrificing throughput.
  • Creating a separate URL extraction flow from listing pages to product pages clarifies responsibilities and makes the code easier to maintain.
  • Select remains a trusted tool for extracting the JSON payload from HTML, with the added benefit of easily locating the LD+JSON script tag.
  • Rooney argues that modern tooling (async, proxies, fingerprinting) should supersede the old Requests + Beautiful Soup approach in many scraping scenarios.

Who Is This For?

Essential viewing for Python developers who want to level up web scraping in 2025, especially those who are transitioning from Requests/Beautiful Soup to async workflows with fingerprinting, proxies, and LDJSON parsing.

Notable Quotes

"Now, this is run a lot through many different CloudFare sites. So what I'm doing here is I'm finding my proxy in my virtual environment and then I'm adding it and I'm creating a client with arnet."
Introducing Arnet as the chosen client and the role of proxies in bypassing protections.
"If you're not using UV yet, you really should be. It's really good."
Advocating for using asynchronous execution with UV for new projects.
"The impersonate is going to be the TLS fingerprint that we're going to send with all of our requests."
Explains the core concept of TLS fingerprint impersonation to improve access.
"ARNET is my favorite at the moment. It's built in Rust and it's written by a guy who commutes a lot to the Rust community."
Rooney shares his recommendation and rationale for Arnet.
"This is the kind of data you want to get from the JSON-LD tag, and Select has been my go-to passer for a long time."
Describes the data source (JSON-LD) and the parsing tool (Select).

Questions This Video Answers

  • How can I bypass Cloudflare blocks in Python scraping with modern tools?
  • Why should I use async IO for web scraping instead of requests?
  • What is TLS fingerprint impersonation and how does it help scraping in 2025?
  • How do I extract data from JSON-LD in web pages efficiently?
  • What are the best Python libraries for rate-limited parallel scraping in 2025?
Python Async IOWeb ScrapingArnetTLS Fingerprint ImpersonationProxiesJSON-LD (LD+JSON)Select (parsing)Rate LimitingCloudflare avoidancehttpx/async clients
Full Transcript
In the past, it was just so common that you would get recommended to try requests and beautiful soup as a good solution for web scraping. But unfortunately, that just hasn't moved with the times and it's just not the same thing that it used to be. Now, don't get me wrong, there's nothing wrong with requests. And in fact, if I'm working with an actual structured API that I have a key for, I usually use it. But for web scraping, you're going to find you're quickly going to get blocked by very simple things which are very easily avoidable if you use these other tools that I'm going to talk about. Now, in this video, I'm going to show you what they are and how you can use them, how they look like with more modern Python so you can improve your web scrapers and get blocked a lot less. So, I'm John. I've been web scraping for five or six years now. Um, I've scraped millions of rows of data for various people all over the world. And this is one of the methods that I would use in the right situation, for the right circumstance. So if we take a look at this website here, we can see we have a shop page and then there's a product page that goes with it. Now one thing that I want to note before we have a look at the actual code is that within this there is this application LD plus JSON which holds a load of the information which I would be very interested in having. Now what's important with about this is that if we go to a JSON parser and paste it in here is that this has a list of elements a list of products URL add list of products which has a list of URLs. So now that we know that we can start to think about how we want to get this data. Now obviously because we're going to use slice proxies and modern Python we can lean into async which is something that you couldn't do with requests. So if I come over here and I activate my virtual environment and if we have a look at this code here. So this is a simple kind of like requests style piece of code. All I'm doing here is I'm creating a client. I'm creating a session. A lot of people don't do this and if you if you are in one of those camps where you don't create clients or sessions, you need to start doing it because it's going to it's so much better so much easier to use and manage. Anyway, all I'm going to do is try and get a couple of URLs from this site. So I come out of this and I'm going to do uv run request main. Now, if you're not using UV yet, you really should be. It's really good. UV in it for a new project. Fantastic. You can see right over here, we've got 403s blocked. Can't get through. And that's with a good user agent. It's with good proxies. So, it's not going to give. But if I was to do this one and I run the main py file, I've got a few shop URLs in here. And we'll see that I'm getting a load of URLs. And I'm now going to reach out and get all of these asynchronously. and we're going to get all the data back. I think this is something like 80 something products that we're going to get. Now, what I have done here is I've actually put a limiter in because I don't want to do too many requests in one go. But this will actually run through all of these and in about a minute or so, we would end up with the data that we wanted. That would look something like this. So, we can see that we've scraped all of the information from those products uh that were on that shop page. Now, this is very useful to us. So let's have a look at the code and we'll explain and we'll go through and and we'll talk about the main points that are the most useful. So the first thing that we're going to do is we are going to create our client. This is a little bit more important than it was before because we want to bake our proxies in and we want to bake in the impersonate. Now, the impersonate is going to be the TLS fingerprint that we're going to send with all of our requests. And this is going to give us that little bit extra credibility when we're making those requests to the site. And it's going to allow us access. Without this TLS fingerprint, we're really going to struggle to um get uh past and get get not get those 403s that we saw with requests because of the way that the fingerprinting works. And this is run a lot through many different CloudFare sites. So what I'm doing here is I'm finding my proxy in my virtual environment and then I'm adding it and I'm creating a client with arnet. Arnet is my favorite at the moment. It's built in Rust and it's written by a guy who commutes a lot to the Rust community. So I've got a I'm hoping that it stays updated and relevant. Plus I found that the impersonation with this one works very well and it's very easy to use and and use asynchronously. So this is our create client function. The next one we want to move on to is to look at actually fetching the URLs from the page. Now, this is where it's important where we looked at this because we know that we're going to have a large chunk of URLs in a list that we can get from accessing just one page. Now, I'm going to do this asynchronously. So, I have this function here which is going to uh use the client uh that I've created. It's going to take in a list and it's going to use this limiter which is what I mentioned just briefly. It's up to you whether you want to use this or not. and it's going to return a list of responses. And we can do this all asynchronously. And what we can do here is we can just create these tasks with our client that we are going to create for each URL. From here, I wanted to think about passing the responses. Now, the response on the list page is uh going to be one chunk of JSON data that I'm going to need to pull the URLs out of. But the response on the actual product page, which has most of the information, looks a little bit like this. And if I copy this over here and put this in, we can see that there's, you know, a load more information. It's got images. It's got everything. So, this is kind of like we want to decide, we want to determine which piece of information we're we're on per page to actually, you know, send it in the right direction. Now, I put this all together in one function, but in hindsight, I think I should have split this up. So, let me know if you think that's the case, too. So, I'm going to create a blank URLs list, and I'm using select to pass. Now, we only need to pass the HTML to grab one small script tag. And select has been my go-to passer for a long time. If you check out the GitHub page, he's put a link to one of my videos on there, which I'm secretly really proud of. So, appreciate that. We're going to find the JSON in this LDJSON tag. Very common schema. Lots of websites use this. This should be one of the first places you look to see if it has the data that you need. Now, I'm going to loop through all of these because there are going to be multiple on the page. I'm looking for certain things in the text. This is what I mentioned just a minute ago. So with this item list, I know that I'm having a list of items. So this is a store page, not a product page. So we can work through it like this. And we can find the elements, append the URLs to our list, and return that list of URLs out of this function. But if it's not, if it's a product page, we're going to find this word product in the JSON data. Then I can just return the whole thing because I know this is a product. I want everything. I'm going to return none if any of these cases don't happen. Um, and then we're going to start on our main function where everything is going to happen. Now, this is the rate limiter that I use. It's an async rate rate limiter based on using semaphors. It's just an easier way for me to write it rather than have to do it myself. I can just do this and wrap the function and it works just fine. So, I'm creating my rate limiter here and you can see that I use it limiter.wrap on this client.get get here just to you know manage the amount of requests that I do. Then these are the shop URLs. I've got some commented out because I was just working on this. Uh and then I have a results. Now we're going to create our client and ask for our responses again asynchronously and then we're going to go through them. So we're just going to go through them one by one and we're going to check the status code because I want to know that these are working. Then I'm going to see for product URLs we're going to find any that are there and I'm going to say if there's none. You can see this function here. If it returns no URLs, then we got nothing there and we move on. Otherwise, we find the URLs again, get the responses for those URLs all asynchronously, limited again, and then get the product data from them and append that to our list of product data. All I'm going to do now is just quickly save the results into a JSON file, which is what I showed you earlier. Now, if we zoom out a little bit and take a bit more of an overview of this, this is going to look quite different to the standard style of requests and beautiful sweep code that you've seen probably written a lot in the older style tutorials. Now, there's nothing wrong with that way of writing code, but what is wrong is you're really going to struggle with practicing those methods if you aren't using these modern tools and modern Python. I really do think it's something that we need to as a community really push forward and say look you know there's request and beautiful soup has its place but these tools are outdated in the world of web scraping. So I just wanted to show you this is what I would do for a simple script and having the impersonate here whichever browser you choose I choose Firefox is definitely going to get you in a better chance of getting through and getting the access to the data that you want to get. Now, this is an overview, but if you want to see me scrape data, more data like this, uh, and show you how I can actually work with it on more of on in a real world example, then you want to click this video right here.

Get daily recaps from
John Watson Rooney

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.