How I Write Web Scrapers w/ Python

John Watson Rooney| 00:10:06|Feb 18, 2026
Chapters11
Describes how initial code is straightforward but not modular and difficult to upgrade or handle errors.

Build a modular, reusable Python scraper as an actual application (ETL-style) instead of one-off scripts, using classes for extract, transform, and load.

Summary

John Watson Rooney walks through why tiny, throwaway scrapers quickly derail: they’re hard to upgrade, ignore errors, and aren’t reusable. He argues for a proper scraping system organized around ETL—extract, transform, and load—to improve modularity and maintainability. The video walks through a concrete project structure with separate Extract, Transform, and Load components, plus a pipeline-like approach that can swap in new clients, proxies, or storage without rewriting the whole thing. Rooney demonstrates an extractor class that supports sync and async fetching, proxy handling, and a retry mechanism using Tenacity. The transform step uses a simple CSS-selector-based HTML parser, while load stores results to JSON (with easy hooks to swap in CSV or a database). He highlights using UV for project scaffolding and logs to trace progress and failures. Finally, Rooney shows running the pipeline with UV run main.py and notes how this approach scales better as sites change or new features are added. The takeaway is clear: treat scrapers as real applications, not ad-hoc scripts, to reduce maintenance headaches and speed up improvements.

Key Takeaways

  • Treat scraping as an ETL problem: extract data, transform it, then load it into a target store, not just fetch and print.
  • Create modular components (Extract, Transform, Load) with clear responsibilities so you can swap in new data sources, parsers, or storage backends.
  • Use a retry mechanism like Tenacity with exponential backoff (e.g., stop after 5 attempts) to handle flaky sites.
  • Implement both sync and async data fetching within the extractor, including a shared proxy setup and logging for observability.
  • Store results in JSON by default, with a straightforward path to switch to CSV or a database later.
  • Rooney demonstrates a practical project layout (main.py, extract.py, transform.py, load.py, logging) that you can reuse across projects.
  • A well-structured scraper reduces maintenance pain when sites change or when you add more targets.

Who Is This For?

Essential viewing for Python developers who write web scrapers and want a maintainable, scalable architecture rather than stopgap scripts. Great for those curious about ETL-style design, async fetching, and modular project layouts.

Notable Quotes

"But there's a big problem with this. It's not modular. It doesn't handle uh anything like errors, and it's very, very difficult to upgrade and change when things inevitably go wrong."
Rooney emphasizes why one-off scripts fail to scale and why modularity matters.
"What I mean by that is at the moment we just have like your main py file and that's fine. This it works and it runs."
Illustrates the starting point before introducing a modular ETL approach.
"We can start to think of how our code and our project structure can reflect using these three things."
Introducing the ETL framing (extract, transform, load) as the guiding structure.
"This is very much an easier way to manage everything."
Rooney touts the practical benefits of the modular design for long-term maintenance.
"The whole thing becomes much more of a system that you can actually build and use and understand what it's going to do for you."
Summarizes the transition from scripts to a usable scraper system.

Questions This Video Answers

  • How do I design a modular Python scraper with an ETL workflow?
  • What is Tenacity and how can I use it to add retries to web scraping in Python?
  • How can I structure a Python project for synchronous and asynchronous web scraping?
  • What are best practices for logging in a web scraping project?
  • How do I switch a scraper from JSON output to a database storage?
PythonWeb ScrapingETLExtract-Transform-LoadTenacityAsyncIOProxiesLoggingUV toolingPython project structure
Full Transcript
For the longest time, I would write code that looks like this. Pretty straightforward, very understandable, and you know, it worked. But there's a big problem with this. It's not modular. It doesn't handle uh anything like errors, and it's very, very difficult to upgrade and change when things inevitably go wrong. You'll end up having to rewrite most of this if you want to implement a different package, for example, or anything else like that. So, what I suggest now is to actually make your Python code much more reusable and build yourself an actual scraping system rather than one-off scripts which are going to break and you'll have to fix. So, what I mean by that is at the moment we just have like your main py file and that's fine. This it works and it runs. What we can actually do is we can start to think about hey what does our code actually do? So, let's have a think. Well, the first thing that we do is we want to extract the data. Uh, and this is going to be possibly I mean this is going to be the most important piece here. Well, then we want to do something with it and then we want to save it somewhere. These three keywords are pretty common and pretty important. ETL it's very very common in the programming world and in the tech world. We're basically doing exactly the same thing. But what we can do now is we can start to think of how our code and our project structure can reflect using these three things. So well let's have a look at it to start with. What does our extract need to do? Well, it might need to fetch some data. Uh there might be, you know, a sync client or we might need to use a browser. Uh we might want to do something with async. Uh we might need to set up our client and our proxies. And the list goes on. It's a bit smaller here. And the next is our transform. So what we going to do here? Well, we're going to take some data in and we're going to do something with it. So we're probably going to, you know, pass HTML or something with JSON. And then we're going to maybe use pyantic classes or something else maybe atas or just you know some kind of class system for our data. These are all very important things that we need to consider. Let's make this a bit smaller. And finally we want to load the data somewhere. Well we're going to want to save it to something. So this could be CSV, JSON database like you know any database that you want or whatever you're trying to do. So underneath these three headers we can start to think well this is going to be a class on its own. This will be a class on its own and so will this. So within that we can then start to implement these methods within that class and then expand on them as we need them. Then each one of these classes will be you know pulled into whatever file we want to do you know whatever we need to do possibly something like a pipeline. And then from here our pipeline you know it's just going to be executed as and when we need and this is just going to give us that much more uh modularity the ability to change things as and when we need to do switch packages out. So let me show you what I mean in a bit more of a project example. So on the screen here is my basic uh project that I've been using recently working pulling some data both synchronously and asynchronously. So I have my extract, my load, I have my main. py and I have my transform. I'm using UV for this which is the new kind of like hot stuff when it comes to Python projects and I think it's really cool. I really really liking it so far. Uh so that's the only difference. That's what the UV lock file is and the pi project.l file is. I also got a logging file here. So, we're starting to build up our scraper as if it's an actual proper Python application. So, let's open it up here. So, I'm going to go open my extract. Let's have a look here. So, I have my extractor class. Now, generally speaking, without too much extra specific customization, this extractor class I can just copy and paste to whichever project I'm going to be doing next. There's nothing project specific here. This class is solely designed to extract the data from whatever website or URL that I send it to. So if we have a look at our initialization, I'm getting my proxies. Probably always going to want to use those. There's a link for the ones I use in the description. Then I'm creating my client and then I'm updating some of the session information. There's two clients here as you'll see. There's this one which is the client, the session client which is my asynchronous client and also the blocking client here. So this is actually using arnet which I talked about in a last video. Um, I'm testing it out a bit more and so far pretty good. I'm really enjoying it. And I'm adding my proxies to these clients and then I'm just logging some info out. Again, logging very useful. So when I run this, I can see exactly what's happening in what's what parts. If we look at the actual methods against this, they're fairly straightforward. I have the fetch HTML sync version which uses the blocking client. And this actually has a retry method on top, a retry decorator. Now, this comes from a package called Tenacity. Um, I've started using Tenacity a bit, but there's I'm going to be doing a video on other retries and, you know, what's the easiest to use and what's the most straightforward. You know, that's what Python's all about really for me. This easy ease of use and straightforward. But this is quite simple. We just say, hey, stop after attempt number five. And I'm waiting on exponential. So, you know, it's going to start waiting with 3 seconds, 4 seconds, 5 seconds, etc. Then we have our log in here as well. So, we can see what's actually going on. Then I have fetching the HTML function itself. It's pretty self-explanatory. If we don't get a 200 status, I'm going to raise an exception. And this exception is then going to be hit by this retry. Then two more functions. Fetch JSON asynchronously. And I know that I'm going to be expecting JSON data back from this function specifically. So I'm going to be using this. And then a fetch all function which essentially, you know, gathers all the tasks and the URLs that I give it and then runs it asynchronously. So I can scrape much quicker. Now let's think about what's actually going on here. Well, there's nothing particularly project specific other than perhaps this where I'm returning the JSON data. Possibly should be returning the actual response object here instead. That probably be better. But what we can do here now is that if you know if we come across something like hey I need to now run a browser for this page. I can just import that in and I can come down here and I can just create a new function you know uh get HTML uh with browser or whatever you want to call it. And then I can write this in here and give it self and then URL whatever I want to do and I spelled browser wrong but you kind of get the idea. So we start to build it up and it becomes much more modular and much easier to manage and improve. And obviously with that retry and the logging when it goes wrong or if something goes wrong when we're scraping we can see exactly what's going on. So I put most of the effort into the extractor class here because this was you know extracting data is the hardest part. But if I come to the transform is very very straightforward and very simple. I'm just using select lax as my HTML part passer and basically saying you know here's my passer that I'm using. This is passing the list page. So when I'm scraping data from a specific from this specific site I know that I want the elements returned in HTML from my CSS selectors and then you know extract the text from it and extract the href data from that information as well. Now here if we wanted to get more data or we wanted to do something else with that we just need to write a new method here and then we can call that when we instantiate this class in our main code and quickly just the load here. All I'm doing in this class is I'm saving to a JSON file. Nothing particularly interesting here but again if I wanted to add a database very simple I can add it in here. I can initialize the connection string or whatever I need to do. And you can even you know start building a config uh pi file or whatever config.json file that you can then import bits from. So the whole thing becomes much more of a system that you can actually build and use and understand what it's going to do for you and then you can take what you need from it from project to project so you never have to sit there and write out your client again and all that sort of stuff there. So it makes your life much much easier and gets you away from those one-off scripts that have such a high chance of failing very often. Uh unless of course you know if you write loads of like good logging and whatever and then that's fine. This is just a much easier way to manage all of that. So I come to the main file. What I've got here is I'm basically just importing what I need. I decided from this I wanted to scrape from these categories. So I've instantiated my class here. I thought I must have been writing go for a second with using the EP and the L. Very not Pythonic at all. Now, it's a straightforward, you know, main function that goes through categories and a value in a range and then, you know, pulls the data out using the functions and the methods that I've created, logging as we go. And then I'm just running the function here. Now, I opted not to use a sort of a pipeline class, although you absolutely could have done that depending on what you're trying to achieve and what you want to do. This is very much an easier way to manage everything. So if I come out of this now, if I clear this up and I do UV run main.py, we're going to see all of the logging that I'm doing here. And you can see this is the first time calling it. And that is from Tenacity. And you can see I'm going through page by page, but then I'm asynchronously grabbing the product information by 48 ago. Uh so I don't have to wait for each one of those pages to load. And then I've going to have my results uh output here. Um if I open that up, my results.json file. Uh you can see I've got 5800 lines. And this is the information. And we can see it there. I didn't quite format this properly. I don't think I must have done something wrong with my formatting there. But you can kind of get the idea. So, what I'm trying to get at with this is that with a little bit more effort and a little bit more time on your behalf when you're learning Python, you can really start to actually treat your scrapers as a project in themselves and make yourself make your life much much easier when it comes to managing the project going forward. Because we all know that scraping maintenance is one of the hardest thing. I put it as the second hardest thing when we're talking like actual proper scraping projects. The first one is actually getting the data. The second is maintaining it. And especially when you have lots of sites to scrape and maintain the data from, having a nice project structure like this where you can easily see what's going on with your logging, you can work with the retries, and you can implement new utilities and new methods to do things for you as you need to, uh, then this is going to be absolutely it. And essentially what we've done here is rewrote scrapby.

Get daily recaps from
John Watson Rooney

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.