Launch a Real-Time AI Video Generation SaaS in 24 Hours
Chapters8
Introduction to building a real-time video AI generator that takes webcam input and outputs stylized video in real time.
Build a real-time AI video generator in 24 hours using Flux 2 Klein, modal GPUs, and ChargeB for flexible, credit-based pricing.
Summary
Developedbyed walks you through engineering a real-time AI video generation SaaS from scratch. The core is Flux 2 Klein, a 4B-parameter model tuned for fast, high-quality frames, hosted on Modal GPUs. He demonstrates a real-time pipeline that streams webcam input, runs it through Flux 2 Klein, and returns stylized video frames in milliseconds. The setup also covers a self-contained billing flow with ChargeB, using metadata on Clerk for per-user usage and plan status instead of a separate database. Deployment notes include local development with a requirements.txt and app.py, WebSocket-based communication, and a 512x512 input/output constraint to match Flux 2 Klein’s pipeline. He explains trade-offs like model quantization (BF16, BF8) and why stream diffusion isn’t compatible yet with Flux 2 Klein at the time. The result is a fully functional, credits-based SaaS prototype that supports saving output as video or snapshots and live prompt streaming for on-the-fly style changes. The hands-on demo includes multiple effects (oil painting, clay, 3D render, Fortnite, pixel art) and live prompt updates without reloading. A practical takeaway is how to price usage (1-minute free trial, tiered Pro/Ultra plans) and manage subscriptions via ChargeB while keeping state in Clerk metadata. ”
Key Takeaways
- Flux 2 Klein (4B parameters) delivers higher quality frames at real-time speeds compared to larger diffusion models, with 2-4 step inference.
- Streaming multiple frames via a Flux 2 client pipeline in Modal achieves higher FPS than per-frame processing, avoiding the bottlenecks of single-frame diffusion.
- A 512x512 webcam input is recommended to align with Flux 2 Klein’s default processing size, preventing processing errors.
- ChargeB is used for enterprise-grade billing with metered usage and plan tiers, while Clerk stores per-user metadata (plan, usage seconds, status).
Who Is This For?
Essential viewing for developers building real-time AI video SaaS or experimenting with Flux 2 Klein and hosted GPU inference. It also helps teams evaluating Stripe-like billing alternatives (ChargeB) and serverless GPU hosting on Modal.
Notable Quotes
""Flux 2 Klein is a distilled model down from 9 billion parameters to 4 billion and it's really high performance and the quality of the images are actually great.""
—Key rationale for choosing Flux 2 Klein as the real-time backbone.
""This essentially works on top of the model by batching multiple frames and applying the steps to all of those rather than one by one. That’s the difference here that I want you to know.""
—Explains the benefit of streaming/batching frames via stream diffusion concept.
""You can download those weights and persist them in a volume. So then it's just like ready to go to be loaded up inside your GPU.""
—Shows deployment approach with Modal volumes for weights.
""ChargeB makes this incredibly easy, and they're the sponsor of this episode. This is your go-to tool for deploying AI ready SAS applications and making the payment process really, really easy.""
—Highlights ChargeB as the chosen billing platform and its advantages.
""After you log in, you can buy a subscription and you will get an hour of usage. So, you can generate these videos and then you can save them locally on your hard drive as well.""
—Describes the credits/usage model and output saving capability.
Questions This Video Answers
- How does Flux 2 Klein enable real-time AI video generation on consumer hardware?
- What are the pros and cons of using Modal GPU hosting for AI workloads vs local GPU servers?
- How to implement a credits-based billing flow for a real-time AI SaaS (ChargeB vs Stripe)?
Flux 2 KleinModal GPU hostingreal-time AI videowebsocket APIChargeB billingClerk user metadatastream diffusionBF16/BF8 quantizationvideo export (WebM)Flux 2 client pipeline
Full Transcript
Hey there, my gorgeous friends on the internet. Hope you're excited because today I'm going to show you how I built out a real-time video AI generation tool that essentially allows you to feed in your webcam data. And it gives you back another video in real time with any effect that you want, whether that's oil painting, it can be 3D render, a clay effect, or even Fortnite if you want. And we'll do this using an open-source model called Flux 2 Climb, which I'll show you how we can host on our own GPUs. Not only that, we'll take this a step further by adding off and also a payment provider like ChargeB.
They're really awesome, really intuitive. We'll be doing this essentially to assign credits. So, after you log in, you can buy a subscription and you will get an hour of usage. So, you can generate these videos and then you can save them locally on your hard drive as well. So, hope you are excited. I can't wait to show you how to make this. Now, generating an AI image these days is fairly trivial. You can go on Gemini and use Nano Banana Pro or you can go on chat GPT. You feed in the prompt and you get back in the image 10 to 20 seconds later.
However, getting something working in real time is was way trickier than that. And no matter if you use a proprietary model like uh Google has here or an open- source stable diffusion model like stable diffusion XL or 1.5. The problem is that all of these stable diffusion models go through the same process. And that's by taking a really noisy image like this and then applying steps to it to essentially make it clearer until you get the cat image. And training basically goes the other way, right? You start with the cat image and the prompt and then you apply steps to apply noise to it until it's full of noise like that.
So what are the options out there? Well, we can go first of all with a smaller parameter model, something like stable diffusion XL. It's way way smaller than using something like Gemini. And usually it gives you a much better result. Now, just using it on its own like this is okay, but what you can also do is apply a different pipeline to it like stream diffusion. This essentially works on top of the model by batching multiple frames and applying the steps to all of those rather than one by one. Right? That's the difference here that I want you to know.
Stream diffusion takes multiple frames and starts applying the stepping process to those and then trying to stream it back to you which essentially gives you higher FPS if you want to do something like this. Now, from my experience, I tried it with stable diffusion XL in 1.5. I wasn't too happy with the results I got with this. H it was looking along the lines of this. Okay. And this was just not too interesting to me. Uh it was cool for like two seconds. So, thankfully, I found a different model that came out literally a month ago, uh called Flux 2 Klein.
So this is a distilled model down from 9 billion parameters to 4 billion and it's really high performance and the quality of the images are actually great. This is fine-tuned for it to work really well between two and four steps. So you're actually going to get really good results. And I actually got this working really well without actually using something like stream diffusion, though that doesn't work now uh with flux 2 client, but just simply feeding it a frame and getting a response back and doing that over and over again. So this is what we're going to use Flux 2 client here and we are going to host it on modal which is a really really cool website that lets you essentially hire out your own GPUs.
So, they were nice enough to give me some credits, but even if if if you want to get some when you sign up, they give you $25 uh credits, which is super cool. Not sponsored or anything. I just really really like the service. Now, I hosted this on a H100 GPU. You don't need this. You can you can host it just fine with an L40S, but if you want to look at the pricing here, it it's it's really not bad at all. So, if we go per hour here, it will work with this. The Nvidia L40S, it's a $1.95 an hour.
Now, you are not going to get 12 14 FPS. You might get somewhere down the line of four to 5 FPS. Going up to H100, it doubles pretty much. So, let me give you a quick idea about how you want to deploy this model. So, if you also want to deploy this locally, it's not much change that you need to do really to it. But all you really need is a requirements txt that just holds all your dependencies. And then we have an app.py. That's it. So here all I'm doing is I'm loading up the model repo.
So Black Forest Labs Flux 2 client. This is again the 4 billion parameter distilled version. There's a 9 billion one that takes up twice the amount of VRAM. So if you want to run this locally, this one takes around 12 GB of VRAM. The 9 billion model takes around 24 to 26 for my testing. But if we run it through modal, we can also download those weights and persist them in a volume. So then it's just like ready to go to be loaded up inside your GPU. So that's super cool. So then we're just downloading the models here.
And then down here, we define an app class where we can tell it what kind of GPU we want to use. I'm using the H100 here with 65 GB of uh VRAM. And then I also have the scale down window here. So this is really cool if you're not using the application. It essentially turns the GPU off after 120 seconds. Now the downside of that is that if you want to use the application, right, it's going to take a minute or two to warm up and to load those weights from the storage into the VRAMm.
Now remember how I showed you the stream diffusion pipeline? I tried to get that working with Flux 2 client, but I don't think it's supported yet. So, unfortunately, that doesn't work. You have to use their own Flux 2 client pipeline. Uh hopefully in the future. And then here, I'm just passing the model down in Bflat 16. So, that's kind of the like how the weights are like the format of them, right? They're in Bflat 16. And that's when you hear the word quantization to quantize it, which is essentially you're cutting this down to maybe a float 8 or float 4, which gives you less numbers.
Uh, so if I do this in Bflat 8, it essentially cuts down the amount of VRAMm that I need to use by half, but you get less precision. The models are a bit crappier in general. And the only other logic I really have in here is creating a websocket through fast API and I'm making an endpoint here public slws where I'm just passing the raw JPEG byes back and forth. So the webcam one comes in and the other one goes out to the front end. Let's break down each hook and component that is essential for this application.
Starting with the use webcam hook. So the whole purpose of this is to stream the raw JPEG bytes to the LLM model for us to get the message back. So it's a simple interface here where we have a stream and is active a start a stop and capture frame. Now the resolution you capture this as well is really important. So by default flux 2 client does it in 512 to 512. So, you want to make sure that the webcam feed you send as well is in the same format. Otherwise, it's going to blow up. I'm not joking.
So, to capture the frame, all I'm doing here is I am adding this to a canvas, getting the context here, and then essentially drawing the image out and then sending this back. I'm returning a promise here. Canvas to blob. Okay, this is really, really important. You can do canvas to data URL. I tried that before, but that's really bad because it's sync. It's not async and it blocks the main thread as well and it just lags the UI. But you can convert it to blob and that's going to do a way better job because it runs off the main thread.
And then I simply have two little functions here in this file. One is the start which just gets the user media right or webcam and then set the stream to that. And then we have one for stop as well that does the same thing. It just gets all the tracks and it sets it to null. And that that's it. We are ready to go. Now to actually render this out, we have a video canvas tsx that essentially takes that canvas and then uses request animation frame to render it out. And one thing I did wrong here at the beginning was use a state react state here.
The problem is you end up rerendering the component 30 30 times per second which can be quite bad for for performance. So instead what I'm doing here is I'm using ref. So if I do ref I'm passing a canvas ref down here and then just updating that directly avoiding all this use. So here we go draw frame. So I'm doing a use callback checking if the canvas exists. If it doesn't that's fine. I'm setting the interval here for the playback uh which we can control through the UI. And then we are simply awaiting create image bit map here.
This is really important. This is a browser API that decodes JPEGs directly on the GPU. Okay, that's the that's the big difference. uh it returns this image bit map uh which can be drawn on a canvas instantly for you because if you do a new image instead like that and then you get the src this runs much slower and it's quite CPU intensive so create image bit map is the way and then we simply render out the video canvas component here we pass in the props we need like the height and the width which is 512 for for both of them the canvas ref and then finally the frame buffer so this is actually coming from our uh GPU.
So WS here, if I look this up quickly, you're going to see that is using that websocket here. This use websocket. So let's go in it and see what's going on. Use websocket. This essentially just creates that connection. Uh we have some little functions here like disconnect and connect. And this just looks for that specific URL that modal provides for me. And that's it. Now for off, I'm using cleric here, but you can use anything you want. I just created a middleware here where I'm protecting certain routes like the API usage, the checkout, and the portal.
And what's cool is that we can run this all without actually hooking up a database. Because what we're going to do with ChargeB and the payment is when the user actually pays for something a subscription we are going to save all that information in the metadata for that specific user like the usage and what kind of subscription that they have. And now for the payment plans I essentially ended up setting up three here. Free, pro, and ultra. Again, you can do it however you want, but the way I did it was for the free. So, if you sign up, you can use 1 minute of generation.
Uh, and then you you need to pay for it, obviously. And then I have a pro plan and an ultra plan here where you get 60 minutes a month. So, how can you set this up? Well, ChargeB makes this incredibly, incredibly easy, and they're the sponsor of this episode. So, thank you so much, ChargeB. This is your go-to tool for deploying AI ready SAS applications and making the payment process really, really easy. So, highly recommend you to check them out. I'm going to leave a link in the description down below. What I like about ChargeB is that it's an enterprise ready subscription and billing platform.
Compared to something like Stripe, they offer quite a bit more even when it comes to subscription complexities. For example, you have tiers here and metered usage and uh hybrid as well. But to create one, let me just quickly show you how you can do that. You might have like product families. So when you create a plan, you can add them to a specific family. In this case, real time here. But I'll just go over here to plans. As you can see, I have two set up here. Real time pro and real time ultra. But let's create a new one here.
So you go here, create a new plan. You give it a name. Here we go. It goes in this product family. So it lives there. So if I want to make a new one, for example, I can do real time extra pro ultra. All right, that's that's the best one that we got. So there we go. We create that. You can add your redirect URL here, which is going to be localhost 3000 for our use case. You create the product and here you can do different frequencies as well. So if you want to charge them daily, weekly, monthly, or yearly, you can set a price here.
So let's set a price. I'll just do a test one here, 15. And then we can create this product. There we go. And once you create it, you are going to have an ID here with the name that we can use in our application. So as you can see my plans TypeScript file here. I just passed down the chargeb item price ID. All right, that's all you need here. Then you can add your label and the price to it as well. So I created a file here called chargeb.ts. We import a package here from chargeb.
You can get that on npm. And then I just export an instance of it. And you just need to pass in the site and the API key. So let me show you where you can get those. So if we head over here to the browser, the site ID is this one here. Semicolons real time test. And to get the API key, you can head over here to settings configure. If you scroll down, you should have API key. So just copy that over and paste it. And one more thing is going to be the web hooks.
So I have it set up here as a tunnel with Cloudflare and that needs to point to / API web hooks chargeb. That's how I have it set up. So if we just have a look at that API/chargeb. Okay, this is going to be a post request here that's going to monitor the subscription status if it's been created, activated, renewed, or cancelled. And then I just have a little function here set up for each of those cases. And that's essentially it. So if we have a look at the handle subscription active, all I'm doing is I'm pulling out the clerk user ID.
And after I do that, I get the plan and then I simply update that clerk user with the plan and the usage as well. And this way, updating just the user metadata, we eliminated having to need a database in the first place. We're just communicating with the web hook from ChargeB and updating the cleric user metadata. And that's it. Now, to complete the checkout process, we just need to get that price item ID for each of these subscriptions. We can show some cards, render it out in the UI, and then we can simply send that through here to an API endpoint, which is a post request.
Okay, so we're just checking for that price ID here. If it's available, then we are good to go. And if it is, we're just getting the user here through cleric again. And then we are getting the get charge B function here. So we're running this to initiate that uh singleton. And then the checkout here is self-hosted by by them. So you don't need to set up a separate page. Though you can do that if you want to integrate it directly into your application. So just passing the price ID and the quantity. It's going to be one in our case.
And that's it. And then finally the redirect URL which we have in our EMV local. But yeah, that's ChargeB. So if you want a real complete billing platform that has multi getaway billing, it has a tax calculation and compliance for you. Uh advanced invoicing as well. It's all in one bundle here. Definitely check it out. So now I'm just rendering some cards out here like pro and ultra plan. And when I click on it, I'll be redirected to ChargeB's uh fully hosted selfch checkout. And then here when I pay it, I finally get those credits assigned to me.
So let me just quickly show this off in the UI here. I can record two webcams at the same time. So I'll do it here this time. So right after we pay through ChargeB, as you can see, it says manage there. And we also get a bunch of credits assigned to us or in this case time uh for generation. So here's the magic of this uh web hook here. So when we run this chargebbe web hook and we pay we update the cleric user metadata and this essentially dictates everything for us. This is where we set the usage seconds used.
So we start off at zero right we haven't used anything and then the period start and the status of that subscription which will be put to active as soon as we pay. And if we check the clerk dashboard as you can see we have that right here in the metadata plan ultra seconds that we used. So, we used almost half an hour of this and also the subscription is active. So, there we go. That's it. And then from here on out, we just need to decrement this time here, right? That's kind of the whole goal.
When you're on the free plan, you get a minute for free. But after that, if you sign up, you get half an hour or an hour. So, how does that work? Well, we have another hook called use usage, which essentially just counts down. It's a local timer here that literally counts down second at a time. And once it does that, it just syncs it to the server. So it literally calls this API usage sync here with a post request and that's it. And then it updates the UI uh on the front end as well. So as you can see, we just have a set timer here.
Uh but if I check this little API usage sync here. So let's go in there. Look, that's all we're doing. We're getting that request. We're calculating. We're checking to see if we're on the free plan, how much we should have, if we're not on the free plan, how much we should deduct. And finally, we just update that clerk metadata. And doing that, you can start now. And as you can see, it will start decreasing as soon as we start getting that response back from uh our LLM. Now, I did it in a way so it doesn't start decreasing the time here automatically as soon as you start because look, we hit a cold start now.
So it might take a minute for the GPU to start running, but once it starts running, it starts decreasing. And that's how I have it set up. So the little payment plan also comes up. Once it goes to zero, you get the option to subscribe to a pro or ultra plan. So there we go. We should be now up and running and we get frame generation nicely. And one final thing is once we hit stop there. This gets also saved in a video. So let me show you how we can do this. So, as you can see, we get an export of the full video here.
And the way that's done is quite simple. We just have a use recorder uh hook here that takes in the canvas ref, right, that we're currently using. And I'm using this canvas capture stream. Okay. And I'm converting it into a web m. And that's it. I'm saving that. And then here on stop, we are adding it into a blob. Here we're creating an object URL with an a tag here. So we can actually save it and download it and then we can revoke it to get rid of it. I have another one where you can take a snapshot.
So it essentially just takes a PNG image. You do the kind of the same process here. You still end up creating an A tag, getting the data URL, hitting download, and then getting rid of it. And as for the styles here that you can pick, you can actually just prompt this whatever you want honestly, but I have some predefined custom uh tags here that you can click on. And the way that works is it's just an array here with an ID and a prompt. Okay? And so I just set up custom prompts for each of these.
And the cool thing is is since we are streaming this through a web soocket, you can directly edit this live whilst you're getting back that stream uh from the from the GPU from the model, which is super cool. Okay, so let's try a couple of examples here. So this is an oil painting one here that we can do. And again, I set up a couple of of default ones here. I feel like my computer is going to blow up with recording and doing all this at the same time. But let's try clay. Let's see how that works.
Look at that. We got a cool clay effect. Hello. There we go. How cool is that? Uh, we also got a 3D render. I kind of like this. Makes your face look nice and smooth. No acne. Oh my gosh. And I added a couple more here like Fortnite and whatever. There's a low poly one. Let's try this one out. It actually gets the text pretty right there. or at least the shape of it of the triangle on my chair. That's pretty cool. Uh, and we should be able to also type this live here. So, if we just get rid of this and say studio gibli and that should work.
There we go. See it live works. So, you don't even need to stop the thing. It just updates automatically. How about add add style drawing? That's not really add it. What else can we do? I'm curious. How about um pixel art style. Here we go. We got pixel art. So, there we go. Have fun with this. It's it's it's pretty fun. I'm not going to lie. So, there we go. That's your base foundation here to creating your next AI SAS application. Now, I'll leave again a link in the description down below with the GitHub page for this.
Uh I'm going to play more around with this just cuz it's so fun and new models always come out. uh just to kind of see how far I can take this. Uh yeah, that's going to be it for me. Thank you so much for watching.
More from developedbyed
Get daily recaps from
developedbyed
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.



