Build a Voice Based Image Editor with Replicate, Deepgram, and Agents SDK
Chapters10
The speaker highlights building with Replicate models and Cloudflare, emphasizing how combining these building blocks enables scalable, secure applications on a planetary network.
A hands-on tour of stitching Replicate, Deepgram, and Cloudflare Agents SDK to build a voice-driven image editor with permanent storage in R2.
Summary
Cloudflare Developers’ video showcases how to fuse Replicate models, Deepgram, and the Cloudflare Agents SDK into a voice-controlled image editor. The creator demonstrates generating images from prompts using Gemini Flash 2.5 on Replicate, then editing them with QuinnEdit Plus, all inside a single web app built on Hono with a React frontend. Voice input is captured via a Deepgram Flux websocket connected to an agent, enabling natural language edits like removing butterflies or adding objects such as a horse or elf. Edits are applied through remote procedure calls (RPC) to an image agent, with state synced in real time on the frontend via useAgent. To make results shareable and durable, temporary Replicate image URLs are stored permanently in Cloudflare R2 using a workflow called storageer that retries on failure and sleeps to persist data longer than an hour. The video highlights the end-to-end flow: from prompt and generation to edit, voice commands, and permanent storage, all while leveraging the Agents SDK’s state and callable workflows. The presenter emphasizes the ease of connecting building blocks—image/text generation, image editing, and cloud storage—into a scalable, secure app that runs on Cloudflare’s planetary network. Expect future ideas like user accounts and social features (e.g., likes) to trigger automated edits, showcasing how automation can scale with simple rule-based triggers.
Key Takeaways
- Gemini Flash 2.5 on Replicate is used to convert descriptive prompts into initial images.
- QuinnEdit Plus handles targeted image edits (e.g., removing butterflies) directly from natural language prompts.
- Deepgram Flux websocket enables real-time voice input that translates to image edits via an agent’s transcription end‑of‑turn event.
- RPC calls on the Agents SDK let client-side actions (like edits) reuse the same server-side logic without duplication.
Who Is This For?
Essential viewing for developers building voice-driven media apps with Replicate, Deepgram, and Cloudflare Agents—shows concrete patterns for model chaining, RPC, and durable storage.
Notable Quotes
"I love replicate. There are so many models and I can stitch them together and I can build whatever I want."
—Intro framing the multi-model workflow.
"Check out this voiceled photo editing app that I built with Replicate Deepgram Graham and the Cloudflare agents SDK."
—Highlighting the tech stack and project scope.
"So let's try that. Let's say add a horse in the distance."
—Demonstrating voice-driven editing flow.
"I'm going to wire up Deepgram Flux websocket from Workers AI to my agent."
—Explaining the voice input integration.
"Replicate URLs are temporary and we saved these in R2 so it can last forever."
—Permanent storage approach using R2 and workflows.
Questions This Video Answers
- How do you connect Replicate models with Cloudflare Workers and the Agents SDK for a live editing app?
- What is the role of Deepgram in a voice-controlled image editor and how is end‑of‑turn handled?
- How do you store temporary Replicate image URLs permanently in Cloudflare R2 with a workflow?
- What is RPC in Cloudflare Agents SDK and why is it useful for reusing server-side logic?
- Can you replicate a voice-driven editing workflow across different image editing models like QuinnEdit Plus and Flux Chanel Ultra?
ReplicateDeepgramGemini Flash 2.5QuinnEdit PlusCloudflare Agents SDKHonoR2RPCWebsocketsFlux Websocket
Full Transcript
I love replicate. There are so many models and I can stitch them together and I can build whatever I want. And of course, you probably already know that I love Cloudflare. I can whip up any sort of application and also I can plug those building block primitives together and mash them together into a scalable secure application that runs on our planetary network. And when you attempt to merge the replicate models and the Cloudflare network together, magic happens here. Check out this voiceled photo editing app that I built with Replicate Deepgram Graham and the Cloudflare agents SDK.
So you've got this place here where you can fill out what you're trying to do. Uh it kicks things off. But if you don't want to, I made it so that you can use this generate prompt. Uh and it uses Gemini Flash 2.5 on Replicate. Uh I asked it to be a little bit weird. So, uh, let's just give it a spin. So, a vibrant autumn forest path bathed in golden light with leaves scattered around the ground featuring a vintage diving helmet resting on a mossy log and a flock of iridescent belterflies. Great. Let's get weird.
All right, so I'm going to click that floating sarcophagus lazily down the river. Wow, that's weird. Okay, awesome. So gorgeous, though, right? So, so what I did is I I took that prompt and I passed it into the image generation model. In this case, it's Flux Chanel Ultra, but there's tons of models and it's super wild how good it's gotten, right? Like, that's an awesome looking photo. So, now what I'm going to do is I'm going to use the QuinnEdit Plus image editing model, which is wild how good it is, too. So, I'm going to come in here and I'm going to say uh uh let's go ahead and let's let's remove all the butterflies.
I don't I don't think I want those there. So, uh right. So, we've got we've got butterflies in there. I'm going to apply this edit. It's going to run this edit and it's going to take should get rid of all these butterflies and keep everything else kind of looking the same. There we go. So, the butterflies are gone. It's pretty awesome, huh? Just from English, it was able to do that. So, uh let's see. What could we do? We can add what what's missing from this? It feels like maybe maybe a horse in the distance.
Uh so, let's let's do that. Let's say add a horse in the distance. And you know it's going it's it's using the image editing model to add that new horse and it's going to keep the same thing. There's a horse in the distance. Isn't that amazing? But editing by typing that's not the future where we're entering. We probably just want to speak and have it happen, right? So let's try that. Let's see. So, what I did here was I wired up Deepgram Graham Flux websocket uh from Workers AI. So, I wired that up to my agent.
Check this out. So, I'm going to do the start voice stream and it's going to kick off here. We're going to get this going. Make the log be on fire. There we go. And let's put a elf on the horse. I guess we replace the horse with an elf. Kind of cool looking. That's a pretty cool photo, right? I don't know what I don't know the story that's going on here, but I'm pretty excited about it. And you got to remember to turn your voice stream off. [laughter] Awesome. So, uh, and all the history is all stored down here, so you can kind of see what happened, right?
I kind of want to see that, you know, right? There's the butterflies. There's us getting rid of the butterflies. Uh, there's us adding the horse, uh, to it. And, uh, you know, sometimes this gets a little bit tricky. So, I wanted to make sure that if you were speaking, you could say stuff like make it bigger or and it have it figure out what it was. So we kind of I I read wrote another Gemini prompt here to kind of help uh the query. And what's cool is I can take this URL and I can share this with anybody and let them go and also edit it.
One of the things that I love is I was able to store these images into Cloudflare object storage R2. So everyone can see my creation later. Replicate image URLs are temporary. So we saved these in R2 so it can last forever. And that's pretty great, right? Want to see how I built it? So, first off, this is a Hono web app with a React front end. Uh, I'm here in workerindex.tx. Now, I have an API right here called API images create. And it comes and it gets the prompt, gets everything that it needs. Uh, and it creates a new image ID based on that prompt, a unique ID.
And then I go and I get an agent. I create a new agent with that name. And I call this method set name on it. I pass it the ID. And then I also call this method called create image which exists on the agent and I'm doing that with RPC or remote procedure call on the server side on the agent. Now I'm making use of the agent SDK and I'm storing everything that I need in its state. Let's take a look at that. So over here under agents if I go under uh image here I have this uh image agent and it uses image state.
And if you look at it here, this image state has uh edits and it's an it's an array of image edits. And the image has the prompt that created it, the generated prompt, right? That was the the additional information that I was adding there. A temporary image URL. That's where we're going to put the um that's where we're going to put the replicate URL. And then we also have the image file name for when it's stored uh properly. So it is an it's an array of edits. And that is how I'm storing things. That's all the story I'm doing.
And that state the state that we have here uh right this is the initial state it gets synced in real time because in the front-end code I'm using use agent here. Let's let's show that. Let's look at that real quick. So over in uh source if I go under pages and I go under image details this is the what we're using to show the agent off here. So we have uh use agent from agents react. So we have a bunch of normal react state stuff. We have all the the the audio things that are going on here.
And I'll be honest, if you check the truth window, I I used a lot of uh AI to actually build the the audio thing cuz this is kind of new to me. Uh so if we come in here, we have this agent, right? So I'm agent and I'm doing use agent. I'm getting a hold of the type, the same type and that same state from the server, right? So so I'm importing that. Uh this is the name of that agent, right? So this is the name space and this is that ID, the image ID that came through and we're getting that from the URL here.
So uh when the state changes right anytime that that state changes this function is going to get called this this this method here is going to get called and it's going to pass that state in and I can update the page and that's how it works. Uh the agent object has a websocket connection like automatically. Uh so so in fact that's how I can send the data back right. So I'm using that send method. So if I look for let's go look for agent. Uh I'm sending the audio chunk. I'm sending the bass 64, right?
Whenever that audio chunk is coming or when the the recorder is happening, when I have the microphone on, I'm using that and I'm using agent. Send to deliver that audio data. So, let's take a look at that agent code. So, worker agents image. If I'm in the image file here, I have a thing that says on message, right? So, on message is going to come in here. Uh, and I'm going to get that. I'm going to parse it. I'm going to go ahead and I'm going to get uh access to this deepgram socket because I'm using a workers AI connection to this deepgram flux model.
Now, this model's pretty rad. It gives you what's known as an end of turn. So, you know, when the person is done talking and when I get that, I trigger a transcription. I trigger on transcription and I use that to do the edit to the image. So, let's take a look at that really quick. So, so uh I'm sending this Dgram socket. I get the DPA socket and I send it. And when I first got this, what I did was I set up in here, I set up get deep socket. I went and got it, right?
I got a websocket equals true. I'm doing im.ai.run. Uh that's a workers AI run call here. Uh and I'm getting the the websocket from it. I'm accepting uh things to come through here. And I'm adding the event listener so that when that comes back from deepgram and I get the end of turn, I know that they're done talking. I have the transcript. I call on transcript. And when I do call on transanscription, what's happening here is uh if there is uh not an active edit, if it's not happening, I call this method edit current image.
And that's the same code path that happens if I use the text box, right? So remember, we saw me type in and we did the text box. So that client side, remember the client side, uh if we go back, we're going to pop back over there. We're popping back and forth a lot. Uh we have uh agent.stub stub.editcurren image, right? So, so while it's coming in here, while handle the editing the current image, it it calls that uh and does it and it's and it clears the prompt input. So, that's the the you know, all the form handling that you would do, but I'm calling from the client side that same method.
So, they're both calling the same method. [snorts] Uh I love that. I think that that's so cool. You just write the function once and everyone can use it via remote procedures calls or RPC. Now the other thing that I want to point out here is that I am making use of a workflow. So I'm going to go back here really quick. When we look at this edit image, we look at this uh edit current image. And now see I've marked it as callable. That's why I was able to use the stub callable here. Not all the methods.
If you don't want to have them, you don't need to. So you just need to mark it as callable. So that's why the client side's able to do it and the the server the server side uh is able to call it is able to call any public method there. So, um I'm going to go ahead and I I call the I get the generated prompt. I want to make sure remember if I say make it bigger, what is it? It needs to know what that is. So, it gets a contextual uh image edit uh prompt there.
Uh I go and I get the current image and what I wanted to show you was I go and do that. I come through here. I have this workflow. So, this workflow is called a storageer. So remember, when the replicate URLs are created, they're temporary and they last for 1 hour, which is great. It's great for quick edits, but in the long term, we want more permanent storage, right? And now I could download it and store it here, but I'd be in process, right? So what I do is I take that temporary URL and I kick off a workflow, right?
So I do this create workflow, and it's actually doing a little bit of job. So not not only does it get out of band, it also has scheduling here. Let's take a look in the workflow here under storageer, right? So storaer is the name of it. So I've defined the stoagger and you you define what the parameters are. So I'm passing in the agent and the file that I want it to be called eventually and the temporary URL and then I'm just doing what you would think you would do. So I'm doing a step.do and if this fails it will retry uh until it'll retry I think by a default of five times.
Uh, and so I'm going to go and I'm going to store it just in R2 like like you would imagine. I'm going to go grab that that from URL. And then uh in the uh I'm going to set the permanent object say again using RPC methods, right? So I'm using RPC methods uh in this step. But here's the really neat thing is that they can sleep, right? So so we need that to sleep, right? Because it's going to expire in an hour. So I'm going to run this workflow. I've updated the image so that the image is stored in R2 and I'm going to leave I'm going to make use of that replicate URL for as long as I can, right?
Because that's f it's faster to use that than it would be good to go do this. So um of course I'm again I'm going to I'm going to sleep and I'm going to sleep for um 1 hour and when that's done I'm going to clean up the temporary URL because otherwise the URL is no good. So I want to get it out of there and then I made a backup so that the R2 file names uh uh will work as well. And then from the front end, if we get back to right where we started in this this uh work worker here, uh I just go and I get the thing uh by the file name, right?
I I grab the glob off of here, whatever the file name is, and that's it. See how I stitched together image and text generation models with the image editing model from replicate and then I used it with the Cloudflare primitive to make building blocks. Now, I'm pretty obsessed with the agents SDK and how much it lets me connect to its execution framework and state sync mechanisms. It's awesome. One thing I wanted to do next was add the ability to add users and likes and do something like if this photo gets 50 likes, I'll make it have more elves.
I add an elf for every 50 likes. Something like that, right? And you see how easy it would be to automate that, right? So, what are you going to build with all these building blocks? Let me know in the comments and subscribe so that you can build along with us. Thanks so much for hanging out and we'll see you real soon.
More from Cloudflare Developers
Get daily recaps from
Cloudflare Developers
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.








