Build a conversational form with Deepgram | WorkerShop

Cloudflare Developers| 01:01:20|Mar 26, 2026
Chapters17
An opening session introducing the workshop format, participants, and a hands-on, collaborative approach to building a cloud-based personal readme app with Deepgram and Cloudflare workers.

A hands-on walkthrough building a voice-enabled “personal readme” app with Cloudflare Workers, Deepgram Flux, and the Agents SDK to turn speech into actionable form updates in real time.

Summary

Cloudflare Developers Craig and Gina (from Deepgram) team up in this live workshop to demonstrate a conversational form powered by voice. They show how to wire a Cloudflare agent to a Deepgram Flux model via WebSockets, capturing audio from the browser, streaming it to the model, and using end-of-turn transcripts to drive updates to a durable object-backed form. The session covers practical gotchas like handling 429s with workflows, tuning end-of-turn thresholds, and supporting multilingual transcription with Deepgram’s models. Gina explains what Deepgram does (speech-to-text and text-to-speech) and why end-of-turn signaling matters for natural conversations. Craig walks through architecture diagrams, including the durable object that stores per-user state and the agent’s workflow-driven patching of the form. They also discuss future directions: more voice-driven interactions, email/Slack integration, and the possibility of voice-aware front ends. The vibe is exploratory and hands-on: generate code from the talk, experiment with parameters, and imagine AI-assisted user interfaces that respond with voice. The code examples come from a repo that Craig built using the Cloudflare Agents SDK and Deepgram models, with live edits and demonstrations of configuring thresholds and key terms for domain-specific accuracy. It’s a practical peek at building interactive, voice-first apps that run close to the user and scale with durable objects and serverless workflows.

Key Takeaways

  • Deepgram Flux end-of-turn signals let the agent know when a user has finished speaking, enabling timely and accurate responses.
  • End-of-turn threshold and end-of-turn timeout are tunable, allowing you to balance latency with conversational safety for different users and domains.
  • Durable Objects in Cloudflare provide per-user state that lives close to the user, with a hibernation API to keep websocket connections efficient.
  • Agents SDK workflows can run multiple tasks in parallel (fan-out) and handle retries, making long-running processes like form updates robust under API limits.
  • Multilingual transcription is supported (with a path to full multi-language workflows); model choices like Flux and Nova3 enable language-aware conversations and code-switching.
  • The architecture supports future voice-to-action flows, where agents can respond back with voice and perform actions (e.g., updating emails, Slack messages) automatically.
  • Code generation from the agent framework is usable for quickly prototyping voice-enabled apps that run entirely in the browser and on Cloudflare’s edge network.

Who Is This For?

Essential viewing for developers exploring voice-enabled interfaces, Cloudflare Workers, and AI-assisted agents. It’s especially valuable for teams prototyping conversational forms, personal assistants, or domain-specific chat/voice workflows.

Notable Quotes

""What is Deepgram? Deepgram is a voice AI company... we do everything with voice. That means speech-to-text and text-to-speech.""
Gina defines Deepgram’s core capability and sets up why voice makes sense for the demo.
""End of turn signals that system that it is time to process""
Gina explains end-of-turn detection with Flux as a core signal for the conversation flow.
""The durable object will fall asleep but the connection will appear to stay alive to your client""
Craig highlights Cloudflare Durable Objects and websocket behavior for edge-based apps.
""You can imagine me saying, let’s just summarize this profile and choose to email or Slack""
Demonstrates how a voice-driven workflow could trigger different output channels based on user preferences.

Questions This Video Answers

  • How does Cloudflare’s Agents SDK enable voice-driven workflows on the edge?
  • What is end-of-turn detection and how does it improve conversational apps?
  • Can Deepgram Flux be used for multilingual speech recognition in real-time apps?
  • How do durable objects support stateful conversational agents in Cloudflare Workers?
  • What are practical tips to tune end-of-turn threshold and timeout for voice UIs?
Full Transcript
see what happens. Hello. Hello everybody. Welcome to the first workershop. Gina, thank you so much for joining me on this this little uh experiment that we're doing. We are not this is not a webinar. This is a worker shop. We are going to build together today. We're going to I've been working on a thing and I needed some help from uh Gina. Gina is coming in from deepgram. Anybody out there in the chat using uh text to speech or or the other way around speech to text and has not used deep gram, it's time it's time for you to use that, right? Yeah. Actually, like if you're interested in it, it it can be just plug and play or you can just follow along with us in this workshop worker shop and see how you can build these things and how you can make your uh app listen to you. Absolutely. Yeah. And and it's called worker shop because it's a cloudflare worker, right? We build these web apps and I've built I've gone and I've built a web app. Um and I was working through it and I was kind of realizing that maybe the way that we do things is changing a little bit. I've been I've been thinking about this pro project. I don't know how many people out there have this like this idea of what a personal read me is. I feel like we're we're in this time GA where like we don't really know people, right? Like we you and I have spent some time on video calls, but like I don't know. I don't know how you like to work really. And so like maybe I'm going to send you an email. You might not like email. Maybe that's not how we're supposed to do it. That that happens all the time. Yeah. Because I it's not just in one country like sometimes we even need to collaborate with different people across time zone across the globe. And I want to be as much accommodating as possible to my collaborators. But sometimes I just don't know their working styles. That's where if they have a personal rhythm will be super helpful for me. Or even the other way around that if my collaborator knows my working style that will be easier for me to collaborate. Yeah, absolutely. And I think that like it's nice for I've seen managers use them too where they're like this is the style of how I want you to interact with me or the reverse of like I would like to know how you would like feedback or things like that. So it's a concept that exists out there and I started thinking about it. I was like, "Oh, that would kind of make a fun that'd make a fun little app where we could kind of store stuff out there because I think I think what's going to happen here pretty soon is that it's not going to just be people talking to us, right? it's going to be agents coming in and like how agents interact with us as well. We're we're entering a new world where it's not just humans in this space that are going to be reaching out and talking to people. So, I was I was kind of trying to think about that about like what would that look like? And I started noodling on it and I I built a thing. I built a little thing out there. A pretty cool thing. Yeah. Yeah. And and then I brought and I brought Gia in to make it even cooler. And I start I started with this concept of like we're we're in a time where we are generating so much code and oftentimes we go and we build the whole spec of things and I tried this I wanted to try this approach with this app where I was like hey go out and research what people have been saying about personal read mess and then I want you to build me an application that helps the the stuff come in. So I went out and built I built this application and you know it's it's experimental but I think that this is is pretty cool. I like I like where it's where it's gone and uh I am so happy that I was able to reach out to Gina to help me even more with it as as we go. I'm going to share my screen here. I'm going to share my screen. Share this guy over here. And I'm going to kick off a brand new profile here. And this is using the the Cloudflare agents SDK under the cover. So I'm going to say we're just going to do this one live just in case I had already made that uh before. So I'm going to click this open. We are going to be looking here at this uh uh read me here. We are our uh so I said I said just go and do it. And it made it a pretty complex form, right? So it so I like what it did. I like the questions that it did. It's asking like what are your preferred channels? You know, there's these check boxes I can go and I can click and it's like there's some computer. How do you communicate? What's your role? And it it decided, you know, obviously I need to know what your team is and that sort of stuff. And I started thinking, I don't want to really think about this and fill it out, right? This is forms like this are daunting and I feel like this is kind of in a place where it was starting to go away. So, so now I'm on uh stage two of my thought was like, man, it would be kind of cool if I could just tell it what is happening. So I I took a swing and I used an LLM in the background to say like my name my name is Craig and then I apply this text update to it and I say I am a uh developer educator and I do this text update and I was hoping that what would happen was I could kick off a little thing in the background run some LLM text on it. Oh yeah, and it would update my thing, right? So, I'm updating that from text. And then I started thinking a little bit more about it. I started thinking like, well, wait a second. So, so it's running. So, one of the things that I I was thinking about is I might want to type a lot of stuff and like not look at the form even and just let it go. And it doesn't really matter. Eventually, that's going to happen. And I kicked off I I started doing this with the agent. The agent framework has this this concept called workflows. And I know that sometimes the LLM's going to get choked up. We've been playing with these LLM. We know that like sometimes you're going to run into some problems where like you you send some text, it runs a a function and maybe that thing hits an API that gives back a 429. So you need to like back off and like you need to retry if you want things to happen there. So conceptually what I thought was like oh I should run those through a workflow and I wanted to you know I'm testing to make sure that I can kind of like check and uncheck stuff. So, like I like to use Slack and email and I wanted to like do this where it automatically would go and check the things and then if I change my mind, I could uh undo it. And so, so this is kicking off one of those workflows. And if I look here, you can see that my workflow is running and I'm I'm doing this uh extract a profile patch. Now, I could cue a whole bunch of those things all at once. So I could kind of go and it would just kind of go and and edit edit things as well. So I was do I was I was doing that and I started so this should this should go it takes a little bit of time right. So and that's on my side. I just want to make this this clear before before I I hand this to to to G. I'm playing with trying to use a a smaller smaller model on this. So I'm I'm using one of my favorite models actually on Workers AI. I'm using this GLM uh model. model. So workers AAI we host models that you can go and use and one of the models that we host there are these deepgram models which is how Genie and I know each other. So there's there's several deep grand models that we host and I was like, "Oh, wouldn't this be cool if I could actually just talk to this forum and I I kind of as I started thinking about that, I think that might be what the future feels like and so I called up Gia to say, "Hey, Gia, what should I do here?" And so I'm so glad to have Gina here to explain a little bit about what is going on and what what what is Deep Graham? Let's let's let's back up a little bit here because I'm coming in with a little bit of knowledge but let's let's bring this to the audience here. Who what what is deepgram? So deepgram is a voice AI company. We do everything with voice. So what does that mean? So say as a human when you and me are interacting what I we do is kind of I listen to what you are asking like what is deep the question in my head I'm kind of processing it and then I'm replying back to you. So when I'm hearing your question, that's actually your speech transcribed in my head. My head is generating a response and then I'm basically replying back to you. That's text to speech. Uh we think that our head actually generated a text. Computer science analogy. Uh the first one when I'm listening that's speech to text when you're thinking that actually LLMs and when I'm speaking back that's text to speech. So when I say deepr does everything with voice that is basically comp encompass saying this two part speech to text and text to speech. So if you want to know what someone is speaking to you, you can use deeprum to live transcript to batch process. You can also use it to just pass it a text like response from your LLM and speak back to you. Nice. That's kind of a rough overview of what uh Deepgram can do. And what I do is basically I work with my client so that I can learn their actual business problem or use case they're trying to solve and then create a voice AI based solution for them. Awesome. And I think that there's a lot of benefit to these agents that we're starting to have, right? I think that we're probably going to be talking talking to the you're seeing this all the time. I'm sure that your customers coming in here as like agents or think we're think agents need to speak. Yeah. Right. And yeah, I think that can be actually quite helpful. So just a few days ago on a weekend I needed emergency medical assistance and instead of going to the urgent care and waiting for hours, I actually chatted with the one medical bot now and it it was actually able to solve my issues within uh like half an hour which is pretty impressive with respect to the wet time. But what if instead of chatting I can just talk with it? Right. Oh, when you said chat I actually thought you were speaking with but you meant like text based. No, I was texting. Yes, I was text right. Basically it's a back and forth chat how we do same slack communication. I was considered that instead of my doctor I was slacking a specialist who is working on behalf of my doctor that what is your symptom? Okay, I'm refilling your prescription again. That's it. Yeah. Right, right, right, right. So anyways, if you are like me and want to do less and like to make your program work for you, maybe you also want it to just talk to it instead of inputting the text by hand and in your keyboard. And that is exactly where Deep Chrome can help you. And basically, yeah, if you have any questions, please let feel free to stop me at any moment. I think what I just accidentally did there was an interrupt and I feel like that's something that that's something deep offers [laughter] as well. Yes. Yeah. Actually, this is one of the very crucial things like when uh you're speaking, especially when you are speaking with a bot. No one wanted to be interrupted by a bot. Human, we may tolerate it, but a bot interrupted me in between a sentence. No, I don't like it at all. So, I'm not sure many of the users might not like it at all. So, you want to know when the user actually complete his or her sentence. That's like end of term. End of term. What does it mean? that you have a high confidence that the user is finished and so now you can uh like start your thought process so that you are not working with halfbaked ideas. Number one and number two you are not deterating your user experience and also you can actually reply when the user is actually completed his or her thoughts. Yeah. from where it is coming. It is actually coming from deep grams flux which is our latest pitch to text model and it the end of turn event. So that's basically signals that system that it is time to process. So it is not going to wait too long where the interaction feels super slow or it is not going to response too early and interrupt the user and make the user experience unpleasant. Right. And so so when I showed G she said like you should pick up Flux because of this end to turn thing and it's really powerful. Thank you for for pointing me to this model. It's it's available on on on uh workers workers AI. So this this this link here you come take take a look at this if you are uh watching on on YouTube we can chat in there we have come to the YouTube Cloudflare developers YouTube and uh we'll chat there and I would love for you to to see that. So I pasted the wrong thing that didn't work. I pasted the I pasted the repo. Let me paste that better. Let me copy and paste better. That was a mistake. So, I wanted to show you this model here that's available to you. So, you can connect a websocket to it. You know, Cloudflare is excellent for websockets and and we're able to connect to this deepgram model. And what what we were just doing, I was waiting for for Gia's end of turn, right, when she was done talking so that I could talk and then and I knew when it was the right time to talk cuz we we want to find what that right beat is. And oh my gosh, this model is so cool at that. I playing with this model, it's like really awesome to to do that. So So Genius said suggested, well, you already have the part of like, you know, this this thing that we're working in here. You already have this part with this text. It's like why don't you just send do a transcription when that happens and then do it into the the flow that you already have working here. And so I tried that. So So let me let let's do a quick demo. Let me I got to look at this form a little bit here. I'm going to show you the reverse of this. So like right I said that I like Slack and email and it checked my Slack and email. Check this out. So So there's this start speech. So let me let me talk about what's going to happen when I when I do this because I can't talk while I am running the thing. So I'm going to click start speech and what's going to happen is it's going to connect uh to workers AI and workers AI is going to return back a websocket that deep gram flux model and with that websocket each time it's going to I'm going to from the client side the client side here we're going to send up uh text to my agent or audio to my agent and my audio is my my agent is going to send that across that websocket. Okay. So, we've got we've got the browser uh microphone is going to send to my agents, my Cloudflare agent, and then that agent is going to connect to workers AI and it's going to hold a connection available to that. And I'm just going to keep pushing audio through and it's going to come back back down and I'm going to wait for the end of turn. And when end of turn happens, Gina, there's like a detection of like, is this what he mean? And I haven't tweaked this at all. This is out of the box, but we could tweak it. We'll talk we'll talk about that a little bit after this after after Let's do this really quick. So, um I'm going to start my speech here. Okay. So, I actually like video calls. I prefer regular check-ins. You know, I want to get better at technical writing. I like project planning. I don't like async docs first, actually. And now I've I've stopped my speech, but it's gone through. But you could see the end of turn and remember it's my code that's slow. It's not this. You see how the quick the input was coming from and it how much it nailed the right end of turn. When I was done saying what I was thinking about it was doing it. So you could kind of imagine somebody looking at a form and speaking about this. I need to speed my site up. It was clear that deep Graham was like going very fast on this. I need to clear my my back end up. But the processing's happening. It's all happening with a workflow, right? So I cued all that stuff up. So in the case that you do have something that's running slow, you can use these workflows to go or if maybe I'm hitting some other sort of API that might tell me that I've overloaded it, I I know that it eventually it will happen. That's the nice thing about this like durable execution. So agents have a nice they use the workflows inside now. So there's a there's a nice workflow inside of the agents where I just say create this workflow was really really cool. first time I actually used that here. And I think that this style is something that I'm going to play more with now that you've introduced me this to this this end of turn. I feel like it's a really powerful way to get what was happening. You could imagine me doing a tool call, right? Like you could imagine me saying like, "Ah, I'd like the lights in here to be green." And then, you know, they go green or something like that. I think that it's a it's a new input field that that a new style of input and the fact that I don't have to do it on the browser. So this will work from my phone. If I was filling out my phone, I just press that and it would, you know, I'm using the browser, but I'm not like not it's not heavy computation because I'm doing all that on the back, right? So it's a really neat, really powerful and go ahead. Go ahead. No, no, that was my end of turn. I end of turned and you can you can jump in there. Is there anything you want to add? So actually I wanted to say that uh if you want to configure it for your use case like as you experienced for most of the use case the default threshold which is 7 works pretty well. But if you want faster response you can actually set few parameters like set eager th uh end of turn threshold. So your LM can start processing before even the start uh user fully finish the sentence if your app requires that or if your user speaks with long pauses for their style. Say you are at some care provider and there may be users who will be pausing longer in between words [clears throat] sentences. You can also play with the end of turn timeout to avoid cutting turns prematurely. So, oh that's cool. It's like however you want it to behave. End of turn threshold controls how confident your model must be before like saying that okay my user has completed his sentence. I I can speak now. And also if you want it to adjust with your user base or your personal speaking style, you can adjust it by setting up the timeout uh so that it's going to be waiting longer. Yeah. And I I I mean love that it did exactly what it needed to do out of the box. And I think that that's really neat to have those knobs to be like, I'm going to run this thing and oh, it doesn't, you know, oh, it it it jumped in too quick, right? Like cuz I could ramble on as I do as you know Gina I ramble on and maybe I don't want the whole thing to go so I want to make it more sensible so I can say confidently so you might you might not want it to jump in there right but but yeah to to your point it's a new tool and like I want people to think about that because I I hadn't really processed that before I actually touched it and played with it. Right. And I I think that's true of like all the AI stuff that's out now. you actually have to like use it to see how the thing might work. And this is a great way to do that, right? Like build a build an app and then be like, "Huh, can I put voice in this?" Cuz I think that's coming. I I feel like it's coming. I think like we had a little run, you know, we're all excited about I was just thinking, say we actually chatting with our code uh agents. Mhm. Like by chatting, I mean texting. It can actually process pretty good text input converted to a good code. The healthation is pretty low nowadays at least for my experience. How about instead of chatting I'm just giving advice comments that how you started it. Go research how people are building personal readmies and then build me an app for that instead of writing it. how I was just telling it that okay I want this and maybe you're doing some other productive work in some other places where you're building another cool demo where talking with one agent to build your personal reading. Yeah, absolutely. And I did do this and you're right when I when I was in I was I built this one with codecs because I want you know like I said like I wanted to see how how does what what will the the app what does the app look like if I don't even give it specs. Right? So I was basically like I want to build this on Cloudflare and I want to do it with agents and I want to go and yeah I could imagine it's like here's the app. is like, "Oh, actually, I want you to do this, but if I spoke that, I feel like it would probably be a little bit more real to how instead of me sitting there and thinking about it." And it's wild that we're getting there, right? Like, it's wild. I just did a I did a thing just yesterday uh I pair program building a thing and we were switching off. We were just using there's a there's a local um a local tool where it will do speech to text and we were doing exactly what you were saying because we didn't want to share a keyboard. So, we were just kind of leaning forward and talking and we built an entire app with just our voices. And like it's at a time now where it's like the speech to text probably does more, right? Does more with the typos than we do. The LMS are excellent at typos, but they're probably also great at like if I were to ramble on about something and like just let it it'll find the right context in there, which is just that's pretty wild. We actually should there might be something fun to even look at here. And I hear somebody talking. I really like the idea of uh multiple people speaking because it's much easier than just grabbing at the keyboard from your colleague and say okay I have to type now rather than we just talking in the common mic. It it can pick up the sound pretty easily. I'm going to bring this. We have a a somebody somebody responded here in chat. I'm going to I'm going to try to show it here and I'm going to read it out loud for those people think they think it's cool. They're using a healthcare app and they worry about max connections and supported languages. Uh supported languages I think is pretty awesome. We could talk you want to talk about that and depending on those might be able to to use flux. So So talk about the supported languages. What what are the supported languages? So deep can actually support re flux currently has English model but it is going we are going to extend it for multilingual also but if you'd like to extend it to multil language right now I would suggest Nova 3 it had different language as well as it has code switch language what does that mean say I'm a bangal take care. I can speak Bengal and English and in my dayto-day I often actually talk in Bangl. So can you give me an example of some Bangl? Oh yeah. Uh I'm not feeling pretty good this morning which basically translate to I'm not feeling pretty good this morning. I am having a light headache. Okay. And that happens well sometimes with my partner and sometimes it as funnly enough happens when I am in a deep brainstorming session with my colleagues. What the word I often say is acha which means okay. It's like a kind of a acknowledging word in Bengali. is like, "Okay, I I'm just used to with that so much that I'm just I keep saying that." But they're like, "Are you sneezing?" Aa at you. No, it's dumb joke. Sorry. Okay. I think that was a dumb joke. Sorry. I apologize. But no, I But it would it would find the get it out of my head. Each time I'm going to say it, I will be thinking about that. Uh so so it does do that, right? Yeah. Yes. that's cool. That's cool. Like it does the it does the um uh it does I could be speaking English and it detects that I maybe said acha. It's like so it knows I said okay. Yes. Yes. If you set but for that you have to set the language to multi. So if you anticipate code switching language, you should be using multi. But if you're actually anticipating that it will be just a Bengali conversation, I would use the language tag to Bengali because that's more uh accurate because then uh if you think that how you're investing your moral power, it is not invested in finding okay what is this word English word or a Bengali word. It just knows the Bengali word and it's going to transcribe. Gotcha. Gotcha. But so but there but there are other you do have other uh there are other models right that do language. Yes. So there Bengali was just my language so I kept rambling about that. But there are Arabic, there are Croatian, Dutch, French, German, different accents of English, Hindi, Italian, Japanese. There is a whole list of languages. And if someone is actually interested, let me paste the language link here. you can just check that if your language is there. Awesome. you're looking for. Will, there was also a question about max connections. I don't believe that that's a problem here. William, can you talk about how many connections you're thinking about there? Um I just just in case I don't want to say you [laughter] never want to say never. I don't know. I don't know what uh what workload you're you're working with there, but if there's a ton, there might be a better way to think about it. But I, as far as I'm aware, I have not heard about the limitations of how many active connections you can have. Now, I don't know if you want to have it all inside the same agent, right? So, like in my in my case, each one of these personal readmes, right? Um, and in fact, this is kind of interesting. Uh, I'll show this a little bit here, kind of a little the power of uh the agents framework here is the agents automatically sync. So, if I I do this like left half and I pull this over here and I do Yeah, there we go. Do this on the right half. Do this on the right half. This is kind of like me looking at the edited version of it. But the way that the agents work is they have state and this is connected to my agent. So, if I were to ever update this, this would just kind of update automatically. Let's see if I can get it to go. Oh, let's do the let's do the time zone. So it says time zone not set over here. Uh and this is really neat, right? The the time zone options here are kind of bonkers, right? So like uh this is where this is where I'm showing off the conceptually why English might be better than forms is I can be like I can do this. I'll do the start speech. I'm going to kick it off again and say I'm in the Pacific standard time zone actually. So grabbed it. It cued it. It should automatically we should see this time zone here change and we should see it change here automatically right so that the agents automatically are updating when you need something like that when you need something that's real time like that and of course again the code's not on the back we saw the trans the transcription's already there so like if if I were to update directly from that but you see it it automatically updated both of these because the agent itself has websockets right so so so uh I think what's powerful is you'll hit a maximum 100 speakers. Of course, not in the same room. Yeah, of course. Yeah. Yeah. So, so totally fine. Hundred hundreds a great a great You're fine there. I was making sure we're not talking about millions because if so, maybe we need to figure out like a team to to get you connected there. And then, you know, Cloudflare also has a a great web RTC. So if you're thinking about doing uh chat channels and things like that and we are very much able to plug things together, you should be seeing more and more of that coming from us. I don't want to talk too much about road map here, but you should be able to speak to these agents, right, automatically and and use models like like these powerful deepgram models, right? You should be able to choose your model and you should be able to speak to it. And then one of the things that we talked about, what we've been talking about here, what we're doing here is speech to text. Speech to text. But we could go the other way. Right. Right. Right. Gia, there's like a there's a whole other world that we talked about a little bit here. And you can imagine that these agents here are coming things are coming in and then they're coming out as well. So So So they're talking back. Uh, one of the things that Genius suggested was that it's like I've updated your form like each time I went like to say or some sort of like notification of that. I could say, let's think about this the other way, right? So, here's the I'm a here's the profile here. What if I came in here and I don't want to read this. I don't want to know what this is. I just want to click a button that says just summarize it. Should I send him an email? And you could say you should not send an email or he actually likes emails. Yes, he does like emails, right? But if I uncheck emails now, I'll uncheck that here and I'll I need to do a save on it. It'll automatically, you know, will happen. So, if I was editing by my voice now, it will not send an email. And so, again, what I was talking about was like I think the future is going to we're going to want to know what this is at all times. Like, because it's not just people talking to each other. I love this for people and I've seen this work really well on teams about the this readme, which is where my brain went for it. But I think that also again agents are going to talk to us and I would imagine one of these checkboxes is like I would prefer to talk to your agent, right? [laughter] Like we're getting to that point where it's like I would like let's just communicate over however we we I don't know. I don't know what that world is, but it is coming. And Gina, you're in this world like way more than than I am. Do you see that? Yeah, obviously. So what like I you real next step for me is that okay my app listens to me it understands me now I want it to response back to me so instead of going back to my terminal and check that if the output is ready what if when the agent is done uh generating the output in this case like saving the profile it will be just giving me a voice answer your profile is saved And then when I am sending an email to Craig, it is responding. Craig has not set email has as as his preferred communication method. Are you sure you want to send it? It's like I will prefer that. So instead of me going through the whole thing, say I'm just typing the email, instead of sending the email, I can copy paste it to Slack as he likes Slacks. That's awesome. And I think maybe even I want to send this message to you're speaking to something. I don't know what it is. You're talking you're talking to an agent that you've built that has access to your Slack and access to to your email and you say I want to tell Craig that I'm going to be 5 minutes late and it goes okay and it looks and sees that oh Craig likes email he doesn't like Slack he's never on Slack and it chooses to write an email that says hey Craig this is Gia I'm going to be 5 minutes late just from you being able to communicate and then the agent having access to these tools that we're talking about and this is happen allow access yeah That's totally doable. So that's one the caveat that you need to allow it the excess, but if you allow it the excess, that's doable. Yeah. And that's a situation where you might want your end of turn, right? If I'm talking to my agent, you might want that to last a little bit longer cuz you might want to say, tell Craig I'm going to be 5 minutes late to the lunch tomorrow. But like it it's already gone and kicked off and tried to go and start. But like that tuning of that is something that like will be powerful and something that you can control where I feel like I don't I know that when we talk about voice we talk about voice agents we talk about my I I like to tease my Alexa who's gotten a lot better but she I couldn't interrupt her. She would misunderstand me and start get going uh before things go would happen. But I think that like is not there. Yeah. [laughter] Exactly. Exactly. And I I think that now those experiences that we had everybody and builders out there who are watching this, we felt what that is and now you have the ability to tune that, right? You have the ability to create the experience that you wish that you could have done here in the web, right? Like it's this is like through a web page of we're able to go and we're doing a bunch of stuff on the back end as well, right? to like whatever sort of connectivity and things like that, but it's a new input channel and output channel and I think that that's really powerful to get to wrap your brain around. This code is available and I wanted to like just be like really clear about this that there is a there is a repo that's available and I started this from code generation, right? agent code generation, which I know not everybody has leaned into. The code's really good. I'm very happy with how the code works. It's a little bit structured differently than I than I would normally do it, but it's still good. I went and I tried to make sure that everything was looking good. And in fact, I'm going to click into this code here so that I can show you. And then I told it that I wanted to use the Cloudflare agents SDK and I used a skill. And then I told it I wanted to draw what is happening here. So let's let's take a look really quick at this architecture if I can figure out how to do this. So so it made it generated this this mermaid diagram for my readme because this is something that you can do right. So in readmes now and I'll I'll share this here so that it's not so big on the screen. Let me put this here so we can kind of talk through what's happening. So I have uh we have React running on the browser right and uh Gina helped me uh with with figuring out exactly how we use this like media recorder. So that there's a media recorder API that we wanted to do that we get and we get chunks and I send those chunks uh chunks of of data into uh my personal readme which is a it's a it's that durable object. So each user has a durable object, right? So if you haven't seen our durable objects before, you can kind of think of them as uh an instance like an an instance of an object that lives and it lives close to wherever the it was created and it has state. So it has a little database inside of it. Uh it also and each one of them have that and they live forever. So uh it sounds a little too good to be true because sometimes this is exactly what you want for an agent and that's why the agents framework wraps around this and adds some additional uh uh things to it. So one of the things that uh is really nice about this is it will it's serverless, right? So we don't want you to pay for stuff that you're when you're not paying for it, right? We only want want you to pay for when it's actually doing work. And so one of the things that happens typically with websockets uh awesome technology but you need to have it connected to a server at all times. So one of the nice things about durable objects is we have this hibernation API. So the durable object will fall asleep but the connection will appear to stay alive to your client. Right? So, and then your client will will say, "Hey, I'm sending this thing and I'm ready to for it to kick off, but I still have a connection to my agent automatically." And I don't the code for that is not it's not uh I'm basically saying use this agent that has this name and it gives me that. Yeah, exactly. So, saying that it's exactly what I most use in my app. So, so durable objects are great. Do when you see DO there. Yeah, a I always lean on them. After I started using the these agents, I haven't stopped using it. And people are like, is that really an agent? Like, yeah, it's going to be an agent. We're we're in a world where like I think everything that we build is going to be an agent. And so this this personal readme agent has all the state of what that form says, right? And I I can query that in different ways. I can query that with voice. I can any sort of application. You think of it as like it's an object that says this is how Craig likes to behave. This how Craig likes to to interact and what he's working on. and things like that. Uh, durable object sends to a worker's AI websocket, right? So that's that's the deep gram model that's here. So it's it's sending a deepgram flux. It sends those audio chunks so that the chunks come from the browser into the agent and the agent sends to this web soocket, right? So I can connect to that websocket. It is so fast. This is so awesome. the the the deep gram flux running here and it's close, right? So, it's it's going to move really really fast and when it detects an end, it sends back a stream of data and it has in there the end of turn transcript. And when that comes back, that's when I say, "Okay, put this in the queue because now I need to go use ZOD and to Zod's a schema." Uh here, it's actually should be in here. I'm surprised it's actually not in here. And that's how I'm doing the patch, right? So, I'm saying like, here's the current schema filled out this way in the form, and I want you to make changes to that form. And as you saw, maybe I'm not doing that the best. Maybe that maybe that's that's slow. This is what's happening. And I worked on I worked on this with the coding agent. So, it looks right, but I think maybe there's some some things in there that I need to to work through. Um, and then I kick off one of these workflows, right? So we kick off the text update workflow with the text that's from the end of turn and and then it does these these callbacks and then there's this it we update from that we use the workflow to update when the thing has happened because that could take time and again what I was trying to show here is that you can kick off a whole bunch of things right with the with the workflow and the and the agent itself has the workflows built into it now so it's called an agent workflow it it kicks off and it comes back and I can update the the profile as the as the text comes through. So, it's a really nice fan out method, right? I can throw a whole bunch of things in there and it can all run in parallel. I know we were talking earlier about how many connections we can have. We can run a whole bunch of things in parallel and it will come back to that one. So, it's like kind of fan out and come back, right? So, that really powerful pattern that we're just starting to play with here because that's that's pretty new into into the agents framework. And [clears throat] yeah, Craig, sorry to interrupt, but one one thing here. If you you want the intim results that is also available. So what I mean by intream results is say I love building I'm thinking with a python not with cpp now there was a pause and you are getting the in results and if you want to save those in results or start working with them you can do it right away. So if someone doesn't want to wait for a end of turn and do some proc start some processing on the back end for some reason like live transcription stuff like that that is doable too. Nice. Right. Right. Right. Right. So so those messages that are coming through I'm specifically looking for end of turn and then I'm doing some work but there are other messages that are coming across that web socket and it's like word and then like how confident it is that it is that word. Yes. And then actually one of the things that I think we actually had somebody in the chat talking about was that there's like medical things and that's probably something that's pretty dangerous but you can control a little bit with that, right? Yeah, that's actually a great thing. So if it is a medical thing that's a domain specific thing and that means there will be some key terms or keywords and you can specify those keywords. So you are not going to mess up the important words for your domain. And it is that's this key term here. This is this this key term field. Okay. And you can pass multiple key terms. Now say I am a CS person and I can talk about different CS fields but I have no idea what the medical terminology stands for and I can mispronounce them or I can write the spelling incorrectly. You don't want your bot to do that. To avoid that you can use K terms. And the fun thing with it is that you don't have to append the full dictionary to this list of medical keywords. That can be a lot. But you can actually append only the necessary keywords for this specific conversation. Say we are now disc. That's clever like contextually. Exactly. Contextually. So from the dictionary say this patient is specifically discussing on the last prescribed antibiotic and say that was amoxiciline. So you are going to just add amoxicylin so to make sure it is spelled out correctly the it's kind of recognized correctly. Can you spell amoxicylin? I'm just kidding. I'm kidding. Why? Why you have to do that with me? No, I was just saying that like because of course of course I don't even I could pronounce it correctly. But if I was pronouncing it, you know, I don't know, amo amoxicil because I'm reading my thing. I don't know how to say amoxicylin moxicylin. I do know how to say that. But like if let's say that I was reading my pills, I'm like I need a refill of blah blah blah blah and I don't know what that is. This is going to help it detect because I you know your prescription. Is that right? Oh, that's so cool. Because it's personalized to me. So again, in the agent space, this is a nice tool for I know that Craig is on these medications. Here are the key terms he might say in this conversation. Let's preload the key terms with what's happen. That's awesome. That's really neat. So then it feels you don't get the I'm sorry, I don't understand what you're not to go back on our old voice agents, but like it's a different world. It's a different world and it's a different It will not keep saying that. I don't know what you were saying. We're going to get past that. We are going to get past that. That's awesome. I love it. Do you know how big that field can be? I think that that's that's what my question would be if if I were in the audience right now. So, it's like how many key terms can you pass? I think that that is 100 key terms per request. Okay. Okay. That makes sense. That makes sense. And let me And again, this is I set this when I kick off the stream, right? So I could actually to that point, you know what you're, you know, you're making me think like, you know what, I think one of the powerful things about agents is they get better over time. So you could kind of figure out like if your person is interrupting you or it's talking too quickly, it's like I know now that this person, this specific person takes longer to talk. So I could tweak these with AI. I could like like find the best speed for this person over time as it's going and we get less and less uh interrupts or angry responses, right? We could do all sorts of stuff in the back end tracking what we want to do for these these end of turns, right? I think that's cool. I hadn't even thought about that. So, you Sorry to interrupt, but I just found that we can actually support up to 500 tokens in the K. So, not 100, it's up to 500. 500. Awesome. And you were not going to wait for my end of term. Yuck. I am so tired. There's no end of turn coming in the future. He is such a rambler. No, actually, that's the thing. I'm a human, so you are not angry with me. I hope. I'm not angry. I am not. I am not angry. I could never be. You helped you help show me all this cool stuff. I can't wait to build like what's next, right? Like I I think I do want to work with the with with more deep grand models where I'm like doing text to speech. So like and I let let's make text to speech next so that this one is talking back to us that I have saved your profile and we are deliberately trying to send you an email where you don't like it and see how they're actually interrupting us. Yeah. Yeah. Yeah. Yeah. And let's wire up let's let's wire up an agent that like gets better at like knowing when the inter turn stuff happens. I really like that. I think that that's a really cool fun uh idea that we can do there. Also we can also experiment like some fun experiment I like is to balance between the latency and the conversational safety. So by conversation safety I mean after a awkward pause like say we are in a room full of friends but if there is awkward pause someone start talking but how much long pause is awkward pause right right so yeah it's something uh like that that okay this is just having a latency lower latency so we are going to start talking or is this just audience is a slow speaker and we should respect them. So play with those two values and see what is the most suited well suited one for the use case. Like say I'm now kind of talking about the real life use cases. I just try to [clears throat] listen what the actual audio is play with it and say that what is like parameter tuning what's the best set of parameter with where the bar is not like awkward interruption. Right. Right. Do do you think this is the future? Is voice is voice the like are are we gonna get rid of are we getting rid of the are we getting rid of the front end alto together I don't believe that no okay I don't think it will be getting rid of everything but I think voice will enhance it voice will enhance how we are interacting with it because seeing is still a very important part of how we are perceiving the world so I think front end will be still there But there will be say a way to communicate with that front end. So the front end is not just abstract or just a device sitting rather than the front end is changing in front of your eyes that that will be more user friendly or more exciting. At least for me I think that there's like there's this bit that I want that I've always wanted. You know, I've been I've been making websites for a long time and like you show somebody something and they're like, "Can we make the font a little bigger?" I want them to say, "Can we make the font a little bigger?" And it gets a little bigger. Like I I want that. I would That is what people want. I feel like we're in that space where it's like, "Can we change that text to say this?" And they just say it specifically and it's in context. We have the key terms. We know when to stop and then it just goes and updates. I feel like that's a cool app. I would love to. It's like mixing of the coding agents, but not not to us developers, but to more like you're you're allowing more people to play in this new space that we're in. Like making it more accessible based on the needs of the user. Yeah. Are you seeing people do like long form content with this like like could I write a could I just narrate a blog? Yeah, you can. Yeah, that you definitely can like keep. So instead of blog, how about thinking a personal diary like okay too at the end of the day too lazy to write it but you are just taking the voice memo and that voice memo save so it's not lost under thousands of files in your phone rather than saved and now you have if say 10 years down the road you want to just see okay what I was doing what I was feeling it's there that's awesome and yeah they could also you could use that. I mean, I always lean into AI. You could always use that to train some stuff as well, like make Yeah. Depends on how personal you get with it. If you're willing to share that information with your with your agent, Yeah. And yeah, it depends on how you want it. I I really like that part of using agents or AI in my life. Like how much I want to share, how I want it. Do I want it just to transcribe or do I want it to generate a summary of my last uh month like say I'm just transcribing it as my work that I completed this this and that. Then after the week instead of me looking back the whole thing and writing the weekly summary I asked my agent summarize this week and I can put a note there for for myself for reviewing for my teammates and for the leadership that okay this is here. So it's just simpler. I've been driving around and using some voice stuff like in my car, right? Like safely driving around, but like also like to that to that memo point of like, hey, help me remember to do this thing or like, hey, let's research this. What what does end of turn mean? I want to like and write basically getting like an interactive podcast with a if your LLM's good enough and and the it the it's getting better. It's getting better as it goes. I often write code in the car with my voice and I think that this this like we're getting to that to like what you were saying earlier where you're like I I want to build this thing and if you talk to it I want to build this thing it it works really well and it can ask questions and the questions are all punctuated right the speech to text feels so rich these days. I think that we have a lot of that future coming. I really think that this is such a cool input output that will enable like accessibility. text to speech is also getting pretty good right now. You can also play with how fast or slow you want it to respond like kind of or what accent accent was there already but you can also play with the speed that I want a fast response or no when you are speaking with me speak slowly so that you can control that word yes you can control that that's neat so that's like a that's like a new style right too of like it's like a style like almost like CSS but like this is how I want my agent my my response I like that analogy yeah CSS for agents That's cool. Speaker agents. Yeah. Yeah. Oh, that's really neat. Let's build that. Let's work with that a little bit. Let's Let's do it. What? Let Let's see. We could You just keep talking to it until you tune it. You tune it with your voice to it. You have it speak something back to you and you keep on talking back to it. That'd be kind of cool. So, by uh my voice, you mean that voice cloning? Do you do that too? No, not yet. Okay. But what I would I mean not yet I'm not sure maybe if the clients wants a lot we might do it down the road but not yet the voice cloning but you can pick what you want. What does that mean? You want a calmer accent. You want a enthusiast person or you want like pretty excited one. You want to respond slowly, a sports commentary versus a healthc care representative. They should speak different. Absolutely. They should Oh, that's weird. That's a weird trip to the doctor if they're doing sports commentary on your Yeah, I will be and he's coming in for the He's got a broken leg. Here he comes. Yeah, that would be like a I don't know. Fun for a day, but every day maybe not. Probably not. Probably not. And for your personal readme or not really readme like personal assistant which is talking to you on your car trip if it is a busy one busy road and jam-packed traffic you can just speak slowly because you have to navigate through too many. Agitated drivers. Let's just say that. But when you are driving long drives and you want to learn something you can say okay you don't have to worry about others in the road it's a free road just you can say that okay now I'm less occupied in my driving so you can speak more faster and it can adjust like that it's so cool it's such a it's such a neat time and I think that we're just starting to scratch the surface of what these these tools can do. I don't know if anybody's thinking in the audience about what you could build with it, but I think and I hope that you saw from this and the these talks that you're feeling that you can literally build anything. You can control both sides of what's coming in and out, right? You're controlling both sides of like the style. You're controlling how long this when you need to know the information from where they're at. Like if you need it immediately, they don't even need to finish their sentence. You can get each one of those words as it goes. if you're doing that, building something with that, which when you started saying that, I was like, "Oh, that's kind of fun." Cuz you could do um You seen that game? I've got a 14-year-old, so like there's a game where they're like they put a bunch of emojis up and you have to say the emoji as it happens. Have you seen that? It's Oh, yes. Yeah. Yeah. Yeah. You could make that game, right? Have you made that game? You play that game. You knew what I was talking about. What you were talking about, but I'm old, so I haven't played with that. You can't move that fast. Now they're like, you know, I hear from the room like apple, banana, uh, cat, dog, and and it's like just and you have to try to hit the thing at the right time. So you could imagine that's an app that you could build on this with the with the stuff. I don't I don't know if that's exactly what you want to do, but like that's a thing. Recognition along with So once the image is recognized, it's just saying the name, that's it. Yeah. Yeah. Oh, you can probably cheat a mission followed by You're suggesting cheating the game by doing the other side doing text to speech. So, it it does image recognition in it. It it can always win. You can impress everybody with that that side of it. But but I feel like there's there's these there's these tools that are now available that like told you I'm old. I don't play games. If I have to win, I will just find something. You're going to reach for an API. I see. I see how you work, Gina. I get it. I get it. That's super fun. Uh, everybody, we're coming up to the the end of of this this first worker shop. I hope that you had a good time here. We do have access to that. I'll I'll just share it if you want to if you want to give it a spin. You could come here to this and we'll we'll we'll send out an email with a link to this as well with all sorts of information. Gina, thank you so much for a first helping me build this app, helping me make my dream come true. um and for coming on and and for sharing your knowledge, doing your knowledge drop. It was really fun. Thank you so much, Craig, for inviting me. And thank you everybody for hanging out. And if you're watching this in the future, make sure you you click into those links and come play along. There's a the footer of the app has uh links to deepgram and links to the workers AI models and also links to the code again that was generated and you should be doing that. I want everybody to know that like I not only did I say I want to use the deep grand models, I want to use agents and it went and it built the things that I wanted to do and I talked about what my concerns were and it helped me write those workflows. It was really fascinating uh watching that whole thing happen. So I would love to see more of this. Do we we saw there there was a a comment here building a design agency with a colleague. Let's let's look at this really quick. We're just about ready to close. Last last words here. Building a design agency with a colleague of mine where part of the process involves editing AI generated images. Oh, that's cool. And use use it for TTS. Yeah, exactly. Uh, make that a different dog if you're like, you know, like I don't know what you're generating. I don't know why I said dog. Like there's a a picture of a dog and you want to make make the dog longer. Make the make the [laughter] the dog. Change the maltis to a husky. Yeah, there you go. And like and it switches it switches the dog out just by talking to it with it on the screen. I think that that's super powerful and that's a great use of the end to turn same thing, right? So like like you you say what it is that you're trying to go and make the thing go and change as it happens. So awesome awesome idea and I think there's so many of those there's so many of those like first of all the fact that we could generate an image is relatively new and being able to generate an image with your voice is also pretty great and I don't know you could also generate it and then put it in the back. Henry, thank you for showing up. Uh we've we have somebody who just showed up uh uh two minutes before this is over. Uh thank you Henry for being here. Appreciate you always showing up even if it's at the last minute. Uh this will be recorded. I'm glad that we have the opportunity to say that this will be recorded or this has been recorded and we will be sharing this out with links to uh the readme and more information about this. Yeah. [laughter] Exactly. Yeah. Uh uh and and uh to the point William just one more thing. We were talking about a dachshund. He said, "Make it make make the make it not a uh make the dog hot and it makes a hot dog, right?" Like make but make make the dachshund. But we could put a key term in that it was a dachshund that we were talking about and it probably wouldn't accidentally draw a hot dog. It would probably know about the docund. I'm hoping I'm hoping that's what happened. Uh that's yeah that's the hope. I'm I'm not sure what the image generator will do, but I hope it will get from the context. All right. So, uh, we will be back cuz Gia and I are going to build lots of stuff. Voice is is only only picking up speed and in my brain, I love to build, uh, new stuff and I'm going to reach for voice all the time. Thank you, Gina, for introducing me to that. Sure thing. Awesome. Thanks everybody. We will see you on the next worker shop. Bye. All right, let's just hang out

Get daily recaps from
Cloudflare Developers

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.