I Built a Voice AI Agent That Runs on Cloudflare Workers
Chapters8
Introduces a customer support voice agent that can talk, call tools, and operate in real time.
A real-time voice AI agent runs on Cloudflare Workers, talks back, and calls tools like order lookup and returns in live conversations.
Summary
Cloudflare Developers walks through building a customer-support voice agent that users can talk to, and which can invoke real-time tools. The demo uses the Gemma model from Workers AI to power the LLM conversation, with a system prompt that normalizes data such as converting order IDs to numbers. The implementation leans on Cloudflare’s voice SDK components, including withVoice for full speech-to-text and text-to-speech, and withVoiceInput for transcription-only flows. UI integration is handled via React hooks (useVoiceAgent) and a simple status/ transcript model to show idle, listening, thinking, and speaking states. Three small endpoints—look up order, request return, and check product availability—drive the tool calls, and the app demonstrates how to route tool outputs back into the chat in real time. The talk also covers practical touches like preventing Markdown in voice responses and logging tool invocations for debugging. Finally, it wraps up with tips on providing tool descriptions to reduce hallucinations and using a framework-agnostic voice client for broader framework support.
Key Takeaways
- Gemma model from Workers AI powers the LLM conversation, enabling contextual voice interactions within Cloudflare Workers.
- WithVoice provides end-to-end speech-to-text and text-to-speech, enabling natural voice conversations in real time.
- WithVoiceInput offers transcription-only input, letting the agent capture user speech without producing a spoken response.
- Three endpoints—look up order, request return, and check product availability—are wired as tools to demonstrate live tool calls.
- Normalization is applied in the system prompt to convert order IDs to numbers, ensuring consistent tool handling.
- Markdown output is disabled for voice responses to keep audio output clean and natural.
- A framework-agnostic voice client and React hooks (useVoiceAgent) make the UI and integration straightforward for developers.
Who Is This For?
Essential viewing for developers building voice-enabled assistants on Cloudflare Workers, especially those curious about integrating tool calls, speech interfaces, and a modular SDK workflow.
Notable Quotes
"I just built a customer support voice agent. You can talk to it and it can also call tools in real time and it's all built with Cloudflare."
—Opening summary of the project and capability.
"We have withVoice. It's a full voice agent with speech-to-text as well as text-to-speech."
—Describing core SDK components.
"I've also added do not return markdown because sometimes it kept returning markdown and I didn't want my voice agent to respond back markdown characters with voice."
—Notes a practical UX consideration.
"These are quick small endpoints that I've written for creating a fake API/CRM of sorts."
—Explains tool endpoints and fake API usage.
Questions This Video Answers
- How do I run a voice AI agent on Cloudflare Workers with cloudflare/voice SDK?
- What is the Gemma model from Workers AI and how does it drive conversations in voice agents?
- How can I integrate tool calls (order lookup, returns, product availability) into a voice assistant using Workers AI?
- How do I prevent Markdown in voice responses for a speech-based agent?
- What are the benefits of using a framework-agnostic voice client in Cloudflare projects?
Cloudflare WorkersCloudflare Voice SDKWorkers AIGemma modelspeech-to-texttext-to-speechtool callsReact hooksAI SDKtool endpoints
Full Transcript
I just built a customer support voice agent. You can talk to it and it can also call tools in real time and it's all built with Cloudflare. Here's how it works. Hi there. Welcome to Acme support. I can help you track an order, start a return, or check if a product is in stock. What can I do for you? Can you look up order 9831? I found your order. It was delivered and it included two pairs Uh uh yeah, can you just initiate a refund for it? I can certainly help with that. Could you just let me know the reason you'd like to return the order?
Uh they just don't fit me. I started that return for you. You should receive Thank you. Uh also, can you check if you have any notebooks available? Yes, we have classic line notebooks from Paper Vault in stock for $26.79. Would you like me to help you with anything else? Uh what was my first message again? Your first message was asking if I could look up order 9831. Okay, thank you. You're very welcome. Feel free to reach out if you need anything else. Have a great day. And you saw how it was able to process the context.
It was able to invoke tool calls to third-party services. And it also handled interruptions very gracefully. All of this is possible with the new Cloudflare's voice SDK. The Cloudflare/voice SDK comes with a lot of nice little mixins and abstractions for you to use in your app. We have withVoice. It's a full voice agent with speech-to-text as well as text-to-speech. Then we have withVoiceInput. It's a speech-to-text only option, so you get transcription without a response. We also have react hooks that go along with this very nicely. And there's also a framework-agnostic client called voice client that you can use in any JavaScript framework.
Let's dig into some code. We're importing withVoice, Workers AI flux SDK. It's the Deep Gram model that comes with the Workers AI platform, so we don't need any API keys to use them. And we have Workers AI KTS and the pipes for the turn server. We're also using AI SDK to invoke tool calls, which I'll explain in a bit. Then we are also importing create Workers AI, the Workers AI provider for AI SDK. And then we also need Zod for creating these tools. So, let me show you how the app actually looks like. We have an class voice agent, which is extending the base voice agent uh class.
We have a transcriber and a TTS method and we have an on call start hook that will run before the call starts. Uh so, you can use this to say send a greeting to a user. You can even use it for analytics. So, anything that you want to do right before the call starts. And then we have the on turn method where a lot of these uh agentic loop runs. So, all of the tool calls that we have written will run uh within the on turn method. So, I'm actually using the Gemma model from Workers AI uh for the uh LLM conversation.
Uh I also have a system prompt. Uh So, yeah. This is like a nice little system prompt. I'm also adding a little bit of normalization to the prompt, so in case if the uh order numbers for or orders come in as strings, I'm just um prompting the agent to convert them to numbers, so it's normalized. And the fake API that I've written will automatically pick it up. And yes, uh I've also added do not return markdown because sometimes it kept returning I mean, Gemma kept returning markdown and um I didn't want my voice agent to respond back markdown characters with voice.
And yes, we have uh we have some text and some steps. So, Okay, let me go to the e-com tools that I've written. So, I've written uh three tools, uh look up order, request return, and check product availability. So, uh one quick tip is always give references to your tool descriptions, so it's easy for the models to pick up the right model because sometimes they hallucinate and keep missing on picking the right tool. So, for example, I have given the order ID example format, so that it picks up correctly when I say or the user says, "Can you look up order 1234?" And it'll automatically pick this tool.
Similarly, I have a request return. And then there's also check product availability. These are um quick small endpoints that I've written for creating a fake API/CRM of sorts. And then we are creating a base voice agent um class and using it I mean, extending it on the voice agent uh primary class. And these are the tools that I am passing to the Workers AI SDK uh sorry, the Vercel AI SDK's generate text. And I also have a stop when the step count is five. Um this is a native Vercel AI SDK feature. And these are just some logs that I have, so whenever I the tools get get getting called, I just added these for uh debugging.
And that's it. Uh this is everything that you need to create a voice agent. Let's look at the UI side of things. So, I have this small little react page. I am importing the use voice agent hook. So, I'm using a few little things um it's exporting. Uh one is the status one. So, it um then the voice agent is idle or listening or thinking or speaking, the status um flag will give you that value. Uh let's actually look at what's inside the status. So, it's idle, listening, thinking, and speaking. So, yeah, that's right. And yes, then we have transcript.
This way it contain an array of all the messages the user and the assistant are conversing. And then we have audio level, which is uh for the mic volume level. If you have some sort of waveform that you want to show or animate, you can use the audio level. Is muted is um a separate flag to mute the uh our mic and not send any voice to the agent. The start call method and the end call method. Uh if you want to start and end, uh yeah, like the name says. Then we have a toggle mute button to just toggle the uh is muted um uh flag.
Uh this is just a flag to know when the agent is active or not. It's right here. So, if I click on it, uh it just toggles the start call and the end call method. Uh the transcript.map actually has a list of all the messages the users are trying to converse and it's reactive, so you should be able to update the UI in real time. And there we go. We learned how to create an app with voice agent capabilities and it works in real time. We also learned how Cloudflare/voice SDK works. So, let me know in the comments if you are building something with it and if you have any questions, drop them in the comments.
More from Cloudflare Developers
Get daily recaps from
Cloudflare Developers
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.









