AI Gateway’s next evolution: an inference layer designed for agents
Chapters11
Meet Ming Louu and learn about her role as a product manager on Cloudflare's AI gateway team.
Cloudflare’s AI Gateway expands into a unified inference layer with a one-stop catalog, automatic failovers, and unified billing for multiple model providers.
Summary
Cloudflare’s Ming Louu explains AI Gateway’s next evolution: turning it into a unified inference layer for agents. The new approach emphasizes a single API to access multiple models, making it easier to switch providers as models evolve. Ming highlights real-world use cases in generative media and agentic workflows, where multi-model pipelines are common. A key feature is the unified catalog, showing models available through Cloudflare, hosted on Workers AI, or accessed via third-party proxies. The blog also introduces an integration with Workers AI bindings, enabling third-party models alongside Cloudflare-hosted options. Reliability is improved with automatic fallbacks and cascading failure protection, ensuring continued responses when a model provider hiccups. Latency is addressed by keeping calls within Cloudflare’s network for Workers AI models, while the bindings extend to external models. The conversation underscores improved billing and observability, consolidating spend, logs, and metadata in one place for better ROI and governance. Ming finishes by pointing readers to the blog and previewing more innovations coming during Agents Week.
Key Takeaways
- Unified catalog will show models across Cloudflare-hosted, Workers AI, and third-party proxies in one place for easy discovery.
- Binding integration now lets you call third-party models from within Workers AI workflows, expanding beyond Cloudflare-hosted options.
- Automatic fallbacks enable multiple providers per model to prevent total outages in agent workflows.
- Billing consolidation via the AI gateway wallet simplifies spend and invoicing to one Cloudflare invoice.
- Observability improvements centralize logs and allow metadata tagging (e.g., by customer, plan, or agent) for cost and usage insights.
- Latency benefits are realized when using Workers AI models due to network proximity, reducing public internet hops.
- Streaming inferences are resilient to disconnects; AI Gateway buffers and restarts seamlessly to avoid recomputation.
Who Is This For?
Developers, platform engineers, and product leaders building AI-powered agents who need multi-model pipelines, simplified billing, and unified observability in one ecosystem.
Notable Quotes
"“We want to make that as easy as possible for people. And also, we want to make it really easy for people to switch out models as these models change.”"
—Ming Louu explains the motivation for a unified inference layer and easy model switching.
"“We can try your like preferred provider and then fall back to some of these other providers so that if Anthropic is having an issue, you’re not out of luck.”"
—Introduction of automatic failovers across multiple providers.
"“There’ll be one place you can go to see what models are available through workers AI, and what models are available as third-party proxies.”"
—Announcement of the unified catalog and its scope.
"“You just have to load money into your like AI gateway wallet and then any spend that you do across these providers will just draw down from that wallet.”"
—Unified billing and simplified vendor management.
"“If you’re building agents using Cloudflare’s agents SDK and you’re using AI gateway, we’ve made it so that streaming inferences are resilient to disconnects.”"
—Resilience feature for streaming inferences.
Questions This Video Answers
- How does Cloudflare AI Gateway unify billing across multiple model providers?
- What is the unified catalog in Cloudflare AI Gateway and how do I use it?
- How do automatic failovers work when a model provider experiences downtime?
- Can I call third-party models from Cloudflare Workers using AI Gateway bindings?
- What latency benefits do Workers AI models offer for agents?
Cloudflare AI GatewayAgents SDKWorkers AI bindingsUnified catalogModel providersAutomatic failoverBilling and observabilityStreaming inferencesLatency optimization
Full Transcript
Hello everybody. Welcome to agents week. I hope that you read this blog that we are about to talk about right now. It is awesome. And I am here with the author of the blog and it happens to be her first blog, Ming Louu. Can you please introduce yourself and tell us what you do here at Cloudfl? Yeah, hi everybody. Um, I'm Ming. I am a product manager on our developer platform team specifically for AI gateway. I am I guess relatively newer to Cloudflare. I joined Cloudflare back in December through Cloudflare's acquisition of Replicate. Um so I was leading product at Replicate and now I'm here.
We are so glad to have you here Ming. You have been building all sorts of incredible stuff. Some of which is talked about here in the blog and I let's get into it. Let's let's get into what we're talking about here. So let's overall like if we're talking about like what what problem are we solving? what what problem is going on, right? Yeah. So, you know, I think what we're trying to do is kind of like lean into AI gateway more as this unified inference layer. You know, I think we have a lot of customers who are currently using AI gateway as a way to proxy requests onto model providers to get that observability.
But we're now kind of leaning more into like this is one API that you can use to access a bunch of different models. I think if you if you're looking at if you're trying to build like a reall life use case and trying to solve for a real life problem, you're often having to use a bunch of different models, not just like one model from one provider. And we want to make that as easy as possible for people. And also, we want to make it really easy for people to switch out models as these models change, right?
Models the like model landscape is getting better and kind of changing so quickly. the best coding model today might not be the same model or even from the same provider in three months and so we want to make that experience for developers as easy as possible. Awesome. What what um are we getting specific customer feedback that's kind of driving that? Uh I think you know what we saw this we worked with so many um so many customers especially at replicate that were building out these real world use cases. I think especially if you look at the um uh you know both for like the generative media like people using image models in video model space like oftentimes what we'll see is if someone's trying to get to a video output they don't go very rarely are they going from like text to video directly right often times people are generating an image or taking an image from something pre-existing and then feeding that image into a video model uh so that they can really control like what is that first frame and what is that last frame of of of their output.
So I think especially in this like generative media use case there are a lot of models that that you need to kind of create one workflow. And then even if you look at the more like agentic use cases that people are building you know if you kind of take something like a coding agent you know people will often use a very very large model to do planning but then they'll hand off execution to smaller models that might be cheaper or you know maybe certain models are better for coding versus better for like writing. And so I think like all the use cases that you kind of see like it it often requires multiple multiple models.
Nice. By having it through one place and thinking about it through one place that's that that solves a lot of those problems. In the blog post we talked a little bit about uh cascading failures like an agent workflow. So using using this unified model. What is what is explain that a little bit? Can we explain what that is? Yeah, sure. So I think if you if you think about a a very simple use case if you're building a very simple chatbot you know it when you when the user gives one prompt or asks a question that gets translated into like one inference request generally like what is the capital of the United States that that just gets sent to the LLM and you get a response back.
If you're building an agent it's very rarely that one inference request will get you what you need. You know, if you're saying like a a typical request to like solve a customer's support question might involve looking at the support question, then calling an MCP server to look at the documentation for that part of the product. It might involve looking up that customer and you know what is the current setup of their account and then you would then like look at both those things. Maybe it would involve looking up a third thing about like how much money they've paid you.
And basically the agentic workflow kind of by nature often involves like a multi-step approach where the outputs of a previous step feeds in as the input of the next step. And so if one of step really early on fails that kind of you know it makes the it prevents the rest of the flow from going but then it also like if you have to restart that loop it makes all the steps that you've like done in the beginning somewhat wasted if you have to then like remake those requests. So that's what it kind of means by like cascading failures.
Okay. Cool. Cool. And so so and this this helps fix that. uh yeah I think it adds another layer of reliability to that like inference layer so that if there's a problem at the model provider level you are not like dealing with it directly one thing we're trying to build in with AI gateway is this concept of like automatic fallbacks or automatic failovers so for a particular model that is served on different providers so for example if you um look at some of the cloud models they're served through anthropic of course but they're also served through I believe bedrock or like Google rotex and So for a model that is served on multiple providers, you know, we can try your like preferred provider and then fall back to some of these other providers so that if you know, Anthropic is having an issue, you're not out of luck and you can still get a response because we'll try some of these like secondary pros.
And that's awesome. And so that's and that's super important in that agent workflow like you were saying because if one of them falls, you you don't want to like go no actually goes to switch to this. So super cool. Awesome. One thing that I'm super excited about is this one catalog, this one this one endpoint. Let's talk let's talk a little bit about what's what's what's that bringing what is that bringing to us as a developer. Um, yeah. So, I think today, you know, I think if you're uh we we we've had AI gateway out for a while, you know, AI gateways never had its own model catalog.
You know, you just kind of had to know like what models were available. And I think we've heard a lot of customer feedback like wanting to see to know I want to know what models I can call, what models are are supported through unified billing. Like when a new model comes up, how do I know that it's it's out now and that I can use it? And so we're releasing uh I think what we're calling like a unified catalog for all of the inference that you can run through Cloudflare. So there'll be one place you can go to see what models are available through workers AI, what models are available through workers as in hosted on Cloudflare's GPUs, and what models are available as like thirdparty proxies.
And so I think that'll just be a really great way to show people or help people discover what models are available because you might not know the exact model that you want to use and you might not know what all the model Cloudflare offers or what models even this provider offers. Yeah. And then also then use them within one interface. Right. So the the kind of big thing that we're we're launching here for AI gateway is an integration with uh workers AI bindings so that in a worker you can now before call workers AI kind of cloudflare hosted models but now you can also call third party thirdparty proxy models as well through the bindings interface.
That's awesome. Let's let's just in case somebody hasn't seen that or thought about that unified building. I want to just let's take let's take a moment here because this is something that's rad that I think might have snuck through a little bit. So, so you can run models that aren't just on workers AI with the binding now is what that is what we're getting to. So, like what kind of models are we talking about there? Yeah. So, like basically all of the we've tried to make this initial launch the like launch set of models the state-of-the-art models that that people are using for um uh their real world use cases.
So we'll have all the all the models from like Anthropic, OpenAI, Google, um all like the main LLMs, but we're also really expanding our model catalog to other like multimodal models. So across image, video, voice, so speech to text, text to speech. I think we've got a music model in there. And so it's really um it's really kind of this like full full featured model catalog. Uh whether you want to do things with like LLMs and agents or whether you want to do generative media, like it's all all Awesome. And then how how long like developer wise, how long does it take to switch between those things?
Like because what's the what's that flow look like? Yeah, with the you know if you're if you're calling an image model and you want to switch to a different image model, it's now super super easy to do. You just have to change the uh like model like the model ID, you know. So going from you know Google Nano Banana to uh you know like Black Forest Labs uh is just kind of uh changing that string. uh you know then like basically all of the other or nearly all of the parameters are just like work. That's awesome.
What what So I think that like there's there's a little bit of a bigger talk here about like what what that does like what does that feel like? I mean I I know that if I in the past am I paying for open AI? Am I paying for anthropic? And I I got to submit my expense for this and my expense for that. It gets us into the unified building which I think is a great place to jump here. So, so uh what how does that change the uh the conversation right for for a developer like myself?
How does that change the conversation for building teams like for developers and their billing teams? Yeah. So before the process like oftentimes I think people start out with like okay I'm just going to use OpenAI and like they're they're doing their thing and then you want to improve but then you realize you need to add on other models or you know you want to use a different model for a different part of your workflow. That would involve then opening an account with this other bottle provider, getting an API key, putting in a credit card or kind of setting up invoicing.
If you're at a larger company, it might involve going through procurement to get that get that new vendor now approved and can be quite like a lengthy process. Um, so instead now with we've greatly expanded the number of providers that AI gateway supports through unified billing. And so now you don't have to do any of that. You don't have to like juggle API keys, save the API key in the secret store, you forward the invoices onto your finance team, whatever that might be. You just have to load money into your like AI gateway wallet and then any spend that you do across these like a bunch of different model providers will just draw down from that wallet and so all of it will be on your like Cloudflare invoice.
So it's one vendor that you have to manage and get approved rather than you know the three to five that you might have to otherwise or possibly even more. That's that's super nice. Then I guess if you can like move that to everybody using that internally, you get like a better visibility, right? So what what what have uh what have we seen for people who have done that? Like what what are some of the like the insights of of people actually running through this unified system? Yeah. Yeah. So yeah, that that's exactly it. Like obviously it's a great simplification on your billing operations but also on your observability and and so you're not juggling between different consoles and dashboards to uh you know look at what is what is the behavior what errors am I seeing on OpenAI versus Anthropic and you know trying to have to like trace a particular session through a bunch of these provider dashboards.
All of your logging now lives in one place and so you can kind of really see your the entirety of your inference traffic. Um, and so that's just like such a such a such a superpower having everything under one place. Uh, and then and so I I think like the things that you can kind of see now are like your overall inference spent. So like how like for this a particular product that I've shipped and launched like how much money is it like actually costing me across like all of my inference providers. And then with our analytics we have this feature where you can pass in metadata alongside of request.
Oh. So that lets you just pass in like arbitrary arbitrary data that you can then later filter on. So one example of a use case here is um you know if you have a application you might want to offer different models or you might use different models to support different tasks between your like free tier users and your paid tier users. And then so now you can kind of cut your data by that particular like that metadata free versus paid and kind of see like okay how much am I spending on my free users? How much am I spending on my paid users?
You can even, you know, go as far as like sending through particular a particular customer's ID through all of their requests. So you can see how much a particular customer costs on an inference level or, you know, if you're running an agent that does a particular a couple of agents that do certain tasks, you can see like this agent, you know, is costing me this amount of money. So yeah, it's really um I think it's really really powerful to just like understand how people are using your product or how your internal user are using your product like where where is your spend like kind of like creeping up maybe then that can like lead you to like try to you know swap out different models uh for a particular task uh stuff like that.
So so cool. I I I can't imagine I can only imagine how much people are going to be using that to exact exactly that point, right? Like and you see you see people saying like AI has there's an expense and we're not sure exactly how expensive it's going to be and being able to predict and like look at that and like track oh the growth is happening in my free tier. We should change that model to and very easily like you said earlier we could very easily change that model to to match the free users.
when the free users use this, we want them to use this model, otherwise we want to use this. And you can kind of uh use that data to make those decisions, which is huge. Yeah. And I and I think it's like, you know, I think if you think about um AI deployments within a company and kind of like internal use cases, you know, everyone everyone is telling their their uh their employees to like use AI and you know, people are building agents and and doing all these things and but I I think like the the conversation like the next question that naturally arises is what is the ROI of all these agents that that we've got running around.
You know, I think it that that's like this is like the first thing that you the first thing that you need to start answering that question, right? Like we've built the company might build like a code reviewer bot and but it's good to see like okay this is costing us x amount of money but maybe it's prevented this many incidents or potential incidents uh you know over the last quarter but but you know it's hard to have that conversation without knowing the costs of these things. Absolutely. Or like bringing that data across yourself. you you probably would use AI to write the report to pull from all the different places.
So, all in one place with the same metadata. It's it's huge. It's super cool. So, being Cloudflare, we wouldn't we wouldn't have a conversation if we didn't talk about latency. Let's talk about latency. What what's going on in this blog post regarding lat Yeah. So, you know, I think like we said earlier, the main thing that we're um or one of the like the big things that we're launching is this like workers AI binding now where you um I that you where you can call thirdparty models, but obviously you can still use the binding to call workers AI models.
And so I think if you're like building if you're trying to call workers AI models through AI gateway, that's a particularly great way to kind of get even like lower latency, right? Because these workers AI models are within the cloud very network, there's no extra hop over the public internet when you're calling for example like Kimmy K2 through workers AI and so your inference runs you know on the same network and so your agents have like even lower latency than if you were to call you know an external model. Great. That's awesome. Uh um did are there any other hidden gems in this blog post that we should call out?
Yeah. So, one thing that I think is notable is if you're building agents using Cloudflare's agents SDK and you're using AI gateway, we've uh uh we've made it so that you're like if you're making an inference call and it's like a streaming inference call, they're resilient to disconnects. And so, basically AI gateway will buffer the responses as they're generated. And that kind of happens independently of of the agent. And so if your agent gets interrupted for some reason, it can then reconnect to AI gateway and kind of find that request that it's already made and basically read it again, read it back rather than you having to make that inference again, which takes time and also costs you that money.
You can just kind of restart and it's, you know, as seamless as as possible. Well, that is so awesome. So, so I guess what we're saying is use use AI gateway all the time. Use AI gateway, use agents SDK, you know, use Cloudflare generally. Yeah, exactly. Exactly. Ming, thank you so much for for doing this. Congratulations on this first blog post. It's it's amazing. It's really great. I hope that everybody goes and reads that. Any last words of wisdom, Ming, to drop before we anything you want them to do? I mean, you you just wrapped that up really nicely.
Any last last thoughts? Yeah, I mean like you know this um so many great things are coming out during during agents week. This is actually my first innovation week at Cloudflare, too. So, it's so cool seeing all the all the new stuff that we're launching. Yeah, there there's so many great awesome things. So, you know, like read the blog, like try the new stuff. Uh, we've got some great things cooking. Awesome. Awesome. Thank you everybody and we will see you next time.
More from Cloudflare
Get daily recaps from
Cloudflare
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.







