How to Save 50% on AI Tokens Using Cloudflare Queues

Cloudflare Developers| 00:05:09|Apr 23, 2026
Chapters9
This chapter introduces a method to cut AI token costs by leveraging badge APIs from OpenAI and Entropic, explaining how queuing requests when a response isn’t needed can yield a 50% discount.

Save half your AI token costs by batching requests with Cloudflare Queues, using Entropic/OpenAI badge discounts, while keeping reliability with retries and a dead-letter queue.

Summary

Cloudflare Developers’ latest walkthrough shows how to pair Cloudflare Queues with a simple web app to cut AI token costs by 50%. Juggling async batch processing, the video explains using Entropic and OpenAI badge discounts whenever a user isn’t actively waiting for a response, and why you should batch your requests instead of calling the model inline. The demo centers on a feedback classifier that tags entries as bug, feature request, praise, or billing, then processes them via a cron-triggered batch workflow. Wrangler powers the cron trigger, running every minute in the demo to gather in-progress batches and fetch results from Entropic. A producer (the web app) enqueues messages into D1, while the consumer (the Q function on the worker) processes batches of up to three messages with a 60-second timer. If a batch completes, results are written back to D1; if errors occur, items retry up to three times before landing in a dead-letter queue. Cloudflare emphasizes reliability beyond cost, noting that message loss is prevented by the dead-letter queue and backoff behavior. The video also shares practical tips, like starting with a frequent cron cadence for demos and adjusting for real-world use. The bottom line: use Cloudflare Queues for non-real-time AI tasks to unlock 50% token discounts and improve resilience. GitHub-hosted code and a hands-on demo reinforce how to implement producer/consumer patterns, batch processing, and error handling with Qs.

Key Takeaways

  • Using Cloudflare Queues can unlock a 50% token discount from OpenAI and Entropic when using their badge APIs.
  • A batch size of three with a 60-second timer effectively balances latency and throughput for asynchronous AI calls.
  • Max retries are capped at three, with failed jobs routed to a dead-letter queue to prevent data loss.
  • The cron-enabled batch processor pulls in-progress batches every minute, then writes results back to D1 once complete.
  • If an external API is down, the system preserves messages in the dead-letter queue and retried processing rather than losing data.
  • Producer/consumer design: enqueue in the web app, consume in the worker Q function, and persist state in D1.
  • Cloudflare Cron triggers and batch processing enable scalable, cost-efficient AI workflows on the Workers Free plan (10,000 operations/day).

Who Is This For?

Developers and DevOps engineers exploring cost-efficient AI pipelines on Cloudflare Workers. Essential for teams building non-real-time AI features who want to reduce token spend while maintaining reliability.

Notable Quotes

"The batch is ready, the chronop pulls the data and drops it in our D1 database."
Illustrates the flow where completed batches are written back to D1.
"Max retries three. If we hit the max retries, then it goes to a dead letter Q."
Shows the retry policy and dead-letter handling.
"Cloudflare cues are available on the free plan now. You get 10,000 operations per day."
Highlights cost and plan details for beginners.

Questions This Video Answers

  • How can I save on AI tokens using Cloudflare Queues with OpenAI and Entropic badges?
  • What is the producer-consumer pattern with Cloudflare Workers and D1 for batch AI processing?
  • How do I configure a Cloudflare Cron trigger to poll batch results from Entropic/OpenAI?
  • What happens to failed AI requests in Cloudflare Queues and how does the dead-letter queue work?
  • What are the cost implications of Cloudflare Queues on the Free vs Paid Workers plans?
Cloudflare QueuesOpenAI badge discountsEntropic APICloudflare Cron TriggersWranglerD1 databaseQueue-based batch processingDead-letter queueProducer-consumer patternAI token optimization
Full Transcript
Cloudflare Q's recently hit the workers free plan and I want to show you my absolute favorite way to use them by cutting your AI token bills. Now look, look, I know that sounds kind of clickbaity, but both OpenAI and Enthropic give you a 50% discount if you use their badge APIs. If a user isn't actively waiting on an AI response, don't call the model in line. You're paying full price if you do. Drop the messages on the queue, let the queue ship it off to Entropic and OpenAI and get a 50% discount. It's cheaper. It's resilient. And let's dive right in. [snorts] Here we have a little demo app. It's a feedback classifier. We can choose between bug, feature request, praise, and billing. Let's start with a bug. We have a web app, and the user says, "The homepage didn't load on Windows. Send it to the web app." Beautiful. Okay. A sales call. The guy was really nice. Great. Another one. And see how fast this is. We're not waiting on a response. And finally, a support email. You need a haircut. Okay, let's go to D1. And here are our entries. The homepage is loaded, guys. Really nice. And they're in progress. And on the left, you see a chron job. And you'll see it's processing all three. It's in in progress. Because batch APIs are asynchronous, we use a Cloudflare cron trigger to pull for completion. Once the batch is ready, the chrona pulls the data and puts it right back into our D1 database. So, while this is turning out, let me show you some of the code. It all starts, of course, at Wrangler. And here we have our chron trigger. This means it runs every minute, which is good for a demo. In real life, you'd probably have it run a little slower, maybe every hour or so. Let's go to the implementation of the chron job. It's on scheduled. This gets invoked every minute. We do some logging. We get all the batch IDs that are in progress. And then we pull entropic. And of course, this code will be available on GitHub. So that's the crunch job. Now how do we actually get things on the queue? It's with the producer and the producer is our web app feedback q. Here you can see we have this API call. This is our API where we have handle feedback submission. We look at the feedback create an ID and then we insert it into D1. Then we put it on the queue and then we just return the JSON. That's how it was so fast when I added the feedback. We talked to our producer. We talked about the chron job. Now we need the consumer. This is the Q function on our worker and it just takes a batch and a batch is a list of those messages. I've set in our consumer here, I've set the size to three. That's why we needed three messages for it to run and a timer to 60 seconds. Whichever one hits first. If batch size is first, then the uses batch size. If 60 seconds have passed and there's only two items in the queue, then let's send those two items. And we have a maximum of three tries. So we go to this process batch mode and all it does is it gets the jobs, we turn them into requests and then we just send those requests to the entropic API. If there's an error then we retry. If there's an error more than three times, it goes to the dead cube. If it's successful, we mark the job as submitted. And then very important, we acknowledge the message here. Let's see what we got here. It is Yay. It worked. The batch is completed. We pulled. After a couple minutes, it was completed. So the web app, the homepage didn't load on Windows. That's a bug. High priority. The sales call, the guy was really nice, praise, but low priority. And the support email that I need a haircut, I mean, I know, I know, is low. So, this is cool because we're only paying half price for this. Because batch APIs are asynchronous, we use the Clifair Chron batch is ready, the chronop pulls the data and drops it in our D1 database. And if I go to our D1 database, you'll see it has upset and we have the categories, the priorities, and then a summary what the bug actually is. But this isn't just about money. It's also about reliability. LLM APIs are flaky. We get rate limited. We get overloads, transient errors. This is exactly what Q's are used for. That's why we use max retries. Let me show you again here. Max retries three. If we hit the max retries, then it goes to a dead letter Q, which is probably one of the coolest names for a paradigm. That means even if we retry it three times, if Entropic errors out, our consumer will just back off and retry the whole batch. If Entropic is down for a longer time, it will go to the dead letter Q and you actually never lose a message. Now, that's pretty crucial. You just never lose a message cuz it will either be processed or on the dead letter Q. So to summarize, if you're building with AI and the work doesn't need to happen in real time, use cues. Save some money with the batch API. Cloudflare cues are available on the free plan now. You get 10,000 operations per day. And if you're on the workers paid plan, [snorts] you get a million operations per month. After that, it's 40 cents per million operations. I hope this was useful for you and that you can save some money on tokens. I use Q's all the time. They're one of my favorite and underrated products and I hope with this video you're excited to try them too. The repos in the description. I hope you build something awesome. I'll see you later.

Get daily recaps from
Cloudflare Developers

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.