Forget Codex vs Claude Code, This Finally Fixes Both
Chapters9
Gary the snail spotlights a market need for an autonomous task platform and introduces the problem of managing long running tasks.
Goal Buddy finally fixes Claude Code’s long-running tasks by adding local state, a dashboard, and a three-agent guard trio.
Summary
Gary the snail’s quest to automate long-running tasks runs headlong into the limitations of Claude Code’s goal command. AI LABS argues that past tools like Ralph Wigum and Claude Code’s own Goal were too brittle, especially without persistent state. Goal Buddy steps in as a plugin that enforces a local state store (state.yaml), adds a transparent dashboard, and coordinates three specialized agents—scout, worker, and judge—under a PM-like manager. The team demonstrates the workflow on Codeex and Claude Code, showing how slices, an oracle, and clearly defined task boundaries improve safety and traceability. The judge reads decisions, the scout maps progress, and the worker executes tasks one at a time while the PM keeps the loop running. The setup is straightforward: install the plugin, run goal prep, and let the system generate a structured goal MD file, state.yaml, and a dashboard. The video ends with notes on parallelization opportunities and a candid nod to the practical realities of real-world AI project pipelines. Overall, Goal Buddy appears to resolve several pain points of the native goal command, turning subjective judgments into auditable, repeatable progress.
Key Takeaways
- Goal Buddy adds a persistent local state (state.yaml) so tasks aren’t lost when chat history degrades or sessions end.
- Three defined agents—scout (readonly, evidence mapping), worker (editable, executes tasks), judge (readonly, high-scrutiny decisions)—coordinate the workflow.
- The oracle, slices, and a PM loop convert vague requests into programmable, testable objectives with clear pass/fail criteria.
- The setup includes a goal prep flow that clarifies ambiguities by asking questions before forming the objective and the tasks.
- Progress is visually trackable in a dashboard, with a state machine that moves tasks from active to completed and a final audit by the PM.
- The tool supports multiple backends (Claude Code and Codeex) and is delivered as a plugin, installed via a simple command.
- Despite strong planning and governance, parallelization is still limited in the demo, suggesting future improvements for concurrency.
Who Is This For?
Essential viewing for AI/ML engineers and product teams deploying long-running AI workflows who want auditable progress, safer task execution, and better governance than the native goal command offers.
Notable Quotes
"Goal Buddy is a tool that was built with one purpose, to make the goal command actually work the way it should."
—Introductory claim highlighting the plugin’s purpose.
"It fixes all the problems we just talked about, but it's not really getting as much attention as it should given how useful it is."
—Praise for Goal Buddy’s practical value.
"It fixes that and actually forces the goal to read and update local state instead of relying on chat history."
—Key improvement: persistent state rather than ephemeral chat.
"You’ll know what each task is doing, which agent is active, what tasks are cued, and which ones are completed."
—Describes the dashboard’s visibility into workflow.
"The PM is the manager in this case, and keeps the goal running until the final audit marks the goal as met."
—Defines the PM’s governance role in the loop.
Questions This Video Answers
- How does Goal Buddy change the way Claude Code evaluates task completion?
- Can Goal Buddy run tasks in parallel across multiple agents in Codeex or Claude Code?
- What is state.yaml and why is it essential for long-running AI workflows?
- How do slices and the oracle work together to make goals verifiable?
- How does Goal Buddy compare to prior tools like Ralph Wigum or the original Goal command?
Claude CodeCodeexGoal BuddyGoal commandscout agentworker agentjudge agentstate.yamldashboardoracles and slices
Full Transcript
This is Gary the snail, and he's identified a market gap to build a dating platform for snails. But since he's super slow, he wants Claude Code to autonomously handle his longrunning tasks. Fortunately for him, agents have gotten really good at longunning tasks. Claude Code has a goal command that just keeps the agent running until the task gets completed. But during our testing, we've found out a lot of issues with the goal command. Since Gary recently went through a divorce, and we want him to be happy, we found this open-source tool that actually fixes the problem.
And it doesn't only work with Claude Code, but Codeex as well, spreading love just like your mom, who I'm sure loves you just as much as your employed sibling. Claude Code previously released a command called Goal, which keeps the agent working until a certain condition is met. We didn't cover this one on our channel, but you probably already know about it. Before this, there was a plug-in called Ralph Wigum that gained a lot of traction, which essentially did the same thing. It used hooks to feed the prompt back to Claude Code until the condition was actually met.
But the thing is, these conditions need to be an exact match because the Ralph loop uses a shell script to check for the condition literally like the airport guard who doesn't let you through because your manly body spray is over the baggage limit. The goal command works differently. It takes the condition and the conversation so far and gives it to a small model which is haiku. And this model intelligently evaluates if the task is done or not. It returns a yes or no decision and a no tells Claude to keep iterating on the same task.
like when your boss tells you to improve the user experience because he just can't find a button on the page. So, this makes the evaluation subjective and for things that we cannot quantify on their own, that's a real improvement. The goal does work well for a lot of tasks, but it still has a lot of issues. The first issue is that it does not use any knowledge base or file system that tracks the progress of the task. And since it's not doing that, the only source of truth for the agent becomes the chat context. This might trigger you since it was your dad who wrote the crypto fortune on a sticky note that fell off the fridge back in 2017.
Once the session ends for any reason and the goal wasn't completed, you sure can resume it using the claw resume command. The goal will not be lost. But the only way it knows where it left off is the chat context. And since this command is meant for longunning tasks, not simple ones, things can get messed up in between. And of course, with the goal running for hours, context bloat and hitting compaction is bound to become a real problem at some point. After compaction, the agents output gets worse. It's going to start behaving like my grandma, who because of her dementia is starting to forget this channel's name.
I need you guys to watch the last video for her. Another problem is that it doesn't break task down into smaller ones. Instead, it just uses the main agent and does the task breakdown on its own the way Claude code normally does. So, there's no structured plan, and the agent may lose track of what's left to do. And even though this might work well for some cases, an unclear definition of what done looks like for agents is never the right thing. The goal relies entirely on the model to evaluate completion. So it might not be as effective in some cases.
It is better than Ralph Wigum being completely strict by using scripts, but at least there should be some metric that tells the agent what done might look like, just like your wedding photographer that kept saying one more shot until the whole event was over. So this is where the goal falls short. And these things might not look like much, but when put into real heavy workflows, they can bring some serious issues. Now, Goal Buddy is a tool that was built with one purpose, to make the goal command actually work the way it should. It solves all the problems we just talked about, but it's not really getting as much attention as it should given how useful it is.
It's like the hot babysitter except instead of flirting with you, she's just babysitting your longunning tasks. Goal doesn't preserve the state of the work locally. So this tool fixes that and actually forces the goal to read and update local state instead of relying on chat history. And it also finishes with proof. So the agent actually knows what done looks like before it starts. In order to track progress, it also includes a whole dashboard where you can watch your agent work while it's working. And to handle all this, it's built upon three agents which are the scout, the worker, and the judge.
Basically a Y Combinator startup team where one does all the work, one watches him do it, and one judges both of them on Twitter. The installation is pretty straightforward. Just copy the install command and paste it into your project folder. It will be installed as a plug-in available for both claude code and codeex. Once you start a new session, you can see the command available for use. So these three agents each have a strictly defined role and access level. Since this tool is built for codeex as well, the agents are defined in toml instead of the standard markdown.
The first agent is the judge which only has read access. It skeptically analyzes hard decisions like risky scope, contradictory sources, and other patterns to make sure the task is completed safely. Its instructions forbid editing because it exists only for making judgments, nothing else. And since its task is highly critical, this agent's reasoning is set to the highest so that decisions are made properly. It's exactly like when you've been composing that one text to your crush for 4 hours straight in the middle of the night. After it finishes working, it returns a JSON structure with the approved and rejected decisions along with the rationale.
The scout is another readonly agent that maps an active task and creates a compact evidence receipt for it. Since its job is just to check the state of the task, its reasoning effort is kept low. Just like your favorite strip club's bouncer, it doesn't actually care that much. And then there's the worker agent, the only one with edit access. It does the actual work and it's only allowed to execute one task at a time. There's also the PM role which is the main thread that coordinates the workflow. It behaves like an actual project manager doing the minimal work possible.
It's the only authority that can actually mark the task as done. The core workflow starts by expressing the intent of the task in proper words. Not vaguely the way us homo sapiens usually do, but in a way the agent can properly understand and then the oracle is defined. The oracle is basically an observable signal that identifies the outcome. It is what the system iterates against to see if the task can be marked as done or not. It could be anything. A test suite, a browser rundown, any artifact, benchmarks, or the code that turns my microwave into a time machine because why not?
AI agents are doing anything at this point. Then the next step is surface. It breaks down the task into actionable steps, creates the dashboard, and maps the tasks into a visual format. The last piece is the PM. He's the manager in this case, and keeps the goal running until the final audit marks the goal as met. To use Goal Buddy, you just run the goal prep command. This is the one that initializes the workflow and you define the goal that you want it to achieve. It first ensures the agents are installed and ready to be used.
It then initiates the workflow, but unlike the native goal command, it's extremely self-conscious and it first removes its own ambiguities by asking you questions so that you can clearly define the implementation. And just like your suspicious wife, it will keep asking questions until it has understood. The first step focuses on creating the goal files. It places the original request along with our answers and then maps it to the proper objective in agent understandable language. It contains a summary of all the information and then defines the oracle which is the most important part. The oracle for this task is straightforward.
All tests must pass with proper behavior. This kind of goal is specific because it can be programmatically evaluated unlike your cover story last night that your wife is totally not buying. Goal Buddy breaks down the whole workflow into small doable tasks. These are called slices. But unlike the real world, size doesn't matter here because a small slice doesn't mean a small task. It means something that is safe, can be verified easily, and can be run individually. It explicitly defines the safe slicing size in the document as well. It creates the state.yaml which tracks the project and tasks and defines how the PM loop would look.
The state.yaml consists of all the goals and rules with all the tasks broken down by their ids and the assigned agent. It contains a field for tracking the active task too. It also mentions the linked dashboard. It lists all the to-do tasks and the in progress tasks. In our case, the scout agent is currently in progress and is mapping all the files and endpoints. So to start the loop, you just copy this command and run it. It instructs Claude to set the goal of doing everything in the goal MD file. From there, it will pick up the first active task like a king and then call out its subordinate agents to perform it.
Once the scout has completed the work, it updates the progress file with all its findings and documents them in a separate directory. It also updates the board from active to completed. Then the loop picks up the next task, marks it as active, and starts the judge agent. The judge critically reviews the findings and sequences the report into the fewest possible vertical slices, which is the task breakdown for the worker to carry out independently. It then updates the slice count and updates the state file accordingly. Each task explicitly lists the allowed files, how to verify them, and when to stop.
This is how it defines each slice so that agents have a clear expected output checks and all the necessary details. Then one by one, it initializes the worker agent and begins with the first slice. The progress of each agent can be tracked using the dashboard. You'll know what each task is doing, which agent is active, what tasks are cued, and which ones are completed. So you don't have to monitor things yourself, and can actually give your kids the time that they need. Once all the tasks have been completed, it performs the last audit as PM, making sure that all the tests have been properly conducted.
Once the audit is done, it marks the judge agents final audit task as done and then marks the goal as completed. After this, you have to start the prayers and hope that thoseing agents didn't hallucinate. Overall, this worked considerably well given the complexity and the scale of the app we gave it, but we think more effective parallelization could be added because it did everything sequentially. It handled one task at a time and didn't make use of Claude Code's parallelization capabilities at all. Daario would have been actually disappointed to see this. But given how well it planned the workflow, it did work pretty well.
Also, if you are enjoying our content, consider pressing the hype button because it helps us create more content like this and reach out to more people. We also wanted to test Gold Buddy on something more generic like designing a UI to see how it handles tasks that can't be evaluated programmatically. The previous test was on a specific workflow with clear pass and fail criteria. But just like you getting that fresh cut from your barber, some tasks just don't have that. So we first gave the usual goal command a vague prompt. It initialized the goal tasks, consulted the adviser and gave a website in no time.
Being lazy, it just created a simple HTML page and didn't go for any framework. But the landing page didn't look bad. So we gave the same exact prompt to Gold Buddy as well. Once it started, it followed the same workflow and gave a similar questioning session to clarify the intent with us. Here, Gold Buddy actually asked for the tech stack as well. Normally, I'd call this kissing, but since I take my AI agent seriously, I'll call it being thorough. Similarly, it created the board and the goal MD file and translated our original request into a proper objective.
It also properly identified the oracle. But the oracle in the previous task was simple. It just needed to pass all the tests. This one had different goals. It defined the task as complete when the dev server would be up and running and browser walkthroughs confirm all the sections work as defined. This is how it turned a non-quantifiable task into something quantifiable. It also created the state.yaml YAML again with the Oracle rules, agents, and all the tasks listed out and then started working in the same way. It took a longer time than the normal goal command, but it ended up implementing the app properly.
This won't be a problem for Gary the snail, but you should do some push-ups in the meantime, I can see you've gotten fat. Comparatively, the whole website performed significantly better than what the simple goal command created. If you're actually want to be an AI B2B SAS founder who likes to build instead of just watching tutorials, then you should be an AIABS pro. You'll actually get like-minded nerds like our team in there with resources from the videos and lots of other goodies as well. The link's going to be in the description and you can check that out.
That brings us to the end of this video. If you'd like to support the channel and help us keep making videos like this, you can do so by using the super thanks button below. As always, thank you for watching and I'll see you in the next one.
More from AI LABS
Get daily recaps from
AI LABS
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.









