Build Hour: API & Codex

OpenAI| 01:00:58|Mar 26, 2026
Chapters7
This session introduces Codeex and codecs, outlining practical goals to empower teams with OpenAI APIs, featuring a live demo on agent legibility scores, a talk on harness engineering, a customer spotlight, and a Q&A segment.

OpenAI OpenAI Build Hours demonstrates CodeEx and API advances, showcasing agent delegation, harness engineering patterns, and real-world production workflows.

Summary

OpenAI’s Christine hosts a session with Charlie and Ryan to explore the latest in CodeEx and API capabilities. The trio highlights CodeEx’s evolution from autocomplete and pair programming to agent delegation, including the Codex desktop app, Windows support, and new features like skills, apps, and hosted shells. They walk through a practical demo of agent legibility scoring on a Symfony repository, then introduce harness engineering via Symfony and “no-human-code” pipelines that scale to millions of lines of code. Mitch, co-founder of Basis, joins for a customer spotlight to discuss company-wide adoption, standards, sub-agents, context sharing, and how to manage codebases at scale with Codex. The session also covers safety, security, and governance patterns such as agent MDs, skills front matter, dotnotes, and integration with Slack, Linear, and other tools. Finally, audience questions address checkpointing, collaboration, browser automation, and how to bootstrap harness engineering in brownfield projects. The broadcast promises post-session resources, a GitHub repository with examples, and upcoming Build Hours events.

Key Takeaways

  • Codex 5.4 (KUA) enables native stateful computer use (up to 1,000,000 tokens) and introduces tool search for better context management via progressive disclosure.
  • The Codex app now ships with Windows native sandboxing and supports “skills” and “apps” to extend agent capabilities beyond the IDE.
  • Hosted shell and WebSocket mode broaden how you run code-ex (and containerized) tasks, making highly tool-heavy workflows more efficient.
  • Harness engineering patterns (Symphony, agent legibility scoring, pilots with work trees, and non-functional requirements) dramatically increase throughput and code quality in large codebases.
  • Docs-first, code-first: encode standards as agent MDs, skills, and sub-agents, keeping context in-repo (and in Atlas Arnold) for reproducible, auditable AI-assisted development.

Who Is This For?

Engineers and engineering managers exploring scalable AI-assisted software development, particularly those adopting CodeEx, codex-based automation, and harness engineering to ship large, high-quality code with minimal human input.

Notable Quotes

"“GPT 5.4 uh, which we'll get into in a little bit.”"
Mention of the new GPT-5.4 update discussed during features review.
"“The app itself… has Windows native sandboxing through and through. doesn’t rely on WSL.”"
Highlighting Windows-native Codex app security and portability.
"“We have upgraded our app to the blessed cryptography library that we support.”"
Basis talk on security standards and toolchain choices.
"“No humans would write any lines of code.”"
Mitch discussing harness engineering and non-human code production.
"“If you can articulate what it is about the code you don't like… you can encode those non-functional requirements.”"
Strategy for reducing AI slop by embedding guardrails.

Questions This Video Answers

  • How does GPT-5.4's KUA mode improve tool usage in large-scale AI assistants?
  • What is the Codex app’s role in managing agents and work trees for codebases?
  • What are harness engineering patterns and how can they be applied to brownfield projects?
  • How can teams enforce non-functional requirements when code is generated by AI agents?
  • What are best practices for integrating hosted shell, skills, and apps in OpenAI Codex workflows?
OpenAI CodeExGPT-5.4KUACodeEx appSkills and AppsHosted ShellWebSocket modeAgent legibility scoreHarness engineeringSymphony patterning (brownfield deployment)
Full Transcript
Hey everyone, welcome back to OpenAI Build Hours. I'm Christine. I'm on the startup marketing team and today I'm joined with Charlie and Ryan. Hey folks, how's it going? Hey everybody. Awesome. Um, so Charlie is on our Dev X team and he will be leading the session and Ryan actually came all the way from Seattle to be with us live in the studio today. Um, and he's going to be chatting about the future of work. Um so today's session is all about um API and codecs. And if this is your first build hour, the goal of the session is to empower you with the best practices, tools, AI expertise to scale your company using OpenAI APIs and models. Um and this session in particular is really going to teach you how to use codeex uh for all of your engineering work and really what is the next step in using codecs that goes beyond um pair programming. So here's a snapshot of what you can expect today. We're going to spend five minutes on all the new features that we launched across Codeex and the API. We have lots to share on the app as well as a new model that just came out last week. So, perfect timing. Then we're going to dive into a demo on agent legibility score. So, more to come and this is going to be super practical for you to apply right away um to your team. And then, uh Ryan is going to talk about harness engineering. So, we released a blog on harness engineering from Ryan, who's going to be actually in person talking to you um about what this means and how you can apply this to your team. And then we'll save some time for a customer spotlight basis. Um we'll be joining us live and um we're really excited to have the co-founder Mitch to actually talk about how he uses Codex with his engineering team. And then last but not least, we'll save 15 minutes for Q&A. So on the right side of your screen there's a Q&A um chat box and you can submit your questions. We'll answer them during the session and then save a few to answer live at the end. So with that I'll turn it over to you both. Thanks Christine. So let's kick off by talking about what's new in codeex. Um you know we've really entered this third phase of uh using AI for software development. The first phase was you know autocomplete right sort of the ghost text. The second phase was pair programming where you would have a model sit alongside you in your IDE and work on you know generating code potentially merging it into your existing code. Uh but we've really come into this new era of agent delegation and that's where you can use things like the codeex desktop app to manage multiple agents across increasingly bigger and bigger tasks and workflows. I'm sure you all have felt this but there was a huge step change in capability with GPT 5.2. 2 and with GPT 5.3 and GPT 5.4 for on top of that we have only trended increasingly into this long horizon full delegation to the agent direction and for me as a builder this has been a fantastic thing to empower me to do more and to do more quicker absolutely I think it's been kind of surprising for me in terms of how quickly my own workflow has changed especially with the codeex desktop app and you know it's mostly replaced the IDE for me at this point um and so yeah we've had you know just in the last couple months like you were mentioning 53 codeex which got even faster, the Codex app which is now about managing agents rather than necessarily living in your terminal or your IDE. And we just released GPT 5.4 uh which we'll get into in a little bit. Uh we'll be using the Codex app a lot today. Um and if you were not aware, it's now available on Windows uh which is a big milestone for us. Super excited about this app. And one thing that is truly amazing about it is has Windows native sandboxing through and through. doesn't rely on WSL and allows you to build with Codeex in ways that are native to your local development best practices. I think it's super cool that we're bringing all the power of Codex's security model and best-in-class coding capability to developers on Windows. Uh we really aim to meet youall where you are uh to help you build. Yeah. Uh so much of the world, you know, is building on Windows uh and so we'd really love it for you to check out the the latest Codex app. The app itself uh has a few new features in the recent weeks and months. Um you know a big ones are skills and apps. So you can bring capabilities and expertise via skills. There are a number of them that are recommended and come uh pre-available inside the app. Uh and then we also have apps which you can um share across both chatbt and codecs and you can use those they were previously known as connectors but you can use those to uh connect chatbt and codex to the tools that you use every day. Yeah, skills are super powerful. The way I think about them is context that we give to the agent to show them to what end do you use the tools you have and how do you use your tools well. And the fact that we're bundling that bunch of them with the codeex app means you get all the best parts of how we use our tools well to supercharge your own workflows. Uh and let's talk about what's new in the API. So the the big news from last week is GPT 5.4. Um, it's got native state-of-the-art computer use capabilities. You'll see us refer to this as KUA, computer use agent. U, it supports up to a million tokens of context. Um, and it has a new tool search tool. Um, we found that in practice, a lot of, you know, the the cutting edge, the bleeding edge builders were using hundreds and hundreds of tools, which was potentially uh creating issues with context management. So, we now have tool search, which can help you namespace those tools. Um and the agent can intelligently sort of you know progressively expose those to to get the tools that it needs. Yeah. The the term of art here if you all have heard it is progressive disclosure. The idea that not all the tools need to be in context all the time for every task the agent does. So instead of giving them everything up front, we hide uh all the tool descriptions and uh individual calls behind something that allows the agent to do a little bit more natural discovery and choose which parts it pulls into context intelligently. Exactly. Um and it's the most token efficient reasoning model. You know, you can see on our benchmarks, it's a achieving the same performance as 53 codeex and 52, but with a fraction of the token consumption and latency. Uh we're not going to touch on all these today. I would encourage everybody to check out developers.openai.com to look through the API change log. Um, but some of the big things we've shipped just in the last month alone, uh, the KUA API, which I mentioned, um, that really beefs up the ability for developers to to manage um, browsers and computers. Uh, we have a new mode, a code mode where you can actually run um, the model can generate JavaScript for you to run in a ripple. Um, and so that dramatically speeds up the time it takes to do. Previously you would have to sort of click and screenshot and find coordinates and now you can compress that all into like a single JavaScript execution statement. Love to give the model more tools and let it cook. Yeah. Um we have uh skills and hosted shell which we will be using in this demo. Um skills you can upload to create a skill ID which the model can reference. Hosted shell gives the model um a container environment which it can spin up and then it can actually execute bash commands inside of that container environment. Um, and then we also have websocket mode, uh, where, you know, if you're building an integration on top of the API, you can switch over to using websockets. Um, and for extremely toolheavy use cases, we find that it improves latency by 20 to 30%. One of the things in here that I'm most excited about is the hosted shell tool because it takes all the magic of coding agents of giving them a computer and a shell in order to go cook and solve really complicated problems and puts it in our API in a way that is super customizable, able to be embedded deeply into the agents that you build and optimize to the way you eval the workflows that you're building toward. And if you want to use the codec harness itself, we do make that easy too by shipping adapters with the agents SDK to allow you to give codecs to your agents as well. So we kind of give you the best of both worlds here with both of these options. 100%. Yeah, our vision is to meet developers where they are and give them primitives whether it's hosted on our infrastructure or running locally on yours. Uh cool. So with that in mind, let's let's jump into our demo. Sweet. Um, one of the principles that we're going to talk about, you know, in the harness engineering portion of of today's build hour, um, is trying to make repositories more legible for agents, right? And so, um, we had this idea of using codeex to actually make an app that will score a GitHub repository um, along a certain set of metrics that we've predefined, right? So, this is exciting because it's an easy checklist that we're going to put together that will guide you in the direction of trying to remove humans from the loop from different parts of the code authoring process. Yeah. So, I'm here in the Codex app. Um, I've got a new folder here. U in the interest of time, I've actually created a plan in advance for the agent to go ahead and implement. Um, this is the kind of thing that you could easily create yourself with plan mode in the Codex app. Uh, but we've got, you know, about 10 minutes here. So now I'm going to go ahead and tell it implement plan.md right and so behind the scenes the model is going to you know think for a minute um decide to explore the workspace and figure out you know what files are available um what the current state of the GitHub repository is um and it's going to take a look at what it can do. Yeah, this is super exciting here because the app makes this much more digestible for me. uh it kind of lifts me out of my terminal and puts me into a higher level operating mode where I'm more reviewing the agent rather than being deep in the weeds with it. And this is sort of the transformation in how we interact with this tool that has been enabled by this huge step change in capability that really only became possible this year. There's something interesting here. It the app is actually the model is actually realized. We've built a previous version of this app in an archived folder uh and it's sort of going into uh to take a look at that code to I think learn from from previous runs. This is uh kind of a peak at an interesting technique that I have used to build some agents for myself which is to basically provide pointers to previous trajectories, previous iterations of the task. And this kind of serves to give the model a reference for how the task was completed previously, what it might learn from those previous trajectories, uh, and kind of gives you a very cheap way to implement what we call episodic memory. A way for the agent to improve the way it works over time while completing similar tasks. While it's working, let's take a look at, you know, some of the uh some of the things going on in the Codeex app here. Um, at the bottom, we've got a task list. This is not anything that we generated or specified. This is just the model uh internally sort of saying here is the road map that I need to complete and giving us some visual indicators on on what it's doing. Um yeah, this is distinct from plan mode is really just a nice way for me to have a highlevel site of what the agent thinks is up next for it. And I use this as a tool for myself to figure out whether or not I should interrupt or provide steering to codeex as opposed to just kind of letting it cook. Helps me have confidence that the agent is doing the things that I expect. Yeah. And so if we we open up a new thread, yeah, we can see, you know, if you want to toggle on plan mode, um it's really powerful for getting context, asking you what direction you want to take things in and developing a really, really thorough plan. Uh we now have speed. Um, and you can choose whether to run codecs in standard or fast mode. Um, fast mode is fantastic. Um, but you know, it does consume uh your usage a bit more quickly. Um, and then we have here I think the ability to toggle some of the permissions and settings. Um, I love I've been using local here, but I love u the workree capability of the Codex app. So work trees, if you're not familiar, are essentially like git sandboxes. uh and they allow you to work on multiple things simultaneously within the same folder without stepping on or or clobbering uh changes in a different task thread. Yeah, if you've seen folks with, you know, five screens and 18 t-mox tabs, work trees is what they're using in order to kind of go in parallel here. And I think getting uh your codebase working with work trees is one of the steps you can do to move toward a harness engineering mentality. uh basically make it really cheap to work on multiple copies of your code at once. Yeah. Um and the Codex app has a really great handoff feature where if you're in a work tree and you do want to bring those changes back to your local branch, um you can click a single button to do so. Or if you want to push them back to the work tree, you can click a single button to do so. There's a lot of great git support in here. This is a pretty simple repo, so we just have the main branch. Um but you can, you know, easily commit and push just from the app. Um, you can also, uh, you know, open up these files in your IDE of choice. Um, and then there's a few built-in, you know, there's a terminal, so we can take a look at at what's going on in here. Um, there's a diff panel. Um, I think in this case, you know, we've got quite a lot of changes now. It's written 8,200 lines of code just in the the few minutes we've been talking. Um, over on the left, we've got uh the skills and apps that we were talking about earlier. um you know really love I think just the the set of skills that we that we provide out of the box here um as well as I think the apps that you can go ahead and install uh for yourself. One neat thing here with uh Google Calendar in apps is it kind of uh should maybe help get you thinking about ways you can use codeex to automate other parts of your work other than just the code that you're writing. Uh I've vied up a couple of tools to help me digest my calendar, make sure I'm allocating my time effectively. uh really cool to unblock the other parts of my job uh with this tool. I love that you brought that up. So, let's talk about automations, right? Which to me is like the killer feature for for the Codeex app. Um automations, and you can kind of see, you know, a lot of my messy inbox here. Um is a way for you to uh just run a command on a regular schedule, right? And it sounds simple, but I've kind of been surprised at like how powerful it is in practice. Yes. Yes. Uh, one I run all the time is to have the Codex app review all my open PRs and make sure that they're mergeable, that they haven't come into conflict with me. Codex is really good at resolving merge conflicts, and I am super lazy, so I am more than happy to delegate this work to Codex to make sure that I'm always on block to smash the merge button as soon as my code is approved. Yeah. Um, in my case, I think my most popular one is uh Slack management. So, just checking on, you know, work streams that I've got going on or updating to-dos from Slack. Um, I've seen other folks, you know, at OpenAI use it for things like, hey, you know, take my most recent, um, PRs and just review them for any potential bugs, uh, or, you know, look at my Git history for the last 24 hours and just surface a quick summary for standup today. Amazing. Okay, so let's take a look at how the the app did. It looks like it went ahead and and built something for us. Um, it did some testing. It ran, you know, lint test build. Fantastic. Um, and so we can go into the repo that I asked it to make. Um, and let's take a look at what it's got. Cool. Um, so it went ahead and, you know, oneshotted this app for us. Um, there's a lot going on here, but I'll kind of give a quick overview. The way that this is meant to work is we're going to put in a GitHub repo, uh, and maybe some custom instructions. Um, it's going to analyze that GitHub repo. Um, again, we're using hosted shell. Uh, and it's going to score it based on these seven um, metrics for agentic legibility, right? I wonder what would happen if you put the Symfony repo in here. Yeah, that's a great call out. So, I actually had already loaded this up. Symfony is a repo that we just open source last week um, alongside this theme of harness engineering. Um I have uh previously tested this so I actually know to tell the model um only score the elixir folder. Uh but Ryan, why don't you tell us a little bit about uh oh missing OpenAI model. Cool. So I mean I think we probably need to give it some environment variables. Uh and let's say I should have these in I should have these preloaded. And let's take another Oops. Let's take Power of live demos, folks. I know you can never be too confident in what's going to come back out. All right, let's try that one more time. I do like that the agent bias toward configurability, though. That's a nice touch. Very 12 factor app style. Okay, it's going ahead. Yeah, why don't you tell us a little bit about Symfony? So, Symfony is kind of the next step that came out of the work that we did in this repo that is very bootstrapped for harness engineering. And it takes this idea that if we have enough guard rails around codeex that it is reliably producing code that is going to be accepted by all the human engineers and agentic reviewers on my team, let's remove the humans from the loop entirely on poking at their terminal and instead have them working at a much higher level defining work in linear. Um, and Symphony is an orchestrator that manages advancing that work through the ticket queue, making sure that Codeex is spun up in a work tree, implementing that code, putting it up for review, going back and forth with CI and Agentic code reviewers, and only bringing the humans into the loop once all our previous constraints are satisfied. And you know this allows me to truly focus my time on the hard stuff prioritizing work reviewing work and making sure that the work that we are doing is like acrewing value to the actual products that we want to build. Yeah. And it's really nice to be able to give be given a fully baked PR and be able to make a cheap yes or no decision on whether or not can merge and then Symphony will manage merging it for me. Nice. Um and it looks like yeah this is reflective of the fact that you have put a lot of thought into those design patterns. Uh the our app here gave you a B-grade on on your agentic legibility. Nailed it. Um there's a few things. So let's take a look at at what it scored and and its recommendations, right? So uh bootstrap self-sufficiency basically uh can the repo just get, you know, set up from scratch, right? Do you need to know is there external knowledge that you might need to know um or commands that you might need to run to make this work, right? uh task entry points. So is it easy for the agent to just run, you know, make or build or lint or that sort of thing, right? Uh validation harness. Um so how easily can it check the changes that it made, right? Um hard to know if you completed the job if you can't measure that the job is done. Exactly. Um linting and formatting. So I think this one um it found that you had a llinter but maybe it wasn't as obvious to the model like where to find that and and how to expose that. Um in practice I think linting is probably for me I think maybe the biggest like the the best lowhanging fruit to tackle. Um it just saves I find in practice it saves like so many cycles of just like the model can just check check its work really cheaply. That's right. And it's super easy to add leverage to your codebase by vibing up some new lids. Codeex is really really good at this. Yeah. Um there's the yeah is there a map for the agent to take a look at where things are? Um are docs structured uh easily and then are there any decision records present in the in the app? Right. Um I think in this case this was the only one that that you guys missed on. Yeah. Artifact of open sourcing the things. Sorry folks. But it ground all those. Yeah, it made those recommendations. Um and you can see over here, you know, it took um this is these are logs from the container, the hosted shell happening in the background, right? Um, and if we go back to the codeex app, um, we can see like, yeah, we actually in firing off this responses API request, um, we send it a model, uh, we specified this skill ID which has all of that knowledge of of the metrics in the background. Um, and then we, you know, the team has actually put a lot of thought into safety and security around this stuff. Um, it's it's not trivial to just, you know, sort of say like, yeah, let's open up all network access here. Uh, so we did have to specify a white list of domains that we wanted the shell to be able to access. Yeah, I think it's super nice how high level the container API is and it allows you to get very good secure defaults that empower the agent to cook while maintaining the safety and security and variance that you want because you know fundamentally uh these agents are going to do what they do and we need to make sure that we put them in the right environment to do that. Yeah. Um so uh yeah we'll have these slides here right these are the legibility metrics that we started on but ultimately um this was kind of a oneshot example right it's a single plan we oneshotted this we vibe coded this app um I think the more interesting question is how do you continue to build bigger things right how do you um how do you extend that vibe coding session into into something much much larger that's right uh this is why I'm excited to talk today about harness engineering kind of what it is what we learned while kind of operating on a project that my team took on for about five months where we had the hard constraint that no humans would write any lines of code. So we ended up with a product that we're shipping internally today uh that is about a million lines of code uh where all of it 100% of it has been written by codeex and this is been kind of a very interesting experience as an engineer here because I can't actually clack on the keyboard in order to make progress. I kind of have to step back and think at a systems level of how to enable my team of agents to go do the thing. And agents see the world and code slightly differently than the humans I would normally have on my team. So over the course of this project, we have been figuring out a bunch of patterns on how to refine the output of these coding agents so we can get to merge over and over again. And so one might imagine, one might assume that a million lines of code entirely written by agents. That's got to be a hot mess, right? That's got to be, you know, like how do you manage that and and and you know, make sure it's good. Yeah. I believe the term uh for this is AI slop. And one thing that's great because code is now so much cheaper to produce than before. You can simply just say I do not tolerate AI slop, which to me is slang for code I don't like. Right? If you can articulate what it is about the code you don't like, the next step is to write that down in documentation or in a bespoke code reviewer agent or in tests or lints. If you can say what it is you don't like about the code that's being produced, you can work toward encoding those non-functional requirements in your codebase to just ban that code from happening in the first place. One nice example I have here is one common pattern we see is that the coding agents like to optimize for local ease of use which means you can end up with several copies of the same routine in your codebase. One thing that has happened is we end up with very many copies of a bounded concurrency helper. Uh but only one of them is one we've invested in to instrument with our open telemetry stack. So we have a lint that basically bans this function of various shapes from being defined anywhere else except our canonical async utils package. And this is just a vibecoded eslint rule which we can assert is 100% covered because we can just do that which makes codeex exhaustively write tests for the positive and negative cases here. And this is kind of what it means to dive deeper into the systems thinking within your codebase. observe the mistakes or classes of misbehavior that the agents are making and do what it takes in order to just statically disallow them. Um, this has been really really cool because this allows us to actually encode what it means to have trusted engineering in the code. It's just not possible for AI slop to enter the codebase. And one other neat thing is that these things have stacked incredibly well, right? We have kind of observed that as we have onboarded engineers to the team, they have a different set of experiences and ideas of what good highquality correct code looked like. Uh which means that each one of the engineers that has joined my team is able to reduce sloth in a unique way. But because everyone is invested in putting that knowledge into the codebase, it means everyone else's coding agents have the best guts of everyone on the team. Um, one neat thing is um Oh yeah, maybe we can go to Yeah, why don't we why don't we look at dig into some of that that codeex usage. Yeah. So um I think when we started this was very very slow, right? like codeex did not have a good idea of what my acceptance criteria was. So I often had to double click into a task to build some building blocks to get the refined output that I actually wanted. But once I had invested in those building blocks, I went from maybe a quarter or half an engineer's worth of throughput at the beginning to three to 10 engineers worth of throughput per engineer. And one other neat thing here is that as I have added members to my team, our throughput has actually increased because we have that flywheel of subsequent knowledge that every engineer has impacting everyone's coding agents for the better. Another neat thing here is because the only way to get code that is accepted is to get the coding agent to do it. We have had to pull a ton more context into the repository than I otherwise would have in the past. Uh a good example here is I partnered with our security engineering team to upgrade our app to the blessed cryptography library that we support. But that decision happened in a Slack thread buried in our team channel two months ago. We hired a new engineer onto the team and they were mcking around in this part of the code and brought in a new npm package for some of this functionality. I knew when I reviewed the code that this was not uh aligned with our engineering practices. But this is not the engineer's fault. It's not Codex's fault. That information just wasn't encoded in the constraints of our system. So I went back to that Slack thread. I did at@codeex please add guardrails to our codebase reflected that knowledge back in and then redid the change which then resulted in more aligned output because it used the bless cryptography library that we were supposed to. Wow. Okay. And so it sounds like then you're heavily relying on these integrations these these apps right into uh Slack or Linear or whatever else to get context back and forth. Yeah. And it's super super cheap to do this right. I at codeex I go away for 5 10 15 minutes and I come back with four proposed PRs on how we can make our code higher and higher quality in the in the future. The other pattern that I think was really really neat is we have a lot of context in the codebase too much to put in an agent.mmd file. uh just talking about our security best practices is a 250 line markdown file. Same for what it means to write reliable code. I'm sure all of you in uh your past have gotten a page that was resolved by adding a retry and a timeout to uh some network code and we all know this but still code ends up in production that does not have retries and timeouts and code produced by codecs is the same. So kind of writing that non-functional requirement down, giving the agent pointers in agents.mmd on how it can look at our best practices for reliability and then reflecting that back into the agent in a reliable way via a code reviewer operationalize around reliability who leaves comments on PRs that our primary authoring agent is forced to read means that we can kind of steer the system to converge around our set of guardrails in a way that's actually higher for reliability than having the humans do it. Codeex is very very patient and willing to take as many code reviews as we can throw at it whereas I might get frustrated and just yolo merge it anyway. Uh so we do end up with very high quality code as a result of this here. Uh another neat thing is um this stacks very very well. Um, we had a new engineer join our teams, very product minded and was frustrated that Codex was not able to effectively smoke test changes to critical user journey flows in our app. So pointed codeex at the codebase, asked it to spider through all of our uh userf facing features, write product specs for them, write manual QA plans, write the user needs that these flows address, and then pushed it into a bespoke reviewer agent to look at every PR and generate a manual QA plan, which again codeex is able to look at in order to validate that its changes work end to end and are aligned with our app's functionality. And what I find really fascinating about this sort of setup is, you know, normally these things, they maybe exist in Slack or in conversations. Um, but we've never sort of had the ability like we've never taken the time to actually write them all down and encode them in our code repositories. Um, but to your point about it is now so cheap to to generate these things, right? Um, that that is kind of a new avenue of of documentation. Yeah. If you want Codeex to be an effective member of your team, you needed to know all the things it takes in order to have acceptable code in order to merge. It really is about the non-functional requirements in order to get truly aligned output from these agents. So, I talked about this a little bit here, but as we have onboarded new folks to our team, they each bring their own experience and expertise. I myself bias more toward backend infra and architecture. uh which means I just did not have the knowhow in order to get high quality react code out of codecs. Uh but as soon as we hired someone with deep front-end architecture experience and they started reflecting that knowledge into the codebase, now all of a sudden hooks were getting decomposed to a single hook per file. We were getting small components that were easily testable via snapshot testing and composed together well uh to have smaller files that are more efficient for the agent to page into context. And with each subsequent engineer on the team, they encode more and more of their taste and expertise, which means everybody's coding agents get more effective. That's so cool. So uh some key tips on how you might apply some of these patterns uh in your own work. Uh you should check out the codeex repo which already has much of this in it. Just clone it and ask codeex to explain how it is structured and to teach you about the best practices that the team has used. Another cool thing that I'm seeing all over X right now is folks are literally taking the link to the harness engineering blog giving it to codeex in their repository and say make my codebase more agent legible make it do this and folks are getting good results that is improving the autonomy quality and long horizon task execution of codecs. Uh super super cool that we can essentially ship prompts to the world uh to help uh everyone be more effective with codecs. Yeah, and I love that it's this shift like we're saying from from you know coding in on the keyboard to now orchestrating right at that next level which Symfony. That's right. That's how we get to the repo name. That's right. All right. Uh then I will hand it over back to Christine. Awesome. Thank you guys for walking us through that. Um and now for our customer spotlight. Really excited to welcome Mitch, the co-founder of Basis. Uh, so we're gonna pop out of this screen share and then Mitch should be joining us on stage. Cool. How do I stop screen? Yeah, there we go. Hey guys, thanks for having me. Hey Mitch, I'm think I'm gonna be saying a lot of the similar things. Perfect. Uh, you were just muted there for a sec. Okay, great. I was just saying I think I'm going to be saying a lot of the same stuff Ryan and and co have been saying on putting stuff in the repo and whatnot, but this will be a fun a fun discussion. Um, let me see if I can share my screen. So, I had Codeex make a quick presentation here. Can you guys see that? Yeah. So I'm just going to quickly go through talk about some of some of the stuff we do which I think is very similar to some of the harness engineering work um and uh both on how we do it for engineering in our codebase but also actually at the company level um and how I think you can think about using codecs beyond um just even the codebase. So I think the the first thing that's really important um and let's talk about the codebase first and we'll talk about that company after is actually and I realized let me give a quick introduction of myself. So I'm Mitch I'm one of the two co-founders of of Basis. I um Basis essentially is an agent platform for accountants. Um we recently raised a a series B uh and we are very very focused on um engineering our company in such a way that we can move extremely fast. And the reason I always tell people this is so important is that our ambitions are so large that if you actually think about the scale we need to be at in very short periods of time, it would literally be beyond the laws of physics to be able to hire that many people. And uh so we can't rely on that. instead you have to really engineer your company and your codebase to be able to um you know have as much output uh as if you were you know a company that was 10x the size. Um so going kind of back onto um uh the codebase side doing that I think really does require a big mindset shift. So, and part of what you guys just heard about from like the symphony side is people need to be able to shift from doing to managing. And I think that's actually a very hard um paradigm shift to make and different people uh are able to get to it at different kind of uh different speeds. But that shift doesn't really it's hard to make people make that shift whether it's engineers doing it or people on the sales team or or the post sales team if uh the agents aren't actually working well. So, it's hard to go and convince someone, hey, like, you know, go, you should just have, let's say, this agent write the code if the agent isn't actually writing the code. Well, number one, and two, you have to have the mindset shift about how do you build for agents, not just for people, right? You want agents to have the ability to self-verify. You want them to be able to um uh understand like what the different intents are uh that people are are giving them. And even then, if you're in a production codebase or in a larger company, you need abilities to collaborate so that one person's preference of how some agent behaves doesn't now go and override another person's preference. And so, you really have to go and build this up. So, what do you actually need to be true at the codebase level for this to work? You need to have good standards in terms of how you want the code to be written. That needs to be turned into good context. And then you need to have good processes around this to make sure that context is kept up to date. And then you need internal setup that's easy to operate. Right? If you have a bunch of uh dev tooling that none of which has MCPs or ways to integrate in with your agents, then they're not going to be able to actually uh you know validate their work or use the same tooling that a developer would. So we for example you know at basis we we have something called satellite that we made which is a single MCP that essentially wraps all other MCPs so that developers only has to integrate with one MCP rather than having to integrate with like 50. Um just as an example of that kind of thing. Why is codec really good at this? So I think the specifically for production code bases if you want the agents to be really good you need to get really good at instruction following and codeex is very good at instruction following to the point that actually it forces you I think to be even more precise in your uh documentation and more precise in your context so that um codeex follows our instructions. You need to be Codex is really good at gathering context ahead of time, right? Ryan was talking earlier about like the progressive disclosure. That's, you know, if you're going to have lots of progressive disclosure, which is really important for any level of complexity, you want the model and the harness to be good at um uh going through that progressive disclosure to gather the context it needs to do tasks. Codex is excellent at that, right? And so it'll constantly go out and and do that. And um you need to be good at just making decisions uh in that areas of ambiguity. And Codex is really good at that. So let's talk a little bit about basis and I'll I'll go into a couple demos here of how we do this in terms of how do we kind of set those standards. So we focus on a couple things. We focus on agents MDs. Uh so we have an agent MD at the root and then we have agents MD in different modules throughout the codebase with the kind of principle that if the information is specific to that module then you can put it into that modules agents.mmd versus if it's you know maybe generic across modules you would put it into um uh into a skill and let's see what else I have here. Yeah. I think it's super super important that you set these standards, you treat them as uh kind of canonical and then you build the processes to keep them updated. So maybe I'll show you one quick example here. Um I'm not going to show uh I'm going have to be careful here going into the live stream on what I can demo, but just to show let's say here. So this is a this is a skill that I have. You can see agent skill. It's called paper updating. I'll explain what paper is in a second here. But this is a skill that we have um on how to on how to you know go and update something internally we call paper. And uh the key here is that there is an owner right there who is put on the skill as part of the front matter. And what that means is that you can have very distinct responsibility for uh what who is responsible for different skills. So that for example, let's say you were running, you know, um an agent that was going to go and look at the different skills and uh see, hey, are there any skills that have like descriptions that conflict with each other? Well, uh making sure that the um the skill that if it it knows who the owners are, so you can like slack them or something like that. And then which gets to this internal tool that we've built called paper, which I'll just show you for a second here. Um, so paper is essentially a way to if you're going to have all of the context in the repository, which we want to do, and our our monor repo is called Arnold because that's that's what we called it. Um, fun fact, when I came up with that, it was because uh Arnold was the most popular name of an accountant in the US according to ChachiBT in like 2023. So that's why it's called Arnold. Um, so if I go to paper and just show this to you quickly. Um, so here is is paper and like as you can see we have, you know, codebased engineering, engineering at at bases. I'm not going to show you all the different skills, but I already have it preset up the skill I was just looking at here. So it's very clear who the owner is, what the flow is, and you kind of go and sort of see it. And this allows you to really easily, at least for us, um, be able to, uh, make sure that you're not getting slop into the codebase, that you actually are keeping it updated, that people are actually looking at the skills and, uh, able to kind of go and read it. So, that's just like an example of that of of kind of do how we h have skills. And then we also use sub agents. Codeex as of recently supports sub agents. Codex is really really good at enforcing standards for sub agents. We use sub agents very heavily, actually. Um so some of the sub sub aents that we use is we use um uh we have a specific sub agent whose job it is to enforce standards. So codeex you know it'll do a pass and then it'll go and call the you know standards enforcer sub agent um if there's any uh things that codex might have missed in terms of our standards. Um we have a sub agent for uh helping you know uh babysit pull requests in case there's like you know Codex wants to send someone off to go do that. And so sub agents I think are a really really good way to take flows that um Codex is amazing at but you want to be really sure and put that in a specialized sub agent that Codex can go and call. See what else we have. So couple other quick things here and then I'll go into a couple more quick demos. So I think it's really important that agents have the opportunity to actually verify their own work. So, uh, in the end of the day, you need to be able to have, uh, if if you want, you know, codeex is really good for going for long periods of time. And one of the things it's really good at is using tools and context to sort of make sure it's doing a good job. So, you need to give it ways to do a good job, right? And that includes tests. Can it quickly run tests against the work it's doing so that it can validate itself? One flow that I like doing a lot is having a more test-driven development approach to things where you don't just have a product spec, but you actually have a list of tests um that Codeex can go and generate for you that outline kind of the intent of the change so that Codex can go and ensure that the different user flows that it is um uh that it is making actually work by the time it comes back to you. Okay. And lastly here, I think one of the the big things that starts to happen once you actually have codeex, let's say writing all the code and is able to go and um uh execute things really well is that decision-m starts to become the main bottleneck on your velocity, right? So we have been trying to do things on our setup that allows for decision-m to actually start become faster as well in working with codecs. So one of the things we do is we have this concept of dotnotes and dotnotes is essentially a way where if you're working with codeex and you have some decision and you know let's say you told codex hey actually you should be doing it like this codeex will put decisions that were happening throughout the session it will put those in dot notes um which is like a part of the repository just to like streamline decisions you can sort of think of it like commit messages but commit messages that codeex can write to whenever it wants rather than only when it is making a commit which creates a full kind of history um for every commit of what were the decisions made in that session that you can use um as a historical kind of debugger. So if you're going back to a historical commit, you can say, well, wait a second. Why did codeex or whoever made this commit, you know, implement it this way, not this way. You can look at the note in the notes folder, and it will tell you that. And obviously specs that live in the repo is really, really key. So as a quick example of that, um I'll show you here, I have a a hero product spec um that I put for paper as kind of an example here. You know, if you kind of go on GitHub, it's sort of a little um uh not exactly the greatest easiest thing to read. it's hard to leave comments. Uh if I go here, it's, you know, it looks okay. And so one of the things we built internally because we have this internal app called paper is the ability to go and collaborate on specs right inside of your codebase. So I can go um here for example, and this is now the the spec I'm looking at. I can go and leave a comment test codeex comment. And this will now go and leave a comment that that is actually directly in the um in the inside the codebase. And so all of the um the commenting, all of the resolutions, all of the interactions, all of the planning actually happens literally inside of the codebase. You see now it says refresh um um literally happen inside the codebase. And if you go to here, you'll see the comments. So it's just a cool like internal tool we built. The reason for this is specifically for the purpose of uh as kind of Ryan was talking about bringing stuff inside of the codebase so that we can um have as much context there as possible so that codeex has all the context it needs to go and help us make decisions and implement things. And then lastly and then I'll I'll Christine I'll hand it off uh to you is I think it is really important to not stop at the codebase. So you'll note here that I didn't have time to go actually give Codex the screenshots I told it it would give it. Um for the company we think about this kind of mindset of having context uh in the codebase not just as a production code question. We think of this as a company context question. So we have two repos in our company. One is called Arnold which is our monor repo and the other is called Atlas. And Atlas is the monor repo of all the company context that is not in our um uh the production repo. And for example things like if I go here uh things like where was it things like so here I'm in the atlas not Arnold this is in paper again but like here are some of our operating principles um which you can go and you know look on our careers page for example um and our operating principles you see like who the owner of that is and having that very very clearly in a codebase because now say you're trying to work with codeex to help plan well can go and help and read all the information about the company in order to help um uh make decisions about different things. And so I think once you start realizing this, you can do a lot more. And so maybe I'll just show you one quick example what that allows you to do. Let me quickly share a different part of my screen here. If I were to go and actually pull up Codeex, um let me pull it up. And I am uh since we used the app earlier, I'll use the CLI quickly. Um so in this case, hey, this is the me making the the PowerPoint earlier. Um so if I'm here, like I for example have a skill in my personal that's called start my day. And what start my day will do, it's going to run through my morning routine. It's going to grab all the context from the last 24 hours, refresh a bunch of different things, do weekends, etc., etc. So, I'm not going to actually let it run because then it'll show some information, but um allows you to kind of go and it can do this really, really well because it has all of the context that it needs both from my personal notes, our kind of company repository, and uh from Arnold. So, I will uh I'll pause there and give it back to you, Christine. Awesome. Thanks so much. Um, so we have about 10 minutes left for Q&A. Mitch, if you want to stay on, we had some come in through the chat specifically for you. Um, and we'll just back into the screen share. Cool. Okay. Um, first question, Mitch. Um, how did you use Codex to generate the deck? Yeah, that uh you you can go and like have it make PowerPoint in Python, but that was just HTML. So, I just told it to make HTML and I uh voice voiceed um all the context about the thing I wanted and uh if you notice it used I think it used opening eyes fonts. I told it to go find what the opening eye fonts are and find a similar one. I I do this all the time too. It's uh super good at just vibing a bunch of single page React slides into place and doing a pretty good job at it. Oh, totally. Yeah. Nice. Okay, next question says, "Shout out to Mitch. Met you at an event in New York City. This is awesome. Um, few questions. So, is there any checkpointing or the ability to revert and from a management perspective, is there any plans for collaboration features like shared threads? Can Codex execute browser previews or browser automations?" Uh, I can take that last one. So CODC can definitely execute browser automations. There's a playright skill that's been around for a little bit and then we just released a playwright interactive skill which I think takes it even further in terms of the depth that Codex can go in in using the browser. Personally I do it whenever I'm doing front end um you know if I have feedback for Codex if it's gotten a box wrong or an alignment wrong I'll tell it use playright and just keep iterating and like screenshot it until you feel confident that you've got it correct. Um I think the other one's for in terms of checkpointing um and and shared threads. I'm not sure about shared threads. I think that's that's a great question. Um in terms of ability to revert in the app, uh you have the like everything shows up as sort of staged files. So it's very easy or unstage files. So it's very easy to just click and sort of stage stuff or you can simply tell you know codeex or have it in your agents MD like get commit everything as you go. Right. And I think once you set up that scaffold um it can take care of itself. Yeah. I definitely let codeex commit its own work and uh I give it a skills for what our expectations are around commit messages and the progress and the granularity of those commits works pretty well. Okay, next question. Um as we move from AI pair programming to full agent delegation, what are the key design patterns that make an agent reliable enough for production systems? So what I've found here is it is highly beneficial to put in pretty rigid architecture patterns that you would expect to deploy in a company of a thousand or 10,000 engineers which means uh I have dived really heavily into separation of concerns proper package layering um making sure that business domains have high cohesion and encapsulation and boundary separation uh and this helps with context management right if you're working in one set of business logic and the agent can take as given an opaque interface that it can rely on invariance for right you know for example if I am building a um you know a task summarization feature as a core set of business logic if it can accept a dependency of inference right it doesn't have to deeply understand it. I've limited the amount of context the agent needs to page in. And this is sort of the same pattern as discressive progressive disclosure that we're doing in the docs. So being able to structure the codebase in a highly opinionated way also permits those code as prompts to follow the same sort of pattern. Yeah, we we by the way did like far more refactors I think than you would have five years ago because of this. Uh so I totally agree. Okay, next question says, "How does your team maintain agent instructions? Does each engineer provide their own personal set of instructions or does uh instructions get shared version controlled? How does your how does the team sync up on adding or improving instructions?" So everything lives in the codebase which means at least on my team we have very few personal agents MDs or personal skills and we treat this as a way to encode leverage for all the agents that is are operating on my team's behalf. Um also we have not we have collected around only a handful of skills that are pretty general right we have a skill for making PRs we have a skill for landing changes and deploying them we have a skill for making commits a skill for doing code review a skill for analyzing the architecture of proposed plans and really that's it. um any additional leverage we want to give the agent either goes into reference documentation that goes with those skills, scripts that go with those skills or the documentation and tests in the repository itself. And this has the benefit of getting things out of individual teammates heads and into a space where everybody's agents can benefit in terms of sharing knowledge. Uh we have bias toward really high frequency in-person standups. We have 30 minute standoff every day which is maybe something that's a little bit different from how folks have been working recently but the code velocity is so high and each of the engineers is disconnected enough from the code that it can be weeks before I realize like core architectural patterns have changed. So bringing those to synchronous time for the humans to collaborate has uh made a ton of sense for us. Uh Mitch, is there anything different that that the basis team does here? No, I think it's all the same. I the thing I'd actually be curious to ask about is the is the thinking between putting some of those do like front end or backend standards into docs versus like a skill. Um or it's maybe just kind of uh you know marginal but we we we put a lot of that in skills. Um and then we have docs as meant to be docs kind of meant for humans. Um and then uh obviously still useful for agents but we for now keep a lot of that in the uh in the scale. So there would be like a front-end skill that the agent you know codex anytime it's touching any front end code will go and look at it and that will have a bunch of references under it that will uh you know have specifics on different parts of the front end depending on what it's doing um assuming that it's not completely scoped to um like a module in which case it's it'll be in the agent. Yeah, we have uh instead put all of that uh task specific expertise in the documentation hierarchy and instead focused the skills on highle operating modes. So like we'll have a principal review uh skill that you know the human can then steer to look at specific documentation but is otherwise just a general mode with which codeex should be the task. Cool. That's cool. uh work tree sounds cool. Can you demo that quickly? Sure. Yeah, let's take a quick look. Uh let's go back to codeex. So the way the work trees uh work, so right now we just pick a new work tree. Um and then I'll just say um you know modify the plan MD um to like to be shorter, right? Uh and the idea here I'm going to toggle off plan mode and I'll put it on low just so it's fast. Um, but you saw briefly there created a new work tree. Uh, and then now it's going to go ahead and um start making uh yeah, it's smart enough to know that it should read it. Um, but it's going to take a look. Uh, and I think it's looking for which one it should edit, right? Um, because we previously had some archive ones. Um, but basically one thing that didn't happen in this branch is we didn't bring in automatically um our existing diffs, right? So like all of the changes that happen here locally um are currently separate from this work tree. And so that way I can work on this branch there. There's an option to bring those changes if you want them. Uh but by default, you know, it'll it'll take the clean main checkout or whatever branch you want. Um and go ahead and do it. So it's going to modify these these files. Uh and then we'll have the ability to either spin off a branch and create it later. um or once it starts making diffs and if you're on a separate branch it'll decide whether you want to hand those off to uh to your local environment. Yeah, the way to think about this is I can have multiple working directories with multiple different branches of the repository uh on my file system at the same time without having to clone the repo multiple times and like these commands are pretty fiddly. Uh I hate using them. So it's great that Codex app is just going to do this for me. So it made a bunch of changes here. Um but you know those are not going to be reflected in our in our main thread. Okay, next question. We'll just rapid fire through these as we're almost at time. Um due to progressive disclosure, should you just enable all skills so they can be called if needed? I would say yes. Um, the thing to invest in is short, highquality descriptions that live in the front matter of those skills because that's going to kind of be the teaser that the model gets to determine whether or not it should invoke a particular skill. Um, so we kind of get to go from a short little snippet describing what capabilities are available to the detailed instructions the model should follow. Yeah, we actually have a codec sub agent whose entire what it does is it can uh spin off um like it can spin off versions of the sub agent that will test for different skills. So it'll like give examples of different instructions uh and see if it actually triggered the skill if the sub agent grabbed the skill or not as a way to kind of like pseudo test pseudo test if the front matter descriptions work well. Um and then Ryan, what about specific skills in our library? Um, uh, which library? Uh, this was just a follow-up question for, uh, the app store. Oh, um, I really like using the Figma skill. Uh, it's super super neat to be able to to close the loop from work that other members of my team are doing to pull that context into codeex. In the same way, we're doing all these tricks with Slack uh, to bring context into the codebase as well. Nice. Okay. Um, I'm curious if any work was put into creating a bootstrap for harness engineering on a brownfield application, for example, documentation management, generic rules and conventions, and if not, what sort of specific implementation details could you advise us on? Yeah, so I would say we are still actively figuring out uh what it means to deploy these techniques into a brownfield codebase. And this is something we're super happy to collaborate with the community on uh to figure out what works. uh but in general the techniques we've been doing are to carve apart parts of the repository into isolated business logic domains add interfaces add highquality local set of documentation on the best practices that every member of the team that owns this code is looking for. So we have been having individual doc sub trees that are close to the code that's being modified using that same sort of structure that we lay out. And yes, turn on all the lints you can. Okay. Do you have any resources or examples for these MD prompt files to help guide codecs with longunning tasks? Um, I mean, I'll let Ryan answer, but I think all the all the the skills we've had, the metrics that we've shown, and the code for the the sample app, we're going to have that in the build hours GitHub repository. So, you can take a look at um how we're asking the model to grade uh these MD prompt files. Yeah. And I'll drop those links um after this session and um in a follow-up email. Uh Mitch, there were some questions for you, too. People were really interested in your slide deck as well as paper. Uh, one thing I'll say is to see these files, what reliability looks like for your codebase, what security looks like, what front-end component architecture looks like. I'd say just sit down as a team and rapid fire throw some bullets on a whiteboard around what you think good looks like. And these can be pretty high level, right? Like something like secure code has secure interfaces that are impossible to misuse is just something you should write down. Uh, and just getting a couple of files in the codebase that have the seeds of this makes it really, really easy to incrementally add to them. So, take the time as a team to figure out what look good looks like for you. Cool. Um, yeah, like I said, it'll be in GitHub. Um, cool. Awesome. And with that, we will wrap up. Here are some more additional resources, but again, I will send them out post session. Um, and then just a quick plug on upcoming build hours. On the next slide, you will see we have two coming up. One on on agent capabilities that's happening March 24th and then GBT realtime 1.5 on April 15th. Um follow this link below. That's our homepage. You can find past recorded sessions as well as sign up for everything that's coming up. And with that, we will wrap up. Um thank you so much for attending and thanks Mitch for joining us. Have a good one. Thanks fam. Happy building.

Get daily recaps from
OpenAI

AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.