The Infrastructure Nightmare Nobody Is Talking About
Chapters11
Emma leads the data platform infrastructure engineering group, responsible for the data systems that underlie analytics, streaming, ML infrastructure, feature stores, and data pipelines, as well as training/eval data preparation and secure, scalable data flows used by every team.
OpenAI's Emma explains how data infrastructure is becoming the bottleneck and enabler as autonomous agents accelerate engineering at scale, and why platform teams must architect guardrails now.
Summary
Emma from OpenAI (2023 hire) leads the data platform infrastructure group, the team responsible for data systems that underlie every product and research system. She details how the data layer powers analytics, streaming, feature stores, and training/eval data, serving every other team—from GTM to HR. The conversation centers on how rapid model improvements (CodeEx and agentic tooling) are accelerating work, yet creating new systemic risks for platform teams that must maintain security and reliability. Emma shares concrete wins like autonomous release pipelines and knowledge capture inside agents, which reduce hours of manual toil and even perform multi-system triage and patching overnight. She also discusses uneven acceleration across teams, the need for autonomous code reviews, defense-in-depth guardrails, and the tension between platform reliability and app-team velocity. Throughout, she emphasizes a culture of experimentation, the importance of an eval harness to test new model capabilities, and the strategic shift leaders must adopt as AI-driven automation reshapes how we operate infrastructure. The takeaway is clear: the future of infrastructure is agent-powered, but needs careful layering of tools, reviews, and guardrails to scale safely.
Key Takeaways
- CodeEx- and agentic tooling have automated release processes, cutting hours of manual testing and triage.
- Autonomous data pipelines and data export for training now run largely without human intervention, with agents diagnosing and patching issues across internal systems.
- Platform teams must implement multi-agent review and governance (separate code owners and reviewers) to safely scale autonomous changes.
- Capture and codify institutional knowledge into agent skills and runbooks to enable safer, smarter automation.
- There is an imminent scalability cliff: app-layer acceleration outpaces platform-layer capabilities, requiring deliberate investments in operations and autonomous guardrails.
- A strong internal culture of experimentation and an eval suite helps teams safely push model capabilities forward, even with fast-changing technology.
Who Is This For?
Infra engineers, data platform leads, and engineering leaders exploring how to safely scale AI-driven automation in complex environments. If you’re worried about platform vs. app-team velocity under AI, this provides a realistic view and practical steps to start experimenting now.
Notable Quotes
""We're completely hands-off. That's like hours of time that people are spending like every day.""
—Emma highlights the impact of autonomous release processes on team workload.
""The models are getting really good. So I think we'll get there pretty fast.""
—Optimism about rapid progress toward fully autonomous operational capabilities.
""It's almost like a defense in-depth kind of strategy... autonomous agentic code reviews""
—Describing how to guard against unsafe autonomous changes at scale.
""CodeEx can turn out features by features... you wake up and it's done.""
—Concrete example of end-to-end autonomous work using CodeEx.
""Business as usual is not going to fly anymore. If you are a leader, you need to be a visionary.""
—A closing advisory to leaders navigating rapid AI-driven change.
Questions This Video Answers
- How can a data platform team scale autonomous tooling without compromising safety?
- What is CodeEx and how does it accelerate software delivery in AI-first companies?
- What guardrails are essential when deploying autonomous infrastructure at scale?
- How do platform teams balance speed of innovation with reliability in an AI-enabled environment?
- What does an eval harness look like for evaluating new AI models in production environments?
OpenAIAI InfrastructureData PlatformCodeExAgentic toolingAutonomous pipelinesGuardrailsCode reviewsPlatform vs. App velocityEval harness
Full Transcript
I would love to hear from you like you know tell me about Emma. Tell me about how you're here at OpenAI, what you're doing etc. So my name is Emma. I joined OpenAI back in 2023 to lead um the data platform infrastructure engineering group here. Um so what we do is anything that involves data systems, the guts and bowels of creating any product or any research system or etc. There's a lot of data systems that underly it. So my group actually take cares of all of that. So whether it's big data so if you want to do analytics you want to crunch some data um or if it's like streaming like you know streaming processing or like event buses or if it's ML infra like you want to do some ranking algorithm etc like feature stores um or if it's and then we get to even like higher level abstractions which builds on top of this which is like piping data from one system to another making sure that's done securely in a scalable way um as well as something unique to open eye which is like preparing training data um and eval data and making sure the systems can sustain kind of that kind of load.
So that's all on data platform. Um so we're we're very low level as a team like product team sits on top of us, research team sits on top of us, we really serve every team in the company. Um whether it's like go to market, you know, trying to sell customers, understanding what customers are doing or finance trying to close the books or like doing M&A deals to understand what's going on. um or even HR um you know but also like you know for the business itself right building the product like personalization is a feature that is very much reliant on data to crunch the um the information used for memory um really every product touches us in some way whether you're logging into OpenAI or API um or integrity it's all plugging into data systems underneath so I I I guess maybe the first question I would have is How has that world, how has your job, how has the team's role shifted maybe in the last year, maybe in the last six months, last three months, things keep moving really quickly.
And I'm curious how you feel like things have evolved for you and the team just as you see the models get better and sort of how that's impacted the data layer. It's all very new for us too. I would say a year ago it wasn't that different from you know artisal software engineering of the of the Y. Um but really in the last six months or so things really started accelerating. I think Codeex got a lot better. Um the models are also like getting much better m like very very fast. Um so we're seeing like really rapid changes in the last six months and I don't think this is the end of it.
It's it's very much the beginning and I think we're going to see even bigger changes coming up very soon um that I don't know if anybody has a clear idea what's going to look like in six months. Um but in terms of on the platform teams like ours um we we we kind of see what's happening in terms of accelerating our own work which is the exciting part. We're also seeing how how much it's accelerating everybody else around us which um if if like different parts of the organism are growing at different rates then you can imagine that there may be like actual problems that are caused by um using a lot of agents and autonomous like coding tools.
So yeah, that makes a ton of sense. Yeah, I I can dig into this as well. Um but uh in terms of just on the on the data infrastructure front in terms of how we're accelerating things um we're using codecs as well as like agentic tooling for things that are just like uh we didn't really imagine that we can use like agents therapy for for example our release process. Um so but okay so I'll back up a little bit. So we have a lot of um uh basically proprietary uh oss software that we use and we basically patch those um like underlying software uh pieces up um every so often and then we release that and then we use that in our systems right so we have our own release process for these underlying components um and it's a very manual process to test validate um make sure like run all the like you know cases through and then to promote it to like the next stage let's say from staging to um canaries and then from canaries to like prod right and then we have a lot of different packages like this you know dozens so every time we need to do a release process it's very manual people have to like watch the jobs like it's couple hours sometimes days and then you have to remember to like check all the results and promote it so now that's all controlled by the agent the release process is the agent itself it does everything um that I mentioned autonomously it will ping you know us in Slack to let us know what the status of things are if something is is broken it will also like try to do its own triage and give suggestions on what's going on.
So, we're completely hands-off. That's like hours of time that people are spending like every day. Um, and it's done a fantastic job, probably better than humans can. Uh, so that's one area. Another area is like, um, we've been using a lot more um, we've been just capturing like the the knowledge, the specialized knowledge that we have as infra teams into more skills um, and more kind of like capture in the agent itself. Um so that users can take advantage of that, right? So um so before there's just like we need to build a lot of guard rails to like oh you you know user can do this wrong and then it'll end up being like a really long running job it'll break.
But now we can encode that in skills. So then when users are using that it actually is like extremely smart about what to do and how to um debug. A recent um kind of example is um we launched a skill for users to help export data for training purposes. So that used to be a very manual process. It's going to take hours. There's a lot of sharp edges in terms of what could go wrong. And um with the new kind of like codeex assisted like um you know skills kind of way of doing things uh the user launched this job and then just went to sleep and then the agent was running in the background.
it actually found some issues. It got blocked. Um, and then normally if you know if a person's blocked, you have to ping us and find out what's going on. The agent was actually able to go into the internal systems itself because it's all one big codebase and figure out what might be wrong there. So, it went into like four or five different internal systems and check the code there and be like, hey, why is this why is this breaking in this way? Um, and then it also pinged us in our support channel, but this was midnight, so we didn't see it.
Um, and then and then it went and actually like three layers deep, it found like this tiny bug that existed three layers deep, it went and somehow patched it and fixed it. Um, and basically worked around it in a way that the job could continue. So by the time a user woke up in the morning, it was completely done. Like no conversations needed to be had. Um, so that's kind of where we're at right now. Um, I can give you more examples on how we're using like kind of agents, but there's like more kind of I think more like kind of vanilla cases like I think everybody's using it this way in that we have some like um kind of uh data tooling um that we launch for all users.
So kind of think about um autonomous dashboarding, autonomous notebooking, you know, uh and in order to make that really good um we take a lot of user feedback and they will just like put in Slack is like I want this feature, I want this feature. So we basically have the agent drive this. We have Codex go pick things up um and really try to do it on its own. And we have created tooling so that Codex can plug into like the full cycle like it can it can it supports browser use for example, right? It can load the whole DOM.
It can figure out like it can do the clicks and figure out what's going on and it can validate its own behavior and its own um uh kind of like fixes and so it can do the full loop. So essentially uh for many tickets not all I would say it's not like quite completely there yet for many tickets it takes like a user like a comment and some description and then codeex will just execute on full stack full loop and then it'll come back with a PR with all the the proof of how it fix it with a video and everything.
Wow. Um and then yeah so it's it's really accelerated a lot of parts of engineering for us. Um, and then I'll talk about the other piece that I mentioned earlier, which is like, okay, what if the acceleration is uneven? And I I think I'm seeing this a little bit right now. I think it'll even out longer term, but right now, this is what we're seeing is that if you are on a team where um the surface area or um like basically the spread of um the damage if something breaks is like a little bit more limited.
Let's say you have a like you're you're still building like um like a project that um hasn't been launched in production yet. You're iterating on it or it's like um you know only launched alpha customers. You're iterating around very very fast. So if you think about like a lot of these kind of teams, they can go very very fast. They can vibe code completely, right? Like codecs can just like turn out like features by features. But if you're on a very like like a like a like um like a root level team um like an infrastructure team where if I change one thing it will like basically affect like thousands of different teams um it's harder to go uh just complete by coding.
We still have to have a lot of guardrails in place. we still have to do a lot of manual checking to make sure that like it doesn't you know affect things um in a way that will like you know harm a lot of users and um uh and so so like I don't think the model is like quite I think it's going to get there but it's not quite there yet in terms of creating like a 100% oneshot it like perfect code so we have to iterate on it a lot more so you can imagine is like a lot of users who are sitting on top of our platform are vibe coding like all these new applications and things um and then they land on our platform which we're in charge of running, right?
If you imagine like a Spark job or like a Flink workload um if they break often times we we we will check with the user and the user probably like has less of a clue of what's going on because it's codeex generated. We've had instances in the past where like users are like I don't even know what Flink is. I don't really know how how to use it. Um but you know but it's it was working I thought you know so I can't really help you. you guys should figure out like what's broken and fix it.
So there's almost like a transfer of responsibility and load onto a lot of these platform teams who are running the code. There's more code, right? There's a lot more code is being created really fast and the teams are responsible for ultimately running it successfully and making sure that it's not breaking. Um they're inheriting a lot of the the burden. So so what are we doing for that? So um at open we're also thinking about how you know we don't have a good term for it. So, so you know, sorry for I'm using my own terminology, but it's like almost like a defense in-depth kind of um kind of strategy where it's like, okay, on the code review side, um this is kind of the interface that we have right now.
We will evolve from here, but how do we make autonomous agentic code reviews happen for each team? So, it captures our specialized knowledge to make sure bad things don't happen. Right now, we still have to spend a lot of manual effort looking at code and inspecting it. Yes, we're using codec to inspect it, but it's in order for things to to really be tied down, we we really still need to do that. How do we encode all of that knowledge and knowhow and all of our runbooks and all the past incidents and all of that and have this agent kind of do that work for us?
Um, I think some people, again, it's an unsolved problem. These are all unsolved problems. Some people are like, okay, why can't we encode that in like agent MD files and like the the the agent that's writing the code itself, right? Why doesn't it just know to do the right thing? Um it's it's an open question but in my perspective I do think the incentives will always be somewhat misaligned. That's why we have you know people who write code and people who review code and they're separate people. You know I I think it'll be it'll be hard for the model to like juggle those two things uh consistently.
I think there should be a multi- aent architecture for for this kind of thing. Um and where a separate agent will will look at the code and review the code and um kind of like you know it's kind of like a code owners like plus+ situation where it's like if this code touches like different teams then the different teams different agents will come in and review the code and make sure that it's like you know tiptop according to that team's like specifications. Um but then also every other layer on the infra operations layer as well.
How can we make that autonomous right? If we if we find like a bad workload, how can we sequester that a lot faster in an autonomous way where it doesn't have to like page us? Um we've also see incidents where like a user again um you know would can can by code something that um they really didn't intend to. they like flip um feature flag from like uh you know like uh that that they didn't mean to turn on and it just took took the entire cafa cluster down you know those kinds of things like how can we build systems that can very quickly capture these kinds of erroneous usages and and and do something about it because I don't think the world is force users to write manual code or like review their code completely manually understand every single thing I don't think that's where we're going we're going to like full autonomous mode where it's like code produced producers and code reviewers and code deployers and code um and also like just people running the systems.
It's all autonomous. It's all agents um every layer. How do we get there um safely? And so I think the models are getting really good. So I think we'll get there pretty fast. So it's a very exciting world we're we're in right now. Um yeah, I'll stop there. No, there's there's a lot to dig into there. Um I could pick at probably any number of half a dozen threads. One of the ones that I think popped out for me is you talk a lot about sort of framing your world in terms of the kinds of autonomous problems that are solvable at the platform team layer versus the kinds of autonomous problems that you know teams that sit on top app teams might might engage with.
And maybe we can dig into that a little bit because I think that you have a really great point that I've seen chatting with other companies as well where they feel like there's this huge separation and difference between what the front-end app teams can go after with perhaps less less need to be completely correct on code versus where platform teams have to be near 100% correct and need to insist on holding that standard even as they're supposed to speed up. And so maybe if you can dig into a little bit more like what are what are the pieces of this where you feel like we have a handle as a platform team on where autonomous agents are at now uh and then maybe we have line of sight to some problems that we think are linearly solvable uh for platform teams and then from there maybe we can blue sky a little bit and and I can hear a little bit about how you're thinking about sort of the longer run and I and by the way I fully realize agents are going so fast the longer run is probably like six months or something like this is the world we live in but I' I'd love to hear your perspective.
Yeah, I mean it's a really tough question because it's so open-ended right now and nobody knows exactly what to do. Um I can tell you what our thoughts are. Um and uh all of these things are are more conjectures right now than things are like um you know close to completion. Um, but I I really do think we need a better at least, you know, as as one of the first things that we should do is um a a better like code review harness almost where it can also plug into different tool calls and also um plug into different knowledge bases so that it can perform autonomously what a human right now needs to do.
uh we we have like a a code review bot but it's not um kind of specialized and I think it needs to be specialized because you know there's just so much vast human knowledge that goes into a code review that's not captured by like a generic agent that just has access to like monor repo um or even like the tools uh yeah so I think iterating on kind of a framework for that will be really important and that's something we're looking into as well um but I Uh yeah the like to to your question is like what is the difference in this world between like you know these teams are like you know maybe building frontends and you know apps that are running on platforms and the platforms itself.
Uh yeah I I think until we solve this problem the disparity will get very great. That's the thing with AI is it's very much power law dynamics and we're seeing this you know in the industry as well like people who are accelerating with HI AI can go much much faster than um teams without and so even within companies that dynamic needs to be noticed and dealt with or else it gets um it can really harm a company's like you know uh ability to like you know uh just run its business. Um, so I guess I'm sorry like I don't have like really great answers to be like, "Oh, this is what we should do to like solve it." Um, but I do think it's something that we should pay attention to um, and invest in.
So on on our end, we're investing in these harnesses. We're investing in like autonomous like operations underneath. But I think there may be even more of a fundamental shift in how things get operated. I can imagine because of power dynamics um that like a lot of like um uh like you know higher layers you know higher layer abstractions teams on top will just be like um extremely agentic whereas the the platform layers um increasingly need to like uh like support more and more load and therefore it needs to like temporarily grow as well in order to build a systems to catch up and once it catches up then we can grow like together.
Um but right now it's almost the scaling laws of the upper layers are like AI scaling laws and the lower layers are human scaling laws and that's not sustainable. So how do we get the lower layers to be into in get into the AI scaling laws as well? We need to build those platforms and I think we I think we're super focused on like code creation and feature creation and not about operations and running right now. And I think that's the next le lever that we need to pull and I think we're we're working on that right now.
So hopefully in another three months we'll have a better answer for you. No. Well, I mean we can always come back and we we can chat some more. Um you know, another thread maybe that I'd pull at uh you mentioned sort of the whole concept of someone sort of typing in Slack something they'd like to see and the agent picking it up and running with it. It got me thinking a little bit about different pathways that uh agents unlock for communication between teams and orgs. Uh so like if we're talking about communicating between teams and orgs, I'm thinking yes, Slack is one where the agents and the humans are together.
But if you think a little bit more creatively, arguably like distributing a plug-in for codeex would also be a form of communication. And you talked about that as well. You're trying to give app teams essentially your brain, your ability to understand how their proposed data pull is actually going to work with your platform and review that code before they decide to set it up and stand it up on their own. Do it. It's sort of playing with that abstraction a little bit. We can be very creative about how how far we want to take that thread.
How have you seen um agent communication patterns shifting how teams work across that platform app team boundary? Yeah, this is a very good question as well. There's so many layers to all of these. Um I'll talk about like maybe the surface layer which is like people are increasingly using agents to like communicate with others like via Slack. There's a lot of generated messages at this point, right? um they they look like they come from humans, but if you read it, it's very obviously agent generated. They tend to be very verbose. Um so so for folks reading it, it it takes a long time and sometimes it's a little bit even harder to like get to the point where it's like, you know, it's too diplomatic or too wrapped up.
Um and then so what people are doing is they're digesting this with their codecs as well. So it's like codeex, right? And then codeex redistill it back to human language for me. What is the point here? Um, we're seeing a little bit of that and I think that's actually like not a bad thing. Um, uh, you know, every time we chat with Codex, every time you give Codex anything, it it it, you know, like it it basically builds a hive brain even even even better. So like the fact that we're mediating through CEX is not a it's not a bad thing.
It'll it'll help us in the future as it accumulates more of this knowledge. Um, but you can imagine as a human in the middle, you know, it sometimes is jarring to see Slack become like a little bit like, you know, filled with like agent communication. Um, so I think there's there's a little bit of like, you know, sense of like, oh, things are really changing, but also like excitement that, you know, we can do this. Um, another layer is, uh, you know, like increasingly as we kind of launch more kind of capabilities, um, as a platform team or as any team like surfacing with users, uh, we're we're trying to get to a point where our agents are good enough that it can handle a lot of the user communications, right?
So, we actually launched this um this kind of support Slack support bot. uh the initial iterations were not that good but now the models have gone very very good and we can leverage codeex um the support is actually getting a lot better so um it's actually able to answer a lot of really great like hard questions very intelligently um and previously you know what I mentioned before is kind of like what users feel as well previously when users see an automated response they instinctively it's like I don't even want to read this it's super long you know like I just want a human to answer my question but because of quality answers have gone so much better.
Um, and because it has tool use and all of this as well, it's not just like searching through like a dead Slack corpus. Um, we see that interaction pattern changing as well. It's like people are more open-minded about like, oh, this is a generated Slack response. It but it could be very helpful. So, I see people reading it a little bit more. So, it's shifting the other direction as well. Um, as the models get really really good, I can imagine like these are all solvable problems by by you know with with model capabilities, right? like whether it's super long and looks super like informal and like too much detail like that's also solvable.
Um including like you know how do you like answer questions better that's also solvable. So I can see the future where it right now in my perception I think the co uh you know the agent can do like sometimes as good of a job but on the most part like maybe less than average than a human does in terms of communication but I can see very quickly that can change. It could be super human. It gets to the point super fast. It's also like very good at like understanding how humans perceive and understand information to make it like super condensed and and and and context aware.
So if I'm in this channel talking about these kind of people, I will use this type of language. If they're in for people, I should talk about this, you know, like you know, really like gain the psychology even. Um and you know, other teams will have their agents that that could do the same as well. I can see that happening very quickly, but we're not quite there yet. Yep. Yep. That makes sense. Um maybe we could play with like one one of the things that's been underneath the conversation so far for me has been thinking about how on a platform team on the infrastructure side of things you have a different at least a potentially different set of primitives that you're working with that you're giving to agents to address and to deal with and I feel like when you talk about the need to have agents scale into the infra layer scale into the platform layer appropriately it feels like part of the conversation is basically ensuring they can work with the right primitives that scale for you versus primitives that scale at the app layer.
It's it's perhaps it's an open-ended question, right? You can come back and say, "No, no, no. I think the primitives are the same." But like I wonder if you would say that. I wonder if you would say no, we have some different primitives that we want agent primitives. I I'll just like give some examples like and maybe I'm like being simplistic about like you know um a lot of other areas as well, but let's say you're you're building a front-end app, right? like you need the codebase. Um you need like a browser, you need to be able to like control the browser and whatever to test things and if you have a backend you need like you know a live service and and being able to spin that up and connect to that, right?
Um but for the most part you can probably just like um use like you know like fake data or whatever uh to to stub data to to kind of test it. But if you are like supporting let's say um a spark cluster for example you know few thousand nodes and we have like you know dozens of clusters across different regions and let's say the cluster is having issues right the agent needs to connect to so many different tools in order to understand what's going on it needs to connect to like the logging the observability needs to connect to like kubernetes it needs to understand the pod it needs to control like all these things um there's you know with with a spark cluster it's not just like one service is connected to like dozens of other services.
There's like the shuffle service. There's also like you know our um our our cluster routing service and there's the quota management service. So you have to connect all of those to understand those two in a live setting. It's not code right. It's not it's code yes but it's not it's it's live code running on a platform. I need to figure out very very different systems plug into all of them understand what's going on put the picture together. So just in terms of number of connectors, the capability that you need to have to delve into all of these different areas and also to do live operations in a way that's secure.
You can't just like try different things. Oh, let's try this, you know, see see if that like solves it. It just probably break everything down. Doesn't go well. It doesn't go well. Exactly. So this is where I feel like the capabilities and the primitives will be different from you know um you know a lot of things but you know it's getting there. Again, the models are getting very very good, very very smart. We just need to apply them in these settings more and more um to test them out. So, it's like a little bit of chicken egg problem, which we're we're solving currently right now, which is like, you know, we don't trust it enough to do a lot of live operations and therefore like it's hard to get a lot better at it.
So, therefore, we're like setting up more kind of isolated environments for it to do it. Right now, we're not quite there where we trust it to completely take over, but we've done a lot of different things with agents. So far, for example, we already have agents pull um for like statuses for different services and give us updates on how things are going during a deploy or during incidents, etc. It's very good at getting like information. We haven't gotten to the point where we fully trust it to like actually come up with like fixes and apply them quite yet, but it can.
We ask it to suggest things and then we'll we'll we'll we'll monitor and look at it ourselves. Um, but again, all things solvable by model capability and we're getting there. Yeah, that's the other piece that like is popping out for me is like I think in a sense the the most of the conversations I'm in what I hear is conversation about models and code. That is a lot of what you've talked about, you know, generating code this and that. Um, and it's framed as a mostly solved problem these days. It's framed as something that the models have become very good at.
And I think that your this conversation opens up a really interesting uh sort of more nuanced picture where you can say it's one thing to say that the model has grasped app level code bases and can kind of get into that and mostly solve that. It's very different to talk about how you support platform and infra teams and how you make sure the model is able to operate autonomously in that space. And when even when you were saying, you know, the the tool calls are there there's like many many more tool calls just to understand what's going on and to understand what's going on correctly.
It sort of got me sort of chewing on this idea that effectively platform teams are on the cutting edge of where scaling laws are going and like you have maybe an order of magnitude maybe two orders of magnitude more complexity that you're dealing with and therefore as the scaling law comes up you're sort of seeing that it hit app team first but you're feeling I'm sort of feeling that cutting edge of like it's not quite there but we can see it getting there as you deal with a much more complex environment. Yeah. Yeah. Yeah. And it adds to the complexity that everything around you is growing very very fast.
Like all the app layer like you know logic is growing very very fast. So not only do we have to deal with the complexity of like trying to upgrade our systems to be agentic. We also are dealing with a deluge of scaling problems just pure raining down from the app layer. Right. The app layer is like okay now we're going to just scale like crazy and like they expect platform to keep up. Yes. Yes. Exactly. So so we're we're kind of in a double whammy right now. Um, but but then again, I'm I'm very optimistic.
We already have like really awesome like ideas and solutions in place to solve a lot of these. And we're very very hopeful. I think we're like like you like we were talking about, I think because of the complexity of kind of platform teams, the model needs to get a little bit stronger for it to be able to do this safely. Um, and we're very hopeful because we see the progress there. Yep. Yep. That makes sense. If you were to give some perspective, you know, I have lots of folks who are deep on data, deep on platform, deep on infra, but of course, they don't work at OpenAI and and they're in my audience.
I'm I'm curious what you would sort of tell them as like their Swiss Army knife or their pocket guide to sort of how to handle some of the complexity they're dealing with because I hear from them some of the same challenges, but of course, they're not working inside a hyperscaler and so they don't necessarily have access to some of the cutting edge stuff that you're going to be able to pull from. And so in that world, you know, what would you say to an infra engineer? What would you say to someone leading a data team um who who isn't working at a hyperscaler?
That's a really good question. Um I think a lot of us on platform teams and infert teams are feeling a little bit underwater. So I would say to those teams first try to get yourself a little bit more time so that you can innovate out of this. Um and and for us for example, we you know rolled out support bots. Again, this this buys us more time, right? With the delage of user requests, like at least some of these less urgent ad hoc requests can be dealt with by an agent. Um we encoded a lot of our best practices in agent MD files.
Um so that you know the agents usually respect those. Um we encoded a lot of our best practices and skills. Again, hopefully like agents will hopefully respect those. And um um but agents are you know just just to to say they're very very innovative. They've they've done things where like okay that was not that's an internal API should have never been exposed. I don't know how you found out about that. Um so yeah so so even if you have it all in scales at HMD you still can do a very squirly stuff. Um and uh and just like build in if you have the resources to just build in a little bit more reinforcement and shoring up of your of your systems to ban bad actors.
I think yeah I I it's it's not intention it's not intentional right um but but a a lot of these a lot of these PRs generated um they can they're almost quite adversarial like the case I mentioned before where they're like doing things they're not supposed to do changing internal APIs to do the thing that they want but you know will break everybody else um you just got to put in more systems and like maybe obiscate those APIs from them make it less accessible to to like you know potentially agent coder to kind of build that defense as well.
And once you buy yourself enough time, innovate on your systems, right? Build that like harness tool so that you can have an agentic response to PRs, right? And build those systems um like for example isolated like you know testing systems so you you feel confident in like a very minimal agentic kind of operations um uh response to like you know some some critical systems and therefore hopefully roll that out in production as well. you need to invest in like wholesale upgrades of your system and so buy yourself time and like start thinking about what needs to be changed underlying to to make this possible for you to scale.
Yeah, I that's just going to be a fascinating tibbit for a lot of folks right there. I think this that like one of the things I'm hearing really pop out is I hear a ton from people who are earlier in the AI transformation journey obviously than than folks at hyperscalers than you are. Um, and sort of the the telegraph from the future that you're sending is basically when you get this AI transformation to a certain point and it starts to take off and your app team start to widely adopt coding agents. The pinch point that starts to emerge is that you now have uh n highly goal- directed agents who are coding in your system and they are going to hit your data and platform layers really hard and they may hit it in a way that feels adversarial even if that is not the intent of the human because the agents are very good and very goal directed and really go after what they want to get.
Yes. Yes. Exactly. Exactly. So, so exactly like you said, it's unintentionally, but sometimes it can get a little adversarial in terms of just the deluge and also in terms of the methods that the agents use to get at your systems. Yep. Yep. Y and ultimately we want to make it so that like we're like our systems are innovative enough and responsive enough and fast enough in an autonomous way where these are not problems. Like again, no bad intentions, right? What is a user trying to do? why does it need to do this and get around things like how can we make that better for the users and go faster together right that's ultimately the goal um but we just need to make sure that we are you know investing in those systems how how do you think about the sort of hierarchy of utility around you've mentioned uh plugins you've mentioned skills you've mentioned markdown files when you're thinking about architecting something that is intended to provide a guardrail or intended to provide some kind of input to an agentic system.
How do you go about sort of segregating that out in a way that you you feel confident you've got the right form factor for that for that tool? It really is an iterative process. I don't think we figured it out. Like all of these are attempts to make it better and most of them have made it m much better, but we still see like gaps. Um uh Got it. Yeah. So it sounds like bias to ship and then you figure it out from there. Yeah. That's why I keep on advocating for like agentic response to like changes right now like skills these are these are model mediated responses yes but it's not like fully like multi- aent world where I can encode my values um separate from um you know like code creators values and then we can have like a balanced dynamic.
We're not quite there yet. I think that's the missing ingredient. All of these things you described are like we still rely on one single model to do the right thing. One single agent to do the right thing. I again this is my personal opinion. I don't think that's enough. Yeah. You were talking about the idea of like a a need to separate incentives between a coding agent and a code reviewer agent for example. Yeah. Yeah. Yeah. Yeah. That makes sense. And so if you were to think about sort of buying your your team time back, the it sounds like just triaging inbound and starting to take action on inbound requests is one of the first places that you look.
Yeah. So for a lot of platform teams like ours, like we have like literally the whole company's are user base. Like we have like 6,000 users essentially, right? Um if you look at like our support channels, like each one of them has like hundreds of questions every day. It's like, oh, you know, my service uh is missing this like access to this service. Like, can you like make that, you know, happen for me? Um, this job is broken in this way. I need someone to help take a look at this and why this is like there's just like so many different very like high cardality type of like, you know, questions being asked.
Yeah. And um, you know, we dedicated resources to just answer those questions and we still do. Um but we're we're really heavily invested in training a model to do this better especially for things that are like more questionoriented like debugging um like why is this happening giving solutions I think there's you know we're we're we're at in this front the model is beyond the frontier it can do this you know it can debug like a job um fairly well it can connect to the sources and and do these and the users can like try those out because it's low consequences a single job you know if it fails they can try it again so the models are like very very good at that so we just need to roll that out more broadly um to all of our systems.
The other thing I mentioned which is another team where a lot of requests are more feature- shaped. It's like I want this dashboard to have like be able to hide this line when I click this thing, right? That's a different type of response and we have a different agent dealing with that, right? And that agent literally just runs a whole loop and tries to like take as much off the plate of of our um of our engineering team as possible. So for every team the response will be very different in terms of what agent will solve your problem the most and and and fill in that gap and then um because the the model capabilities are changing so fast like what it can do can change so every month we have to go and reevaluate can it do this now can it do this other thing now um uh but yeah so we're very optimistic on where it's going but right now it's able to help in specific ways and we're kind of like seeing what it can do now and try to you know help ourselves a little bit in what it can do But like seeing that very quickly it's going to be able to do more and take advantage of that.
Yeah. And that's another piece. Maybe we can tug at that thread before we close here. I It feels to me like you have an intuitive sense of where you would want to push, experiment, or play with a model where you're like, I'm not sure if the model can do this yet. It feels like it's on the edge of the capability set. Let me go and kind of nudge it and play with it and see what we get. I think my my impression as we talk is that that feels very organic for you and I'd love it if you unpack that a little bit for folks who may be struggling with and trying to understand like you know I'm in a complex codebase I'm in a complex environment I understand exactly what you say when you say the model isn't there yet on something how do I know when to push it how do I know when to play with it how do I know when to experiment with it yeah so we have um and the same for folks are using our models as well internally we have previews of models are coming up.
So we do get to play with a a few more versions of them. So we do always try to test the most frontier model um with existing kind of capabilities. The other thing we're trying to do across the different teams is we actually have our EBALs library. Um so for different teams they have EBLs for the core capabilities that they want the agent to do and we're still fleshing that out because again with greater model capabilities we can add even more like types of things we can do. But for example, this eval for um you know full stack like figuring out like you know feature sets and and and deploying that um we have evalu oh it's really getting good at this maybe we should try to get it to do this you know um so we do have those and then we do do a lot of live testing we're always like pushing the frontier on things um you know and this may be more of a manager like you know culture thing is like at OpenAI We're always pushing people to like try more things with it.
Have you tried this other thing with it? If someone comes back like, "Oh, this took me six hours to do." It's like, "Have you pushed Codeex hard enough to try to do this?" Why six hours? Come on. Yeah. And the answer maybe it's like it literally just has to take six hours and slower, but we we're pushing people all the time, right, to like try new things. Um, and there's an internal culture of experimentation as well. It's like we kind of send kudos to people who have innovative use cases. We all share with each other how we're using codecs and how it's accelerating and new workflows that's coming up and and we have um learning sessions where we swap these but also just like culturally we just like there's admiration for people who are able to do this find innovative working use cases um with agents and it's considered cool here it's considered like you know you're you're innovative here.
So that itself creates this kind of culture to like people really want to try what else we can do with it. as soon as a new model comes up like people are experimenting um things that they tried and didn't really work out before they'll try again right again this is where the emails library come up you know we already have a set of things that we really would like the agent to do and we can just keep on trying um and see if it improves there and I will say like most of the larger teams that I talk with do not have the discipline yet of maintaining a private eval suite that is designed to test emerging model capabilities.
So model drops, they're going back and they're either looking at production and they're saying do we want to try and swap it in and see what happens? Which is really scary. Um or they're saying we don't have a defined process for this. We need to assign someone at some point to look at this. And both of those are fairly inefficient ways to explore the space. And I think that one of the things that is a great takeaway is invest in that eval suite. Um and even if you're not in a hyperscaler, you're still going to get new models coming in.
you will still get the chance to use that and it will be very efficient for you to understand what's going on. Yes. Yes. And you it doesn't have to be heavy investment. You can do a very janky email suite. It's just like you have a notion doc with like all of the emails and expected out outputs and just like run it through the model every time it comes out. It can be very janky. It doesn't have to be fancy. Um and that's kind of like you know a scrappy way to do it. Yeah. It's funny that you say that because one of the things that tipped for me and I love how this ties into this idea that you want to be exploring and finding new ways to use models creatively.
I had a tipping point with Codeex uh just this week uh where uh it's obviously much smaller scale from a data perspective, but what I was doing was I was spending time essentially assembling a local folder um with a bunch of meeting transcripts with a bunch of uh research that the agent had done. Um, and my goal was actually turn that into a variety of documents. It wasn't just one. Uh, I think it was eight. And for the first time, I was able to tell CODC, okay, we've had a conversation about this. You've got a local folder with a bunch of stuff, right?
It's got text of various types from different uh input settings. Like some of it is formal reporting, some of it is very informal conversation. Your task is to now simultaneously produce eight different documents. And it did it. Um, and those were angled. It was really That's kind of Yeah. Yeah. Eight different documents. That That is pretty cool. I have never tried that. I need to go try that. There you go. Like kids right here. It's like you're now you're the cool dog, you know? They'll have to learn from me. No, it was really fun. And And like I could feel like for me it feels like the the wind is like blowing my hair back.
Like it's like wow, the future is here. Like I'm doing eight docks at once. Uh, and then inside the same context window, I was able to give it edits and it was able to agentically version out the eight docs and rewrite them until I got them to where I wanted them to be. It was It was really fun. Wow. Wow. Wow. That must have been like crazy. Also, like think about the timesaving. Eight docs in the past. Man, we're we're in a brave new world now. For sure. That's right. That's right. Uh and maybe that that leads us to the last question I have for you which is uh we talk a lot about autonomous AI, autonomous infra kind of where the models are going in this conversation.
Uh human attention continues to be very scarce and so where do you allocate yours? How do you think about how your human attention scales, your team's human attention scales? Uh such a good question. Um I think as an infra engineer or as software engineers in general um we spend a lot of time thinking about scalability and I I think we're we're right in the midst of that. It's like the perfect thing for an inframinded person to think about because it's all about scalability right now. And so I I think a lot about what systems need to be upgraded in order to scale what people systems as well.
like if I want my my folks to like really maximize their time here, they need to like leverage our um agentic coding a lot more or a like agentic tooling a lot more. So I encourage that also on the systems level is like what are the systems we need to fix right now and change and uplevel so that we can respond to the onslaught. Um this this is this ties into all the other things I talk about as well. And then another thing is just you know in this world there's a lot of information um and nobody really knows how this is going to play out and you know in my role a big part of my job is to kind of let other people including um other leaders uh at the company know what's going on on the ground right so uh you know this this kind of like difference acceleration it it could be a problem if it continues like this right it could be a problem in terms of like you know reliability or scalability may really be compromised if we don't invest in um our tooling.
Uh so a big part of my job is also just like talking to various folks and making sure that people understand what's going on on the ground. Makes a ton of sense. Um well, if you had one thing to say and maybe we'll frame it as like if you had one thing to say as an AI powered leader to advise other leaders on like this is how you should think about your job and how it's shifting, what would it be? H I think business as usual is not going to fly anymore. If you are a leader, you need to be a visionary.
You need to like wear that visionary hat and be like things are changing rapidly. How do I navigate in a world where things are changing navally? How do I get all the information I need to understand where we're going? And then give clear direction to your your own folks um and encourage them to not be scared of the brave new world. I think human a lot of times human like the first instinct is like oh my gosh it's like this is scary it's my job going away you know all these things you need to provide that positivity and like um kind of just like inspiration and kind of like a vision for them to realize that this is actually a really exciting time to be you know and we can all be at the forefront of this um so yeah just provide the support for your team.
Love that. Well thank you for taking the time. I really appreciate it Emma. Thank you so much, Nate. Yeah, it's been really fun. I'm glad we got a chance to chat and I'm look looking forward to hearing how like the models make your job easier in the future. I know, right? It's going to get serious in the next few months. I know. Like we have some really awesome stuff coming up. So, uh it's it's going to be great. Codeex is like amazing. Everybody should use it. Yeah. Yeah. It's been a lot of fun to use for me, too.
I've I've been watching the like I I actually do token burn counts every day and like I have Codeex let me know how many tokens I burn. like, "Wow, you can see the power law here because I'm suddenly using it more and more and more." So, yeah. Yeah. Yeah. I love it. I love it. Let's go. Yeah. There you go. Amazing. Thank you so much. Okay. Thank you so much. All right. Bye. All right. See you.
More from AI News & Strategy Daily | Nate B Jones
Get daily recaps from
AI News & Strategy Daily | Nate B Jones
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.









