From one person to 80: Scaling a hypergrowth engineering org with Claude Code

Claude| 00:23:58|May 20, 2026

Chapters15

The speakers outline Base 44 as a vibe coding platform and the plan to grow from a solo founder to a larger engineering team while preserving velocity.

A practical playbook for scaling an engineering org from 1 to 80 using Claude Code, focused on lean processes, simple onboarding, and real-time automation.

Summary

Yav from Base 44 and Gabrielle, who leads their AI team, share a blueprint for scaling a hypergrowth engineering org with Claude Code. They describe starting as a solo founder with a vision for a搭 platform that lets anyone build software, and how this evolved into an 80-person team after an acquisition by Wix. The talk splits into two phases: moving from 1 to 15 engineers with tight, low-friction processes, then from 50 to 80 engineers with more sophisticated experimentation and QA practices. Key tactics include ultra-simple onboarding prompts that map the organization in real time, Claude-assisted PR reviews to amplify a cautious maintainer’s code oversight, and a lightweight, production-focused feedback loop using user frustration signals to gauge agent performance. Gabrielle then surveys how Base 44 scaled experiments, evolved evils (AI testing suites), and QA automation with cloud code across browser-based tests, end-to-end scenarios, and a centralized dashboard. Across both halves, the central thread is keeping things deliberately simple, letting past actions guide “taste,” and preparing for the next bottleneck as the team grows. The session closes with reflections on the value of product-as-tooling, the importance of real-world feedback loops, and an eye toward post-production validation metrics.

Key Takeaways

Start new hires with two simple prompts: (1) summarize who matters in the codebase by reviewing commits, and (2) generate a real-time Mermaid chart of how a component works. This approach rapidly embeds organizational knowledge for the next engineer.
Amplify a cautious maintainer’s PR review using Claude to extract the most crucial comments and run destructive tests, enabling faster, safer code deployment without a heavyweight process.
Use a simple, production-first feedback signal—customer frustration when a chat agent misbehaves—to steer version releases and monitor agent health across model, prompt, and infrastructure changes.
Adopt a lightweight experimentation shell connected to Posto (AB testing) and BigQuery to derive clear ship-or-kill guidelines for PRs, with automatic testing plans and verifications.
Build a user simulator and CI/CD loop that spins up real Base 44 app instances to test AI changes, enabling rapid, low-cost validation of new features in a controlled environment.
Shift left QA by combining cloud code with browser automation (Playwright MCP) and domain-specific skills to cover the 80% most common flows, reducing manual QA load while preserving coverage.
Recognize that the bottleneck will move as you scale; plan for post-validation metrics (e.g., support tickets, feature usage) to confirm a change’s real impact on users.

Who Is This For?

This talk is essential for startup engineers and engineering leaders who are rapidly expanding teams—especially those using AI-assisted tooling—to learn lean onboarding, PR review, experimentation, and QA automation strategies that keep velocity without sacrificing quality.

Notable Quotes

""Every new engineer that comes into the company will give him a task to basically use two prompts before he starts working on his task.""

—Yav explains the ultra-simple onboarding prompts used to map the org in real time.

""We basically put a small percentage of the customers on that version and we can track the frustration level.""

—Gabrielle describes using user frustration signals to validate agent performance during releases.

""We wanted to shift left product management decisions in AB testing so we also needed better evils.""

—Gabrielle on evolving experimentation and AI testing into a scalable process.

""The bottleneck will keep moving.""

—Reflects on the ongoing challenge of scaling processes as the team grows.

""This is how we dog food our own platform.""

—Both speakers emphasize product-internal testing to refine workflows.

Questions This Video Answers

How did Base 44 scale from 1 to 80 engineers using Claude Code in practice?
What simple onboarding prompts did Base 44 use to onboard new engineers quickly?
How can you use customer frustration signals to validate AI agent performance in production?
What is Posto AB testing and how was it integrated with Claude Code at Base 44?
How does Base 44 implement lean QA with cloud code and Playwright MCP?

Claude CodeBase 44WhatsApp integrationPR review automationPosto AB testingBigQueryPlaywright MCPevils (AI testing)QA automation Cloud code

Full Transcript

Hello everyone. My name is Yav. I lead product at B 44. And going to join me on stage later on is Gabrielle who leads our AI. And we're going to talk about how B 44 scale from a solar founder engineer all the way up to 80 engineer and how cloud code help us facilitate that growth while maintaining our velocity. We split this talk into a short intro and then two phases going from one engineer to 15 engineer and then going from 50 engineers to 80 engineers. So let's talk a little bit about the first phase which is mostly an intro to base 44 and our solo founder. So B 44 is a vibe coding platform but this is a new term a year ago it was Mar thinking I want to build a platform that will let anyone build software non-technical user technical user let's build up the speed. He started the the platform at the end of 2024 and by 2025 you already had a working product started building in product in in sorry building in public on LinkedIn and Twitter gain a lot a lot of traction and by April 2025 the product was already profitable that's the moment I joined because money was starting to flowing in and getting a lot of traction and because this was a profitable product AI focused user base and a crazy founder started getting the focus of a lot of uh companies and acquisition opportunities which leaves us in the next phase which is our post acquisition. So Wix has very similar user base as base and so they saw base 44 as a big bet and they wanted to maintain the velocity of B 44 but it's expanded dramatically. So we basically went from a two member team into a 15 engineer team and we needed to scale and we needed to scale fast as possible and we had four major challenges. One is onboarding doesn't scale. We can't have Mo onboard each engineer to the team. Code review doesn't scale. Mo was really really cautious about what goes inside the back end of base 44. So he wanted to review each PR on his own. We can't have each engineer sit with our beta tester to understand whether the product is working as expected. So we need to find a way to automate that as well. And an interesting part about the fact that you have like very um immediate product market fit is there's a lot of product surface you need to cover. Whether it's integration, whether it's the identic flow, whether it's the visual editor, there's so many areas and you need the engineer to ramp up really really quickly. So let's jump in. How do we solve each one of the challenge? And the key takeaway I want everyone to get come out of here especially for those with small teams is the fact that you need to keep everything very very simple. Okay. The meetings when we try to tackle those challenges would start with hey let's build this process where we review everything and then build an onboarding dock and we'll do like a nightly that that uh update that. We're thinking actually no let's keep it very simple. Every new engineer that comes into the company will give him a task to basically use two prompts before he starts working on his task. One go over all the commits and tell me what everyone is what everyone cares about. So after we were like three four engineers and people started like building their knowledge in each area like the fifth and sixth engineer came wrote this prompt and they already get like this map of the organization and you don't need to kind of like think about how do I keep like these onboarding docs updated as new engineer come up no a simple prompt gimps you in real time the entire map organization the second thing is before you dive into each area is basically ask claude hey can you give me a mermaid chart of how this component works. And again, this works in real time because because everything keeps evolving. You don't want to kind of like try, hey, I need to keep this document up to date. I need to keep this document up to date. No, claude keeps it for you. Very, very simple. One prompt gives you everything an engineer needs to know in order to start working inside of B 44. The second thing is as I mentioned Maul was very very cautious about what code goes inside our agent and what code goes inside the back end of base 44. So we needed a way to amplify MA's PR abilities. So after about one or two weeks we already have a big pool of PR comments M add inside our repo. So again, instead of kind of like sitting down and thinking of brainstorming, okay, what's the instruction that we need, let's have Claude review the Pas say, what's the most important things and what's the most crucial things we need to keep in mind while engineers are writing their uh new code and we put it in destruction, run it every couple of days and have more PR reviewer inside of B 44 without having us to build a sophisticated and complicated process. The cool thing about it is when we really started to see kind of like velocity picking up. Okay, so one of the uh PR that we kind of like remember and we keep referring to is we wanted to do a WhatsApp integration inside B 44 to kind of like communicate with the aging using WhatsApp and we handed it over to a new engineer. We assumed a new engineer working on this kind of feature. It requires an integration. It requires working on the agentic flow. It requires a new meta API. We assumed it's going to be going to be a one to two weeks uh endeavor. And it was really really awesome to see that we gave that Thursday night, Sunday morning, everything was ready. He onboarded on Thursday with uh using those simple prompts. He sent it over to PR. The PR model review had kind of like two three small comments and we were ready to move on to production. Okay, so we we managed to resolve most of the issues. Now we have the issue of how do we make sure that what goes into production especially our agent works really really good for our customers. Previously when we were a tiny team, we would just sit with customers and hear like how they interact with base 44. But now we need to find a way to scale. And like almost every naive AI company out there, we will say, "Hey, let's build an evil suite. We'll make sure that everything that comes out, we'll run it through our evils. It will work perfectly and we'll understand what's going on." And I don't know if you tried to build evil um mechanism before but usually 15 people team is not ready for it. It's a much bigger endeavor. So we sat down and we said okay we already have a tremendous amount of traffic in production. How do we use that traffic in order to understand whether the model is working for our customers or not? We have conversion rate which is nice but we want to understand whether the agent itself especially for paying customer is working as expected. So we started looking at the conversation and a very simple pattern emerged and that if you look at the conversation when everything working well well the user doesn't say anything. It just goes to the next feature to the next feature to the next features. But when things start to break, that's when users get really really loud inside the chat and say, "Hey, why is this broken? I can't believe it's not working." It's really really easy to see and manifest the fact that things are broken. So we said, "Okay, we have a very strong signal signal when things aren't working. Why don't we use that and leverage that and ask Claude using a simple model using an IQ model to classify each message on whether it's the frustration level of the user is high or low. Once we have that then every single version of the agent that we want to that we want to release. We basically put a small percentage of the customers on that uh version and we can track the the frustration level. And this works whether we're changing the infrastructure, we're changing the prompt or we're changing the model. And we can understand whether this works as well as expected for after the change for our users. And the key takeaway again is just keeping everything super simple without building a sophisticated process around it. uh like we hear a lot about like let's build an agent for this and and agent orchestration but when you're a small team you have very simple way of getting the almost the same amount of value while keeping processes really really really lean but when you scale from 15 to 18 it becomes a little bit of a different challenge and that's when Gabrielle is going to walk you through thank you very Hello everyone. My name is Gabriel and I lead the app builder agent for base 44. I had a lot of time to watch you have behind the scenes so I got a little bit nervous. So, so you I've just told you about the first two phases of our growth and last couple of months we reached a new point of growth like we started hiring more externally. We had more internal movers moving from weeks to base 44. And then we even merged a different product working on vibe coding and in one single night we doubled our ad count from 40 people to almost 80. And that brought a new set of challenges that we had to solve. So we had many new challenges. I'd like to focus on the three most interesting ones like the first one is how do we do experimentation at scale. Now you have just shared how we did the the frustration metric and how we AB tested in in production but you can't expect any new hire to understand exactly which KPIs to test how long do we want to test things whether you can just be brave and and ship it and like not everything needs an experiment right so we knew we wanted to shift left product management decisions in AB testing so we also uh needed better evolves now Again back to what Yav just said, we had we we were before in a point where we knew that evils is not the best uh ROI for us but now it became something we really need to focus on. And the last thing is how do you do QA QA properly in a company that's very consumer oriented without growing your uh testers in a linear way with the other headcount. Let's start with experimentations. Okay. So we had we started with a general shell of what we wanted to have like we knew we wanted a process that runs when a pull request is ready. We knew that eventually we want like a a bot commenting on GitHub saying like for a developer whether she could or not just ship it. If she needs an AB test, how long should it run? Which KPIs does the experiment need to monitor? And we also wanted it to post to open the experiment on postto that was like the shell was the easy part but we also needed the guidelines the actual logic of how do how do we work like how do we how do we operate we never sat and and articulated that we didn't have a guideline committee we just like had really good product sense and intuitions so we had one option like get a multistakeover committee and like enter a lot of meetings but we really hate meetings so We figured out that our past actions they could convey our guidelines in the best way possible. So we thought like wait we can just take like the 100 last experiments we had on posto the matching pool request and distill our guidelines from that. So we spawn cloud code hooked it up to the posto mcp. Posto is an AB testing experimentation. Pretty great product by the way. Uh and and had Claude u um suggest the first iteration of the guidelines and it was it did a great job. It wasn't perfect, very rough on the edges, but we had like a working document. we can just iterate and a couple of hours later we had like something working like uh uh each pull request opened has like a clear verdict whether you can just ship it gradual roll it a gradual uh roll out it or do an AB test and how long some features deserve seven days of of testing for our scale some need to have a full a full month because you might uh you might affect uh uh conversion rate and premium rates in very little percentages. And to wrap all of that up, we needed a central place that everyone could just see what's going on. So it was a great opportunity for us to dog food our own product base 44 connect it to BigQuery our data warehouse to posttog to GitHub to everything and have a central place where you can everyone could see which experiments are running uh how they're how are they u uh how they're moving the needle if something's causing more AI cost if something's reducing like rate of published apps like all the things we cared about and this for now kind solved us the problem and allows us to open up a new paradigm in how we uh scale our experiments. Okay, so the next part is evolves like this could be like a easily a full onehour talk and maybe next year depending team here will even do that but our challenge was very short term like we needed something to give us real value. We didn't want it to be like a three months project and we didn't we couldn't afford uh um taking their top AI engineers. We need them to work on features and improve the product. So we asked ourselves what do we really need to be build? Do we want to just evaluate the output of the model or do we want to check correctness of the apps that our users are building? And eventually we had to build a user simulator. Now for base 44 when a user types in like a request they want like an app and some small part of it won't work that doesn't mean that the evil uh fails. That was a great epiphany moment for us. It means that our evil suite needs to pipe back the rejection and and ask the our our agent to to to fix the the the missing parts. And then we ended up looking at uh latency, how many turns things took, how much every uh uh how much it cost to us, how many credits we took for our users, and we got into like a a working CI/CD pipeline where any change in our AI code spins up real a real base 44 app instance and we use stage hand to simulate us real user actions like if like there's like a automated QA engineer spun up in a small box. That's how we look at it. And this is how the internal app we built to support that looks like. Again, a great opportunity for us to dog food our own platform. You can see here the example of like the the most canonical evil we have is like the hello world app like uh it it doesn't mean that the app is doing like that B 44 is is performing the way we want. It's like a smoke test. It's just making sure we didn't break anything. So the way we'll do that, we'll ask B 44 to build us a simple hello world app. Assert that the right text is visible and there's like it looks good and it's very subject subjective but we trust AI on failing. If not, then we ask for a very small change, uh, text change, and then we ask for a small feature. And as you can see, most of them just pass. And fun fact, these eval will pass on the smallest model you can think of, which is really cool. And of course, we have many more complex evals. For example, we have scenarios where we start with an existing app and do many changes. we have scenarios where we get to to check our compaction mechanism which is very complex and requires a lot of user messages. So this is kind of brought us to a new paradigm in in in evil. It's not perfect yet. We're constantly working on it but it was like the right time the right moment uh to to build such a system. And the third thing I want to share is how did we uh streamlined QA? So we do believe in shifting left quality of course like all of us like unit testing end to end tests it's obvious everyone uh working at base 44 needs to have complete ownership of what they build but most of the times you're working on really deep features that have a lot of edge cases. For example, imagine testing a feature that only affects users at a specific sub subscription tier when they reach a specific point of their credit limits. Like, and imagine your feature has a lot of permutations that affect that. Like, it will be very tedious for everyone to test it manually. And so, that's a classic case where we would hand off to a QA engineer, but then we'd have longer feedback loops and you have to wait for someone to be available. And and and and that was wasn't ideal. We knew that cloud code need what could operate a browser right playright MCP browser use like there's ton of tools out there but it was missing critical pieces of how to do it well for example each time it had it had to relearn the platform the selectors the flows um each time uh it had to uh um understand which events to look for in in in our database and then mix panel so we started wrapping um are are common uh flows in skills. For example, we have one skill that taught uh cloud code combined with the browser how to uh go over all the major user flows that most of features will touch and of course for new features like cloud code can just understand what the feature does and how to get along. So we don't need to cover 100% in the skill. You just need to maximize the 80% so you have enough context. It's like a a thin trade-off between like the right abstraction level and what do you tr just trust close code. The second challenge we had is like how do you do a proper setup testing like for for your test. Let's take the let's take the the example from before like when you want to test a very specific edge case. Now you could just click and and do it very manually like like like a QA a manual person a human person like could just do the clicks but that would be very very slow right so what uh a good engineer will do a QA engineer will do is just go to the database and override the the the the setup so that they can just test that that case we needed cloud to be able to do the same thing so we created CLI tools that abstract our APIs and database cases specifically for the use case of setting up tests. And we uh built skills that taught uh Claude how to use that those properly. And eventually we combined all of these uh efforts and skills into one like meta skill of how to do proper QA and we got into a flow where a PR pull request opens the agent triggers it creates a test plan also great opportunity to dog for their own product sends it to an base 44 app starts testing and reports back and this is like how it looks like for a single test like you get screenshots you can know what what it tests, what it didn't test. And I sometimes I will get cases where like I know it it it's I'm stretching the boundary of what it can do and then it was just like right like I couldn't test that and like surface the the missing capabilities but that works for 80% of the time and it allowed us to shift left deep and edge case quality assurance and move faster. Okay, so that's all for the challenges and I'd love to share a little bit about the the the the common thread around all of all of our like challenges and solutions. Like just as you have said before, we really value simplicity. like we really think about like we we we try to think about like bold and and and simple and sometimes like we we'll take like we we'll work very hard not to to to build complex things when they're not when it's not the right time. Evils is a great example. We hold it off until it was the right moment to build it and then we went all in. The second thing is that taste is a big word, right? like recently like everyone's talking about taste and like it's the last mode of us humans against machines. So I'm I I believe in that too. But I do think that you can encode a big chunk of your team's company's taste by looking at your past actions like just and and that kind of pipes back to to the the memory talk from the last session where like you can just look at what you actually did like in the last uh week or so and understand what your guidelines are like for code reviews for for AB testing you name it. The third thing is like if you're lucky enough to work on a product that can that you yourself can use uh that's also like a huge win like I think the uh the team at Entropic constantly speaks about how magical it is to be working on cloud codework and all the product suite and how how you get the feedback and insight loop going like in a magical way. So if you can do it like sometimes you have to stretch a little bit if you're working like on like I don't know an finance app but find ways to do it. it will be of value. And the last thing is that the bottleneck will keep moving. Like for example, for now, our current challenge is like first of all, how to continue and scale all of the processes I just shown, but also how do we do post validation correctly? Like once a pull request reaches uh production, how do you make sure it's moving the right needle? For example, is a bug really reducing a support tickets? You don't want a human to keep it on his head like is a feature really being used by users? Is it of course you want it to raise business metrics but not everything will will show that fast. So we we want to automate that. And that's it for today. Uh we really appreciate you coming and I really hope you found at least one thing you can take back to your company organization. Thank you.