The Biggest Test Bench I’ve Ever Seen
Chapters8
The chapter explains the purpose of test benches for hardware development, highlighting the need for large, powerful setups to evaluate rack-scale systems and introduces ASUS’s R&D lab, Grace Blackwell, and the GB300 liquid cooling as a glimpse into Vera’s upcoming generation.
ASUS’s R&D lab showcases a two-ton Vera Rubin-ready rack with liquid cooling, 1.1 MW room power, and 100 kW cooling to stress-test AI hardware at scale.
Summary
Linus Tech Tips tours ASUS’s R&D lab to see how the big league, rack-scale hardware gets tested. The centerpiece is the Grace Blackwell GB300 node, pairing two Nvidia Grace Biancas with Blackwell Ultra GPUs (HBM3e) and a single 72-core Grace CPU, all fed by a 54V DC/DC scheme to cut wiring. ASUS uses liquid cooling, including a dedicated CDU under the floor, to manage a combined rack power budget approaching 8 kW per node and a room-level total near 1.1 MW. The lab also hosts a massive environmental chamber capable of 100 kW cooling, with a -40 to 85 C range for testing extremes and aging. Software tools like AIDC handle design, deployment, and monitoring, while ASUS CC dashboards let teams drill into specific failures, track carbon emissions, and enforce access rules. The team notes Vera Rubin will push the system to near two-ton racks and require floor upgrades, illustrating the trade-offs of next-gen, high-power, data-center-grade evaluation. We also see practical lessons in testing cadence—running workloads 24/7, monitoring for data errors and latency across GPU-to-GPU, GPU-to-network, and GPU-to-storage paths. The proximity to production lines is highlighted as a benefit for rapid reproduction of issues and fixes. Finally, the video casually laughs at the chiller’s “not-so-chill” water temperature setup, revealing the balance between energy savings and hardware reliability.
Key Takeaways
- Grace Blackwell GB300 uses two Nvidia Bianca GPUs and a 72-core ARM Grace CPU with LPDDR5X memory flexibility.
- HBM3e GPU stacks provide 288 GB of memory per GPU, cooled by a shared massive cold plate.
- Each GB300 node has ~8,000 W total power budget; room power is about 1.1 MW, and Vera Rubin racks are estimated near two tons per rack.
- The lab employs liquid cooling via a dedicated CDU and under-floor chilled water to validate high-power, data-center-scale deployments.
- Chiller water temperature is intentionally kept around 20 C for cost savings, with flexibility to colder water when needed (ASUS’s approach is not for every customer).
- ASUS CC and AIDC software enable 3D space planning, OS deployment, firmware updates, and custom dashboards for monitoring, with a focus on data transfer latency, errors, and QoS thresholds.
Who Is This For?
Essential viewing for AI developers and data-center engineers evaluating rack-scale hardware and cooling strategies, especially those curious about how enterprise-grade test benches validate next-gen GPUs and CPUs before mass production.
Notable Quotes
"This is a Grace Blackwell GB300 compute node that includes two Nvidia Bianca boards side by side."
—Describes the core hardware setup of the GB300 node.
"Water cooled networking. That's what we've come to."
—Highlights the extreme cooling approach for networking in the rack.
"The total power coming into this room, it's about 1.1 megawatts right now."
—Gives scale of the lab's power envelope.
"Vera Rubin is more power hungry and also heavier than GB300 with ASUS estimating that each rack will come in just shy of two tons."
—Outlines future rack-scale demand and floor implications.
"ASUS targets around 20°C for the water that is going over to the racks next door."
—Explains their cooling strategy and cost considerations.
Questions This Video Answers
- How does ASUS's liquid cooling enable 100kW+ rack testing for AI hardware?
- What makes the Nvidia Grace Bianka and Blackwell Ultra combo unique for data-center benchmarks?
- What is AIDC and how does ASUS CC help manage a rack-scale deployment?
- Why are Vera Rubin racks heavier and more power-hungry than GB300 nodes?
- What are realistic cooling temperatures for data-center test benches and why do some labs push colder water?
ASUS R&D labGrace Blackwell GB300Nvidia BiancaBlackwell Ultra GPUsHBM3e memoryGrace CPU LP camliquid coolingCDU (coolant distribution unit)AIDC design/deployment softwareASUS CC management software,”“data center power and cooling testing
Full Transcript
When developing or evaluating new hardware, a test bench is an essential piece of the puzzle. It saves a ton of time just by putting power, cooling, and important tools at your fingertips, which makes it quick and painless to swap the device under test. What it also does is provide a controlled environment so that accurate comparisons can be made to earlier models or competing solutions. But what if the new hardware that you need to evaluate is a giant rack that weighs nearly two tons and sucks back over 100,000 watts of power? Well, then you're going to need a bigger test bench.
And that's exactly what ASUS sponsored us here to see. They're going to be showing off the R&D lab where they are hard at work performing development, maintenance, and long-term portraiture testing on their enterprise and data center products. We'll be looking at their current setup for Grace Blackwell, including my first look under the hood of a liquid cooled GB300. And we'll be talking about some of the upgrades that they'll be making to accommodate the next generation just announced Vera Rubin. This R&D lab is purpose-built to test rack scale products. In a lot of ways, it's kind of like a miniature data center, but at a much smaller scale, and with a bit more flexibility.
Here in the center is a traditional air cooled setup. So, chilled air comes in from the sides, runs through the servers, and the hot air gets sucked up into the ceiling above us. But because many of the systems here are newer and so power hungry, many of their customers have moved toward liquid cooling and they've got to be able to validate those, too. So, there's a dedicated CDU or coolant distribution unit next door that brings chilled water under the floor and then up anywhere that it's needed. We're going to go look at that in a minute, but first let's look at the kinds of systems they're testing in here.
This is a Grace Blackwell GB300 compute node that includes two Nvidia Bianca boards side by side. Each of those boards gets a single 72 core ARM Grace CPU that uses LP cam to offer flexible memory configurations alongside a pair of Blackwell Ultra GPUs that have 288 gigs of HBM3e each and consume up to 1,400 watts each. To improve efficiency and keep wire gauges down a little, Nvidia is using a 54vt architecture rather than the 12vt that we use in desktop PCs. And the DC/DC conversion to split out to the rest of the system is done with this power supply right here.
For networking, each of our Biancas gets a pair of connectex 800 Gbit per second nicks, which contribute alongside the CPUs and GPUs to this thing having a power budget of about 8,000 watts. That is why liquid cooling is a must. Let's take a closer look at this module. Man, this thing is heavy and gorgeous if you're well, if you're into that sort of thing. Uh, it's all color coded, too, so you can see where the cold supply comes in here. Then splits off in parallel to go to each of those Blackwell 300 Ultra GPUs. Then it comes together to cool both the gray CPU and you can actually see the contact pads for all of the LP DDR5X memory that goes around it.
That's not necessary on the GPUs because they're using HBM3 Estacks and those are right on the same package. So, those are cooled by this single giant plate for each of them. Another thing you might be wondering about these is, hey, what's up with these little flexible PCBs that kind of look like little antennas? Those are leak sensors. If any water gets on these, it bridges the two sides and feeds that into ASUS's management software, which we're going to take a look at a little bit later. The last thing this cools are the network cards. Water cooled networking.
That's what we've come to. Of course, nobody buys just one of those nodes. So, the R&D lab has to accommodate rack scale deployments of them. The total power coming into this room, it's about 1.1 megaww right now. But, uh, with validation coming up for Vera Rubin, they're going to have to upgrade that. Fun fact, by the way, Vera Rubin is more power hungry and also heavier than GB300 with ASUS estimating that each rack will come in just shy of two tons. Uh, you guys might need a heavier duty floor. And if they do, that's going to come at their own cost.
While Nvidia may at their discretion provide GPUs and CPU chips for development purposes, the responsibility falls to the manufacturer of the rack to procure, design, and build everything else around it. I was really interested in what kind of testing they would do on systems like these. And from talking to ASUS, they say that the exact software differs, but the function is actually surprisingly similar to what we might use to validate a desktop at home. They use a combination of their own software and packages that are provided by Nvidia to artificially load the system, often running it 24/7 for long periods of time while simultaneously monitoring for everything from temperatures to data errors or especially any anomalies in transfer feeds and latency whether it's from GPU to GPU, GPU to network or GPU to storage.
And this is key because while compute matters a lot in any kind of AI inference, the latest reasoning models are especially sensitive to how quickly you can move data through the system. On the subject of speed, ASUS pointed out several times actually that one of the best things about this lab is that it is literally just a few blocks away from their mass production. that helps make it easier to collaborate whenever they need to try to reproduce an error or roll out a fix. Now, let's roll out and check out the chiller. This is one of the least chill chillers that I've ever seen.
Not cuz it's a bad one. ASUS says it's good for about 1.3 megawatt of cooling capacity. It's just not that chill because ASUS takes a bad approach to managing the thermostat. While most data centers use very cold water, data center Dynamics says around 6 7° is typical, ASUS targets more like 20° for the water that is going over to the racks next door. That's not something they actually recommend to customers, but it's good enough for a test bench and apparently saves them about $20,000 a year in energy costs. So, I think I finally get it.
why dad was always so stingy. And besides, it's not like ASUS doesn't still have access to colder water if they need it. The piping in here is all colorcoded. So, yellow is the coldish supply to next door. Green is the warm return. Blue is the R134A refrigerant that chills the water. And then white here is actually a buildingwide cold supply that does run at 7C and handles chilling the air in the cold aisle next door for any air cooled deployments. Again, remember this setup is all about flexibility. Now, let's talk about endurance. This environmental chamber makes mine look honestly like a toy.
Even my big one, which by the way is still for sale. Just slide into my DMs. Anyway, this mamajama can do up to 100,000 watts of cooling and has a temperature range as low as -40 and as high as 85. Now, nobody would want to put a live server into temperatures that cold or that hot. Not if they want it to stay alive, but then again, sometimes they don't want things to stay alive. in here right now. ASUS has some GB200s that are undergoing long-term aging analysis. That means, well, we'll open that later. Putting them under dynamic loads.
So, sometimes very heavy, sometimes lighter, and wildly varying environmental conditions as low as minus 20 and as high as 45. Let's see what it's at right now. Oh god, that's unpleasant. Okay, we're going to have to get out of here pretty quick. It's 110 dB and almost 40 C. Oh, it's so humid. I uh I rescued your temperature sensor. You're welcome. What about second Thermal Lab? This one behind me is less for long-term aging and more for cooling validation. In there right now is an Nvidia HGX. And what they want to know is, okay, we can see how the fans ramp up at 25° C, but what if it's deployed in a data center in say India and they experience some kind of challenge with their cooling?
For that, they can turn this thing up as high as 45C. How much will the fans ramp up? Will the system be stable? There's only one way to know. Of course, nobody wants to hang out in there during all that testing. So, that's where ASUS's software comes in. AIDC is for design and deployment. So, they've got everything from a planning utility that lets you just kind of plon servers down in 3D space and calculate your structural power and cooling requirements to handy scalable tools that can do operating system deployments, driver management, firmware updates on everything from boards to nicks to switch trays and even SSDs.
It's even got an app store, but instead of like I beer, like Weta, you know. Then next to this, they're showing off ASUS CC, which is more for the long-term management of your deployment. They did a quick demo for me showing how you can track all the vital stats for your site and use this to drill down to say the specific machine that logged a given error. You can also create custom dashboards that will report on whatever is important to your organization like carbon emissions or quality of service thresholds. And then there's a bunch of tools in here for setting rules around software access and managing notifications.
Now it's time to manage your attention to another video you might like. How about the one that we did touring Simon Fraser University's latest supercomput? That was a really great look at what a realworld deployment of this kind of tech looks like.
More from Linus Tech Tips
Get daily recaps from
Linus Tech Tips
AI-powered summaries delivered to your inbox. Save hours every week while staying fully informed.



