SE Radio 709: Bryan Cantrill on the Data Center Control Plane
Bryan Cantrill, the co-founder and CTO of Oxide Computer company, speaks with host Jeremy Jung about challenges in deploying hardware on-premises at scale. They discuss the difficulty of building up Samsung data centers with off-the-shelf hardware, how vendors silently replace components that cause performance problems, and why AWS and Google build their own hardware. Bryan describes the security vulnerabilities and poor practices built into many baseboard management controllers, the purpose of a control plane, and his experiences building one in NodeJS while struggling with the runtime's future during his time at Joyent. He explains why Oxide chose to use Rust for its control plane and the OpenSolaris-based Illumos as the operating system for their vertically integrated rack-scale hardware, which is designed to help address a number of these key challenges.
Brought to you by IEEE Computer Society and IEEE Software magazine.
Bryan Cantrill, the co-founder and CTO of Oxide Computer company, speaks with host Jeremy Jung about challenges in deploying hardware on-premises at scale. They discuss the difficulty of building up Samsung data centers with off-the-shelf hardware, how vendors silently replace components that cause performance problems, and why AWS and Google build their own hardware. Bryan describes the security vulnerabilities and poor practices built into many baseboard management controllers, the purpose of a control plane, and his experiences building one in NodeJS while struggling with the runtime’s future during his time at Joyent. He explains why Oxide chose to use Rust for its control plane and the OpenSolaris-based Illumos as the operating system for their vertically integrated rack-scale hardware, which is designed to help address a number of these key challenges.
Brought to you by IEEE Computer Society and IEEE Software magazine.
Show Notes
Related Episodes
- SE Radio 413: Spencer Kimball on CockroachDB
- SE Radio 690: Florian Gilcher on Rust for Safety-Critical Systems
Related links
Transcript
Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.
Jeremy Jung 00:00:18 Hey, this is Jeremy Jung for Software Engineering Radio and today I’m talking to Brian Cantrill. He’s the co-founder and CTO of Oxide Computer company and he was previously the CTO of Joyent and he also co-authored the DTRAce Tracing Framework while he was at Sun Microsystems. Brian, welcome to Software Engineering Radio.
Bryan Cantrill 00:00:38 Awesome, thanks for having me. It’s great to be here.
Jeremy Jung 00:00:41 You’re the CTO of a company that makes computers, but , I think before we get into that, a lot of people who build software now that the actual computer is abstracted away, they’re using AWS or they’re using some kind of Cloud service. So, I thought we could start by talking about data centers because you were previously working at Joyent and I believe you got bought by Samsung and you’ve previously talked about how you had to figure out how do I run things at Samsung’s scale. So yeah, how was your experience with that? What were the challenges there?
Bryan Cantrill 00:01:21 Yeah, I mean, so at Joyent, and so Joyent was a Cloud computing pioneer. We competed with the likes of AWS and then later GCP and Azure. And we were operating at a scale, right? We had a bunch of machines, a bunch of DCs, but ultimately, we were a VC backed company and a small company by the standards of, certainly by Samsung standards. And so, when Samsung bought the company, the reason by the way that Samsung bought Joyent is Samsung’s Cloud bill was, let’s just say it was extremely large. They were spending an enormous amount of money every year on the public Cloud. And they realized that in order to secure their fate economically, they had to be running on their own infrastructure just did not make sense. And there was not really a product that Samsung could go buy that would give them that on-prem Cloud in that regard, like the state of the market was really no different.
Bryan Cantrill 00:02:19 And so they went looking for a company and bought Joyent. And when we were on the inside of Samsung, we learned about Samsung scale. And Samsung loves to talk about Samsung scale. And I got to tell you, it is more than just chest thumping like Samsung scale really is, I mean just the sheer, the number of devices, the number of customers, just this absolute size. They really wanted to take us out to levels of scale certainly that we had not seen. The reason for buying Joyent was to be able to stand up on their own infrastructure so that we were going to go buy, we did go buy a bunch of hardware. And I remember just thinking, God, I hope Dell is somehow magically better. I hope the problems that we have seen in the small, I just remember hoping and hope it was of course a terrible strategy, and it was a terrible strategy here too.
Bryan Cantrill 00:03:09 And the problems that we saw at the large were, and when you scale out the problems that you see kind of once or twice, you now see all the time and they become absolutely debilitating. And we saw a whole series of really debilitating problems. I mean in many ways like comically debilitating in terms of showing just how bad the state-of-the-art debt is. And we had, I mean it should be said, we had great software and great software expertise, and we were controlling our own system software. But even controlling your own system software, your own hosts(?), your own control plane, which is what we had to join, ultimately, you’re limited. You got, I mean you got the problems that you can obviously solve the ones that are in your own software, but the problems that are beneath you, the problems that are in the hardware platform, the problems that are in the componentry beneath you become the problems that are in the firmware.
Bryan Cantrill 00:04:09 Those problems become unresolvable, and they are deeply, deeply frustrating. And we just saw a bunch of them again, they were comical in retrospect. And I’ll give you like a couple of concrete examples just to give you an idea of what kind of what you’re looking at. One of the, our data centers had really pathological IO latency. We had a very database heavy workload, and this was kind of right at the period where you were still deploying on rotating media on hard drives. So, this is like so, and all flash by did not make economic sense when we did this in 2016. This probably it’d be interesting to know like when was the, the kind of the last time that actual hard drives made sense because I feel like this was close to it. So, we had a bunch of pathological Io problems, but we had one data center in which the outliers were actually quite a bit worse and there was so much going on in that system.
Bryan Cantrill 00:05:07 It took us a long time to figure out like why. And because when you’re Io when you’re seeing worse Io, I mean naturally you want to understand like what’s the workload doing? You’re trying to take a first principles approach, what’s the workload doing? So this is a very intensive database workload to support the object storage system that we had built called Manta and the metadata tier was stored and was we were using Postgres for that. And that was just getting absolutely slaughtered and ultimately very Io bound with these kind of pathological Io latencies and as we trying to like peel away the layers to figure out what was going on. And I finally had this thing, so it’s like okay, we are seeing at the device layer, at the disc layer, we are seeing pathological outliers in this data center that we’re not seeing anywhere else.
[...]