ASHRAE Journal: ASHRAE Journal presents
Justin Seter:
Welcome, everyone, to today's episode of the ASHRAE Journal Podcast. We've got a very interesting topic today, regarding liquid cooling and data centers, and how that ties in with ASHRAE. So we'll go around and do some introductions of who's here on the call, and then we'll jump right into it. So I'll start, I'm Justin Seter with DLB Associates, I'll be your facilitator here, for today. I've been with DLB for almost 18 years, and about five years in the industry prior to. I've been working in data centers essentially my whole career. Mechanical by education. Have been a member of ASHRAE since I was a student, back in 2002. And currently a member of ASHRAE TC 7.9 on building commissioning, and ASHRAE TC 9.9, data center committee. And so I'm going to go ahead and pass the introduction to Dave Quirk next.
David Quirk:
Good morning, thanks for having me. So I'm Dave Quirk, president and CEO of DLB Associates. We're about a 330-person consulting, engineering, commissioning, and controls company, that operates nationwide in nothing but data centers. My background is a mechanical engineer, been in the industry over 25 years. And I'm a prior chair of ASHRAE TC 9.9, for the mission critical facilities and data centers. I'm also a current voting member of that committee, and have participated in other committees, like 90.4, et cetera. Thanks for having me. How about Tom?
Tom Davidson:
Thanks, Dave and Justin. My name is Tom Davidson, I'm a senior mechanical engineer, also at DLB Associates. I have been with the company for about 35 years, and Dave said we probably started doing data centers in the range of 25 years ago, so lots of experience. Also been a corresponding and voting member of TC 9.9. That's all for now, thanks.
Dustin Demetriou:
I'm Dustin Demetriou, I'm not from DLB. I'm the current vice chair of the IT subcommittee of ASHRAE TC 9.9, and the former chair of TC 9.9. I've been in the data center industry for over 15 years now, pretty much all that time focused on data center energy efficiency, liquid cooling for IT equipment. My background is also as a mechanical engineer. Currently an ASHRAE distinguished lecturer, and also a Uptime Institute accredited sustainability advisor.
Justin Seter:
Thank you, guys, for your intros. And it's an action-packed topic we got here today, so we've definitely got the right people in the industry who can help educate the rest of the industry on why liquid cooling is so important today. So what I want to do first, and then I'll open it up to you guys, is just go through a little bit of the outline of what we're going to talk about today. So I think first, we should back up, talk a little bit about the history of server technology and data center cooling over the last, call it, 10, 15 years, and then what has happened here in the last 24 months to really accelerate and ramp up the adoption of liquid cooling. Feels like everything's going at breakneck pace right now, and so what does that mean for the industry. And then what are these trends, and what are the latest innovations of what is happening on the ground today?
And then very prescient topic, since the latest ASHRAE Journal was just published, what is ASHRAE doing about it? So how does it relate to standards and guidelines and technical committees within ASHRAE? And then maybe talk about some of the case studies, or practical applications, lessons learned of things that we're seeing right now. And then where are we headed? What's the future look like? Because it sure does feel like this is only the beginning of this rapid change within the industry. So I'll start with the first question, is how did we get here and why is liquid cooling top of mind for everyone in the industry right now? Maybe Dave, you can start with the answer on that one.
David Quirk:
Sure. Well, this goes back a great many years. For those that have been in the industry a long time, liquid cooling is really nothing new. It's been in the industry, going back to the mainframe days, but what is new is the scale that it's now being deployed in the industry, and largely driven by artificial intelligence software applications. So we chugged along for probably the last 20 years, doing air cooling within the data center, with CPUs, or central processing units. And now we have this new thing called a GPU that's being deployed at scale, or the graphical processing units, that are changing the game and require a different form of heat transfer in order to make the higher densities of those chips work, because we've kind of outgrown the ability for air to work and solve that problem.
So for me personally, I've been involved with liquid cooling for over 20 years, dates back to early days of my involvement with the National Science Foundation. I was an advisory member for the direct-to-chip applications in the early days, when all of this stuff was being developed. Then fast forward a little bit, my involvement through ASHRAE, and writing of the liquid cooling book, and managing that committee for many years, made me heavily immersed in this topic, pun intended there. Fast forward today, as a voting member, it's a topic of conversation on literally every call that we have.
Justin Seter:
You mentioned a couple of things there, both liquid-to-chip and the immersion pun, love it. So Dustin, can you tell us a little bit about, there's a couple different types of liquid cooling, right, and I think we're going to sort of focus on liquid-to-chip a little further in. But what is the liquid cooling backdrop? What are some of the other types, and how do they work, as well?
Dustin Demetriou:
Yeah, sure. So as Dave mentioned, you probably hit the top two, right? You have the direct-to-chip solutions, where you would basically replace the air-cooled heat sink on the server processor chip, CPU or GPU, graphical processing unit, with a cold plate, as we call them, where you have some fluid that flows through that to extract the heat, and then ultimately, reject that heat out through the TCS, the technology cooling system loop, out into the data center. And again, this technology is not new, it's been around since the-probably late-1960s, early-1970s with mainframe computers. So that's sort of one type, and you can have that in both, what we call single-phase, where you have the fluid that stays as a liquid the whole time. You could have that technology with two-phase coolant also, so when the coolant will start as a liquid, it'll pick up the heat and boil, again, enhancing heat transfer in that process, and then ultimately then, recondense that liquid through a condenser loop, back to liquid.
So those are kind of the two types in ASHRAE. Dave mentioned the liquid cooling book, a lot of those we refer to as conductive cooling. So you're basically attaching a device directly to the chips or the components within a server to redo that heat rejection. Probably, the other type is immersion cooling. This is the case where you're taking the electronics in whole, or in part, and literally immersing it into a fluid that is non-electrically conductive, or a dielectric coolant. There are a number of advantages to this, because you're fully immersing the server, you're able to reject all of the heat, not just specific components, like you would do with a cold plate, but that system is being immersed. Again, you have either a system where you're pumping that fluid through the immersion solution, or there are natural convection-based systems, where you're just relying on the buoyancy to do the heat transfer to move the fluid.
But again, there, you're fully immersing the system, and we can come back and talk about pros and cons and some things there. But that's a second type. And again, there, you can also have single-phase, where again, the fluid remains a liquid the entire time, or you can have a two-phase version of that, where again, you're boiling the fluid off, and then you're re-condensing that fluid back into a tank or a bath to do the heat transfer. So those are the two types in the nomenclature of ASHRAE when we talk about liquid cooling. It's probably also worth just mentioning we often think of things like rear door heat exchangers, or more in-row kind of cooling solutions, where you are bringing that heat rejection closer to the IT equipment, right? In ASHRAE's nomenclature we call those close-coupled cooling, right? In those cases, the air, or the heat, is still being rejected to air. So we don't really think about those as liquid cooling, we think of those as close-coupled, and really the direct-to-chip and immersion as what we call liquid cooling in the case of ASHRAE nomenclature.
Justin Seter:
Perfect. Thank you for that background, it definitely covers all the bases there. So why today, is direct-to-chip liquid cooling the thing that's on everybody's mind? Isn't it easy? We had chill water in the data centers forever, and that kind of thing, so I'll throw that one out for whoever wants to take it first.
David Quirk:
Sure. So short answer is artificial intelligence. The software is the occupant of the data center, for those that haven't heard that before. And the software is what drives the hardware design and the hardware architecture, which then drives the infrastructure that we, as part of ASHRAE members, get to design and commission and operate in these facilities.
So artificial intelligence is really, really demanding of high computations and high bandwidth, and it requires a lot of densification of the hardware in order to make that happen, and for these, both large language models and inference applications of artificial intelligence, to work properly. So that's what's causing this incredible boom of the need for really big scale liquid-to-chip, direct liquid-to-chip applications, because we have to put in thousands, tens of thousands, hundreds of thousands of GPUs in a small footprint in the data centers, and each of them has really high density. I mean, in the air-cooled sites, we were working with the likes of a 300 watts a chip, and now we're up to a 1000 watts, 1500 watts a chip and beyond, with GPU conversation. So that's really what drives this demand, and at such a large scale that we've never seen before.
Dustin Demetriou:
Just to add to that, I agree 100% with the GPUs, but it's actually starting to even go beyond GPUs, right? It's across the board, where density and power consumption is starting to go up across all components. It was not too long ago, I mean if I go back to, say, the early 2000s, we had a sort of a blip, where we thought we were going to have to go back to liquid cooling, and we came out with, at that time, multi-core processors. And that kind of continued for a long time, but really for the last number of years, Moore's Law has kind of gone away, where we used to be able to get that performance by putting more and more things onto a single chip and chip power stayed the same.
It doesn't matter today, if it's a CPU, if it's a GPU, if it's memory, power is going up across all of those components. And so even some of the systems that don't have GPUs, that just have-We're talking six, seven, 800-watt CPUs today, plus a whole bunch of memory, it just becomes really difficult to cool that with air, even plus the preheat and the exhaust temperatures. It's really challenging, even outside of the GPUs today.
Justin Seter:
So that kind of leads to a point of sort of baselining for those that may be listening here, that are not super familiar with what data center densities are like or have been like. So if we were to imagine a typical server rack, a 42U server rack, and what are sort of the ranges of densities of that rack that we've been seeing, call it for the last couple of decades, that have been able to be cooled with just air, no problem, and where are we headed for that rack density?
David Quirk:
Sure, I can start again, on this one. So the average cloud site today, whether we're looking at a hyperscaler or the colo type of sites, probably ranges between 10 and 20 kilowatts a rack. So 10 kilowatts, just as a frame of reference, that's like a residential fireplace going full blast, and that's not sitting in front of it, that's like putting your head in the chimney kind of level of heat, so you get the full 10 kilowatts. So that's your typical rack in a data center today. But with the liquid-to-chip applications, what we're seeing is anything from 40 kilowatts, slash, to 60 kilowatts a rack today, being deployed in mass scale, all the way up to 120, 130, to 300 to 400. And now there's lots of conversations of 500 kilowatt a rack and beyond, and already listed on some websites out there, some of the manufacturers. So really, really incredible step functions, or leaps, in the power density per rack, and hence, the overall density in the data center going way up from what we used to have.
Justin Seter:
So somewhere between a 10X and a 50X density increase, so now you've got 10 or 50X more heat in essentially the same footprint. So I guess I have two questions related to that. One, sure sounds like the design and operability is going to be significantly different. And number two, what is ASHRAE doing about that, is being able to publish industry guidance to keep people straight, because that is a truckload of heat and power.
Dustin Demetriou:
So there's a lot going on within ASHRAE, around trying to provide guidance and best practices at this rapid pace that everybody's trying to deploy technology with. I mean, the good news is ASHRAE has had guidance on liquid cooling, I think, since 2014, or maybe before that even, when the first version of the liquid cooling book came out. That was done a lot at the time, because a lot of the national labs and supercomputing centers were using liquid cooling. And so we've had, for example, the ASHRAE Facility Water System W classes, that tries to provide some guidance and standardization around temperatures of the facility water systems.
More recently, in the latest version of the Liquid Cooling Book, that was published last year, or earlier this year actually, in the new Datacom Encyclopedia, we've provided additional guidance on the TCS system loop, the technology cooling system, as well as introduce new, what we call, S classes there to, again, further bring more, I'll say, guidance and standardization around those temperatures that this equipment really needs in the data center, to make sure that both, we could design the facility, as well as the IT equipment, in a consistent manner.
So I'd say that was the first thing. There are also other efforts outside of TC 9.9 directly, things like ASHRAE Standard 127, which is a method of test for, I think it's called a method of test for air conditioning units today, but that name is being changed. But the point is that as part of that, we're also introducing things like methods of tests for equipment like coolant distribution units, and things like that, to again, try to provide more industry guidance, standardization, ways that we can really compare technologies to make sure we're making the right design decision when we think about the deployment of these. And so those are some of the ways that I could think of that ASHRAE is contributing at this point, others have-
David Quirk:
Yeah, I would add to that, Justin, that ASHRAE TC 9.9 did put out a technical bulletin in September of this year. It's a very short document, only four pages, but it provides a lot of critical guidance around liquid cooling applications, and really, it's about resiliency guidance for cold plate applications and deployment. So we can talk about that more later, here, in the podcast, but that's an important piece that just recently was published.
And I'll put a shameless plug here, in that there was a related article in the ASHRAE Journal, in the December edition, that further expanded on that technical bulletin, to help provide the industry with more guidance around the design of the data centers for liquid-to-chip applications. And then last but not least, ASHRAE has always been a leader in research. So there's a research project specific to this topic, out there as well, which I know Tom can speak more to, but it's Work Statement 1972, and it also is about the thermal resiliency for these server applications in data centers, and what sorts of things, what practices need to be done to make sure that we don't create damage to the hardware based on how we operate the infrastructure that supports it.
Justin Seter:
Yeah. Maybe we can spend some time and talk about that now, because going back a decade-plus, the publication of the X factor for air cooled servers back in the day, was pretty revolutionary in the way it allowed people to own and operate different classes of servers. So it sounds like that doesn't currently exist for liquid cooling, and that's sort of at the heart of this research, so maybe Tom, do you want to talk about that a little bit more?
Tom Davidson:
Yeah. This work statement, the title is Data Center Direct-to-Chip Liquid Cooling Resiliency-Failure Modes and IT Throttling Impacts, and then also a Liquid Cooling Energy Use Metrics and Modeling. And it's working its way through ASHRAE. These things don't happen overnight, but I think Dave Quirk came up with the suggestion for the project, probably a little bit over a year ago. We've gone through what's called the RTAR phase, and now we're in the work statement phase, where we're working with the research administration committee at RAC, to really fine tune the research project, and hopefully, within a year or so, it will take off.
But I'm just going to switch to the energy metrics part of it. There is an ASHRAE standard for data centers, that's Standard 90.4. I think the currently published version is 2022, 2025 may be out. But in any case, I was kind of interested in that standard and how it applied to data centers. So I did a search for the term "Liquid cooling," and not only was there no definition of liquid cooling, but the term didn't even show up in the standard. That doesn't mean that nothing's happening.
And how this relates to the work statement, the 1972 work statement, is that we did reach out to the 90.4 committee, and we asked them if they could provide us with the calculations that they used to come up with the current, what are called the MLC and the ELC components of the energy standard, and they've agreed to do that. So I think this new research, in addition to looking at cooling resiliency, will also be able to compare the energy consumption of liquid cooling products to air-cooled products, so that we can kind of, on an apples-to-apples basis, figure out what 90.4 needs to do from an energy standard to look at the efficiency of liquid cooling over time.
Justin Seter:
So efficiency and resiliency, I think those are probably the two topics that are going to be most top of mind for everybody that's listening to this podcast right now. So maybe if we dive into each of those a little bit here, and just give some of our thoughts on, maybe we'll start with resiliency. So obviously, the research project's going to take some time, but what do we anticipate as outcomes of that? Is it going to be sort of standard entering conditions, rate of rise limits, maximum case temperatures? What do you guys sort of forecast, if you look into the crystal ball on what does that resiliency guideline look like?
David Quirk:
Yeah, the reason why I prompted that research project, is as design engineers, we need better guidance for this interface between the infrastructure and the hardware. When we were air-cooled data centers, that interface was air, and there's a different time constant associated with that, meaning we had a lot more safety factor, we had a lot more buffer. If things went awry with the infrastructure supporting that air cooling to the servers, there was a lot more time to deal with that deviation in temperatures, whether it be on the entering conditions to the rack, or otherwise. And when we look at direct liquid-to-chip, that heat transfer rate and that time constant is a lot smaller. We just don't have the safety factors that we once had with the air cooling, so we're really trying to define some criteria, not only around the entering conditions, but what does the transient scenarios look like. Because all of these facilities are connected to the grid, or to their own standalone power supply, and all of those are subject to failure, including backup systems.
And so what happens when we transfer power in these facilities and we have open transition conditions on the electrical gear, those transient conditions then create ripples into the cooling infrastructure. And we need to understand, as design engineers, how much time do we have, how much deviation do we have on those supply temperatures going into those racks, and what kind of impact does it have, not only on the software operations of those processes, but the hardware itself, and will we be creating premature failures of the hardware? So this research project's really about trying to define the criteria, such that both parties, the IT hardware side of the world and the infrastructure guys, have this common terminology and meeting place, where they can make certain that we don't create those problems with either the software operation or the hardware during all those transient conditions that data centers inevitably go through, no matter what their availability and redundancy criteria is.
Dustin Demetriou:
Yeah. I mean, I'll add to that, part of the reason we have these challenges today, is also kind of the way this industry has evolved, right? 10, 15, 20 years ago, when the industry was delivering liquid cooling systems, I'll say most of those systems were completely delivered by a single vendor. They would design, not only the server equipment or the IT equipment, they would design the coolant distribution units and the pumps and the plumbing, and all the infrastructure you need, and the TCS loop to make this thing work. And for many reasons, today, the industry is sort of evolved to where, as we're doing these huge large scale deployments, it really isn't practical for a single vendor to deliver all of that infrastructure. And we're starting to see this infrastructure being delivered more like facility infrastructure, which is bits and pieces from different vendors, and having to make all that stuff work together with controls.
So that's also been a big challenge that the industry is starting to have to contend with, is how do we take those practices that have been delivered as a sort of a packaged unit, disaggregate them, and still design them efficiently with all the resiliency, with all of the ride-through that we need, right, as a bunch of disaggregated pieces too. So that's the other thing that I think the industry's contending with here too. And again, it's about getting those best practices and guidance out through things like this research, so not only do the IT vendors, but also the design engineers, the commissioning engineers, understand it, or talk the same language.
Tom Davidson:
I have pulled up the objectives from the work statements, so there's a lot of them, but maybe I'll just briefly go through them so people have a sense as to the breadth of the research that we're looking at. Objective one, research failure system design for air liquid, hybrid, and liquid-cooled equipment, using models and empirical data, just kind of like an overall objective. But we're also looking at ITE power and thermal capacitance impact on IT throttling time, this could be normalized, such as by curb weight or processor server types. Another objective is to look at the impact of the TCS loop liquid inlet temperatures, and these are S-classes from the ASHRAE liquid cooling book, and to look at that impact on IT throttling time. We also want to look at the impact of the TCS loop liquid Delta-T on IT throttling time, not a whole lot of information on that that we've seen. We'd like to look at the impact of hybrid servers, liquid cooling percentage on the rate of rise and the IT throttling time.
So these servers have a combination of liquid and cooling, they're typically not 100% liquid-cooled, unless they're immersion. This research does not include immersion, but to the extent that we're able to look at that variation, it's just another data point that we can look at. And we want to assess the impact of the TCS loop liquid flow rate failure percentage on it throttling time. So hopefully, the loop is designed with more than one pump, so that if some percentage of the pumps go down, how does that impact the overall failure rate, so there's difference between a complete failure and a partial failure.
Let's see. Once we get the basic bit data, then the plan is to really sit back and come up with about 30 tests that will look at various combinations of these, and really try to define what works best for, in terms of practical information that the industry can use to help with liquid cooling design. And finally, looking at that energy impact. Again, 90.4 isn't really addressing liquid cooling, but we feel that they need to be. I think they know that they need to be. And to that extent, as I mentioned previously, they'll be supplying some calculations that they've done based on air cooling, and we'll be able to compare that with some liquid cooling results. And that has some tweaks and challenges, which maybe I'll just touch on a little bit later. So those are the main objectives. Thanks.
Justin Seter:
Yeah, that's very exciting research, and I do hope we get good industry support and donations, and things like that, to allow ASHRAE to take that research expediently, because I know that there are a lot of people that really need this right now. And you hit on a couple of things there, Tom, and I do want to come back to energy efficiency, and we'll get there. But you mentioned hybrid liquid air a couple of times there, and so maybe, Dave or Dustin, if you guys want to just talk about what that looks like a little bit. So if we imagine today, a 10-megawatt data hall, you have just 10 megawatts of cooling, perimeter cooling, or close coupled, or something like that. What is a 10 megawatt data hall for a direct-to-chip application look like?
David Quirk:
Yeah, sure. I'll take it first here, from an infrastructure standpoint. So it's a bit of a misnomer when we talk liquid cooling. As Dustin mentioned before, we have two different types that are defined in the liquid cooling book, there's the kind that has a conductive interface, which includes things like a heat pipe or cold plates, and then there's the immersion type. So for this podcast, we've been talking about the cold plate applications. Well, when we're this conductive interface, that is specific to the chips typically, so like the CPU and GPU, and perhaps a few other elements within the motherboard. But server motherboards have lots of other components on them, things like power supplies, RAM, DRAM, there's other elements on the board, that Dustin can speak about, that still require air cooling. And so from an infrastructure standpoint, when we look at a 50-kilowatt a rack liquid-to-chip application, we are still dealing with 10 kW rack air cooling.
And so common rule of thumb emerging from the industry right now, is like the 80/20 rule, it seems to apply to everything in life, and does so with liquid-cooled servers, in that 80% is that heat rejection from the liquid to the chip, and 20% tends to be the air cooling. And now again, it's a rule of thumb, so if we're dealing with a 50-kilowatt rack, we're still at 10 kilowatt a rack air cooling for that liquid-to-chip application. If we double that again, and go up to 100, you can see quickly, the math for the air cooling continues to rise. And obviously, we will eventually still run into problems with air cooling as we get hotter and hotter there, and that ratio continues to scale proportionally. So Dustin, you want to add a little bit more context to that at the server level?
Dustin Demetriou:
Yeah, sure. Yeah, it's a great point. I mean, talking about having to still cool a 30-kilowatt air-cooled rack, which really not many sites are even doing today. But I'll just add, a couple other things is as we start to go up in densities, everything becomes harder and harder to cool. So whereas 15, 20 years ago, the industry was talking about liquid cooling primarily for energy efficiency reasons. At that time, there was a lot of discussion around waste heat reuse, we were talking ASHRAE W40, W45 temperatures. That is becoming harder and harder today, to cool these very high power density systems. And there's a lot of things that go into why we have density. Probably in the AI space, the biggest one is interconnection, and having to actually physically have these systems very densely packed and close together. Oftentimes, we get asked the question, "Well, why can't we just spread these things out and reduce the rack density a little bit?"
Well, the way we drive the performance is by making these things denser and denser, and shortening the interconnection between all of those things. So I mean there's a lot of challenges there. I mean, the other thing I will point out is that AI is one use case, there are certainly other use cases that probably need liquid, that aren't going to be driving 400, 500 kilowatt racks, like Dave said. But I think it's important from a data center perspective, that we don't just go and design a data center for one use case, right? Because if we design everything for 400, 500 kilowatt racks, it's probably going to push us to lower water temperatures to do that, and it might not be the right solution if we're only trying to cool a 50, 60 kilowatt rack, too.
So I think that's also a part that's really important, is making sure we're thinking about these different use cases as we think about the infrastructure, otherwise we could be either wasting energy, or not delivering the performance that we need across these sites, or making huge investments, that five years from now, we can't really leverage anymore. So I mean, that's the other thing I would add.
Justin Seter:
Those are great points. And so now it feels like we're starting to get into the meat and potatoes here, of what does this density mean for designs? So when we think about use cases and we think about who are the customers that are building these, and Dustin, you mentioned earlier, the single ecosystem of all of the equipment comes from single manufacturer, and they're in control of it from heat rejection through software. How does that apply in today's landscape, when we have the hyperscalers, for those who don't know all the big internet company names, and then you have colocation providers that are also building facilities for, in some cases, a lot of the same customers. How is that different between when the hyperscalers are building it internally and when the co-location providers just sort of have to meet the deadline on a contract?
David Quirk:
Oh, my favorite topic. Every time you add more players to the mix, you always complicate a problem, and this is no different. And we like to think of it in terms of control boundaries in the engineering world, and so who owns what control boundary is really what this conversation comes down to. When we were air-cooled servers, that control boundary was just draw a box around the IT racks and servers, and that was their domain, and everything else was easily divided as the data center owner-operator domain. Well, now these two are close-coupled literally, with piping and fixed connections at a different level, so that interconnect just used to be fiber and ethernet cables, now it's fiber, ethernet cables, and liquid-cooled piping. So there's a physical hard connection there, between the cooling infrastructure and the server racks. So I'm going to use your favorite line there, Justin, is "Has anybody told the IT guys they're now part of the commissioning group?"
And that really speaks to the fundamental change that has come about by way of liquid-to-chip applications. Everybody's role has now really changed because those boundary conditions are no longer just really simple and clean. And so even when we look at the difference between a tenant, or let's say it's a hyperscaler or an enterprise customer that's going to be in a data center, and the colo company who built and is going to operate that facility, who owns the CDU? Who owns that TCS piping system? Who owns the servers and the server racks? And oftentimes, the answer to those questions varies for every project, and that's what creates the challenge, because now you have this fixed system, where you can draw a line for the boundary condition, but there's overlap with who is owning, operating, testing, commissioning, designing, and ultimately has the liability for within the system. And no matter how far you draw that boundary condition upstream, you still have that same problem to deal with in these facilities now. Anything to add there, Dustin?
Dustin Demetriou:
Yeah. I mean, I'd say the other challenge in an enterprise or a colo site, is where... In the hyperscaler world, maybe they have a bit more control over exactly all those aspects, and a colo probably need to be supporting many different vendors' equipment. So this is where, in the past, when the solution was delivered as a solution, vendor A could have their own water quality requirements and materials requirements for the public system, right? Now, if you try to do that at the facility level and put in different vendors' equipment that may have different requirements, it becomes a real challenge, and how do you coordinate?
So whereas in the colo world, maybe they didn't want to talk to the IT vendors, because that was the tenant's responsibility, right? Now, if you're delivering a colo site, you really need to know what equipment's going to go in there, because all of this equipment's going to have different water treatment plans and requirements and materials requirements, and things like that. So I think there's a lot of challenges that come along with just the disaggregation, and the fact that these are heterogeneous sites, so I think that's the other challenge.
David Quirk:
Yeah. Further, to add to that, so this is the reason why ASHRAE is doing publications like the liquid cooling book and the latest bulletins, is to try and drive that industry consistency and standardization. So things like the S-classes are really important. But interestingly, it's brand new to the industry. It's barely a year old. And what we're seeing is that the industry at large, is still unaware about these, and not having adopted and using them. And then what's compounding the issue is, and this is also going to compound the energy efficiency problem, when you don't have a homogeneous environment where it's all one stakeholder, you introduce lots of inefficiency and safety factors, largely driven by contracts. So the manufacturer of the hardware may say, "Hey, I can take this entering temperature," but then when the buyer of those services, "Well, I'm going to run this software here, and I'm going to overclock those processors, so I want this lower number to give me a factor of safety."
And then the colo guy says, "Well, okay, I'm going to make sure I don't violate my SLA limit based on a number that you set lower, so I'm going to set mine even lower than that." And then design engineers get a hold of that, and say, "Oh, we got to set it lower than that to make sure we don't hit that limit on your end." And so all of these factors of safety get added in when you add more stakeholders, because every one of those has a legal agreement between each other. And so all of that drives energy inefficiency within the design, and it's really hard to overcome right now, as an industry. So further is the importance of the standardization, getting everybody on the same page, the research projects, and why those have to happen, provide that comfort level, close some of those factors of safety between everybody's division in their domains, and improve the overall end product as a result. So we have a long way to go, in my opinion, but this type of podcast is important for getting that word out.
Justin Seter:
Yeah, it sure sounds like the equation is getting a lot more complicated. I mean, I can remember back into the late 2000s, where the only story that mattered in the data center was how can we get the temperature up? Because we had run data centers for far too cold, for far too long, costing far too much energy, and so ASHRAE 9.9 led the way on being able to increase temperature in data centers and get that efficiency under control. And everything that I just heard, sure sounds like there's a whole bunch of temperatures that are going down and a lot of safety factors. I mean, if we're talking about safety factor to prevent IT hardware failure, not just throttling, but failure, and knowing the cost of these systems these days, I mean we're talking hundreds of millions or billions of dollars potentially per data center, what does the future look like? How do you trade that off against efficiency?
David Quirk:
Yeah, great question. There is a figure in the liquid cooling book, I believe it's in chapter six, that shows this, and is really representative of the challenge we're up against. And it's physics. So we have silicon-based processors today, and they have an upper limit on their operating temperatures and doing conductive heat transfer. So we have this deep flux equation that we have to work against, so as the density, as the power, and the heat flux in that chip goes up, we have to basically bring the temperature of the TCS loop, or that entering condition, down in order to get enough of that conductive heat transfer from the chip. So that graph actually shows as the device power at the chip level goes up, we have to bring that TCS entering temperature down. And all those factors of safety I mentioned before, will continue to compound on there.
Probably, a better way to think about the efficiency as we go denser, is that we're able to do far more processing. So the unit of work in those much denser chips, is significantly higher. So if you think of it as deploying a whole data center that's doing much, much denser chips, they're getting way more order of magnitude work done versus the ones that are at that lower density. So I think that's going to be the real challenge for 90.4 and others, to wrap their head around, is that even though we have to drive the temperatures down for these higher density chips, we're ultimately going to be improving the useful work per watt, as it were. And that's really going to be difficult for that committee to measure and to quantify. And I think we're probably not going to be able to, and we're going to have to find other ways to solve this problem. Dustin, you got anything you want to add on that?
Dustin Demetriou:
No. I mean, that's the exact point, right? I mean, if we look at, whether it's sustainability, or whether it's just getting the energy efficiency, I mean the metrics we're using have to really reflect back to, as you said, the useful work we're getting per watt of energy, because that's ultimately the ultimate energy efficiency goal. And we've all heard the stories, 10, 15 years ago, back to your point about raising temperatures, Justin, where all of a sudden, you raise the temperature in the data center and you save some chiller energy in your mechanical plant, but you actually ended up consuming more kilowatts in your data center because your server fans sped up to overcome that. And so that challenge there, of making sure we're not looking at a sole metric just to drive all the decisions, is going to be key in liquid cooling.
David Quirk:
And I think this is really fundamentally the challenge with today's data centers on the water usage front as well. So right now, the industry's gone largely air cooling, which has a higher penalty on the energy usage at a site level. But when we look at the trade-offs between energy and water, water has become the dominant factor in that equation anymore, which has forced us, and limited us in our cooling solutions on the heat rejection side. I think it's a similar issue we're going to have with driving energy efficiency equations into liquid-cooled applications, for a similar reason, is you have to zoom out far enough and measure the right metrics. I think, and I've commented on this in the industry in other forums, but I think we're going about the water problem the wrong way, too.
So while we're on liquid cooling, let's talk water for a moment. I really think the industry as a whole, needs a new metric, like water usage credits, similar to REC credits for renewable energy sources, where the industry puts funding back into the development of sustainable water resources so they can use water at the site. Because one way or the other, we're likely using water back at the source, and so we're going to run out of the water one way or the other, so we have to figure out a more sustainable model as an industry, to make that dynamic work and sustain long-term, as we add many gigawatts of power for these facilities, whether it's on-grid or off-grid.
Justin Seter:
Yeah, great. But I think we should talk about that for a moment, because any metric that we're using to measure things like efficiency for water and power, the denominator has always been pretty simple frankly, although there are decades of arguing over PUE in subcommittee meetings that would probably contrast that. But fundamentally, PUE is a pretty simple metric in a lot of ways. But now, if we have to get useful compute as the denominator, and maybe it's worth talking about, and I think the industry has come up with some different words for GPU. GPU is always graphics processing units, and that's what they used for PC gaming essentially, was where that was born from. But GPUs are really good at parallel computing versus serial computing, so now you've basically supercharged a CPU with a GPU and allowed it to be a one-to-many factor of work it can do.
And so if you look at a row of accelerated CPUs that have GPU components in them, you could potentially replace two, five, 10, 20 rows of traditional cloud servers. So now the efficiency that one row can do, you just replaced 20 rows, so maybe the efficiency on that one row is different. But you also replaced 20 rows, so the building's smaller and the room's smaller, and all the other types of efficiency with that. So how does that ultimately tie into useful work, or this fundamental change of the IT hardware, for an industry that's just got well-established metrics that are going to be difficult to overcome? I know that's a tough question.
Dustin Demetriou:
Yeah, it's tough, I wish I had the answer, but this is an area where the industry, for years and years and years, has been trying to come up with this useful work per watt metric. There's been a lot of work that's been done. I'll say if you look at things like Energy Star for servers, there are what we call the server energy efficiency rating tool benchmarks there that can be used to look at-You get out of that, a metric of performance per watt, and there's a bunch of different benchmarks. But I think the key challenge here, is those are benchmarks, so they serve a purpose, but the purpose is not exactly, "This is exactly what you're going to get when I deploy this," right? It's just a benchmark.
But the real challenge is that a lot of these things don't apply across the board. So today, GPU-type equipment, we don't have a standard sets of benchmarks. And it comes back to the value of metrics, whether it's PUE or whatever metric, is we often get into this case where we're trying to use all these metrics to compare my thing versus your thing. And really, the value in a lot of these is how are you using these metrics to show what you're doing is continually improving? And I think that's where something like performance-per-watt that is necessary. What is the measure of performance for your business? I mean, if you're a bank versus a startup company, you probably have different useful work that is important for your company. So I think that's where it becomes having to define that for your business, and then using that to kind of trend over time, because I don't think it's consistent for every single business, every single organization across the world. So yeah, a hard question.
Justin Seter:
That one's going to take a few years to solve, I think, so we'll keep working on that. So let's circle back here real quick, and talk about some of the unique design considerations for liquid-cooled applications. We talked about CDUs and TCS loops, what else do design engineers and owners, when they're sort of specifying what they're looking for, need to be thinking about for these sort of super high-density liquid-cooled applications?
David Quirk:
Yeah, so everything's new. Everything is new. So we've got new equipment, we have new vendors, we have new design criteria. We've got new fluids in the mix, we have new processes that we're having to wrap around all of this. You can't even do the commissioning process the same as we used to do before. And we have a lot of new players entering the space, even the number of manufacturers doing CDUs, for example, and we don't even have a method to test, or rating standard out for those things yet. So really, really challenging as we're building gigawatts of these, and we got all this newness. And then layered on top of that, as we published in that recent bulletin, we have things like thermal energy storage now coming into the fold, that the industry traditionally didn't have to deal with at scale, I'll say. Having to consider putting pumps on UPS so we can maintain continuous flow.
And then in order to model and size, and look at the different dynamics of things like that in the system, we're having to do some new modeling, using tools like co-simulation, where we can look at a combination of the hydraulic system analysis, along with CFD, or computational fluid dynamic analysis. So you got this dynamic happening between the air-cooled system and the liquid-cooled systems, both under steady-state and transient conditions, and it's really, really dynamic because they have different time constants, so they react differently when temperatures in the system change. So it's exciting but challenging at the same time, because nobody wants to go slower in the data center industry, they want to go faster. And when you add all this newness into the mix, it really creates both, like I say, an exciting and challenging, and kind of high-risk environment to do all this work.
Dustin Demetriou:
And to that end, I mean a big part of this challenge is just access to the data that you need to do a lot of this physical planning type stuff. So what is the pressure drop if I add this device to my loop, and then how do I use that data then do my hydraulic network? So to that end, TC 9.9, just a month ago, or two months ago, released a liquid cooling server thermal template. So if you're aware, back in the, I think it was the second edition of the thermal guidelines, we introduced this thermal template that gave, for an air-cooled server, here's the information you need from a physical planning perspective, air flow rates, maximum temperatures, power, and that had been around, and most manufacturers today, are using that as part of their physical planning material that somebody could use when doing that.
So we introduced a version of that for liquid-cooled servers, where you get things like how much of the heat of this server is being captured by the TCS loop versus being rejected to air. What is the pressure drop and flow rates, both on the air-cooled and liquid-cooled side? What is the maximum pressure that this system can operate at, right? Because that's another thing that's really important with these liquid-cooled systems, now you're attaching these pretty fragile-like cold plates into big liquid cooling loops that can have pumps that are putting out hundreds of PSI of pressure, so you have to make sure all this stuff works together. So that's again, something relatively new that the committee has put out, but I think can go a long way in helping get the right information in the hands of people that need to do the design for these things. So put another shout out for that, and really starting to look at how we can push that into the vendors to get all this information out into the industry.
Justin Seter:
Great. So maybe one last question here, I'll just throw it out to the group. And thinking about the applications of all this new design and the new IT hardware, what does the load profile look like that we're having to design for here? And do we even know that when we're doing the design, and does it have an impact?
David Quirk:
It has a huge impact. And everybody's trying to cast a really wide net right now, which becomes really capital-intensive potentially, if you don't bind a lot of the decisions. But we're seeing people trying to design for 100% air-cooled environment simultaneously, with trying to design for large percentage of that same space being liquid-cooled at 5X the power density. And so you have an overlay of these piping networks, particularly on the facility water side, that are provisioned to be able to accommodate CDUs, should they come to the facility. So there's a lot of that, I'll call it partial future-proofing that's happening in the facilities right now, where people are trying to guess and predict when they're going to see these liquid-cooled servers in their sites. So it's introduced a lot of unknowns, a lot of, "Hey, let's try and provision for as much as we can now, without spending too much money," and everybody doing a lot of guessing.
Justin Seter:
Awesome. Well, we're almost at time here. Thank you, gentlemen, for joining and sharing lots of great information today. Maybe you guys could just give a quick plug for how can people stay up to date on what's the latest on this? It's changing so rapidly right now, and people are going to want to know where to go for guidance.
David Quirk:
So obviously, ASHRAE is one of those places. They've been writing the book on data centers since about 2004, and I think there's going to be a lot more coming out from TC 9.9 in particular. But as Dustin mentioned, SPC-127 is also working on it. SSPC-90.4 is also working on energy efficiency standpoint. I'm sure soon, AHRI will be working on a rating standard for things like CDUs, so that's definitely the one place. Others are the Open Compute Project, or OCP, they regularly are publishing white papers and related content to help with this platform interface that has to occur between the hardware and the infrastructure of the world. The industry conferences, and of course, things like this ASHRAE podcast. So I'm sure we'll be doing more of these in the future as well. Anything to add there, Dustin?
Dustin Demetriou:
No, I think just further plug for the TC 9.9 Datacom Encyclopedia. So if you're familiar with the TC, for many, many years, we published physical books. There's 14 books in the Datacom series that have been multi-versions of many of those. Earlier this year, we transitioned all that to a new online platform, available for very low subscription costs, and the whole point of that is to be able to get material out much quicker. So we're doing quarterly updates to that Datacom Encyclopedia. A lot of the new liquid cooling stuff we talked about is already published in there, and we will continually quarterly updating that with the latest and greatest stuff on liquid cooling. So just a further plug for the work of the TC in getting that encyclopedia up and running and out there.
Justin Seter:
Great. Thanks everyone, for joining today, and look forward to chatting more about it soon, here in the future. So we'll sign off here. Thank you.
ASHRAE Journal:
The ASHRAE Journal Podcast team is editor, Drew Champlin; managing editor, Kelly Barraza; producer and assistant editor, Allison Hambrick, assistant editor, Mary Sims; associate editor, Tani Palefski; and technical editor, Rebecca Matyasovski. Copyright ASHRAE. The views expressed in this podcast are those of individuals only, and not of ASHRAE, its sponsors or advertisers. Please refer to ASHRAE.org/podcast for the full disclaimer.