Machine learning applications are just starting to be used in data centers. So far, these have mostly been vendor-specific and built into existing platforms, leaving data centers with a hodgepodge of solutions centered around vendor and device type. What's been lacking is an AI solution that addresses the entire data center.
About a year and a half ago, a San Jose, California-based startup called LitBit released an AI technology that uses machine learning to take a predictive approach to all aspects of a data center's infrastructure -- keeping an eye on everything from servers to cooling units to generators -- to warn of possible upcoming failures. The platform has performed well in relatively small use cases, but now it's about to get tested at scale, as Montreal-based ROOT Data Center prepares to put LitBit's technology to work.
It works by doing the same things data center technicians do when they walk a data center's aisles: it listens and watches for the out-of-the-ordinary. It's primarily a SaaS offering, although the company does make available an on-premises hardware/software component customers can use to keep the system running in the event that LitBit's servers become unreachable over the public internet.
The company's CEO and founder, Scott Noteboom, is a veteran of the data center, having served two years as Apple's head of infrastructure strategy, design and development, and nearly seven years as Yahoo's VP of data center engineering and operations before that. In a blog post earlier this year he named LitBit's data center solution Dac, calling it "the first AI-powered data center operator."
In a recent interview with Data Center Knowledge, he said most organizations get started using LitBit's AI simply by placing "pulse point stickers" -- basically QR codes that identify the machine -- on devices they wish to monitor. After that, operators place a mobile phone on the pulse point and use the phone's microphone and accelerometer to take the device's pulse by sensing its vibration and analyzing the sound it produces.
"The first thing you want to teach the AI is that this is, say, a Liebert 10 series UPS," Noteboom said. "It's in normal mode, and it's under a 10 percent load. You can take a pulse point reading, which takes about two minutes, and your mobile phone will actually capture the operational fingerprint of that machine. It will teach the AI [what normal is] for this particular machine with these load dynamics."
The first time the AI "hears" something like a failing motor bearing, it's not going to know exactly what it's hearing -- only that it's abnormal -- so the first time it warns of a particular failure it will only send a warning that something is amiss. But after the technicians at the facility identify the problem, they can input their diagnosis into the AI, so the next time it runs across the same abnormality it can be more specific and warn of a possible motor bearing failure in the affected device. The phone's camera can also be used to identify normal and damaged components of devices, such as rectifier controller cards or bus arrays.
Noteboom pointed out that with this small amount of input, using no more than a handheld smartphone as a sensor, a data center can realize considerable manpower savings.
"If you have a security guard doing rounds on a device, or you hear something that sounds a little different, then you can take your mobile phone out of your pocket, put it on the pulse point and then basically ask the AI, 'What do you think this is?' It can come back and say, 'This is a Liebert 610 series UPS and it sounds like and it feels like it may have an unusually high-frequency noise,' or it could say that, 'It sounds like you have a grinding fan inside the device.' It depends on what it's been taught and what its experience is."
Users can delve deeper and supply input from sensors installed in the data center, such as temperature and humidity readings, and incorporate those readings into the process. Those sensors can include a facility's security cameras as well, and the system can be trained to recognize not only the presence of people in an area that should be unoccupied, but a human lying on the floor, indicating an accident has occurred and that an employee has been injured. Microphones can be permanently mounted on devices, allowing constant monitoring of machinery.
The latter approach is being taken by LitBit's new customer, ROOT, which is testing the system at MTL-R2, a 175,000-square foot facility whose current power capacity, according to its website, is 20MW but whose potential capacity at full build-out is 50MW. Although the company eventually plans to monitor most if not all of the data center's equipment using LitBit's technology, it's initial test involves the center's 14 emergency diesel generators.
"In any power outage with all 14 generators running it would be difficult for technicians to be standing beside every generator and listen to make sure there's no noises or sounds or anything to indicate there's a problem in those generators," ROOT's CEO AJ Byers told Data Center Knowledge. "Instead of hiring 14 people, we can have these microphones using artificial intelligence inside of every generator, alerting us if they hear an odd sound.
"That odd sound could be caused by a bearing rubbing the wrong way, a fuel level that is too low, or something knocking in the generator. It's an engineering persona that is like having engineers standing beside each piece of equipment on the property."
Because the generators sit outside, the AI's learning process was a bit more thorough than might be necessary for equipment located within the data center's walls.
"They're inside of a heated compartment," Byers said, "but the sound of rain would make a difference, and the sound of wind would make a difference. So we've gone through kind of an extensive learning pattern of 'This is what they sound like in all normal weather conditions.'"
Byers said that the system's training has now been completed, and it's now ready to go into full-operation mode.
As ROOT begins the process of expanding LitBit's machine learning to include the rest of the data center, other large data center operators will doubtlessly be paying attention, because downtime is expensive. According to IDC, the average hourly cost of an infrastructure failure is $100,000 per hour, a number that rises to $500,000 to $1 million per hour in a critical application failure.
In a white paper released December 5, ROOT pointed out that the actual cost could even higher. "For data centers themselves, a significant incident can undermine the perception of the center as a key partner. Instead, it can mean being perceived as a barrier to productivity that undermines a company's ability to compete effectively. It is dangerous as well for the data center's own corporate health because significant resources are allocated to reacting instead of building value initiatives into the business."