Cable trays inside a data center

Microsoft to Open-Source Its Secret Weapon Against Cloud Network Outages

Microsoft researchers said they’re planning to open-source Open Network Emulator, the system that simulates the entire network that powers the company’s hyperscale cloud platform • The company has been using it for about a year to test changes made to the network before they’re deployed in production • The researchers said Microsoft’s network engineers caught hundreds of bugs in proposed changes, potentially preventing major outages

Computer networks are complex things, which makes them fragile. And the bigger a network is, the more damage a single mistake can make.

After more than a year of using the system that emulates the entire global network powering its Azure cloud to avoid disastrous errors engineers inevitably make, Microsoft is planning to open source the code behind the emulator.

“We have decided that this is such an important resource for everybody that just hoarding it [ourselves] is not the right thing to do,” Victor Bahl, distinguished scientist and director of mobility and networking at Microsoft Research, said in a live interview at the company’s Research Faculty Summit earlier this month. “So, we are making it available to the entire community.”

Called Open Network Emulator, or ONE, the system simulates in software all the hardware and software devices that comprise a network and the ways they’re interconnected. Running in Docker containers and VMs, its purpose is to test changes network engineers make before they’re deployed on the live network whose uptime is critical to so many people and businesses.

Giving the public access to the technology would help large enterprises improve their network uptime but also provide students and researchers with a tool they can use to simulate hyperscale networks the likes of Microsoft, Google, and Amazon have built and innovate without having access to the actual networks themselves, Bahl explained.

It would also give networking product vendors a way to test new control-plane software at scale, according to Microsoft.

The company hasn’t said when it plans to open-source ONE. Searches on the open-source software repository GitHub – which is in the process of being acquired by Microsoft – returned no results, and a Microsoft spokesperson did not respond to a request for clarification in time for publication. (We’ll update this article once they get back to us.)

The company first revealed the system last year, about six months after it had been in use internally. At the time, it was called CrystalNet, as in crystal ball that shows the network’s future.

Microsoft researchers hinted then that they were thinking of releasing the technology to the public. They confirmed the plans to open source it as ONE at the Sigcomm Conference this June.

“Our network is large, heterogeneous, complex and undergoes constant churns. In such an environment, even small issues triggered by device failures, buggy device software, configuration errors, unproven management tools and unavoidable human errors can quickly cause large outages,” Microsoft researchers explained in a description of ONE submitted for Sigcomm. “Therefore, the ability to validate the impact of every planned change in a realistic setting, before the change is deployed in production, is crucial to maintaining and improving the reliability of our network.

Azure network engineers have used ONE daily over more than a year now, according to the Sigcomm paper. They’ve “spent millions of core-hours on ONE emulations – and caught hundreds of bugs in proposed changes, potentially preventing major outages.”

As businesses increasingly depend on cloud services like Microsoft’s, ensuring those services don’t go down is more important than ever. But no matter how well the systems are designed and how smart and vigilant the engineers running them are, humans do occasionally make mistakes. A micro-error made during a change in a hyperscale network can lead to a mega-outage.

“So, let’s say everything is working perfectly well. Barring hardware failure, everything should be fine,” Bahl said, explaining ONE at the Faculty Summit. (Microsoft Research released a recording of the interview as a podcast this week.) “But then, somebody, who is part of your team, goes and changes something somewhere – and I have horror stories about that that I can tell you, but I will not – but goes and changes something at some point, and that thing – as has happened in the past – can bring down an entire [cloud availability] region, because, you know, if you break the network, your packets are going nowhere.”

At hyperscale, an outage like that can affect millions of people, he said, “and I don’t want to be the source of that, right?”

Today, when Azure network engineers make changes, those changes are applied to the simulation first, but that first step is seamless as far as they’re concerned. “They actually don’t even know if they’re making the change to the network,” Bahl said. “They actually are changing the emulator. Because it mimics the network underneath so amazingly that you can’t tell the difference.”

If the change doesn’t cause any errors in the simulation, it gets propagated down to the production network automatically, he said.

TAGS: Linux
Hide comments

Comments

  • Allowed HTML tags: <em> <strong> <blockquote> <br> <p>

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
Publish