To fight the COVID-19 pandemic, huge swaths of humanity have transformed their daily routines. Offices and schools are closed, city streets are empty, and most people are trying to substitute as many of their normal activities as they can with internet-powered alternatives.
But cloud platforms of some of the most popular internet services the quarantined world is now heavily leaning on for work, socializing, and entertainment – Zoom, Dropbox, and Netflix – have so far had no major trouble absorbing the massive surge in usage.
That’s according to infrastructure leads for each of the three companies, who spoke as candidly as they could about the situation in a webinar Wednesday. Conducted over Zoom, the virtual event was organized by Kentik, developer of network monitoring tools which some of the speakers’ companies use.
That their technical infrastructure has been able to handle the surge and, importantly, a shift in traffic patterns doesn’t mean there isn’t a ton of work taking place in the background to ensure things stay this way.
“Last couple weeks it’s been all hands on deck,” Alex Guerrero, senior manager of SaaS operations at Zoom, said.
It’s also important to avoid overconfidence in the ability of the massive collection of independent networks that interlink to make up what we refer to as “the internet” to handle what may come in the future, as more cities around the world go on lockdown, as more employees get sick, and as infrastructure operators start feeling the impact of disrupted supply chains more acutely.
Network intelligence company ThousandEyes has been tracking network outages of ISPs, pubic cloud providers, unified communications, and edge services globally and has noted an upward trend in the amount of weekly outages between the second week of February and last week:
Zoom Scales Up
To date, Guerrero’s team at Zoom has been focusing primarily on scaling up bandwidth in various places on its network. That’s meant peering with more carriers and ISPs, ordering more transit, and increasing bandwidth on existing interconnections, with a particular focus on doing more peering closer to end users.
“That’s mainly what I’m looking at: bandwidth and being as close to the customer as possible,” Guerrero said. “Our product can handle a lot of latency, but still, the closer you are to the eyeballs the better performance you’re going to get across the board.”
Zoom traditionally keeps about 50 percent more capacity on its network than its maximum actual usage, he said, and the team has been busy in recent weeks maintaining that cushion.
Being an Equinix customer has helped Zoom increase its network’s bandwidth, Guerrero said. The company has been using Equinix’s Cloud Exchange Fabric, the software-defined network interconnection platform, to a great extent to boost capacity, he said.
Zoom today is in 19 data centers around the world, and each facility is connected to the biggest exchange in the market it’s in, Guerrero explained. Now, however, its network engineers are looking at second-biggest and in some cases third-biggest exchanges in those markets to bring its network closer to more end users.
As usage goes up, the platform is designed to scale both network and compute automatically, “with very little human intervention,” he said.
Zoom uses a combination of its own data centers and public cloud (by Amazon Web Services) for its compute infrastructure. While it’s had some challenges quickly scaling compute in its own data centers, due to the lockdown-related “supply chain issues” (details of which Guerrero did not disclose), scaling compute in the cloud hasn’t been a problem.
Other than having to scale “a lot faster” than anticipated, “everything is kind of in our standard operating procedure,” he said.
Netflix Is Careful Not to Scramble
While Netflix runs mostly on AWS, its platform is also a hybrid, because it operates its own content delivery network. Like Zoom, it’s had no trouble scaling cloud capacity, but it did hit a snag last week when trying to get more servers into the ISP locations to increase the capacity of its CDN.
“We have had multiple fires at this point with our supply chain,” Dave Temkin, VP of network and systems infrastructure at Netflix, said during the webinar.
Netflix’s primary server manufacturer (whom Temkin did not name) is in Santa Clara, California, and earlier this month, when six Bay Area counties including Santa Clara issued a shelter-in-place order, Temkin’s team had 24 hours “to get as many boxes out of there as we could.”
Those issues have since been resolved by switching to a different manufacturing location, he said.
Otherwise, the part of Netflix’s infrastructure that delivers content to users has been scaling up as designed. Temkin’s team has effectively “pulled forward” its growth plans for the coming holiday season, he said. “We don’t feel like we’re stressing our cloud infrastructure by the current events.”
Things are different for the part of the company’s infrastructure that’s used to make content. “Right now (it’s not unique to us) most content production is shut down around the globe,” he said.
Besides the problem social distancing presents for shooting movie scenes, other big parts of the production process, such as post-processing, visual effects, and animation, are things you can’t simply do at home, because they require a lot of network and compute power. So Temkin and his colleagues have been busy searching for technological solutions to make at least some of those things possible for creators to do remotely.
“The internet itself seems to be scaling pretty well,” Temkin said. While there has been some strain – in some cases on interconnects, in others on last-mile networks – “generally, nothing is absolutely melting down.”
Netflix has done some things to try to ease the strain, he said. It was unclear whether he was referring to scaling of bandwidth and compute on Netflix’s network or the company’s decision last week to reduce its video bit rates in Europe to ease network congestion.
Overall, Temkin’s philosophy has been to avoid scrambling for resources to ensure other, more essential services can get them if they need – services like healthcare, e-learning, and video conferencing. Much like US officials have been pleading with the public to avoid hoarding surgical masks because hospital workers badly need them, Netflix doesn’t want to hoard servers and hog network capacity because it’s more important for a doctor to be able to see her patient remotely than for you to be able re-watch Breaking Bad in glorious 4K.
Dropbox Is Seeking Peers
Unlike Zoom and Netflix, Dropbox runs mostly in of its own data centers. The company moved its platform from AWS to its own computing facilities in 2015.
However, the company continues to rely on AWS for unanticipated bursts in capacity and for some technological capabilities it wouldn’t make sense for Dropbox to build in-house.
The value of hybrid cloud platform is “you can always utilize public cloud capabilities and public cloud scale,” Dzmitry Markovich, senior director of engineering at Dropbox, said.
Like Zoom and Netflix, the cloud storage and collaboration company’s platform has successfully relied on automation to scale along with the recent surge in demand. But there have been some operational challenges with “restricted access” by some vendors around the world, Markovich said.
He didn’t specify what those challenges were, but many data center providers have reduced customer and vendor foot traffic in their facilities to prevent transmission of the coronavirus by allowing access only when it’s absolutely necessary and by rigorously screening visitors.
Another challenge for Dropbox has been the shift of internet traffic from being highly concentrated in big hubs to a more distributed pattern, he said. Instead of having a lot of traffic coming from a thousand accounts in a university, for example, Dropbox is now seeing all those accounts access its platform from many different places, through many different networks.
To address this, Markovich’s team has been analyzing its last-mile connectivity strategy and actively looking for more last-mile ISPs to peer with. Dropbox already peers “heavily,” but it’s now investing in even more peering relationships.
Accelerated Scaling Plans
Markovich, Temkin, and Guerrero all declined to specify by how much the shift to working from home has increased traffic on their networks. Cloudflare, provider of CDN and other internet infrastructure services, said this week that it’s seen roughly a 10 percent decrease in traffic in office areas, a 20 percent increase in residential areas, and a 5 percent decrease for campuses between February 19 and March 18.
Bill Long, senior VP of core product management at Equinix, who also participated in the webinar, said his company was seeing increases in traffic on its infrastructure ranging from 10 to 40 percent starting in December.
Equinix is the world’s largest operator of data centers of the kind where much of the interconnection that enables the internet takes place. “The good thing is all that core infrastructure is actually scaling pretty well,” Long said.
Luckily, the pandemic started as many companies, including Equinix, had been upgrading their networks from 10 Gigabit links to 100 Gigabit links, and that ten-fold increase in capacity has been partly responsible for things running as smoothly as they have, he explained.
Because they can quickly provision multiple 100G links and scale their bandwidth in an automated fashion, many companies are in a good position to absorb massive increases in traffic. The technical capabilities to scale network capacity were there, and many Equinix customers had been planning to scale anyway – just not as quickly, Long said.
“People are scaling much faster than they intended,” he said. What was expected to happen over one or two years “is now happening in months.”