As you probably know by now, the internet has held up well during the coronavirus pandemic. A few snags aside, overall, this has been one of the very few positive stories to emerge during the crisis.
Outside some expected growing pains, most likely caused by the system tweaks network operators have been making to accommodate new traffic patterns, the internet as a whole has so far hummed through the crisis without major meltdowns.
But the COVID-19 crisis has highlighted the urgency of providing broadband access to communities that don’t have it. We can celebrate schools’ pivot to e-learning, but there are vast numbers of students, particularly in rural and tribal areas, that simply don’t have that option. Members of their households can’t work or get seen by a doctor remotely.
And, while the internet has shown remarkable resilience so far, a recent incident should serve as a reminder that this “best-effort” network of networks is by its nature fragile.
Network Outages Spike but Aren’t Felt by Many
ThousandEyes, a San Francisco-based software company that builds tools for monitoring network health, said the weekly number of network outages globally reached record numbers in February and March, as more and more governments issued stay-at-home orders.
The company doesn’t monitor performance of last-mile ISPs, the networks that carry internet traffic to the end users. It monitors Tier 1 and Tier 2 carriers. (If the internet was a tree, Tier 1 carriers would be the trunk, and Tier 2 the thick branches attached to it.) It also monitors the networks of public cloud platforms like Microsoft’s and Google’s, and Unified Communications-as-a-Service (UCaaS) providers, such as Zoom and Twilio.
The number of weekly network outages had been trending up between the middle of February and the end of March, before going slightly down in the first week of April, the company said.
It’s important to keep in mind that there isn’t some publicly available global database of all outages of all networks; neither is there such a thing just for the US. The visibility of tools like the ones built by ThousandEyes is limited to the group of networks they monitor. There also isn’t a standard definition of a network “outage,” so it can vary from company to company.
Archana Kesavan, director of product marketing at ThousandEyes, speculated that the most likely explanation for the increase in outages was that the companies operating those networks were making configuration changes and capacity upgrades to accommodate the shift in traffic patterns, as offices and schools closed, and people switched to working, learning, socializing, and being entertained from home.
The infrastructure team at Zoom, for example, scaled up bandwidth in various places on its network to ensure it can handle the surge in usage, Alex Guerrero, senior manager of SaaS operations at the company, said during a webinar in March. The Zoom team’s focus was on peering with more ISPs, buying more transit, and boosting bandwidth on existing interconnections.
Around the same time infrastructure engineers at Netflix were busy boosting capacity on the company’s content delivery network, Dave Temkin, the company’s VP of network and systems infrastructure, said during the webinar.
Network providers “are trying to improve their networks, be that through some sort of traffic engineering, some sort of maybe even upgrades to their network,” which most likely explained the spike in outages ThousandEyes had observed, Kesavan told DCK.
ThousandEyes defines as outage an event where 100 percent of packets are being dropped on a network path within a single “autonomous system,” impacting infrastructure and a “certain number of sensors and services.” Because the company tracks performance deep in the network infrastructure, the outages it registers don’t necessarily propagate to end users.
The increase in network outages wasn’t surprising, said Avi Freedman, co-founder and CEO of Kentik, also a San Francisco-based software company that builds network monitoring tools. Freedman is an internet infrastructure veteran. He spent the first decade of this century building out Akamai Technologies’ global content delivery network.
“It’s natural to expect micro outages in rough proportion to traffic growth,” he said. “Given the growth [in network traffic] that we’re seeing, the number of outages is not unexpected.”
The outages have been relatively brief and haven’t meaningfully affected companies’ ability to serve their users, Freedman said.
At its worst points over the recent weeks, the internet has been more reliable than a couple of decades ago, he said. “In the 90s, things were much worse.”
The Digital Divide Could Deepen the Economic One
The bigger concern about the internet during this crisis is the lack of broadband access by many in the US, Freedman said. There are households that don’t have enough bandwidth for e-learning, Zoom, or gaming.
While 93.5 percent of the total US population had access to fixed terrestrial broadband with download speeds of 25 Mbps or higher as of the end of 2017 (the most recent FCC data available), more than a quarter of Americans in rural areas and close to a third of Americans in tribal areas didn’t.
One of the outcomes of the crisis, Freedman suggested, could be a deepening of the country’s already deep economic divide – simply as a result of unequal access to broadband.
The $2 trillion US economic stimulus package enacted in late March included $100 million for rural broadband and $200 million for telehealth for hospitals and other health care providers. The telecom industry and broadband-access advocates say a lot more funding is needed and have been lobbying lawmakers to include it in the upcoming stimulus bills.
Much of the concern around internet performance has been bandwidth on some last-mile ISPs’ networks, and it remains a concern going forward, Freedman said.
That’s the reason big content providers like Netflix and YouTube have switched to lower default video bit rates, John Graham-Cumming, CTO of Cloudflare, told DCK. They want to help prevent overwhelming ISPs whose networks aren’t architected for all their users to log on at the same time.
Among other things, Cloudflare operates a large content delivery network, so it has visibility into a sizable slice of the world’s ISPs.
“Life has gone online,” Graham-Cumming said. As a result, some smaller local internet providers, particularly ones that aggregate connections from many homes in a single link to the internet upstream, are having trouble managing congestion on their networks.
The Internet’s Fragile Design
Individual networks and how they interconnect are just a part of the puzzle. There are many other elements that make moving data across the big, messy network of networks possible, while adding to its overall fragility.
One incident early this month served as a reminder of that fragility.
On April 1, as if it was making a bad April Fool’s joke, the Russian ISP Rostelecom mistakenly “advertised” a wrong network route, effectively inserting its own network into the path of traffic meant for Cloudflare, according to ThousandEyes. It announced the route using BGP (Border Gateway Protocol), which is the language independent systems on the internet use to tell each other how they can be reached.
Level 3, the CenturyLink-owned Tier 1 transit network, picked up the bad route and propagated it to its peers, which ended up causing issues not only for Cloudflare but also for others, including Amazon Web Services, according to ThousandEyes.
Because of the way BGP works, there “was really no way to prevent this,” ThousandEyes’ Kesavan said. “Level 3 basically took the route that was coming in from Rostelecom and said, ‘You know what, to reach Cloudflare I’m getting instructions to send traffic to Rostelecom.’”
Even though this type of incident is often referred to as “BGP hijacking,” the Rostelecom issue most likely resulted from an error and “doesn’t appear to be malicious,” Angelique Medina, director of product marketing at ThousandEyes, said.
A very similar incident involving Rostelecom happened about three years ago, Kesavan said, and, like it did this month, the ISP quickly corrected course.
Such incidents underscore the fact that “this whole fabric of the internet… by default, does not have a security wrap around it,” she said. “It didn’t seem necessary when it was built decades ago.”
BGP is a perfect example. “It’s a game of telephones,” she said. “It’s kind of built on trust. You just assume there’s not going to be a bad actor in place claiming to be somebody else.”