Wide Area Networks (WANs), the global backbones and workhorses of today’s internet that connect billions of computers over continents and oceans, are the foundation of modern online services. As Covid-19 has placed a vital reliance on online services, today’s networks are struggling to deliver high bandwidth and availability imposed by emerging workloads related to machine learning, video calls, and health care.
To connect WANs over hundreds of miles, fiber optic cables that transmit data using light are threaded throughout our neighborhoods, made of incredibly thin strands of glass or plastic known as optical fibers. While they’re extremely fast, they’re not always reliable: They can easily break from weather, thunderstorms, accidents, and even animals. These tears can cause severe and expensive damage, resulting in 911 service outages, lost connectivity to the internet, and inability to use smartphone apps.
Scientists from the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) and from Facebook recently came up with a way to preserve the network when the fiber is down, and to reduce cost. Their system, called “ARROW,” reconfigures the optical light from a damaged fiber to healthy ones, while using an online algorithm to proactively plan for potential fiber cuts ahead of time, based on real-time internet traffic demands.
ARROW is built on the crossroads of two different approaches: “failure-aware traffic engineering,” a technique that steers traffic to where the bandwidth resources are during fiber cuts, and “wavelength reconfiguration,” which restores failed bandwidth resources by reconfiguring the light.
Though this combination is powerful, the problem is mathematically difficult to solve because of its NP-hardness in computational complexity theory.
The team created a novel algorithm that can essentially create “LotteryTickets” as an abstraction for the “wavelength reconfiguration problem” on optical fibers and only feed essential information into the “traffic engineering problem.” This works alongside their “optical restoration method,” which moves the light from the cut fiber to “surrogate’’ healthy fibers to restore the network connectivity. The system also takes real-time traffic into account to optimize for maximum network throughput.
Using large-scale simulations and a testbed, ARROW could carry 2 to 2.4 times more traffic without having to deploy new fibers, while maintaining the network highly reliable.
“ARROW can be used to improve service availability, and enhance the resiliency of the internet infrastructure against fiber cuts. It renovates the way we think about the relationship between failures and network management — previously failures were deterministic events, where failure meant failure, and there was no way around it except over-provisioning the network,” says MIT postdoc Zhizhen Zhong, the lead author on a new paper about ARROW. “With ARROW, some failures can be eliminated or partially restored, and this changes the way we think about network management and traffic engineering, opening up opportunities for rethinking traffic engineering systems, risk assessment systems, and emerging applications too.”
The design of today’s network infrastructures, both in data centers and in wide-area networks, still follow the “telephony model,” where network engineers treat the physical layer of networks as a static black box with no reconfigurability.
As a result, the network infrastructure is equipped to carry the worst-case traffic demand under all possible failure scenarios, making it inefficient and costly. Yet, modern networks have elastic applications that could benefit from a dynamically reconfigurable physical layer, to enable high throughput, low latency, and seamless recovery from failures, which ARROW helps enable.
In traditional systems, network engineers decide in advance how much capacity to provide in the physical layer of the network. It might seem impossible to change the topology of a network without physically changing the cables, but since optical waves can be redirected using tiny mirrors, they’re capable of quick changes: no rewiring required. This is a realm where the network is no longer a static entity but a dynamic structure of interconnections that may change depending on the workload.
Imagine a hypothetical subway system where some trains might fail once in a while. The subway control unit wants to plan how to distribute the passengers to alternative routes while considering all possible trains and traffic on them. Using ARROW, then, when a train fails, the control unit just announces to the passengers the best alternative routes to minimize their travel time and avoid congestion.
“My long-term goal is to make large-scale computer networks more efficient, and ultimately develop smart networks that adapt to the data and application,” says MIT Assistant Professor Manya Ghobadi, who supervised the work. “Having a reconfigurable optical topology revolutionizes the way we think of a network, as performing this research requires breaking orthodoxies established for many years in WAN deployments.’
To deploy ARROW in real-world wide-area networks, the team has been collaborating with Facebook and hopes to work with other large-scale service providers. “The research provides the initial insight into the benefits of reconfiguration. The substantial potential in reliability improvement is attractive to network management in production backbone,” says Ying Zhang, a software engineer manager at Facebook who collaborated on this research.
“We are excited that there would be many practical challenges ahead to bring ARROW from research lab ideas to real-world systems that serve billions of people, and possibly reduce the number of service interruptions that we experience today, such as less news reports on how fiber cuts affect internet connectivity,” says Zhong. “We hope that ARROW could make our internet more resilient to failures with less cost.”
Zhong wrote the paper alongside Ghobadi; MIT graduate student Alaa Khaddaj; and Facebook engineers Jonathan Leach, Ying Zhang, and Yiting Xia. They presented the research at ACM’s SIGCOMM conference.
This work was led by MIT in collaboration with Facebook. The technique is being evaluated for deployment at Facebook. Facebook provided resources for performing the research. The MIT affiliated authors were supported by Advanced Research Projects Agency–Energy, the Defense Advanced Research Projects Agency, and the U.S. National Science Foundation.