Refactoring Router Software to Minimize Disruption
Speaker: Eric Keller
Series: Final Public Orals
Location: Engineering Quadrangle B327
Date/Time: Friday, August 26, 2011, 2:00 p.m. - 4:00 p.m.
Network operators are under tremendous pressure to make their networks highly reliable to avoid service disruptions. Yet, operators often need to change the network to upgrade faulty equipment, deploy new services, and install new routers. Unfortunately, changes cause disruptions, forcing a trade-off between the benefit of the change and the disruption it will cause. This disruption comes from the very design of the routers and routing protocols underlying the Internet's operation. First, since the Internet is composed of many smaller networks, in order to determine a path between two end points, a distributed calculation involving many of the networks is necessary. Therefore, during any network event that requires a calculation, there will be a period of time when there are disagreements among the routers in the various networks, potentially leading to the situation where there is no path available between some end points. Second, selecting routes involves computations across millions of routers spread over vast distances, multiple routing protocols, and highly customizable routing policies. This leads to very complex software systems. Like any complex software, routing software is prone to implementation errors, or bugs. Given these disruptions, operators must make tremendous effort to minimize their effect. Not only does this lead to a lot of human effort, it also increases the opportunity for mistakes in the configuration -- a common cause of outages.
We believe that with a refactoring of today's router software we can make the network infrastructure more accommodating of change, and therefore more reliable and easier to manage.
First, we tailor software and data diversity (SDD) to the unique properties of routing protocols, so as to avoid buggy behavior at run time. Our bug-tolerant router executes multiple diverse instances of routing software, and uses voting to determine the output to publish to the forwarding table, or to advertise to neighbors. We designed and implemented a router hypervisor that makes this parallelism transparent to other routers, handles fault detection and booting of new router instances, and performs voting in the presence of routing-protocol dynamics, without needing to modify software of the diverse instances.
Second, we argue that breaking the tight coupling between the physical and logical configurations of a network can provide a single, general abstraction that simplifies network management. Specifically, we propose VROOM (Virtual ROuters On the Move), a new network-management primitive where virtual routers can move freely from one physical router to another. We present the design, implementation, and evaluation of novel migration techniques for virtual routers with either hardware or software data planes.
Finally, we introduce the concept of router grafting. This capability allows an operator to rehome a customer with no disruption, compared to downtimes today measured in minutes. With our architecture, this rehoming can be performed completely transparently from the neighboring network -- where the customer's router is not modified and is unaware migration is happening.
Together, these three modifications enable network operators to perform the desired change on their network without (i) possibly triggering bugs in routers that causes Internet-wide instability, (ii) causing unnecessary network re-convergence events, (iii) having to coordinate with neighboring network operators, or (iv) needing an Internet-wide upgrade to new routing protocols.