NetBouncer: Active Device and Link Failure Localization in Data Center Networks

March 13, 2019

Authors : Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang

Introduction :

Availability of large scale data centers ensures that all the operations within the center's network function smoothly. The biggest challenge for such data centers is the ability to accurately locate and contain device and link failures, across millions of servers and thousands of network devices. The authors introduce NetBouncer, a failure localization system which is capable of detecting both device and link failures.

The authors introduce the framework's algorithm for high accuracy link failure inference and use domain knowledge to make it robust to real-world data inconsistencies.

Pros :

The NetBouncer framework makes use of Clos network architectures, which were introduced by Bell Labs in 1950s and made famous again with the onset of Ethernet switches. Many software companies now make use of multi-tier Clos network-based architectures. These networks reduce complexity in implementation and make it a simple network topology.
The servers, which initiate the path probing, contain both the senders and receivers of the probing data. This prevents failures between senders or receivers or both in a distributed systems setting, which makes the failure handling a less complicated affair.
The authors come up with a theorem, the sufficient probing theorem, which helps in reducing the number of paths for the probing packets to traverse, thus saving a lot of computation resources. This saves a lot of resources in the switch CPUs.
The authors use a specialized regularization term and use coordinate descent, which takes the sparse nature of the path probing problem into consideration. It serves as a two-direction penalty and helps to resolve the false positive cases where the imperfect measurement involves a bad link and a good link. Coordinate descent is also considered useful when computing the gradients is a tedious task.

Cons :

The framework produced a few false negatives over the years it has been deployed in Microsoft Azure's data centers. The real-world inconsistency seem to produce these values.
The framework also makes an assumption that the probing packets also experience the same failures as the real applications, which may not hold in all cases.
The objective function used is Least Squares with a specialized regularization term, based on the domain knowledge. The optimization method used here is coordinate descent which is a simple method to implement. There might be other optimization methods which can produce good results.
The authors reiterate the sentence "there are zero false positives and a few false negatives so far". This indicates that the authors are expecting some cases for the framework to produce false positives ? This has not been communicated clearly from the paper.
This particular framework works well with the Clos-based architectures. The framework doesn't generalize for other multi-stage interconnected networks.

- Aditya Bhat

Comments

jon_weissmanMarch 14, 2019 at 6:49 AM
Your first two cons are valid concerns: is the network so dynamic and noisy that it makes accurate inference dependent on long-term stable failure probabilities on links. Not sure what you mean by your next to last con?
ReplyDelete
Replies

Add comment

Search This Blog

CSci 8980 Machine Learning in Computer Systems