NetBouncer: Active Device and Link Failure Localization in Data Center Networks

March 14, 2019

Authors:
Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang

Motivation:
Reliability and availability are crucial in data centers since loss of service is inexcusable. It is important that any failure in the network is identified and mitigated immediately. However, the process of accurately locating the point of failure is a non-trivial task in a network with millions of servers and network devices. The paper proposes NetBouncer, a failure location system which can detect both device and link failures. It does so through a combination of domain knowledge with machine learning, all the while ensuring that it is accurately able to find the path to the point of failure with minimal overhead.

Pros:

It manages to identify gray failures,(partial or subtle malfunctions where packets are dropped probabilistically ). Traditional monitoring systems have difficulty in observing gray failures since they are not easily perceived by the end-hosts
Using IP-in IP tunneling enables NetBouncer to explicitly identify paths with minimal overhead. It probes a path without involving the CPUs of switches by encapsulating the entire path in nested IP packages. The assumption of the target network, Clos, states that there is only a single path from a server to upper layer switch, greatly reducing the number of possible duplicate paths. This is crucial since in a data center network, a minute drop in throughput can have adverse effects.
Unlike most probing plans that try to minimize redundant paths, NetBouncer is not affected by redundant probing paths. Since redundant paths are considered a validation of each other, its accuracy is ultimately improved.
It makes use of Coordinate Descent for optimization to solve the link failure inference problem. CD has certain advantages over SGD such as the ability to handle the sparse characteristic of the network data (since each link is included by only a few paths as compared to the total number of paths in the system), and also being able to handle large volumes of data. It is also said that CD converges more than an order of magnitude faster than SGD for link failure inference.

Cons:

NetBouncer assumes that the failures are independent. While this may be the case in most failures, in cases where the failures are correlated, The optimization problem could be significantly more complex since we would have to consider the interaction between different failure incidents.
A number of the techniques and assumptions (such as existence of single path from server to switch, sufficient probing theorem) used depend on the underlying network being a Clos network. Given these assumptions, it is possible that the model may not work well with other networks.
NetBouncer assumes that both the sender and receiver for packet bouncing are the same server. If these servers are different, it would have to consider the case of failure of the sender or receiver, which may complicate failure handling. Although this assumption fits well with the Clos network, it may not hold be possible in certain more restrictive networks where sender and receiver cannot be the same server
It uses probes that originate from end-hosts rater than switches since the switches in their data centers are non-programmable. Using a switch based probing agent could offer more flexibility like individual link probes. As they claim that NetBouncer should support switch based approaches as well a comparison of switch based and end-host based probes would be interesting

-Karthik Unnikrishnan

Comments

jon_weissmanMarch 15, 2019 at 12:25 PM
Good points. As mentioned, there is also the issue of the rate at which dynamism is occurring relative to their prediction overhead.
ReplyDelete
Replies

Add comment

Search This Blog

CSci 8980 Machine Learning in Computer Systems

NetBouncer: Active Device and Link Failure Localization in Data Center Networks

Comments

Post a Comment

Popular posts from this blog

A Machine Learning Approach to Live Migration Modeling

StormDroid: A Streaminglized Machine Learning-Based System for Detecting Android Malware

A Machine Learning Approach to Live Migration Modeling