CSci 8980 Machine Learning in Computer Systems

Posts

Showing posts from March, 2019

Neural Network Meets DCN: Traffic-driven Topology Adaptation with Deep Learning

March 27, 2019

Introduction: In this paper, the authors are trying to obtain optimal network topology in data center networks using a machine learning approach. In this paper, they use multiple neural network models which take input a traffic demand matrix and output a network topology. The network topology is optimized to achieve the best value for a network metric provided by the data center operator. Motivation: Data centers networks are a critical part of the infrastructure. Congestion in some switch can bring down the performance of a large part of the system. The authors argue that most of the data centers use static network topologies and overprovision them to handle different scales of traffic. This increases the investment cost and resource consumption at runtime. The authors propose to use dynamically reconfigurable networks which can change the topology depending on the traffic demand. Research questions addressed by the paper: In this paper, they propose models that are capable

Neural Network Meets DCN: Traffic-driven Topology Adaptation with Deep Learning

March 27, 2019

Authors: Mowei Wang, Yong Cui, Shihan Xiao, Xin Wang, Dan Yang, Kai Chen, Jun Zhu Motivation: Adopting new network components (e.g., optical circuit switches or wireless radios) into the data center networks (DCNs) has become a very common approach to improve the DCN’s performance in recent days. However, how to find the optimal (or near-optimal) topology configuration to support the dynamic traffic demands has become a key challenge. In order to address this challenge, this paper proposes the xWeaver which can find the best global topology to meet the overall traffic demands in a practical DCN. xWeaver is a traffic-driven deep learning system with three key design features: expressive learning framework, data-driven feature extraction, and traffic-topology mapping learning. Experiments demonstrate that xWeaver outperforms other solutions such as Weight-matching and Sample, and it can update its model parameters for new traffic smoothly without extensive retraining ev

Doomsday: Predicting Which Node Will Fail When on Supercomputers

March 25, 2019

Authors - Anwesha Das, Frank Mueller, Paul Hargrove, Eric Roman, Scott Baden Main idea - Uses Machine Learning to predict when a particular node would cause failure in a HPC. Current trend is to use reactive approaches for failure recovery. Checkpoints/Restarts which cause significant overhead. Important terminologies - 1. Lead time = (Failure occurrence Time – Time at which an impending failure is flagged by the algorithm) 2. Earlier the failure flagged -> Higher the lead time -> more time for secondary actions 3. Failure Chain: Exact sequence of phrases which lead upto a node failure 4. False Positives = Very similar to the failure chains but NOT leading to a failure Research questions answered via the paper: 1. Predict short lead times, pin-point the failure location 2. Analyze lead time sensitivity for feasible proactive measures Proposed Solution - What does it try to do? 1. Node A may fail at location B 2. Analysis is done between the trade off of lead

Doomsday: Predicting Which Node Will Fail When on Supercomputers

March 25, 2019

Authors: Anwesha Das, Frank Mueller, Paul Hargrove, Eric Roman, Scott Baden Motivation: Identifying nodes which fail prior to actual failures can reduce overhead induced by failure recovery. The authors focus on node failure detection in Cray systems. Cray machines are widely used supercomputer systems which produce lower-level linux style raw logs. Time-based phrases (TBP) is proposed by authors to extract messages from temporal logs of Cray systems which are indicative of node failures. TBP utilizes Topics over time (TOT), a variant of LDA(extracts phrases & outputs latent topics present in a document) to form chains of messages/phrases which lead to failures. Proactive actions for failure prevention can take place only if there is enough lead time. Hence, the focus of the paper has mostly been on achieving better lead times while predicting node failures. With sufficient lead times, proactive measures like live job migration, process cloning, lazy check

NetBouncer: Active Device and Link Failure Localization in Data Center Networks

March 14, 2019

Authors: Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang Motivation: Reliability and availability are crucial in data centers since loss of service is inexcusable. It is important that any failure in the network is identified and mitigated immediately. However, the process of accurately locating the point of failure is a non-trivial task in a network with millions of servers and network devices. The paper proposes NetBouncer, a failure location system which can detect both device and link failures. It does so through a combination of domain knowledge with machine learning, all the while ensuring that it is accurately able to find the path to the point of failure with minimal overhead. Pros: It manages to identify gray failures,(partial or subtle malfunctions where packets are dropped probabilistically ). Traditional monitoring systems have difficulty in observing gray failures since they are not easily perceived by the end

NetBouncer: Active Device and Link Failure Localization in Data Center Networks

March 13, 2019

Authors : Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang, Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang Introduction : Availability of large scale data centers ensures that all the operations within the center's network function smoothly. The biggest challenge for such data centers is the ability to accurately locate and contain device and link failures, across millions of servers and thousands of network devices. The authors introduce NetBouncer, a failure localization system which is capable of detecting both device and link failures. The authors introduce the framework's algorithm for high accuracy link failure inference and use domain knowledge to make it robust to real-world data inconsistencies. Pros : The NetBouncer framework makes use of Clos network architectures, which were introduced by Bell Labs in 1950s and made famous again with the onset of Ethernet switches. Many software companies now make use of multi-tier Clos network-based architecture

CrystalBall: Statically Analyzing Runtime Behavior via Deep Sequence Learning

March 11, 2019

Authors: Stephen Zekany, Daniel Rings, Nathan Harada, Michael A. Laurenzano, Lingjia Tang, Jason Mars Motivation: Execution time can be reduced by optimizing a small portion(path) of program instead of complete program. The motivation of the paper is finding these small portions or "hot paths" as termed by the authors in substantially large pool of possible paths which are formed by the "basic code blocks" taken from the "intermediate representation" during compile time. Positive Points: 1) Intermediate representations are representation of the source code to closer-to-language code which is independent of source language. Leveraging this to make their proposed algorithm language independent is interesting. 2)Precision, Recall and F-measure depends on a threshold which is not ideal for this case. AUROC as explained by the authors is a better metric although it depends on the state we are operating on. 3) This not only a sequence learning pro

CrystalBall: Statically Analyzing Runtime Behavior via Deep Sequence Learning

March 11, 2019

Author: Stephen Zekany, Daniel Rings, Nathan Harada, Michael A. Laurenzano, Lingjia Tang, Jason Mars Intro: Understanding the runtime behavior of the software is critical in many aspects of program development. A conventional approach to this problem is using dynamic profiling, meaning we have to run the program multiple times with environments that representative enough for the actual working situation. Besides that, a dynamic profiling approach has other weakness such as cannot profiling a subset of functions or paths very economically and requires re-profiling each time any changes is applied to the code. This paper presents a statically profiling method, harnessing the ability of RNN to discover the hidden information of hot path in the code. Also, since the training process uses intermediate representation as input, the result will have better generalization power. Methodology: The dataset is in the form of a basic block feature; each sequence of basic blocks represen

Mitigating the Compiler Optimization Phase-Ordering problem using Machine Learning

March 04, 2019

Optimizing compilers to execute programs are important to design efficient modern compilers. In fact selecting best order of optimization potentially improves running time of dynamically compiled programs. The current state-of-the art approach to find this ordering is Genetic algorithm (GA), which suffer from (i) an expensive search, and (ii) the method-specific difficulty. To mitigate these shortcomes to solve this so-called phase ordering problem (finding best order for optimizations), this paper goes through utilization of an artificial neural network (ANN) to find best order. Indeed, there might be two possible approaches to this problem from an ANN point of view. First, to predict complete sequence of optimizations needed to be applied to the code, and second to find the current best optimization required. This paper adheres to the second approach. A salient feature the paper relies to on is the Markovian assumption for the state of the method bein optimized. This is in con

Mitigating the Compiler Optimization Phase-Ordering problem using Machine Learning

March 04, 2019

Mitigating the Compiler Optimization Phase-Ordering Problem using Machine Learning. TL;DR: There are a lot of compiler optimizations to choose from in the modern compilers. Order in which these optimization are applied greatly affects the performance. In this paper they model the phase ordering and optimization technique as a Markov process to understand the context and ANNs to predict and apply a best technique at a given instant in the optimization process for optimization. However, instant at which these optimizations are applied greatly affects the performance of the overall application as each of the optimizations interacts with the code and other optimization techniques. Brief introduction As mentioned above, modern compilers offer many options I believe this is one of the sophisticated modeling technique which may not have been necessary in the current status of DL advancement(esp. time series and state space modelling). The interplay between the optimizations applied and