Doomsday: Predicting Which Node Will Fail When on Supercomputers

Authors - Anwesha Das, Frank Mueller, Paul Hargrove, Eric Roman, Scott Baden

Main idea -
Uses Machine Learning to predict when a particular node would cause failure in a HPC.
Current trend is to use reactive approaches for failure recovery. Checkpoints/Restarts which cause significant overhead.

Important terminologies - 
1. Lead time = (Failure occurrence Time – Time at which an impending failure is flagged by the algorithm)
2. Earlier the failure flagged -> Higher the lead time -> more time for secondary actions
3. Failure Chain: Exact sequence of phrases which lead upto a node failure
4. False Positives = Very similar to the failure chains but NOT leading to a failure

Research questions answered via the paper:
1. Predict short lead times, pin-point the failure location
2. Analyze lead time sensitivity for feasible proactive measures

Proposed Solution - 
What does it try to do?
1. Node A may fail at location B
2. Analysis is done between the trade off of lead time versus false positive rate

What does it not try and solve?
1. Root cause diagnosis is not attempted.
2. No failure characterization.

Difficulties in the domain -
1. Which phrases does one consider while making the priority list?
2. How do you decide between minimizing false positive rate and maximizing lead times?

Challenge -  Predict node failures with low false positives but adequate lead time

Method proposed - TBP - Time Based Phrase
Steps:
1. Time-based data correlation and integration - Correlation between time-based phrases of failed nodes and scheduled jobs
2. Topic modeling based training - Extract relevant phrases based on the top topics as given by the ToT algorithm
3. Formulation of node failure chains - Look at the original data, examine the sequence of messages/phrases, and then look at the terminal messages
4. Inference on new test data different from training data - Examine sequences with ≥ 50% similarity with learned failure chains and based the test results on that

Explanations for the steps:
Data Correlation -
Job Logs <=> Node logs, Failed Nodes(Messages Files) <=> Timestamped Console Logs

Topic Model -
Topics Over Time algorithm (LDA-style learning infused with time component)
Follows in three steps -
1. Topic assignment is done to the correlated logs
2. Running ToT on the training data
3. Making chains from the trained data


Formulation of node failure chains -
1. Find out the terminal messages
2. Make chains out of the phrases from the training data

Inference on new testing data
1. The test data is ran through phrase sequence evaluator, following which potential node failures are considered


Things about the paper I really liked -

1. It is an extremely interesting area of work. The sheer volume of ongoing research in this field is huge. The method used in this paper is ToT algorithm, which is a variant of LDA style learning. However, there exists a lot more algorithms having similar properties which are coming up.
2. This paper notes to remove some of the more generic node failure reasons early on. Losing power and such.
3. As the author mentioned, even though the study has been done only on Cray systems, it can be very easily adapted to different hardware systems. This generability of the model is something that is really amazing!
4. The paper compares their algorithm with the rest of the related works and explains what it does differently(and how it out-performs them).

A few doubts about the paper -

1. The paper talks about missing data points in its log entries and how it could possibly create a problem with the correlation. How would one take care of this in a real-time environment?
2. In the case when the phrases are in a mixed order compared to the failure chain, is 50% really the best estimate? And what happens when the test data has a phrase that it hasn't encountered before? Does the ML model work with the online data as well to keep producing real-time phrases?
3. Using the ToT algorithm, the author is able to calculate and get the evolution of topic intensity over time, but what about evolution of topic content? Does ToT help us there?

Some links for reference -

[1]. Example logs that the author has used for her project.
[2]. Paper explaining the ToT algorithm.

-Rahul Biswas

Comments

  1. Nice review. Good point about topic content evolution though in this domain that may not be an issue. I would think new phrases would require re-learning as in any method. Missing data is an important property to address in any live data analysis task!

    ReplyDelete

Post a Comment

Popular posts from this blog

A Machine Learning Approach to Live Migration Modeling

StormDroid: A Streaminglized Machine Learning-Based System for Detecting Android Malware

Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors: A Machine Learning Approach