Doomsday: Predicting Which Node Will Fail When on Supercomputers
Authors:
Anwesha Das, Frank Mueller, Paul Hargrove, Eric Roman, Scott Baden
Motivation:
Identifying nodes which fail prior to actual failures can reduce overhead induced by failure recovery. The authors focus on node failure detection in Cray systems. Cray machines are widely used supercomputer systems which produce lower-level linux style raw logs.
Time-based phrases(TBP) is proposed by authors to extract messages from temporal logs of Cray systems which are indicative of node failures. TBP utilizes Topics over time(TOT), a variant of LDA(extracts phrases & outputs latent topics present in a document) to form chains of messages/phrases which lead to failures.
Proactive actions for failure prevention can take place only if there is enough lead time. Hence, the focus of the paper has mostly been on achieving better lead times while predicting node failures. With sufficient lead times, proactive measures like live job migration, process cloning, lazy check-pointing and quarantine techniques can be employed in case of a node failure. Authors did phrase reduction of 30% to increase lead times. There is a trade-off between lead time and performance. Phrase reduction leads to increase in lead times & false positives and decrease in recall.
Positive points:
1. Trivial pitfalls like maintenance windows, power outages and deliberate shutdowns are eliminated prior to training for better predictions of node failures which are caused by internal or external failures.
2. TBP is minimally invasive and application independent as it doesn't use fault injections or source-code references for predicting node failures.
3. The model considers complexity of logs beyond the flags or severity levels. Events which are non-critical also serve as an indicator of failures over time.
4. Study[1] shows that the prediction recall has an important impact on the overall efficiency improvement, contrary to the prediction precision, that has only a minor impact. Based on this study, the paper focuses on improving the recall for failure prediction but the precision achieved by TBP is still high(98%).
5. Despite the model complexity, TBP handles scalability issues well as it can be deployed to a time-limited window of 6 hours such that it can detect node failures faster.
6. Cray system logs are unstructured and act as a superset of logs for many other systems like BlueGene. TBP is generic and can be used for other systems, that have simpler, structured logs, with appropriate integration of data sources.
Negative points:
1. There is an assumption of consistent and complete logs in the model. TBP requires integration & correlation of distributed set of events in space and time from various sources like ALPS logs and job logs for both training and evaluation. Manual validation was used for missing data or inconsistencies in the training data as it affects the correlation across the logs. Such manual intervention of system administrators may be needed for validation of logs correlation in case of events like stopped daemons/logging or upgrades in job schedulers.
2. The model can't detect failures which didn't occur in the past. Such failures can be detected if the model can somehow incorporate the failure detection using source code analysis.
3. Some phrases related to rare errors are ranked low and discarded from the chain of phrases used for evaluation. This leads to lower recall rates. Can we create a set of phrases corresponding to crucial and rare errors and always consider them for top N phrases?
-Sravya
Anwesha Das, Frank Mueller, Paul Hargrove, Eric Roman, Scott Baden
Motivation:
Identifying nodes which fail prior to actual failures can reduce overhead induced by failure recovery. The authors focus on node failure detection in Cray systems. Cray machines are widely used supercomputer systems which produce lower-level linux style raw logs.
Time-based phrases(TBP) is proposed by authors to extract messages from temporal logs of Cray systems which are indicative of node failures. TBP utilizes Topics over time(TOT), a variant of LDA(extracts phrases & outputs latent topics present in a document) to form chains of messages/phrases which lead to failures.
Proactive actions for failure prevention can take place only if there is enough lead time. Hence, the focus of the paper has mostly been on achieving better lead times while predicting node failures. With sufficient lead times, proactive measures like live job migration, process cloning, lazy check-pointing and quarantine techniques can be employed in case of a node failure. Authors did phrase reduction of 30% to increase lead times. There is a trade-off between lead time and performance. Phrase reduction leads to increase in lead times & false positives and decrease in recall.
Positive points:
1. Trivial pitfalls like maintenance windows, power outages and deliberate shutdowns are eliminated prior to training for better predictions of node failures which are caused by internal or external failures.
2. TBP is minimally invasive and application independent as it doesn't use fault injections or source-code references for predicting node failures.
3. The model considers complexity of logs beyond the flags or severity levels. Events which are non-critical also serve as an indicator of failures over time.
4. Study[1] shows that the prediction recall has an important impact on the overall efficiency improvement, contrary to the prediction precision, that has only a minor impact. Based on this study, the paper focuses on improving the recall for failure prediction but the precision achieved by TBP is still high(98%).
5. Despite the model complexity, TBP handles scalability issues well as it can be deployed to a time-limited window of 6 hours such that it can detect node failures faster.
6. Cray system logs are unstructured and act as a superset of logs for many other systems like BlueGene. TBP is generic and can be used for other systems, that have simpler, structured logs, with appropriate integration of data sources.
Negative points:
1. There is an assumption of consistent and complete logs in the model. TBP requires integration & correlation of distributed set of events in space and time from various sources like ALPS logs and job logs for both training and evaluation. Manual validation was used for missing data or inconsistencies in the training data as it affects the correlation across the logs. Such manual intervention of system administrators may be needed for validation of logs correlation in case of events like stopped daemons/logging or upgrades in job schedulers.
2. The model can't detect failures which didn't occur in the past. Such failures can be detected if the model can somehow incorporate the failure detection using source code analysis.
3. Some phrases related to rare errors are ranked low and discarded from the chain of phrases used for evaluation. This leads to lower recall rates. Can we create a set of phrases corresponding to crucial and rare errors and always consider them for top N phrases?
-Sravya
Very nice positive points. The manual intervention is an issue as to whether such an approach can be automated and put on-line in a production setting. I also think the 16% false negative rate is disappointing as a missed failure may be more problematic that reacting to a failure that did not occur. I also liked the sequence pruning idea as a way to gain a longer prediction horizon -- though this may assume earlier phrases in the chain are more indicative of the problem, so to speak.
ReplyDelete