CrystalBall: Statically Analyzing Runtime Behavior via Deep Sequence Learning
Authors:
Stephen Zekany, Daniel Rings, Nathan Harada, Michael A. Laurenzano, Lingjia Tang, Jason Mars
Motivation:
Execution time can be reduced by optimizing a small portion(path) of program instead of complete program. The motivation of the paper is finding these small portions or "hot paths" as termed by the authors in substantially large pool of possible paths which are formed by the "basic code blocks" taken from the "intermediate representation" during compile time.
Positive Points:
1) Intermediate representations are representation of the source code to closer-to-language code which is independent of source language. Leveraging this to make their proposed algorithm language independent is interesting.
2)Precision, Recall and F-measure depends on a threshold which is not ideal for this case. AUROC as explained by the authors is a better metric although it depends on the state we are operating on.
3) This not only a sequence learning problem but also an anomaly detection problem where hot path is an anomaly (1 in a million). Usage of LSTM networks is ideal here because they are not only good at sequence learning but also excel in anomaly detection.
Negative Points:
1)Authors proposed sampling of data by limiting the number of cold paths for each function to 2000 and consider all the hot paths which may make data more balanced but the non-sampled data is very skewed. Will the proposed algorithm give the same AUROC/preformance score when tested on the non-sampled data where event of hot path occurance is very rare?
2)As far as my understanding the authors are trying to use high-level IR- very close to source code-requires only one compiler pass, although required multiple compiler passes will there be any better results if they use low level IR- abstract machine level code?. Also authors use O2 compiler optimization. How will usage of other optimization techniques (eg:O3 optimization) affect the proposed algorithm.
-Kayala Sai Kumar
Stephen Zekany, Daniel Rings, Nathan Harada, Michael A. Laurenzano, Lingjia Tang, Jason Mars
Motivation:
Execution time can be reduced by optimizing a small portion(path) of program instead of complete program. The motivation of the paper is finding these small portions or "hot paths" as termed by the authors in substantially large pool of possible paths which are formed by the "basic code blocks" taken from the "intermediate representation" during compile time.
Positive Points:
1) Intermediate representations are representation of the source code to closer-to-language code which is independent of source language. Leveraging this to make their proposed algorithm language independent is interesting.
2)Precision, Recall and F-measure depends on a threshold which is not ideal for this case. AUROC as explained by the authors is a better metric although it depends on the state we are operating on.
3) This not only a sequence learning problem but also an anomaly detection problem where hot path is an anomaly (1 in a million). Usage of LSTM networks is ideal here because they are not only good at sequence learning but also excel in anomaly detection.
Negative Points:
1)Authors proposed sampling of data by limiting the number of cold paths for each function to 2000 and consider all the hot paths which may make data more balanced but the non-sampled data is very skewed. Will the proposed algorithm give the same AUROC/preformance score when tested on the non-sampled data where event of hot path occurance is very rare?
2)As far as my understanding the authors are trying to use high-level IR- very close to source code-requires only one compiler pass, although required multiple compiler passes will there be any better results if they use low level IR- abstract machine level code?. Also authors use O2 compiler optimization. How will usage of other optimization techniques (eg:O3 optimization) affect the proposed algorithm.
-Kayala Sai Kumar
Good analysis. Point #1 -- I believe they are assuming hot paths are far less likely -- unless by rare you mean REALLY unlikely. For point 2, it might be nice to see how the resulting inference performs relative to more expensive dynamic optimization (which was the motivation for their approach).
ReplyDelete