Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning approach
Authors:
Motivation:
The paper introduces a framework that manages shared resources like 1) off-chip bandwidth, 2) L2 cache space and 3) power budget of chip multiprocessors using ANNs in a co-ordinated manner. ANN model takes a program's behaviour at multi-level granularities as input and outputs the performace of a given allocation. Since exhaustive search of allocation space is intractable, it uses stochasitc hill-climbing with ANN as a heuristic function to search the allocation space. It produces better results than Fair-Sharing, Unmanaged sharing, Individual and Uncoordinated resource management.
Main points:
1. This paper, published a decade ago, has improvised previous work[5], which used hill-climbing, by introducing learning models. It has set a benchmark for other research in the area of Coordinated Management of resources.
2. Despite of not having true labels(performance values of an allocation before-hand), they were able to find an estimate for prediction error using CoV(coefficient of variance). The determination of linearity between CoV and prediction-error is interesting and plays a crucial role.
3. Learning model is comprised of an ensemble of 4 ANNs(12 nodes in input layer, 4 hidden nodes(1 layer) and 1 output node). The inputs to ANN are 12 attributes(cache-state of both recent and long term past of L2 cache along with dirty cache ways+the three shared resources).
4. With a variant of stochastic hill climbing(a heuristic search), they are able to prune the sub-regions with high CoV and thus reduced the overhead.
Trade-offs:
1. Re-training is done in every fifth interval which creates an overhead. Also, hyper-parameters like number of layers, number of hidden nodes, number of epochs, momentum and learning rate are fixed. As the context of the system changes, hyper-parameters tuning may be required with re-training. They mention only learning rate and momentum being robust. Cross-fold validation which is used to prevent over-fitting can also be used for hyper-parameter tuning.
2. The paper addresses the case of context-switches by suggesting Kernel modifications to persist the ANNs weights in process state. But, it doesn't mention the case where an application is restarted. Is it retrained with initial, inaccurate weights?
3. The hardware-based solutions add complexity and makes maintenance harder.
4. Now that the Moore's law is dead, is it still scalable with the increasing number of applications?
-Sravya
Motivation:
The paper introduces a framework that manages shared resources like 1) off-chip bandwidth, 2) L2 cache space and 3) power budget of chip multiprocessors using ANNs in a co-ordinated manner. ANN model takes a program's behaviour at multi-level granularities as input and outputs the performace of a given allocation. Since exhaustive search of allocation space is intractable, it uses stochasitc hill-climbing with ANN as a heuristic function to search the allocation space. It produces better results than Fair-Sharing, Unmanaged sharing, Individual and Uncoordinated resource management.
Main points:
1. This paper, published a decade ago, has improvised previous work[5], which used hill-climbing, by introducing learning models. It has set a benchmark for other research in the area of Coordinated Management of resources.
2. Despite of not having true labels(performance values of an allocation before-hand), they were able to find an estimate for prediction error using CoV(coefficient of variance). The determination of linearity between CoV and prediction-error is interesting and plays a crucial role.
3. Learning model is comprised of an ensemble of 4 ANNs(12 nodes in input layer, 4 hidden nodes(1 layer) and 1 output node). The inputs to ANN are 12 attributes(cache-state of both recent and long term past of L2 cache along with dirty cache ways+the three shared resources).
4. With a variant of stochastic hill climbing(a heuristic search), they are able to prune the sub-regions with high CoV and thus reduced the overhead.
Trade-offs:
1. Re-training is done in every fifth interval which creates an overhead. Also, hyper-parameters like number of layers, number of hidden nodes, number of epochs, momentum and learning rate are fixed. As the context of the system changes, hyper-parameters tuning may be required with re-training. They mention only learning rate and momentum being robust. Cross-fold validation which is used to prevent over-fitting can also be used for hyper-parameter tuning.
2. The paper addresses the case of context-switches by suggesting Kernel modifications to persist the ANNs weights in process state. But, it doesn't mention the case where an application is restarted. Is it retrained with initial, inaccurate weights?
3. The hardware-based solutions add complexity and makes maintenance harder.
4. Now that the Moore's law is dead, is it still scalable with the increasing number of applications?
-Sravya
For 3, maybe the recent emergence of TPUs can help. For 4, I think we are still increasing core counts. The paper argues this would be faster than the growth of apps. On a desktop with a 16 core machine would I have more than 16 apps concurrently executing? Unclear.
ReplyDeleteadding to the above comment: As model is hard coded, I feel providing ability to tweak the model architecture(by adding reconfigurable layer like an FPGA) would greatly help for adjusting to future workloads.
DeleteTwo of my major concerns about this paper:
ReplyDelete1. It's not easy to collect the input data like "L2 cache read/write miss", especially for shared cache.
Software monitor is hard to collect the accurate info (like cache side-channel attack). Hardware approach needs to differentiate which CPU core executes this read/write.
2. Even in 2008, the number of user/kernel processes is much bigger than 4 (I guess 20-30). In that case, the usage of hardware ANN will be a bottleneck for independent processes, even though the interval time duration is small.
Here is an update after discussing with my lab mate Kartik. He said there are many performance counter in modern computer. So collecting these hardware performance info is not hard anymore.
ReplyDeleteSorry for my incorrect comment.