Device Placement Optimization with Reinforcement Learning
Why is this interesting?
Modern neural network training is often synonymous with GPUs. However, given the
size of the data sets that deep learning requires, we cannot actually train most
industrial strength models on a single GPUs. Instead, we use combinations of
CPUs and GPUs, which leads to the important question of where to place which
particular operation.
A quick review.
Device placement is usually performed by human experts, mainly by giving each
layer of a DNN its own GPU. On the other hand, Mirhoseini et al. propose using
reinforcement learning as an optimization procedure to identify the best device
placement model.
Mirhoseini et al. employ a sequence-to-sequence model for the probability
distribution over policies, and an attentional LSTM to convert the (embedded
policy) into actual device placements. However, large scale neural networks have
far too many operations to make modeling feasible, and so the operation graphs
are compressed by colocating several operations onto the same device. Finally,
they speed up their training using several controllers, each of which samples
from the policy distribution and executes the policy on works under it. The
controllers calculate gradients after execution and return them to the
parameter server.
Things I quite liked.
They recognize that device placement is a subtle question, and notice that human
experts may not be able to optimally locate various operations.
The idea of using reinforcement learning as an optimization technique in itself
is interesting, especially in cases where there is a quick
feedback loop. Their metric is the time taken to perform one gradient update,
which is an important contribution for further analyses.
Things I didn't quite like.
They use the REINFORCE estimator for policy gradients. Especially towards the
tail end of the learning process, this estimator has much larger variance than
the loss, which would make training unstable.
Operation colocation is performed using heuristics. It would be a lot more
interesting to learn this using the dependence structure, since this step again
requires expert intervention and guesswork.
The models are architecture specific; in particular, changing the architecture
will require retraining all models from scratch, instead of learning the quirks
of different hardware types.
This could be further improved by viewing it as a regression problem in
the space of graph embeddings. Efficient Bayesian optimization is a good
starting point for better results.
Modern neural network training is often synonymous with GPUs. However, given the
size of the data sets that deep learning requires, we cannot actually train most
industrial strength models on a single GPUs. Instead, we use combinations of
CPUs and GPUs, which leads to the important question of where to place which
particular operation.
A quick review.
Device placement is usually performed by human experts, mainly by giving each
layer of a DNN its own GPU. On the other hand, Mirhoseini et al. propose using
reinforcement learning as an optimization procedure to identify the best device
placement model.
Mirhoseini et al. employ a sequence-to-sequence model for the probability
distribution over policies, and an attentional LSTM to convert the (embedded
policy) into actual device placements. However, large scale neural networks have
far too many operations to make modeling feasible, and so the operation graphs
are compressed by colocating several operations onto the same device. Finally,
they speed up their training using several controllers, each of which samples
from the policy distribution and executes the policy on works under it. The
controllers calculate gradients after execution and return them to the
parameter server.
Things I quite liked.
They recognize that device placement is a subtle question, and notice that human
experts may not be able to optimally locate various operations.
The idea of using reinforcement learning as an optimization technique in itself
is interesting, especially in cases where there is a quick
feedback loop. Their metric is the time taken to perform one gradient update,
which is an important contribution for further analyses.
Things I didn't quite like.
They use the REINFORCE estimator for policy gradients. Especially towards the
tail end of the learning process, this estimator has much larger variance than
the loss, which would make training unstable.
Operation colocation is performed using heuristics. It would be a lot more
interesting to learn this using the dependence structure, since this step again
requires expert intervention and guesswork.
The models are architecture specific; in particular, changing the architecture
will require retraining all models from scratch, instead of learning the quirks
of different hardware types.
This could be further improved by viewing it as a regression problem in
the space of graph embeddings. Efficient Bayesian optimization is a good
starting point for better results.
Nice observation w/r to colocation; although colocation is used to reduce complexity and learning colocation runs counter to that. Including hardware features in the learning space is an interesting idea worth exploring. Later point needs some more explanation.
ReplyDelete