Posts

Showing posts from February, 2019

Up By Their Bootstraps: Online Learning in Artificial Neural Networks for CMP Uncore Power Management

by Jae-Yeon Won, Xi Chen, Paul Gratz and Jiang Hu, from Texas A&M University and Vassos Soteriou from Cyprus University of Technology This paper is focusing on the power management of un-core structure: Last-Level Cache (LLC) and Network-on-Chip (NoC). The method is using online learning ANN and assisted by PI control. They compare the offline/online ANN and many combination of different optimization method. Overall, they reduce the energy-delay product of uncore system by 27% versus state-of-the-art methodologies. Motivation of this paper:  1. The non-core (uncore) structure also consume lot of energy in modern CMP since the number and communication between cores increasing. 2. Most of previous core power management focused on either core DVFS or DVFS domain partitioning around cores and merely including a slice of the uncore. 3. In reality, the applications often have abrupt changes, which cannot be captured by rule-based controller like PI (Proportional Integral). Major

ReLeQ: An Automatic Reinforcement Learning Approach for Deep Quantization of Neural Networks

Authors: Amir Yazdanbakhsh ∗,  Ahmed T. Elthakeb ∗,  Prannoy Pilligundla, FatemehSadat Mireshghallah, Hadi Esmaeilzadeh, ACT Lab UCSD, Google Brain This paper provides a smart Reinforcement Learning based approach to make the training process of deep neural networks computationally faster and consume less space. Reducing the bitwidth (known as deep quantization) requires manual effort, hyper-parameter tuning, and re-training. In this paper, authors provided an end to end framework (called as ReLeQ) to automate the deep quantization process without compromising the classification accuracy of deep neural network. Intuition of the addressed research problem: Quantization of all the layers to the same bitwidth results in sub-optimal accuracy. The intuition that the authors provide is that each layer in a neural network play different roles and hence have unique properties in terms of its weight distribution. Over-quantizing one layer can result in other layers also being over-quan

Integrated CPU and L2 Cache Voltage Scaling using Machine Learning

Authors : Nevine AbouGhazaleh, Alexandre Ferreira, Cosmin Rusu Ruibin, Xu Frank Liberato, Bruce Childers, Daniel Moss´e, Rami Melhem Motivation:  In Multiple Clock Domain chips, the design allows for fine grain power management of each domain using dynamic voltage and frequency scaling. The paper tries to exploit this extra level of power management provided to generate a custom power management policy for embedded processors. They propose a Power-Aware Compiler-based approach using Supervised Learning(PASCL) to automatically derive an integrated CPU-core and L2 cache DVS policy. The approach explores the fact that every application goes through memory intense and cpu intense stages which can be identified and used to set appropriate voltage frequencies for the processor and L2 cache. Pros: 1. The input state space considered takes into account CPI, L2PI, MPI, CPU-core and cache frequencies which captures the program and architectural behavior 2. The technique can be

Integrated CPU and L2 Cache Voltage Scaling using Machine Learning

Author: Nevine AbouGhazaleh , Alexandre Ferreira, Cosmin Rusu, Ruibin Xu,Frank Liberato, Bruce Childers, Daniel Moss´e, Rami Melhem Department of Computer Science, University of Pittsburg Motivation: Power management is not trivial with the increase of computational and storage capabilities. The Multiple Clock Domains (MCD) design makes a great improvement in fine-grain power management with dynamic voltage scaling (DVS). This paper provides a Power-Aware Compiler-based approach (PACSL) that uses supervised learning to automatically construct the DVS policy of integrated CPU-core and on-chip L2 cache according to the system and workload requirement. Main points: 1. The system state representation they selected: CPI, L2PI, MPI, CPU-core frequency and L2 cache frequency, which characterize the application behaviors. The optimal policy is decided by the exhaustive search in the state space. 2. They use supervised learning to train the data.

Device Placement Optimization with Reinforcement Learning

Why is this interesting? Modern neural network training is often synonymous with GPUs. However, given the size of the data sets that deep learning requires, we cannot actually train most industrial strength models on a single GPUs. Instead, we use combinations of CPUs and GPUs, which leads to the important question of where to place which particular operation. A quick review. Device placement is usually performed by human experts, mainly by giving each layer of a DNN its own GPU. On the other hand, Mirhoseini et al. propose using reinforcement learning as an optimization procedure to identify the best device placement model. Mirhoseini et al. employ a sequence-to-sequence model for the probability distribution over policies, and an attentional LSTM to convert the (embedded policy) into actual device placements. However, large scale neural networks have far too many operations to make modeling feasible, and so the operation graphs are compressed by colocating several

Device Placement Optimization with Reinforcement Learning

Motivation : The main aim behind the paper is to learn which sets of operations on TensorFlow would perform best on which set of devices in order to optimize the size and computational requirements of the CPUs and GPUs for a neural network. Currently, the devices in the environment being used are heterogeneous and distributed where there is a mixture of hardware devices like CPUs and GPUs and the decision about which parts of the neural models to place on which devices is made by human experts.  These decisions are based on simple heuristics and intuitions, thus there is a limitation to this approach which the authors have tackled using Reinforcement Learning.  Main points : 1.They use a sequence-to-sequence model where device placements is the output sequence and the operations in a neural network is the input sequence. 2. They use three Benchmark algorithms to make comparisons.(Recurrent Neural Network Language Model (RNNLM), Neural Machine Translation with attenti

A Machine Learning Approach to Mapping Streaming Workloads to Dynamic Multicore Processors

Authors: Paul-Jules Micolet, Aaron Smith, Christophe Dubach        Motivation: There exists no accurate as well as automated solution for optimizing the performance of streaming applications at both hardware level and software level. This paper presents a machine learning technique to get near optimal performance using software level threads and hardware level cores. It analyzes the effect of number of threads and number of cores on an application’s performance and attempts to get optimal performance by determining appropriate number of threads and number of cores of a Dynamic Multicore Processor using static code features of the application. Positive Points: [1] Paper analyzes the impact of thread partitioning on performance of various StreamIt applications for individual cores and composed cores. It concludes that the performance of an application follows the same trend and does not depend on the composition of cores. Hence, it reduces the