Neural Network Meets DCN: Traffic-driven Topology Adaptation with Deep Learning

Introduction:

In this paper, the authors are trying to obtain optimal network topology in data center networks using a machine learning approach. In this paper, they use multiple neural network models which take input a traffic demand matrix and output a network topology. The network topology is optimized to achieve the best value for a network metric provided by the data center operator.

Motivation:

Data centers networks are a critical part of the infrastructure. Congestion in some switch can bring down the performance of a large part of the system. The authors argue that most of the data centers use static network topologies and overprovision them to handle different scales of traffic. This increases the investment cost and resource consumption at runtime. The authors propose to use dynamically reconfigurable networks which can change the topology depending on the traffic demand.

Research questions addressed by the paper:

In this paper, they propose models that are capable of understanding different traffic patterns and provide a network topology optimized for that traffic pattern. This is a centralized approach which is able to observe all the traffic patterns in the entire data center and come up with an optimal solution.

Method:

With some empirical evidence, the authors claim that topologies which support a certain set of traffic demands share a set of critical links. Hence we can use historical traffic patterns and corresponding optimal topologies to train neural network models offline and use them to infer network topologies for a data center online.
The input to the entire system is a demand matrix. Each cell in the demand matrix represents traffic demand between the pair of racks. The network consists of two parts: the configurable part and the static part. The configurable part consists of Optical Circuit Switches which, enable us to establish dynamic optical links between different ports.
The Xweaver framework/model consists of three main modules:

Scoring Module: The module takes input a pair of traffic pattern(demand matrix) and network topology and outputs a score as a performance metric which is to be optimized. The scoring module consists of two convolutional neural networks which extract latent features from traffic patterns and topology matrices.

Labeling Module: It labels a demand matrix with corresponding optimal network topology with the help of Beam Search algorithm which makes use of the scoring module. It is used to generate training data for actual model training (mapping module).

Mapping Module: This module takes input a traffic matrix and outputs optimal topology which is optimized to yield the best value of the metric provided by the data center operator.

Evaluation: They evaluate the performance of their model by comparing with three solutions of optimal topology configuration: Weight-matching solution, Sample solution and Optimal solution using the brute-force method. They evaluate the performance of each of the method and compare it with Xweaver framework using different types of patterns and different scales of data center networks.

Things I liked about the paper:

The way they overcome the challenge of generating training data: In the paper, they mention that it is computationally expensive to generate traffic-topology pairs which will be used as training data for the neural network. Hence they use a scoring and labeling module which are other neural network models to generate a sufficient amount of training data. They claim that generating sample traffic-topology pairs for training using simulator can take over a month as it involves detailed simulation of lower level network protocols and handling link conflict constraints. As per my understanding, this really seems to be a scalable approach.
Use of two convolutional neural networks: The input topology and input traffic both are provided in the form of matrices. CNN’s are used to extract latent features about spatial information from matrices. Using two separate CNN’s will definitely enable in extracting different latent spatial features form traffic and topology matrices. We can extract information about relation between neighboring links and traffic patterns in the matrices. In my opinion, it adds to the flexibility of Xweaver framework. We are able to extract different latent features required to generate traffic-topology pairs (training data) with any desired Score value, which the operator wants to optimize.
Use of separate modules for learning traffic patterns and topology patterns for generating samples: This makes the Xweaver scalable as compared to using a single fully connected network.
CRF Module: The CRF modules aims at maximizing the likelihood of input topology satisfying the human constraints and embedding prior human knowledge. Changing the model structure can enable us to embed different human knowledge and constraints.
Extensive Evaluation: The authors have done an extensive evaluation of the models used in the network. They evaluate the performance of the scoring module. They even compare the performance of Xweaver with other solutions like the weight-matching algorithm, Sample Search algorithm used in the labeling module and the Optimal Solution found out using the brute-force method. They even evaluate the performance for different traffic patterns (application traffic, hot-spot traffic, and hybrid traffic) which are commonly seen in data center workloads.

My concerns about some points in the paper:

Complexity evaluation for Beam Search (Sample Solution Method) is missing: Beam search algorithm is used to label optimal topologies for traffic patterns. This module is used to generate training samples with less computational complexity. They have evaluated the performance of the Xweaver and Sample Solution with Optimal Solution for topology generation. However, they haven’t compared the time complexity. As they claim earlier that Xweaver and Sample Solution take less time in training than exploring the search space using brute force simulation.

Method for sample point generation is not elaborate: They claim that generating 10,000 sample points using complete simulation may take over a month. Then I would like to know the method they used to generate 20,000 sample points, which are later used to evaluate the performance of Xweaver, Sample Solution, Optimal Solution (brute force method) and Weight-matching algorithm. I am not questioning the authenticity of the method, I am just interested in knowing the method of generating the sample data.

In continuation with the previous point. For evaluating the scoring module, they generate 25000 topology-traffic pattern pair and find their score by running flow level simulator. As per their previous claim it would naturally take more time. So generating a huge sample for training the scoring module which is used to save time on sample generation for mapping module seems counterintuitive to me.

Structure of Neural Networks: Design of neural networks involves lot of hyper-parameter tuning. One such important parameter the is convolution window size. It would have been helpful in knowing the window size they are using for convolution. They are evaluating the performance for different size of networks, which would change the size of input matricetraffic/topologyogy). Window size gives an idea of how much localized information is being extracted from a matrix. In my opinion, it would be helpful to mention the window size when changing the scale of the network.

Conclusion:

The paper was really well written and thorough. They provided everything from empirical justification for using Deep Learning to extensive evaluation.

-Hrushikesh Nimkar

Search This Blog

CSci 8980 Machine Learning in Computer Systems