I. Introduction
With the rapid development of network applications, switching devices in traditional networks are carrying more and more control logic. This makes it difficult to adapt networks for virtualization, cloud computing, big data, and the needs of related business development for highspeed data transmission, the flexible configuration of resources, and rapid deployment protocols. A software defined network (SDN) is a promising network paradigm that separates the control plane and data plane in a network so that switches become simple dataforwarding devices and network management is controlled by logically centralized servers. The SDN concept has inspired widespread research in both academia and industry. However, although the centralized control of an SDN results in innovation and convenience for network applications [1], [2], it also creates reliability [3] and scalability [4] problems.
To solve these problems, researchers proposed a method of multiple controllers in OpenFlow v1.2 [5]. For example, multiple controllers manage different switch regions cooperatively to improve an SDN’s scalability. On the other hand, multiple controllers deal with failstop faults by backing up the master, and resist Byzantine faults or attacks by executing a Byzantine fault tolerance (BFT) protocol, which can enhance the reliability of an SDN. Thus, the deployment of multiple controllers in an SDN plays an important role in improving its scalability and reliability. Furthermore, with respect to established time, communication overhead, fault recovery, etc., [6]–[8] illustrate the effect of a controller’s optimized schedule and deployment on the performance and reliability of an SDN. However, these studies ignored a key point: the reliability of the controller itself has an effect on the control plane in an SDN.
Therefore, we propose a dynamic and selflearning schedule method of multiple controllers in an SDN in order to further improve the control plane’s reliability. This method is inspired by the old Chinese saying, “They who learn the history, know what thrives and what is calamitous in the future.” The system continually and statistically learns the historical behavior of a controller’s reliability, and treats this as a metric to combine and schedule them. This is beneficial in selecting controller groups with higher reliability.
Based on the satisfaction of constraints such as communication delay and load capacity, we attempt to combine and schedule controllers to 1) avoid scheduling controllers with poor reliability, which means they have higher fault rates in historical statistics; and 2) avoid combining controllers with isotype faults (see more details in Section II) into a group, which reduces the handover frequency between a Master controller and Slave controller (or Equal controller) owing to faults, and also benefits the BFT defending the effects against faults.
This method is appropriate for SDN networks in data centers or cloud environments, which have many controllers and are strict on reliability. In addition, this method is original and easy to deploy, which will optimize the combination of multiple controllers. The main contributions of this paper are as follows:

• First, we summarize the combination and schedule problems of multiple controllers in an SDN, and propose a concept of isotype faults between controllers.

• Second, we research a dynamic and selflearning schedule method to combine and schedule multiple controllers. In order to further improve the reliability of the control plane, this method continually and statistically learns the historical behavior of the controllers’ reliability, and treats it as a metric to schedule controllers.

• Finally, we propose an efficient heuristic algorithm to solve the optimization problem of combining and scheduling controllers. This algorithm can calculate the NP problem’s approximate optimal solution within an acceptable time.
The rest of this paper is organized as follows. Section II reviews the background and related work. Our system architecture and detailed design are introduced in Section III. Section IV presents an efficient algorithm to solve its optimization problem. Section V introduces an experimental environment and examines the simulation results. Finally, Section VI concludes this paper and suggests extensions to this work.
II. Background and Related Work
1. Multiple Controllers’ Combination and Schedule
Multiple controllers play an important role in improving the scalability and reliability of an SDN. Bari and others researched controllers’ dynamic assignments, which makes them more efficient in managing largescale switches, in [6]. As shown in Fig. 1, owing to the limited load capacity of a single controller, switches in the data plane are divided into several regions. Each switch region is managed by a group of controllers. Controller groups collaborate with each other, achieving the systematic management of all switches in the data plane. Meanwhile, controllers in the same group can form redundant relations in the roles of Master, Slave, or Equal [9]. This can improve the reliability of the control plane.
Li and others studied multiple controllers to resist Byzantine faults in [7], which solved the security problem of SDNs. According to the BFT protocol [10], there are at least 3f + 1 controllers grouped into a quorum in the control plane at any time. This can tolerate Byzantine faults in f controllers. When the number of faulted controllers exceeds the upper bound f of the controller quorum, as time t_{1} does in Fig. 2, the quorum view is changed, and a new controller quorum is selected to manage the SDN network.
After analyzing the application scenario above, we conclude that there exists a combination and schedule problem for multiple controllers in an SDN. To optimize the controllers’ schedule and deployment, [6] and [7] researched how to select and assign them under constraints such as load capacity and communication delay. However, in practical application, there are usually many controller groups or combinations that satisfy the constraints.
Therefore, based on the above methods, we consider other metrics to further optimize the controllers’ schedule. For example, we can treat reliability as a metric to schedule controllers by statistically learning their reliability information from historical behaviors. This will improve the reliability of the SDN.
2. Controllers’ Reliability
Shalimov and others examined the reliability and security of popular opensource SDN controllers (NOX, POX, Beacon, etc.) in [11]. Their evaluation shows that modern SDN controllers are not ready to be used in production and have to be improved. In [11], the researchers evaluated the reliability by measuring the number of faults during longterm testing under a given workload profile, and examined security by studying how the controllers manipulate malformed OpenFlow messages. In this paper, we define the general notion of reliability as the controllers’ ability to work correctly in any scenario, including the reliability and the security defined in [11].
Furthermore, with respect to the research in [11] and [12], we make two important points about reliability. First, there are differences in reliability between different types of controller. For example, most controllers can work correctly under a given workload profile, but MuL and Maestro start to drop PacketIns after several minutes of work. In addition, because there are more differences in reliability between the computers that run the controller software, the differences in reliability between controllers (including the computers that they run on) are more significant.
Second, OpenFlow messages that lead to breakdowns of the controllers are different. For example, when receiving OpenFlow messages with incorrect lengths, NOX will crash, and POX will close the connection with switches. However, Ryu is not affected by the malformed messages. Therefore, we propose the concept of an isotype fault, which is defined as a fault of multiple controllers that is caused by the same incentive. We should avoid scheduling controllers with higher isotype fault rates into the same group because isotype faults will cause multiple controllers to crash simultaneously or continuously, which will significantly decrease the reliability of the control plane. The above points reveal the essence and rules of the reliability of multiple controllers, and are important in optimizing their combinations and schedule. Table 1 compares our work and related works.
III. System Description
In this work, we propose a dynamic and selflearning schedule method of multiple controllers in an SDN. First, we design the general architecture by adding a management framework to the control plane in an SDN. As shown in Fig. 3, controllers can be divided into many groups that manage different switch regions. Under the condition of satisfying some constraints (for example, communication delay and load capacity), controllers can be combined arbitrarily, and one controller can manage one or more switch regions. Other application scenarios of multiple controllers’ combinations and schedules can be obtained by simplifying the general architecture. For example, we obtain a network model in [7] by dividing each switch into a single region.
1. Management Framework
The management framework runs on management plane in the SDN, which can collect the information of the working status of the control plane. It will periodically learn and analyze the historical behavior for reliability in order to optimize the combination and schedule in the next working cycle. Our management framework contains four modules, as depicted in Fig. 3 and explained below:

• Monitor and Statistics Module monitors the controllers’ working status and collects their fault information, including the fault controller’s ID, the time when it crashes, the ID of the set of controllers that encounters isotype faults, and the number of isotype faults.

• Reliability Evaluation Module quantitatively evaluates each controller group’s reliability based on the above fault information by designing a reasonable mathematical model.

• Combination and Schedule Module schedules controller groups to each switch region as optimally as possible using a reliability evaluation (that is, to make the total reliability value of the control plane as high as possible), and should satisfy some constraints.

• Reassignment Module reestablishes the correspondence between controllers and switches according to the above results, achieving their dynamic reassignment.
One important part of implementing the prototype of this framework is to monitor the controllers’ isotype fault information, which requires a deep reading of the controller status and a comprehensive analysis. To simplify the design, we adopt its approximate value. If a + 1 controllers in a same group crash in a short time of a × Δt_{f} (Δt_{f} is the average time that controllers process an OpenFlow message [13]), we assume that an isotype fault has occurred in these controllers.
2. Evaluation of Controllers’ Reliability
Reliability evaluation is the key part of the management framework, which is related to the effect of improving the reliability of the SDN. The problem of how to design this part is open: designers can propose different evaluation algorithms based on their focus of reliability. We describe the evaluation algorithm in this paper in detail.
For convenience, first, we establish the system’s formal description. In an SDN, there are n (n ≥ 2) controllers, which are numbered from C_{0} to C_{n − 1} and will be divided into m (m ≥ 1) groups. Any controller groups (sets) S_{i} have a bijective relation with positive integer Z:
This mapping can simplify the controller groups’ representation significantly. In addition, η(Z) denotes the element number S_{i} in sets S_{i}. For example, when controller set S_{i} is {C_{1}, C_{3}, C_{4}}, the integer Z it corresponds to is 26 (2^{1} + 2^{3} + 2^{4}), and η(Z) is 3. In addition, if sets S_{i} ⊂ S_{j}, we can denote that Z(S_{i}) ∝ Z(S_{j}).
The Monitor and Statistics Module needs to record the number of faults F(Z(S_{i})) of each controller set S_{i} and the number U(Z(S_{i})) in which the controllers have been scheduled to work. The initial value of F(Z(S_{i})) and U(Z(S_{i})) is set to 0. Note that if a controller set S_{i} was scheduled to work (or a fault occurs in it), the module should record information from all of its subsets. For example, in Fig. 2, the group that contains three controllers is denoted as {C_{1}, C_{3}, C_{4}}. In a schedule period ΔT, we assume that the total number of OpenFlow messages that the controller group processes is y, and the number of faults that are monitored to occur in controller groups {C_{1}}, {C_{3}}, {C_{4}}, {C_{1}, C_{3}}, {C_{1}, C_{4}}, {C_{3}, C_{4}}, and {C_{1}, C_{3}, C_{4}} is x_{1}, x_{2}, x_{3}, x_{4}, x_{5}, x_{6}, and x_{7}, respectively. Now, we update U(Z) and F(Z) as follows:
and
Based on the above statistics, we make a quantitative description Q(Z) of the reliability of controller group S_{i} itself. This description represents the ability to control networks correctly and is defined as follows:
Here, F(Z)/U(Z) denotes the average fault rate for all of its works. Equation (1) includes η(Z)^{2} because the more the controllers with isotype faults, the greater the effect on the total reliability. At system initialization, Q(Z) is set to 0 (the highest reliability), which allows the system to explore the controller groups that have not been scheduled.
Then, we define the total reliability of controller group S_{i} as follows:
(2) 
R(Z) contains all of the subsets’ reliability in group S_{i}, whose number is 2^{η(Z)} − 1. However, the total reliability is measured by the average reliability of all of its subsets in (2), which coincides with the intuitive experience that more controllers lead to higher reliability.
Analyzing the procedure of calculating R(Z), we find its time complexity is O(2^{n}), which requires significant computational capacity. The number of controllers is not too large (usually within 30) in a general SDN, so the algorithm’s calculation is acceptable.
3. Multiple Controllers’ Combination and Schedule
Under the condition of some constraints, we select m groups based on the above reliability evaluation to get the optimal value of the control plane’s reliability. First, the reliability of control plane in SDN is defined as follows:
(3) 
Here, ${S}_{i}^{\prime}$ denotes the controller groups that are scheduled to work at the current time, and the result Y is the sum of their reliability in (3).
To simplify description and analysis we consider only one constraint of the load capacity in the remainder of this paper. We can use a similar approach to deal with other constraints. For each controller C_{i}, we assume its maximal load capacity is l_{i}. If a controller C_{i} is scheduled to manage one or more switch regions, its load capacity g(C_{i}) is the total number of switches that it manages. Thus, the problem of multiple controllers’ combination and schedule (MCCS) is formally described as follows:
Obviously, the MCCS problem is similar to the binpacking problem, which is NPhard [14]. Because one controller is allowed to work in different groups, it results in greater difficulty in solving this problem. For example, if the number of controllers in each group that is assigned to switch regions is p, it needs to search in the $O({({C}_{n}^{p})}^{m})$ of space to get the optimal result for (4). Although this problem has an exponential computation complexity similar to the reliability evaluation described above, to solve the MCCS problem, more computation than the latter is needed. For example, if there are 20 controllers and 6 switch regions in an SDN, the computation amount for the reliability evaluation is at the level of 10^{6}, but the computation amount of the MCCS problem is at the level of 10^{18}, which is out of the range that we can accept. Therefore, we need to research its heuristic algorithm to reduce the computation amount significantly.
IV. Proposed Heuristic
In this section, we propose a greedy algorithm to solve the MCCS problem. Its basic idea is to iteratively schedule controller groups to switch regions that are sorted according to their reliability in descending order. The pseudocode of the proposed algorithm is shown in Algorithm 1. Consider that different switch regions may need different numbers of controllers. The input parameter p_{i} is included.
In the algorithm, we first divide sets S into different set groups Φ_{i} according to the number of controllers in each set. We sort sets S_{i} in each group Φ_{i} according to their reliability in descending order (lines 2 to 3). This facilitates the next operation, which frequently finds the set with the highest reliability, or the next in Φ_{i}. Second, for each switch region p_{i}, we try to select the set S_{k} with the highest reliability in the set group Φ_{j} corresponding to it, and add S_{k} to the approximate optimal result Γ (lines 4 to 6). However, the result of this greedy method usually does not satisfy the constraints, so we need to check Γ and find the controller C_{i} in Γ that exceeds its load limit (lines 7 to 8).
Third, we find the controller set S_{h} with the minimum reliability in Γ. The set S_{h} must include the controller C_{i} (line 9). Then, we replace the set S_{h} by set S_{h}′ with the next minimum reliability in group Φ_{q} that S_{h} corresponds to (lines 10 to 13). Last, we iteratively perform the above procedures until the approximate optimal controller sets that satisfy the constraints are found.

Algorithm 1. Greedy algorithm for the MCCS problem.

Input: Controller sets, S

Controller sets’ reliability, R

Capacity constraints of controllers, l

Number of total groups in data plane, m

Number of controllers in each group, p

Output: Approximate optimal sets, Γ = {S1, S2, … , S_{m}}

1. Γ = ∅

2. for i = 1 to Max{p_{i}} do

3. Φ_{i} ← Sort S_{j} according R and S_{j} = i

4. for i = 1 to m do

5. S_{i} ← Max R{Φ_{j}} and j = p_{i}

6. Γ = Γ + S_{i}

7: While Γ not s.t. l do

8: Search C_{k} that does not satisfy l_{k}.

9: S_{h} ← Min R{Γ} and C_{k} ∈ S_{h}

10. Γ = Γ − S_{h}

11. Φ_{q} = Φ_{q} − S_{h} and q = p_{h}

12. S_{h} ← Max R{Φ_{q}} and q = p_{h}

13. Γ = Γ + S_{h}
In analyzing the algorithm’s process, we prove that it will search in the $O(\Sigma {C}_{n}^{{p}_{i}})$ of space to obtain the approximate optimal result. For the example described in Section III. 3, the computation amount of solving the MCCS problem with this algorithm is much lower, decreasing from a level of 10^{18} to a level of 10^{3}. However, we sacrifice the accuracy of the solution. In addition, the algorithm has good generality because it is easy to add other constraints to it. For example, if an application scenario includes more constraints (such as communication delay or fault recovery time), we only need to extend the pseudocode in lines 7 to 8 to check whether the selected controller sets Γ satisfy these constraints.
V. Evaluation
1. Simulation Setup
To evaluate the effect of our proposed method DSL, we tried different simulation methodologies to find a suitable one for our purpose. First, we attempted to use Mininet [15] to simulate the data plane of the SDN, and FlowVisor [16] to divide switches into groups. However, Mininet cannot generate malformed OpenFlow messages that lead to controller faults, which makes it difficult to achieve the expected experimental environment. Therefore, we designed an SDN network simulator, based on the C++ language, to present multiple controller faults. Network parameters in the simulator can be freely configured, including the number of controllers, fault rate, load capacity, and number of switches. We assume that each controller group has been working for schedule period ΔT, that controllers in the group process 10,000 OpenFlow messages, and the management framework will evaluate the reliability of the controllers and schedule them again.
The simulation process is driven by a discreteevent approach [17] in which the state transition of controllers processes messages as the event. As shown in Fig. 4, when a controller group receives an OpenFlow message of_{1}, it is first processed by the main controller C_{i1} in the group. We determine whether the fault occurs in the main controller according to its fault probability P(C_{i1}). If a fault occurs, the next controller C_{i2} will be the main controller to process the message of_{1}. We also determine whether the fault occurs in controller C_{i2}. Note that we determine whether the message of_{1} will cause a fault to occur in the controller C_{i2} according to the conditional probability
instead of the fault probability P(C_{i2}). This is done because the message of_{1} has led to fault in controller C_{i1}. Similarly, if a fault also occurs in controller C_{i2}, we will follow the above procedure for controller C_{i3} and determine its state according to the conditional probability P(C_{i1}C_{i2}C_{i3})/P(C_{i1}C_{i2}). If controller C_{i3} can correctly process the message of_{1} at that moment, it will be the main controller to process the next OpenFlow message of_{2}.
Next, we introduce a test scenario in the experiment. There are 20 controllers in an SDN network, and each controller’s load capacity is randomly distributed within the range of [2,000, 7,000]. As listed in Table 2, we configure the fault probability of each controller group that contains a different number of controllers. For example, controller group {C_{i1}, C_{i2}, C_{i3}} contains three controllers, so its fault probability is in the range of [0, 0.017]. Switches in the data plane are divided into four regions. The number of switches and required controllers in each region are listed in Table 3.
Table 2.
Number of fault controller  1  2  3  4 

Range of fault probability  [0, 0.03]  [0, 0.024]  [0, 0.017]  [0, 0.01] 
2. Results
Using the simulator and test scenario above, we compare the random schedule method (RDM) and dynamic selflearning schedule method (DSL). When the simulator adopts RDM to manage controllers, it randomly selects controller groups on the basis of satisfying these constraints, but does not take their reliability into account. In the simulation, we also record the number of controller faults during each schedule period ΔT, which includes single controller faults and isotype faults of multiple controllers.
Figure 5 shows the experimental results, which contain fault information for controllers in the first 200 schedule periods. Similar behavior is observed for RDM and DSL in Fig. 5(a) to Fig. 5(d). With regard to DSL, we can see that the number of controller faults in the first 18 schedule periods is much higher than the latter. This is because the management framework has not collected sufficient information about the controllers’ reliability at the beginning, and needs to explore the controller groups that have not been scheduled. After it has explored all combinations of controllers (that is, after time 18ΔT), the total reliability of control plane tends to converge, and the number of faults decreases significantly.
We compared the effects of DSL and RDM in Fig. 5. For either a single controller fault or an isotype fault of multiple controllers, it is obvious that DSL performs much better than RDM after DSL has reached convergence. That is, the number of faults with DSL is lower than the number of faults with RDM in each schedule period. Therefore, DSL will improve the reliability of SDN. This treats reliability as a metric to schedule controllers by statistically learning their reliability information from historical behaviors.
In addition, we verified the effect of the heuristic algorithm GA proposed in Section IV. Because the SDN network’s scale in the test scenario is small, we can exhaustively search the optimal solution to its MCCS problem within an acceptable time. During a time of 90 ΔT – 100 ΔT, the simulator not only calls GA to solve the MCCS problem but also calls an exhaustive searching algorithm (ES) for contrast. We evaluated each algorithm’s effect by using the total reliability Y in (3) that it solves, and evaluated its computation overhead by the number of controller groups n that it searched. As shown in Fig. 6, the approximate optimal result that GA solves is close to the optimal result that ES solves, but the computational overhead of GA is much lower than that of the latter. Therefore, the proposed optimization algorithm GA is more practical for application.
Like other optimization problems, designing an appropriate management network for an SDN generally requires tradeoffs between reliability and performance. We evaluated SDN’s effects on networks performance, including the latency of OpenFlow requests and responses, and the proportion of the control flow. We used the test parameters in [11] to configure the simulator, which is similar to NS3 [18], and adopted the OS3E network as the topology of the data plane. The average update rate for each node’s network events is random in the range of [10, 100]. The simulator runs 30 min at a time, and schedules these controllers every 20 s (real system time). Compared with the traditional method (without DSL), the effect of the DSL method on system performance was measured by monitoring events and their occurrence times (virtual simulation times).
The delay of OpenFlow requests is the time from the creation of a packet by a switch to its capture by a controller. As shown in Fig. 5, compared with the traditional method, OpenFlow requests increased with a 0.17ms latency on average when the DSL method was deployed in an SDN. The increase in the requests’ latency is small and is caused by message blocking when the master controller changes. Similarly, the delay in OpenFlow responses is the time from the sending of a request by a switch to its reception by the switch.
OpenFlow responses increased with a 9.2ms latency on average, which included the time that management schedules controllers and new controller sets discover links. In addition, we counted the number of control packets and data packets in the simulator. The ratio between the control packets and the data packets increased by approximately 3.3%, which is related to the frequency of the management scheduling controller. We think the performance overhead is acceptable, and thus the DSL method is suitable for networks with a high demand for reliability (see Fig. 7).
VI. Conclusion
In this paper, we proposed a dynamic and selflearning schedule method of multiple controllers to improve the reliability of the control plane in an SDN. By statistically learning the reliability information from historical behaviors, this method treats reliability as a metric to combine and schedule controllers. Experimental results proved that this method can improve the total reliability of an SDN.
In our followon research, we intend to extend this work in two directions. First, we will perfect the implementation of the prototype system to make it more practical, and open its source code. Second, we want to optimize the algorithm of the reliability evaluation to adapt it to different fault scenarios.