• web of science banner
  • web of science banner
  • scopus banner
  • Engineering Information banner
  • Inspec Direct banner
  • Dialog banner
  • EBSCO banner
Subscription button Subscription button
ETRI J award winner banner
Article  <  Archive  <  Home
DSL: Dynamic and Self-Learning Schedule Method of Multiple Controllers in SDN
Junfei Li, Jiangxing Wu, Yuxiang Hu, and Kan Li
vol. 39, no. 3, June. 2017, pp. 364-372.
http://dx.doi.org/10.4218/etrij.17.0116.0460
Keywords : SDN, Multiple controller, Reliability, Combine and schedule, Self-learning.

This is an Open Access article distributed under the term of Korea Open Government License (KOGL) Type 4: Source Indication + Commercial Use Prohibition + Change Prohibition (http://www.kogl.or.kr/news/dataFileDown.do?dataIdx=71&dataFileIdx=2).
Manuscript received  Jul. 05, 2016;   revised  Dec. 19, 2016;   accepted  Jan. 12, 2017.  
  • Abstract
    • Abstract

      For the reliability of controllers in a software defined network (SDN), a dynamic and self-learning schedule method (DSL) is proposed. This method is original and easy to deploy, and optimizes the combination of multiple controllers. First, we summarize multiple controllers’ combinations and schedule problems in an SDN and analyze its reliability. Then, we introduce the architecture of the schedule method and evaluate multi-controller reliability, the DSL method, and its optimized solution. By continually and statistically learning the information about controller reliability, this method treats it as a metric to schedule controllers. Finally, we compare and test the method using a given testing scenario based on an SDN network simulator. The experiment results show that the DSL method can significantly improve the total reliability of an SDN compared with a random schedule, and the proposed optimization algorithm has higher efficiency than an exhaustive search.
  • Authors
    • Authors

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_a001.jpg

      Corresponding Author winfelin@gmail.com

      Junfei Li received his BS and MS degrees from the department of computer science and technology at the National University of Defense Technology, Changsha, China, in 2012 and 2015, respectively. He is currently pursuing his PhD in the department of computer science and technology at the National Digital Switching System Engineering & Technological R&D Center, Zhengzhou, China. His research interests include software-defined networks and network security.

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_a002.jpg

      chxachxa@126.com

      Jiangxing Wu received his BS degree from the department of computer science and technology at the Engineering and Technology College, Zhengzhou, China, in 1978. He was elected the academician of the Chinese Academy of Engineering in 2003. He is currently a professor at the China National Digital Switching System & Technology Research Center, Zhengzhou, China. His main research interests include network security and future network architecture.

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_a003.jpg

      ndscwjx@126.com

      Yuxiang Hu received his PhD in information and communication engineering from the National Digital Switching System & Technology Research Center, Zhengzhou, China, in 2011. He is currently an associate professor at the China National Digital Switching System & Technology Research Center. His research interests include future network communication and future network architecture.

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_a004.jpg

      studying2012@126.com

      Kan Li received his MS degree in information and communication engineering from the National Digital Switching System & Technology Research Center, Zhengzhou, China, in 2009. He is currently an assistant in the department of information science. His research interests include information security and future networks.

  • Full Text
    • I. Introduction

      With the rapid development of network applications, switching devices in traditional networks are carrying more and more control logic. This makes it difficult to adapt networks for virtualization, cloud computing, big data, and the needs of related business development for high-speed data transmission, the flexible configuration of resources, and rapid deployment protocols. A software defined network (SDN) is a promising network paradigm that separates the control plane and data plane in a network so that switches become simple data-forwarding devices and network management is controlled by logically centralized servers. The SDN concept has inspired widespread research in both academia and industry. However, although the centralized control of an SDN results in innovation and convenience for network applications [1], [2], it also creates reliability [3] and scalability [4] problems.

      To solve these problems, researchers proposed a method of multiple controllers in OpenFlow v1.2 [5]. For example, multiple controllers manage different switch regions cooperatively to improve an SDN’s scalability. On the other hand, multiple controllers deal with fail-stop faults by backing up the master, and resist Byzantine faults or attacks by executing a Byzantine fault tolerance (BFT) protocol, which can enhance the reliability of an SDN. Thus, the deployment of multiple controllers in an SDN plays an important role in improving its scalability and reliability. Furthermore, with respect to established time, communication overhead, fault recovery, etc., [6][8] illustrate the effect of a controller’s optimized schedule and deployment on the performance and reliability of an SDN. However, these studies ignored a key point: the reliability of the controller itself has an effect on the control plane in an SDN.

      Therefore, we propose a dynamic and self-learning schedule method of multiple controllers in an SDN in order to further improve the control plane’s reliability. This method is inspired by the old Chinese saying, “They who learn the history, know what thrives and what is calamitous in the future.” The system continually and statistically learns the historical behavior of a controller’s reliability, and treats this as a metric to combine and schedule them. This is beneficial in selecting controller groups with higher reliability.

      Based on the satisfaction of constraints such as communication delay and load capacity, we attempt to combine and schedule controllers to 1) avoid scheduling controllers with poor reliability, which means they have higher fault rates in historical statistics; and 2) avoid combining controllers with isotype faults (see more details in Section II) into a group, which reduces the hand-over frequency between a Master controller and Slave controller (or Equal controller) owing to faults, and also benefits the BFT defending the effects against faults.

      This method is appropriate for SDN networks in data centers or cloud environments, which have many controllers and are strict on reliability. In addition, this method is original and easy to deploy, which will optimize the combination of multiple controllers. The main contributions of this paper are as follows:

      1. •  First, we summarize the combination and schedule problems of multiple controllers in an SDN, and propose a concept of isotype faults between controllers.

      2. •  Second, we research a dynamic and self-learning schedule method to combine and schedule multiple controllers. In order to further improve the reliability of the control plane, this method continually and statistically learns the historical behavior of the controllers’ reliability, and treats it as a metric to schedule controllers.

      3. •  Finally, we propose an efficient heuristic algorithm to solve the optimization problem of combining and scheduling controllers. This algorithm can calculate the NP problem’s approximate optimal solution within an acceptable time.

      The rest of this paper is organized as follows. Section II reviews the background and related work. Our system architecture and detailed design are introduced in Section III. Section IV presents an efficient algorithm to solve its optimization problem. Section V introduces an experimental environment and examines the simulation results. Finally, Section VI concludes this paper and suggests extensions to this work.

      II. Background and Related Work

      1. Multiple Controllers’ Combination and Schedule

      Multiple controllers play an important role in improving the scalability and reliability of an SDN. Bari and others researched controllers’ dynamic assignments, which makes them more efficient in managing large-scale switches, in [6]. As shown in Fig. 1, owing to the limited load capacity of a single controller, switches in the data plane are divided into several regions. Each switch region is managed by a group of controllers. Controller groups collaborate with each other, achieving the systematic management of all switches in the data plane. Meanwhile, controllers in the same group can form redundant relations in the roles of Master, Slave, or Equal [9]. This can improve the reliability of the control plane.

      Fig. 1.

      Multiple controllers to solve scalability problem of SDN.

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_f001.jpg

      Li and others studied multiple controllers to resist Byzantine faults in [7], which solved the security problem of SDNs. According to the BFT protocol [10], there are at least 3f + 1 controllers grouped into a quorum in the control plane at any time. This can tolerate Byzantine faults in f controllers. When the number of faulted controllers exceeds the upper bound f of the controller quorum, as time t1 does in Fig. 2, the quorum view is changed, and a new controller quorum is selected to manage the SDN network.

      Fig. 2.

      Multiple controllers to resist Byzantine fault.

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_f002.jpg

      After analyzing the application scenario above, we conclude that there exists a combination and schedule problem for multiple controllers in an SDN. To optimize the controllers’ schedule and deployment, [6] and [7] researched how to select and assign them under constraints such as load capacity and communication delay. However, in practical application, there are usually many controller groups or combinations that satisfy the constraints.

      Therefore, based on the above methods, we consider other metrics to further optimize the controllers’ schedule. For example, we can treat reliability as a metric to schedule controllers by statistically learning their reliability information from historical behaviors. This will improve the reliability of the SDN.

      2. Controllers’ Reliability

      Shalimov and others examined the reliability and security of popular open-source SDN controllers (NOX, POX, Beacon, etc.) in [11]. Their evaluation shows that modern SDN controllers are not ready to be used in production and have to be improved. In [11], the researchers evaluated the reliability by measuring the number of faults during long-term testing under a given workload profile, and examined security by studying how the controllers manipulate malformed OpenFlow messages. In this paper, we define the general notion of reliability as the controllers’ ability to work correctly in any scenario, including the reliability and the security defined in [11].

      Furthermore, with respect to the research in [11] and [12], we make two important points about reliability. First, there are differences in reliability between different types of controller. For example, most controllers can work correctly under a given workload profile, but MuL and Maestro start to drop PacketIns after several minutes of work. In addition, because there are more differences in reliability between the computers that run the controller software, the differences in reliability between controllers (including the computers that they run on) are more significant.

      Second, OpenFlow messages that lead to breakdowns of the controllers are different. For example, when receiving OpenFlow messages with incorrect lengths, NOX will crash, and POX will close the connection with switches. However, Ryu is not affected by the malformed messages. Therefore, we propose the concept of an isotype fault, which is defined as a fault of multiple controllers that is caused by the same incentive. We should avoid scheduling controllers with higher isotype fault rates into the same group because isotype faults will cause multiple controllers to crash simultaneously or continuously, which will significantly decrease the reliability of the control plane. The above points reveal the essence and rules of the reliability of multiple controllers, and are important in optimizing their combinations and schedule. Table 1 compares our work and related works.

      Table 1.

      Comparison of different methods to improve reliability in SDN.

      MethodsImplementationPrerequisiteProtection scope
      DSLSimpleSupported managementControl plane
      DeploymentModerateSupported managementData plane
      Failure recoveryDifficultSupported switchesAll
      FLOWGUARDDifficultSupported controllersController

      III. System Description

      In this work, we propose a dynamic and self-learning schedule method of multiple controllers in an SDN. First, we design the general architecture by adding a management framework to the control plane in an SDN. As shown in Fig. 3, controllers can be divided into many groups that manage different switch regions. Under the condition of satisfying some constraints (for example, communication delay and load capacity), controllers can be combined arbitrarily, and one controller can manage one or more switch regions. Other application scenarios of multiple controllers’ combinations and schedules can be obtained by simplifying the general architecture. For example, we obtain a network model in [7] by dividing each switch into a single region.

      1. Management Framework

      The management framework runs on management plane in the SDN, which can collect the information of the working status of the control plane. It will periodically learn and analyze the historical behavior for reliability in order to optimize the combination and schedule in the next working cycle. Our management framework contains four modules, as depicted in Fig. 3 and explained below:

      1. •  Monitor and Statistics Module monitors the controllers’ working status and collects their fault information, including the fault controller’s ID, the time when it crashes, the ID of the set of controllers that encounters isotype faults, and the number of isotype faults.

      2. •  Reliability Evaluation Module quantitatively evaluates each controller group’s reliability based on the above fault information by designing a reasonable mathematical model.

      3. •  Combination and Schedule Module schedules controller groups to each switch region as optimally as possible using a reliability evaluation (that is, to make the total reliability value of the control plane as high as possible), and should satisfy some constraints.

      4. •  Reassignment Module reestablishes the correspondence between controllers and switches according to the above results, achieving their dynamic reassignment.

      One important part of implementing the prototype of this framework is to monitor the controllers’ isotype fault information, which requires a deep reading of the controller status and a comprehensive analysis. To simplify the design, we adopt its approximate value. If a + 1 controllers in a same group crash in a short time of a × Δtftf is the average time that controllers process an OpenFlow message [13]), we assume that an isotype fault has occurred in these controllers.

      Fig. 3.

      System architecture.

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_f003.jpg

      2. Evaluation of Controllers’ Reliability

      Reliability evaluation is the key part of the management framework, which is related to the effect of improving the reliability of the SDN. The problem of how to design this part is open: designers can propose different evaluation algorithms based on their focus of reliability. We describe the evaluation algorithm in this paper in detail.

      For convenience, first, we establish the system’s formal description. In an SDN, there are n (n ≥ 2) controllers, which are numbered from C0 to Cn − 1 and will be divided into m (m ≥ 1) groups. Any controller groups (sets) Si have a bijective relation with positive integer Z:

      Z ( S i ) = 2 j   s.t.   C j S i .

      This mapping can simplify the controller groups’ representation significantly. In addition, η(Z) denotes the element number |Si| in sets Si. For example, when controller set Si is {C1, C3, C4}, the integer Z it corresponds to is 26 (21 + 23 + 24), and η(Z) is 3. In addition, if sets SiSj, we can denote that Z(Si) ∝ Z(Sj).

      The Monitor and Statistics Module needs to record the number of faults F(Z(Si)) of each controller set Si and the number U(Z(Si)) in which the controllers have been scheduled to work. The initial value of F(Z(Si)) and U(Z(Si)) is set to 0. Note that if a controller set Si was scheduled to work (or a fault occurs in it), the module should record information from all of its subsets. For example, in Fig. 2, the group that contains three controllers is denoted as {C1, C3, C4}. In a schedule period ΔT, we assume that the total number of OpenFlow messages that the controller group processes is y, and the number of faults that are monitored to occur in controller groups {C1}, {C3}, {C4}, {C1, C3}, {C1, C4}, {C3, C4}, and {C1, C3, C4} is x1, x2, x3, x4, x5, x6, and x7, respectively. Now, we update U(Z) and F(Z) as follows:

      { U ( 2 ) = U ( Z ( { C 1 } ) ) = U ( 2 ) + y , U ( 8 ) = U ( Z ( { C 3 } ) ) = U ( 8 ) + y , U ( 16 ) = U ( Z ( { C 4 } ) ) = U ( 16 ) + y , U ( 10 ) = U ( Z ( { C 1 , C 3 } ) ) = U ( 10 ) + y , U ( 18 ) = U ( Z ( { C 1 , C 4 } ) ) = U ( 18 ) + y , U ( 24 ) = U ( Z ( { C 3 , C 4 } ) ) = U ( 24 ) + y , U ( 26 ) = U ( Z ( { C 1 , C 3 , C 4 } ) ) = U ( 26 ) + y ,

      and

      { F ( 2 ) = F ( 2 ) + x 1 + x 4 + x 5 + x 7 , F ( 8 ) = F ( 8 ) + x 2 + x 4 + x 6 + x 7 , F ( 16 ) = F ( 16 ) + x 3 + x 5 + x 6 + x 7 , F ( 10 ) = F ( 10 ) + x 4 + x 7 , F ( 18 ) = F ( 18 ) + x 5 + x 7 , F ( 24 ) = F ( 24 ) + x 6 + x 7 , F ( 26 ) = F ( 26 ) + x 7 .

      Based on the above statistics, we make a quantitative description Q(Z) of the reliability of controller group Si itself. This description represents the ability to control networks correctly and is defined as follows:

      Q ( Z ) = { η ( Z ) 2 × F ( Z ) U ( Z )               U ( Z ) 0 , 0 U ( Z ) = 0.
      (1)

      Here, F(Z)/U(Z) denotes the average fault rate for all of its works. Equation (1) includes η(Z)2 because the more the controllers with isotype faults, the greater the effect on the total reliability. At system initialization, Q(Z) is set to 0 (the highest reliability), which allows the system to explore the controller groups that have not been scheduled.

      Then, we define the total reliability of controller group Si as follows:

      R ( Z ) = 1 2 η ( Z ) 1 × Z j Z Q ( Z j ) .
      (2)

      R(Z) contains all of the subsets’ reliability in group Si, whose number is 2η(Z) − 1. However, the total reliability is measured by the average reliability of all of its subsets in (2), which coincides with the intuitive experience that more controllers lead to higher reliability.

      Analyzing the procedure of calculating R(Z), we find its time complexity is O(2n), which requires significant computational capacity. The number of controllers is not too large (usually within 30) in a general SDN, so the algorithm’s calculation is acceptable.

      3. Multiple Controllers’ Combination and Schedule

      Under the condition of some constraints, we select m groups based on the above reliability evaluation to get the optimal value of the control plane’s reliability. First, the reliability of control plane in SDN is defined as follows:

      Y = i [ 1 , m ] R ( S i ) .
      (3)

      Here, S i denotes the controller groups that are scheduled to work at the current time, and the result Y is the sum of their reliability in (3).

      To simplify description and analysis we consider only one constraint of the load capacity in the remainder of this paper. We can use a similar approach to deal with other constraints. For each controller Ci, we assume its maximal load capacity is li. If a controller Ci is scheduled to manage one or more switch regions, its load capacity g(Ci) is the total number of switches that it manages. Thus, the problem of multiple controllers’ combination and schedule (MCCS) is formally described as follows:

      { Max      Y ( { S 1 , S 2    ...    S m } ) , s.t.         i [ 0 , n 1 ]         g ( C i ) l i .
      (4)

      Obviously, the MCCS problem is similar to the bin-packing problem, which is NP-hard [14]. Because one controller is allowed to work in different groups, it results in greater difficulty in solving this problem. For example, if the number of controllers in each group that is assigned to switch regions is p, it needs to search in the O ( ( C n p ) m ) of space to get the optimal result for (4). Although this problem has an exponential computation complexity similar to the reliability evaluation described above, to solve the MCCS problem, more computation than the latter is needed. For example, if there are 20 controllers and 6 switch regions in an SDN, the computation amount for the reliability evaluation is at the level of 106, but the computation amount of the MCCS problem is at the level of 1018, which is out of the range that we can accept. Therefore, we need to research its heuristic algorithm to reduce the computation amount significantly.

      IV. Proposed Heuristic

      In this section, we propose a greedy algorithm to solve the MCCS problem. Its basic idea is to iteratively schedule controller groups to switch regions that are sorted according to their reliability in descending order. The pseudocode of the proposed algorithm is shown in Algorithm 1. Consider that different switch regions may need different numbers of controllers. The input parameter pi is included.

      In the algorithm, we first divide sets S into different set groups Φi according to the number of controllers in each set. We sort sets Si in each group Φi according to their reliability in descending order (lines 2 to 3). This facilitates the next operation, which frequently finds the set with the highest reliability, or the next in Φi. Second, for each switch region pi, we try to select the set Sk with the highest reliability in the set group Φj corresponding to it, and add Sk to the approximate optimal result Γ (lines 4 to 6). However, the result of this greedy method usually does not satisfy the constraints, so we need to check Γ and find the controller Ci in Γ that exceeds its load limit (lines 7 to 8).

      Third, we find the controller set Sh with the minimum reliability in Γ. The set Sh must include the controller Ci (line 9). Then, we replace the set Sh by set Sh′ with the next minimum reliability in group Φq that Sh corresponds to (lines 10 to 13). Last, we iteratively perform the above procedures until the approximate optimal controller sets that satisfy the constraints are found.

      1. Algorithm 1. Greedy algorithm for the MCCS problem.

      2. Input: Controller sets, S

      3.             Controller sets’ reliability, R

      4.             Capacity constraints of controllers, l

      5.             Number of total groups in data plane, m

      6.             Number of controllers in each group, p

      7. Output: Approximate optimal sets, Γ = {S1, S2, … , Sm}

      8. 1. Γ = ∅

      9. 2. for i = 1 to Max{pi} do

      10. 3.         Φi ← Sort Sj according R and |Sj| = i

      11. 4. for i = 1 to m do

      12. 5.         Si ← Max R{Φj} and j = pi

      13. 6.         Γ = Γ + Si

      14. 7: While Γ not s.t. l do

      15. 8:         Search Ck that does not satisfy lk.

      16. 9:         Sh ← Min R{Γ} and CkSh

      17. 10.         Γ = ΓSh

      18. 11.         Φq = ΦqSh and q = ph

      19. 12.         Sh ← Max R{Φq} and q = ph

      20. 13.         Γ = Γ + Sh

      In analyzing the algorithm’s process, we prove that it will search in the O ( Σ C n p i ) of space to obtain the approximate optimal result. For the example described in Section III. 3, the computation amount of solving the MCCS problem with this algorithm is much lower, decreasing from a level of 1018 to a level of 103. However, we sacrifice the accuracy of the solution. In addition, the algorithm has good generality because it is easy to add other constraints to it. For example, if an application scenario includes more constraints (such as communication delay or fault recovery time), we only need to extend the pseudocode in lines 7 to 8 to check whether the selected controller sets Γ satisfy these constraints.

      V. Evaluation

      1. Simulation Setup

      To evaluate the effect of our proposed method DSL, we tried different simulation methodologies to find a suitable one for our purpose. First, we attempted to use Mininet [15] to simulate the data plane of the SDN, and FlowVisor [16] to divide switches into groups. However, Mininet cannot generate malformed OpenFlow messages that lead to controller faults, which makes it difficult to achieve the expected experimental environment. Therefore, we designed an SDN network simulator, based on the C++ language, to present multiple controller faults. Network parameters in the simulator can be freely configured, including the number of controllers, fault rate, load capacity, and number of switches. We assume that each controller group has been working for schedule period ΔT, that controllers in the group process 10,000 OpenFlow messages, and the management framework will evaluate the reliability of the controllers and schedule them again.

      The simulation process is driven by a discrete-event approach [17] in which the state transition of controllers processes messages as the event. As shown in Fig. 4, when a controller group receives an OpenFlow message of1, it is first processed by the main controller Ci1 in the group. We determine whether the fault occurs in the main controller according to its fault probability P(Ci1). If a fault occurs, the next controller Ci2 will be the main controller to process the message of1. We also determine whether the fault occurs in controller Ci2. Note that we determine whether the message of1 will cause a fault to occur in the controller Ci2 according to the conditional probability

      P ( C i 2  |  C i 1 ) = P ( C i 1 C i 2 ) P ( C i 1 ) ,

      instead of the fault probability P(Ci2). This is done because the message of1 has led to fault in controller Ci1. Similarly, if a fault also occurs in controller Ci2, we will follow the above procedure for controller Ci3 and determine its state according to the conditional probability P(Ci1Ci2Ci3)/P(Ci1Ci2). If controller Ci3 can correctly process the message of1 at that moment, it will be the main controller to process the next OpenFlow message of2.

      Fig. 4.

      Multiple controllers process OpenFlow messages.

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_f004.jpg

      Next, we introduce a test scenario in the experiment. There are 20 controllers in an SDN network, and each controller’s load capacity is randomly distributed within the range of [2,000, 7,000]. As listed in Table 2, we configure the fault probability of each controller group that contains a different number of controllers. For example, controller group {Ci1, Ci2, Ci3} contains three controllers, so its fault probability is in the range of [0, 0.017]. Switches in the data plane are divided into four regions. The number of switches and required controllers in each region are listed in Table 3.

      Table 2.

      Controller fault probability.

      Number of fault controller1234
      Range of fault probability[0, 0.03][0, 0.024][0, 0.017][0, 0.01]

      Table 3.

      Switch regions.

      Switch group ID1234
      Number of switches2,0003,0003,6004,400
      Required number of controllers2334

      2. Results

      Using the simulator and test scenario above, we compare the random schedule method (RDM) and dynamic self-learning schedule method (DSL). When the simulator adopts RDM to manage controllers, it randomly selects controller groups on the basis of satisfying these constraints, but does not take their reliability into account. In the simulation, we also record the number of controller faults during each schedule period ΔT, which includes single controller faults and isotype faults of multiple controllers.

      Figure 5 shows the experimental results, which contain fault information for controllers in the first 200 schedule periods. Similar behavior is observed for RDM and DSL in Fig. 5(a) to Fig. 5(d). With regard to DSL, we can see that the number of controller faults in the first 18 schedule periods is much higher than the latter. This is because the management framework has not collected sufficient information about the controllers’ reliability at the beginning, and needs to explore the controller groups that have not been scheduled. After it has explored all combinations of controllers (that is, after time 18ΔT), the total reliability of control plane tends to converge, and the number of faults decreases significantly.

      We compared the effects of DSL and RDM in Fig. 5. For either a single controller fault or an isotype fault of multiple controllers, it is obvious that DSL performs much better than RDM after DSL has reached convergence. That is, the number of faults with DSL is lower than the number of faults with RDM in each schedule period. Therefore, DSL will improve the reliability of SDN. This treats reliability as a metric to schedule controllers by statistically learning their reliability information from historical behaviors.

      Fig. 5.

      Number of controller faults: (a) single controller fault, (b) isotype fault of two controllers, (c) isotype fault of three controllers, and (d) isotype fault of four controllers.

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_f005.jpg

      In addition, we verified the effect of the heuristic algorithm GA proposed in Section IV. Because the SDN network’s scale in the test scenario is small, we can exhaustively search the optimal solution to its MCCS problem within an acceptable time. During a time of 90 ΔT – 100 ΔT, the simulator not only calls GA to solve the MCCS problem but also calls an exhaustive searching algorithm (ES) for contrast. We evaluated each algorithm’s effect by using the total reliability Y in (3) that it solves, and evaluated its computation overhead by the number of controller groups n that it searched. As shown in Fig. 6, the approximate optimal result that GA solves is close to the optimal result that ES solves, but the computational overhead of GA is much lower than that of the latter. Therefore, the proposed optimization algorithm GA is more practical for application.

      Fig. 6.

      Effect of algorithm GA.

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_f006.jpg

      Like other optimization problems, designing an appropriate management network for an SDN generally requires trade-offs between reliability and performance. We evaluated SDN’s effects on networks performance, including the latency of OpenFlow requests and responses, and the proportion of the control flow. We used the test parameters in [11] to configure the simulator, which is similar to NS3 [18], and adopted the OS3E network as the topology of the data plane. The average update rate for each node’s network events is random in the range of [10, 100]. The simulator runs 30 min at a time, and schedules these controllers every 20 s (real system time). Compared with the traditional method (without DSL), the effect of the DSL method on system performance was measured by monitoring events and their occurrence times (virtual simulation times).

      The delay of OpenFlow requests is the time from the creation of a packet by a switch to its capture by a controller. As shown in Fig. 5, compared with the traditional method, OpenFlow requests increased with a 0.17-ms latency on average when the DSL method was deployed in an SDN. The increase in the requests’ latency is small and is caused by message blocking when the master controller changes. Similarly, the delay in OpenFlow responses is the time from the sending of a request by a switch to its reception by the switch.

      OpenFlow responses increased with a 9.2-ms latency on average, which included the time that management schedules controllers and new controller sets discover links. In addition, we counted the number of control packets and data packets in the simulator. The ratio between the control packets and the data packets increased by approximately 3.3%, which is related to the frequency of the management scheduling controller. We think the performance overhead is acceptable, and thus the DSL method is suitable for networks with a high demand for reliability (see Fig. 7).

      Fig. 7.

      Average latency of OpenFlow messages.

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_f007.jpg

      VI. Conclusion

      In this paper, we proposed a dynamic and self-learning schedule method of multiple controllers to improve the reliability of the control plane in an SDN. By statistically learning the reliability information from historical behaviors, this method treats reliability as a metric to combine and schedule controllers. Experimental results proved that this method can improve the total reliability of an SDN.

      In our follow-on research, we intend to extend this work in two directions. First, we will perfect the implementation of the prototype system to make it more practical, and open its source code. Second, we want to optimize the algorithm of the reliability evaluation to adapt it to different fault scenarios.

      Footnotes

      Junfei Li (corresponding author, winfelin@gmail.com), Jiangxing Wu (chxachxa@126.com), and Yuxiang Hu (ndscwjx@126.com) are with National Digital Switching System Engineering & Technological R&D Center, Zhengzhou, China.

      Kan Li (studying2012@126.com) is with Xi’an Communication Institute, China.

  • References
    • References

      [1] 

      N. McKeown et al., “OpenFlow: Enabling Innovation in Campus Networks,” ACM SIGCOMM Comput. Commun. Rev., vol. 38, no. 2, Apr. 2008, pp. 69–74.

      [2] 

      A. Lara, A. Kolasani, and B. Ramamurthy, “Network Innovation Using OpenFlow: a Survey,” IEEE Commun. Surveys Tutorials, vol. 16, no. 1, 2014, pp. 493–512.  

      [3] 

      X. Guan, B.Y. Choi, and S. Song, “Reliability and Scalability Issues in Software Defined Network Frameworks,” Res. Educ. Experiment Workshop, Salt Lake City, UT, USA, Mar. 20–22, 2013, pp. 102–103.

      [4] 

      S. Sezer et al., “Are We Ready for SDN? Implementation Challenges for Software-Defined Networks,” IEEE Commun. Mag., vol. 51, no. 7, July 2013, pp. 36–43.

      [5] 

      ONF-TS004, OpenFlow Switch Specification Version 1.2, CA, USA, 2011.

      [6] 

      M.F. Bari et al., “Dynamic Controller Provisioning in Software Defined Networks,” Netw. Service Manage., Beijing, China, Oct. 14–18, 2013, pp. 18–25.

      [7] 

      H. Li et al., “Byzantine-Resilient Secure Software-Defined Networks with Multiple Controllers in Cloud,” IEEE Trans. Cloud Comput., vol. 2, no. 4, Oct.–Dec. 2014, pp. 436–447.  

      [8] 

      D. Hock et al., “Pareto-Optimal Resilient Controller Placement in SDN-Based Core Networks,” Int. Teletraffic Congr., Beijing, China, Sept. 10–12, 2013, pp. 1–9.

      [9] 

      V. Pashkov, A. Shalimov, and R. Smeliansky, “Controller Failover for SDN Enterprise Networks,” Int. Sci. Technol. Conf. (Modern Netw. Technol.), Chicago, IL, USA, Oct. 28–29, 2014, pp. 1–6.

      [10] 

      M. Castro and B. Liskov, “Practical Byzantine Fault Tolerance and Proactive Recovery,” ACM Trans. Comput. Syst., vol. 20, no. 4, Nov. 2002, pp. 398–461.  

      [11] 

      A. Shalimov et al., “Advanced Study of SDN/OpenFlow Controllers,” Proc. Central Eastern European Softw. Eng. Conf., Moscow, Russia, Oct. 24–25, 2013, pp. 1–7.

      [12] 

      D. Klingel et al., “Security Analysis of Software Defined Networking Architectures: PCE, 4D and SANE,” Proc. AINTEC Asian Internet Eng. Conf., Bangkok, Thailand, Nov. 26–28, 2014, pp. 15–23.

      [13] 

      A. Tootoonchian et al., “On Controller Performance in Software-Defined Networks,” Proc. USENIX Conf. Hot Topics Manage. Internet, Cloud, Enterprise Netw. Services, San Jose, CA, USA, Apr. 24, 2012, pp. 1–6.

      [14] 

      T.K. Truong, K. Li, and Y. Xu, “Chemical Reaction Optimization with Greedy Strategy for the 0–1 Knapsack Problem,” Appl. Soft Comput., vol. 13, no. 4, Apr. 2013, pp. 1774–1780.  

      [15] 

      R.L.S. de Oliveira et al., “Using Mininet for Emulation and Prototyping Software-Defined Networks,” IEEE Colombian Conf. Commun. Comput., Bogota, Colombia, June 4–6, 2014, pp. 1–6.

      [16] 

      R. Sherwood et al., “Flow Visor: a Network Virtualization Layer,” OpenFlow Switch Consortium Tech. Rep., vol. 15, no. 7, Oct. 2009, pp. 1–13.

      [17] 

      M. Lu, “Simplified Discrete-Event Simulation Approach for Construction Simulation,” J. Constr. Eng. Manage., vol. 129, no. 5, Oct. 2003, pp. 537–546.  

      [18] 

      G.F. Riley and T.R. Henderson, The ns-3 Network Simulator, Berlin, Heidelberg, Germany: Springer, 2010, pp. 15–34.

  • Cited by
    • Cited by

  • Metrics
    • Metrics

      Article Usage

      350
      Downloaded
      385
      Viewed

      Citations

      0
      0
  • Figure / Table
    • Figure / Table

      Fig. 1.

      Multiple controllers to solve scalability problem of SDN.

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_f001.jpg
      Fig. 2.

      Multiple controllers to resist Byzantine fault.

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_f002.jpg
      Fig. 3.

      System architecture.

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_f003.jpg
      Fig. 4.

      Multiple controllers process OpenFlow messages.

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_f004.jpg
      Fig. 5.

      Number of controller faults: (a) single controller fault, (b) isotype fault of two controllers, (c) isotype fault of three controllers, and (d) isotype fault of four controllers.

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_f005.jpg
      Fig. 6.

      Effect of algorithm GA.

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_f006.jpg
      Fig. 7.

      Average latency of OpenFlow messages.

      images/2017/v39n3/ETRI_J001_2017_v39n3_364_f007.jpg
      Table 1.

      Comparison of different methods to improve reliability in SDN.

      MethodsImplementationPrerequisiteProtection scope
      DSLSimpleSupported managementControl plane
      DeploymentModerateSupported managementData plane
      Failure recoveryDifficultSupported switchesAll
      FLOWGUARDDifficultSupported controllersController
      Table 2.

      Controller fault probability.

      Number of fault controller1234
      Range of fault probability[0, 0.03][0, 0.024][0, 0.017][0, 0.01]
      Table 3.

      Switch regions.

      Switch group ID1234
      Number of switches2,0003,0003,6004,400
      Required number of controllers2334