Modeling, Monitoring and Scheduling Techniques for Network Recovery from Massive Failures

Modeling, Monitoring and Scheduling Techniques for Network Recovery from Massive Failures

Author: Zad Tootaghaj, Diman
Graduate Program: Computer Science and Engineering
Degree: Doctor of Philosophy
Document Type: Dissertation
Date of Defense: May 23, 2018
Committee Members:

  • Thomas F Laporta, Dissertation Advisor
  • Thomas F Laporta, Committee Chair
  • Ting He, Committee Member
  • Nilanjan Ray Chaudhuri, Committee Member
  • Marek Flaska, Outside Member


  • Network Recovery
  • Massive Disruption
  • Stochastic Optimization
  • Uncertainty
  • Network Recovery Massive Disruption
  • Uncertainty.
  • Cascading Failures
  • Interdependent Networks
  • Power Grid
  • Software-Defined Networking

Abstract:This dissertation explores modeling, monitoring and scheduling techniques for network recovery from massive failures, with a focus on optimization methods under the uncertain knowledge of failures. Large-scale failures in communication networks due to natural disasters or malicious attacks can severely affect critical communications and threaten the lives of people in the affected area. In 2005, Hurricane Katrina led to the outage of over 2.5 million lines in the BellSouth (now AT&T) network. In the absence of a proper communication infrastructure, rescue operation becomes extremely difficult. Progressive and timely network recovery is, therefore, a key to minimizing losses and facilitating rescue missions. Many prior works on failure detection and recovery assume full knowledge of failures and use a deterministic approach for the recovery phase. In real-world scenarios, however, the failure pattern might be unknown or only partially known. Therefore, classic recovery approaches may not work. To this end, I focus on network recovery assuming partial and uncertain knowledge of the failure locations. I first studied large-scale failures in a communication network. In particular, I proposed a new recovery approach under the uncertain knowledge of failures. I proposed a progressive multi-stage recovery approach that uses the incomplete knowledge of the failure to find a feasible recovery schedule. From the elements of this solution, I selected a node with the highest centrality at each iteration step to repair and exploit as a monitor to increase the knowledge of network state, until all critical services are restored. The recovery problem can be addressed by giving different priority to three performance aspects including 1) Demand loss, 2) computation time and 3) number of repairs (or repair cost). These aspects are in conflict with each other and I studied the trade-off among them. Next, I focused on the failure recovery of multiple interconnected networks. In particular, I focused on the interaction between a power grid and a communication network. I modeled the cascading failures in a power grid using a DC power flow model. I tackled the problem of mitigating an ongoing cascade by formulating the minimum cost flow assignment problem as a linear programming optimization. The optimization aimed at finding a minimum cost DC power flow setting that stops the cascading failure, where the total cost is defined as the total weighted amount of unsatisfied load due to the re-distribution of the power in the generators and loads without violating the overload constraint at each line. Then, I focused on network monitoring techniques that can be used for diagnosing the performance of individual links for localizing soft failures (e.g. highly congested links) in a communication network. I studied the optimal selection of the monitoring paths to balance identifiability and probing cost. I considered four closely related optimization problems: (1) Max-IL-Cost that maximizes the number of identifiable links under a probing budget, (2) Max-Rank-Cost that maximizes the rank of selected paths under a probing budget, (3) Min-Cost-IL that minimizes the probing cost while preserving identifiability, and (4) Min-Cost-Rank that minimizes the probing cost while preserving rank. I showed that while (1) and (3) are hard to solve, (2) and (4) possess desirable properties that allow efficient computation while providing a good approximation to (1) and (3). I proposed an optimal greedy-based approach for (4) and proposed a (1-1/e)-approximation algorithm for (2). My experimental analysis revealed that, compared to several greedy approaches that directly solve the identifiability-based optimization (i.e. (1) and (3)), the proposed rank-based optimization (i.e. (2) and (4)) achieved better trade-offs in terms of identifiability and probing cost. Finally, I addressed a minimum disruptive routing framework in software-defined networks. I showed that flow disruption, congestion, and violation of policies can occur during updates of flow tables in software-defined networks. I aimed to minimize the update disruption and minimize the number of affected flows during the update while taking into account link capacity constraints and the importance of various flows to upper-layer applications. I formulated the problem as an integer linear programming and showed that it is NP-Hard. I proposed two randomized rounding algorithms with bounded congestion and demand loss to solve this problem. In addition to a small SDN testbed, I performed a large-scale simulation study to evaluate my proposed approaches on real network topologies. Extensive experimental and simulation results show that the two random rounding approaches have a disruption cost close to the optimal while incurring a low congestion factor and a low demand loss.

Modeling, Monitoring and Scheduling Techniques for Network Recovery from Massive Failures

CAGE: A Contention-Aware Game-theoretic Model for Heterogeneous Resource Assignment

Traditional resource management systems rely on a centralized approach to manage users running on each resource. The centralized resource management system is not scalable for large-scale servers as the number of users running on shared resources is increasing dramatically and the centralized manager may not have enough information about applications’ need. In this paper we propose a distributed game-theoretic resource management approach using market auction mechanism to and optimal strategy in a resource competition game. The applications learn through repeated interactions to choose their action on choosing the shared resources. Specially, we look into two case studies of cache competition game and main processor and co-processor congestion game. We enforce costs for each resource and derive bidding strategy. Accurate evaluation of the proposed approach show that our distributed allocation is scalable and outperforms the static and traditional approaches.

To appear in the 35th IEEE International Conference on Computer Design (ICCD)

Parsimonious Tomography: Optimizing Cost-Identifiability Trade-off for Probing-based Network Monitoring

Network tomography using end-to-end probes provides a powerful tool for monitoring the performance of internal network elements. However, active probing can generate tremendous traffic, which degrades the overall network performance. Meanwhile, not all the probing paths contain useful information for  identifying the link metrics of interest. This observation motivates us to study the optimal selection of monitoring paths to balance identifiability and probing cost. Assuming additive link metrics (e.g., delays), we consider four closely-related optimization problems: 1) Max-IL-Cost that maximizes the number of identifiable links under a probing budget, 2) Max-Rank-Cost that maximizes the rank of selected paths under a probing budget, 3) Min-Cost-IL that minimizes the probing cost while preserving identifiability, and 4) Min-Cost-Rank that minimizes the probing cost while preserving rank. While (1) and (3) are hard to solve, (2) and (4) are easy to solve, and the solutions give a good approximation for (1) and (3). Specifically, we provide an optimal algorithm for (4) and a (1-1/e)-approximation algorithm for (2). We prove that the solution for (4) provides tight upper/lower bounds on the minimum cost of (3), and the solution for (2) provides upper/lower bounds on the maximum identifiability of (1). Our evaluations on real topologies show that solutions to the rank-based optimization (2, 4) have superior performance in terms of the objectives of the identifiability-based optimization (1, 3), and our solutions can reduce the total probing cost by an order of magnitude while achieving the same monitoring performance.


To appear in IFIP Performance 2017 [PDF].


Controlling Cascading Failures in Interdependent Networks under Incomplete Knowledge.

Vulnerability due to inter-connectivity of multiple networks has been observed in many complex networks. Previous works mainly focused on robust network design and on recovery strategies after sporadic or massive failures in the case of complete knowledge of failure location.
We focus on cascading failures involving the power grid and its communication network with consequent imprecision in damage assessment.
We tackle the problem of mitigating the ongoing cascading failure and providing a recovery strategy.
We propose a failure mitigation strategy in two steps: 1) Once a cascading failure is detected, we limit further propagation by re-distributing the generator and load’s power. 2) We formulate a recovery plan to maximize the total amount of power delivered to the demand loads during the recovery intervention.
Our approach to cope with insufficient knowledge of damage locations is based on the use of a new algorithm to determine consistent failure sets (CFS). We show that, given knowledge of the system state before the disruption, the CFS algorithm can find all consistent sets of unknown failures in polynomial time provided that, each connected component of the disrupted graph has at least one line whose failure status is known to the controller.

Check our paper in SRDS 2017:

Download our paper her

Network Recovery from Massive Failures under Uncertain Knowledge of Damages

We address progressive network recovery under uncertain knowledge of damages. We formulate the problem as a mixed integer linear programming (MILP), and show that it is NP-Hard. We propose an iterative stochastic recovery algorithm (ISR) to recover the network in a progressive manner to satisfy the critical services. At each optimization step, we make a decision to repair a part of the network and gather more information iteratively, until critical services are completely restored. Three different algorithms are used to find a feasible set and determine which node to repair, namely, 1) an iterative shortest path algorithm (ISR-SRT), 2) an approximate branch and bound (ISR-BB) and 3) an iterative multi-commodity LP relaxation (ISR-MULT). Further, we have modified the state-ofthe-art iterative split and prune (ISP) algorithm to incorporate the uncertain failures. Our results show that ISR-BB and ISR-MULT outperform the state-of-the-art ”progressive ISP” algorithm while we can configure our choice of trade-off between the execution time, number of repairs (cost) and the demand loss. We show that our recovery algorithm, on average, can reduce the total number of repairs by a factor of about 3 with respect to ISP, while satisfying all critical demands.

Check our paper in IFIP Networking 2017:

Download our paper here

Presentation Slides