Author: Zad Tootaghaj, Diman

Graduate Program: Computer Science and Engineering

Degree: Doctor of Philosophy

Document Type: Dissertation

Date of Defense: May 23, 2018

Committee Members:

Graduate Program: Computer Science and Engineering

Degree: Doctor of Philosophy

Document Type: Dissertation

Date of Defense: May 23, 2018

Committee Members:

- Thomas F Laporta, Dissertation Advisor
- Thomas F Laporta, Committee Chair
- Ting He, Committee Member
- Nilanjan Ray Chaudhuri, Committee Member
- Marek Flaska, Outside Member

Keywords:

- Network Recovery
- Massive Disruption
- Stochastic Optimization
- Uncertainty
- Network Recovery Massive Disruption
- Uncertainty.
- Cascading Failures
- Interdependent Networks
- Power Grid
- Software-Defined Networking

Abstract:This dissertation explores modeling, monitoring and scheduling techniques for network recovery from massive failures, with a focus on optimization methods under the uncertain knowledge of failures. Large-scale failures in communication networks due to natural disasters or malicious attacks can severely affect critical communications and threaten the lives of people in the affected area. In 2005, Hurricane Katrina led to the outage of over 2.5 million lines in the BellSouth (now AT&T) network. In the absence of a proper communication infrastructure, rescue operation becomes extremely difficult. Progressive and timely network recovery is, therefore, a key to minimizing losses and facilitating rescue missions. Many prior works on failure detection and recovery assume full knowledge of failures and use a deterministic approach for the recovery phase. In real-world scenarios, however, the failure pattern might be unknown or only partially known. Therefore, classic recovery approaches may not work. To this end, I focus on network recovery assuming partial and uncertain knowledge of the failure locations. I first studied large-scale failures in a communication network. In particular, I proposed a new recovery approach under the uncertain knowledge of failures. I proposed a progressive multi-stage recovery approach that uses the incomplete knowledge of the failure to find a feasible recovery schedule. From the elements of this solution, I selected a node with the highest centrality at each iteration step to repair and exploit as a monitor to increase the knowledge of network state, until all critical services are restored. The recovery problem can be addressed by giving different priority to three performance aspects including 1) Demand loss, 2) computation time and 3) number of repairs (or repair cost). These aspects are in conflict with each other and I studied the trade-off among them. Next, I focused on the failure recovery of multiple interconnected networks. In particular, I focused on the interaction between a power grid and a communication network. I modeled the cascading failures in a power grid using a DC power flow model. I tackled the problem of mitigating an ongoing cascade by formulating the minimum cost flow assignment problem as a linear programming optimization. The optimization aimed at finding a minimum cost DC power flow setting that stops the cascading failure, where the total cost is defined as the total weighted amount of unsatisfied load due to the re-distribution of the power in the generators and loads without violating the overload constraint at each line. Then, I focused on network monitoring techniques that can be used for diagnosing the performance of individual links for localizing soft failures (e.g. highly congested links) in a communication network. I studied the optimal selection of the monitoring paths to balance identifiability and probing cost. I considered four closely related optimization problems: (1) Max-IL-Cost that maximizes the number of identifiable links under a probing budget, (2) Max-Rank-Cost that maximizes the rank of selected paths under a probing budget, (3) Min-Cost-IL that minimizes the probing cost while preserving identifiability, and (4) Min-Cost-Rank that minimizes the probing cost while preserving rank. I showed that while (1) and (3) are hard to solve, (2) and (4) possess desirable properties that allow efficient computation while providing a good approximation to (1) and (3). I proposed an optimal greedy-based approach for (4) and proposed a (1-1/e)-approximation algorithm for (2). My experimental analysis revealed that, compared to several greedy approaches that directly solve the identifiability-based optimization (i.e. (1) and (3)), the proposed rank-based optimization (i.e. (2) and (4)) achieved better trade-offs in terms of identifiability and probing cost. Finally, I addressed a minimum disruptive routing framework in software-defined networks. I showed that flow disruption, congestion, and violation of policies can occur during updates of flow tables in software-defined networks. I aimed to minimize the update disruption and minimize the number of affected flows during the update while taking into account link capacity constraints and the importance of various flows to upper-layer applications. I formulated the problem as an integer linear programming and showed that it is NP-Hard. I proposed two randomized rounding algorithms with bounded congestion and demand loss to solve this problem. In addition to a small SDN testbed, I performed a large-scale simulation study to evaluate my proposed approaches on real network topologies. Extensive experimental and simulation results show that the two random rounding approaches have a disruption cost close to the optimal while incurring a low congestion factor and a low demand loss.

Modeling, Monitoring and Scheduling Techniques for Network Recovery from Massive Failures

]]>Traditional resource management systems rely on a centralized approach to manage users running on each resource. The centralized resource management system is not scalable for large-scale servers as the number of users running on shared resources is increasing dramatically and the centralized manager may not have enough information about applications’ need. In this paper we propose a distributed game-theoretic resource management approach using market auction mechanism to and optimal strategy in a resource competition game. The applications learn through repeated interactions to choose their action on choosing the shared resources. Specially, we look into two case studies of cache competition game and main processor and co-processor congestion game. We enforce costs for each resource and derive bidding strategy. Accurate evaluation of the proposed approach show that our distributed allocation is scalable and outperforms the static and traditional approaches.

To appear in the 35th IEEE International Conference on Computer Design (ICCD)

]]>Network tomography using end-to-end probes provides a powerful tool for monitoring the performance of internal network elements. However, active probing can generate tremendous traffic, which degrades the overall network performance. Meanwhile, not all the probing paths contain useful information for identifying the link metrics of interest. This observation motivates us to study the optimal selection of monitoring paths to balance identifiability and probing cost. Assuming additive link metrics (e.g., delays), we consider four closely-related optimization problems: 1) Max-IL-Cost that maximizes the number of identifiable links under a probing budget, 2) Max-Rank-Cost that maximizes the rank of selected paths under a probing budget, 3) Min-Cost-IL that minimizes the probing cost while preserving identifiability, and 4) Min-Cost-Rank that minimizes the probing cost while preserving rank. While (1) and (3) are hard to solve, (2) and (4) are easy to solve, and the solutions give a good approximation for (1) and (3). Specifically, we provide an optimal algorithm for (4) and a (1-1/e)-approximation algorithm for (2). We prove that the solution for (4) provides tight upper/lower bounds on the minimum cost of (3), and the solution for (2) provides upper/lower bounds on the maximum identifiability of (1). Our evaluations on real topologies show that solutions to the rank-based optimization (2, 4) have superior performance in terms of the objectives of the identifiability-based optimization (1, 3), and our solutions can reduce the total probing cost by an order of magnitude while achieving the same monitoring performance.

To appear in IFIP Performance 2017 [PDF].

]]>Vulnerability due to inter-connectivity of multiple networks has been observed in many complex networks. Previous works mainly focused on robust network design and on recovery strategies after sporadic or massive failures in the case of complete knowledge of failure location.

We focus on cascading failures involving the power grid and its communication network with consequent imprecision in damage assessment.

We tackle the problem of mitigating the ongoing cascading failure and providing a recovery strategy.

We propose a failure mitigation strategy in two steps: 1) Once a cascading failure is detected, we limit further propagation by re-distributing the generator and load’s power. 2) We formulate a recovery plan to maximize the total amount of power delivered to the demand loads during the recovery intervention.

Our approach to cope with insufficient knowledge of damage locations is based on the use of a new algorithm to determine consistent failure sets (CFS). We show that, given knowledge of the system state before the disruption, the CFS algorithm can find all consistent sets of unknown failures in polynomial time provided that, each connected component of the disrupted graph has at least one line whose failure status is known to the controller.

Check our paper in SRDS 2017:

]]>We address progressive network recovery under uncertain knowledge of damages. We formulate the problem as a mixed integer linear programming (MILP), and show that it is NP-Hard. We propose an iterative stochastic recovery algorithm (ISR) to recover the network in a progressive manner to satisfy the critical services. At each optimization step, we make a decision to repair a part of the network and gather more information iteratively, until critical services are completely restored. Three different algorithms are used to find a feasible set and determine which node to repair, namely, 1) an iterative shortest path algorithm (ISR-SRT), 2) an approximate branch and bound (ISR-BB) and 3) an iterative multi-commodity LP relaxation (ISR-MULT). Further, we have modified the state-ofthe-art iterative split and prune (ISP) algorithm to incorporate the uncertain failures. Our results show that ISR-BB and ISR-MULT outperform the state-of-the-art ”progressive ISP” algorithm while we can configure our choice of trade-off between the execution time, number of repairs (cost) and the demand loss. We show that our recovery algorithm, on average, can reduce the total number of repairs by a factor of about 3 with respect to ISP, while satisfying all critical demands.

Check our paper in IFIP Networking 2017:

]]>MapReduce framework is widely used to parallelize batch jobs since it exploits a high degree of multi-tasking to process them. However, it has been observed that when the number of servers increases, the map phase can take much longer than expected. This paper analytically shows that the stochastic behavior of the servers has a negative effect on the completion time of a MapReduce job, and continuously increasing the number of servers without accurate scheduling can degrade the overall performance. We analytically model the map phase in terms of hardware, system, and application parameters to capture the effects of stragglers on the performance. Mean sojourn time (MST), the time needed to sync the completed tasks at a reducer, is introduced as a performance metric and mathematically formulated. Following that, we stochastically investigate the optimal task scheduling which leads to an equilibrium property in a datacenter with different types of servers. Our experimental results show the performance of the different types of schedulers targeting MapReduce applications. We also show that, in the case of mixed deterministic and stochastic schedulers, there is an optimal scheduler that can always achieve the lowest MST.

]]>GTNS is a discrete-event network simulator targeted primarily for research and educational use. GTNS is written in Visual C++ programming language and supports different network topologies. This simulator was first produced to implement locally multipath adaptive routing (LMAR) protocol, classified as a new reactive distance vector routing protocol for MANETs. LMAR can find an ad-hoc path without selfish nodes and wormholes using an exhaustive search algorithm in polynomial time. Also when the primary path fails, it discovers an alternative safe path if network graph remains connected after eliminating selfish/malicious nodes. The key feature of LMAR to seek safe route free of selfish and malicious nodes in polynomial time is its searching algorithm and flooding stage that its generated traffic is equi-loaded compared to single-path routing protocols but its security efficiency to bypass the attacks is much better than the other multi-path routing protocols. LMAR concept is introduced to provide the security feature known as availability and a simulator has been developed to analyze its behavior in complex network environments [1]. Then we have added detection mechanism to the simulator, which can detect selfish nodes in network. The proposed algorithm is resilient against collision and can be used in networks which wireless nodes use directional antennas and it also defend against an attack that malicious nodes try to break communications by relaying the packets in a specific direction. Some game theoretic strategies to enforce cooperation in network have been implemented in GTNS, for example Forwarding-Ratio Strategy, TFT-Strategy and ERTFT. This tutorial helps new users to get familiar with GTNS and run different network scenarios.

Check our published Code:

]]>