Traditional resource management systems rely on a centralized approach to manage users running on each resource. The centralized resource management system is not scalable for large-scale servers as the number of users running on shared resources is increasing dramatically and the centralized manager may not have enough information about applications’ need. In this paper we propose a distributed game-theoretic resource management approach using market auction mechanism to and optimal strategy in a resource competition game. The applications learn through repeated interactions to choose their action on choosing the shared resources. Specially, we look into two case studies of cache competition game and main processor and co-processor congestion game. We enforce costs for each resource and derive bidding strategy. Accurate evaluation of the proposed approach show that our distributed allocation is scalable and outperforms the static and traditional approaches.
To appear in the 35th IEEE International Conference on Computer Design (ICCD)
Network tomography using end-to-end probes provides a powerful tool for monitoring the performance of internal network elements. However, active probing can generate tremendous traffic, which degrades the overall network performance. Meanwhile, not all the probing paths contain useful information for identifying the link metrics of interest. This observation motivates us to study the optimal selection of monitoring paths to balance identifiability and probing cost. Assuming additive link metrics (e.g., delays), we consider four closely-related optimization problems: 1) Max-IL-Cost that maximizes the number of identifiable links under a probing budget, 2) Max-Rank-Cost that maximizes the rank of selected paths under a probing budget, 3) Min-Cost-IL that minimizes the probing cost while preserving identifiability, and 4) Min-Cost-Rank that minimizes the probing cost while preserving rank. While (1) and (3) are hard to solve, (2) and (4) are easy to solve, and the solutions give a good approximation for (1) and (3). Specifically, we provide an optimal algorithm for (4) and a (1-1/e)-approximation algorithm for (2). We prove that the solution for (4) provides tight upper/lower bounds on the minimum cost of (3), and the solution for (2) provides upper/lower bounds on the maximum identifiability of (1). Our evaluations on real topologies show that solutions to the rank-based optimization (2, 4) have superior performance in terms of the objectives of the identifiability-based optimization (1, 3), and our solutions can reduce the total probing cost by an order of magnitude while achieving the same monitoring performance.
To appear in IFIP Performance 2017 [PDF].
Vulnerability due to inter-connectivity of multiple networks has been observed in many complex networks. Previous works mainly focused on robust network design and on recovery strategies after sporadic or massive failures in the case of complete knowledge of failure location.
We focus on cascading failures involving the power grid and its communication network with consequent imprecision in damage assessment.
We tackle the problem of mitigating the ongoing cascading failure and providing a recovery strategy.
We propose a failure mitigation strategy in two steps: 1) Once a cascading failure is detected, we limit further propagation by re-distributing the generator and load’s power. 2) We formulate a recovery plan to maximize the total amount of power delivered to the demand loads during the recovery intervention.
Our approach to cope with insufficient knowledge of damage locations is based on the use of a new algorithm to determine consistent failure sets (CFS). We show that, given knowledge of the system state before the disruption, the CFS algorithm can find all consistent sets of unknown failures in polynomial time provided that, each connected component of the disrupted graph has at least one line whose failure status is known to the controller.
Check our paper in SRDS 2017:
Download our paper her
We address progressive network recovery under uncertain knowledge of damages. We formulate the problem as a mixed integer linear programming (MILP), and show that it is NP-Hard. We propose an iterative stochastic recovery algorithm (ISR) to recover the network in a progressive manner to satisfy the critical services. At each optimization step, we make a decision to repair a part of the network and gather more information iteratively, until critical services are completely restored. Three different algorithms are used to find a feasible set and determine which node to repair, namely, 1) an iterative shortest path algorithm (ISR-SRT), 2) an approximate branch and bound (ISR-BB) and 3) an iterative multi-commodity LP relaxation (ISR-MULT). Further, we have modified the state-ofthe-art iterative split and prune (ISP) algorithm to incorporate the uncertain failures. Our results show that ISR-BB and ISR-MULT outperform the state-of-the-art ”progressive ISP” algorithm while we can configure our choice of trade-off between the execution time, number of repairs (cost) and the demand loss. We show that our recovery algorithm, on average, can reduce the total number of repairs by a factor of about 3 with respect to ISP, while satisfying all critical demands.
Check our paper in IFIP Networking 2017:
Download our paper here
MapReduce framework is widely used to parallelize batch jobs since it exploits a high degree of multi-tasking to process them. However, it has been observed that when the number of servers increases, the map phase can take much longer than expected. This paper analytically shows that the stochastic behavior of the servers has a negative effect on the completion time of a MapReduce job, and continuously increasing the number of servers without accurate scheduling can degrade the overall performance. We analytically model the map phase in terms of hardware, system, and application parameters to capture the effects of stragglers on the performance. Mean sojourn time (MST), the time needed to sync the completed tasks at a reducer, is introduced as a performance metric and mathematically formulated. Following that, we stochastically investigate the optimal task scheduling which leads to an equilibrium property in a datacenter with different types of servers. Our experimental results show the performance of the different types of schedulers targeting MapReduce applications. We also show that, in the case of mixed deterministic and stochastic schedulers, there is an optimal scheduler that can always achieve the lowest MST.
Download our journal paper here
Download our technical report here