Archive of ‘MapReduce’ category

ML-NA: A Machine Learning based Node Performance Analyzer Utilizing Straggler Statistics

Abstract: Current Cloud clusters often consist of heterogeneous machine nodes, which can trigger performance challenges such as the task straggler problem, whereby a small subset of parallel tasks running abnormally slower than the other sibling ones. The straggler problem leads to extended job response and deteriorates system throughput. Poor performance nodes are more likely to engender stragglers, and can undermine straggler mitigation effectiveness. For example, as the dominant mechanism for straggler alleviation, speculative execution functions by creating redundant task replicas on other machine nodes as soon as a straggler is detected. When speculative copies are assigned onto the poor performance nodes, it is hard for them to catch up with the stragglers compared to replicas run on fast nodes. And due to the fact that the performance heterogeneity is caused not only by static attribute variations such as physical capacity, but also dynamic characteristic fluctuations such as contention level, analyzing node performance is important yet challenging. In this paper we develop ML-NA, a Machine Learning based Node Performance Analyzer. By leveraging historical parallel tasks execution log data, ML-NA classes cluster nodes into different categories and predicts their performance in the near future as a scheduling guide to improve speculation effectiveness and minimize task straggler generation. We consider Map-Reduce as a representative framework to perform our analysis, and use the published OpenCloud trace as a case study to train and to evaluate our model. Results show that ML-NA can predict node performance categories with an average accuracy up to 92.86%.

Keywords—Node Performance, Straggler Problem, Machine Learning, Prediction.

Fluid Petri Nets for the Performance Evaluation of Map-Reduce and Spark Applications

ABSTRACT: 

Big Data applications allow to successfully analyze large amounts of data not necessarily structured, though at the same time they present new challenges. For example, predicting the performance of frameworks such as Hadoop and Spark can be a costly task, hence the necessity to provide models that can be a valuable support for designers and developers. Big Data systems are becoming a central force in society and the use of models can also enable the development of intelligent systems providing Quality of Service (QoS) guarantees to their users through runtime system reconfiguration. This paper provides a new contribution in studying a novel modeling approach based on fluid Petri nets to predict MapReduce and Spark applications execution time which is suitable for runtime performance prediction. Models have been validated by an extensive experimental campaign performed at CINECA, the Italian supercomputing center, and on the Microsoft Azure HDInsight data platform. Results have shown that the achieved accuracy is around 9.5% for Map Reduce and about 10% for Spark of the actual measurements on average.

PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel Frameworks

Abstract: We reveal loopholes of Speculative Execution (SE) implementations under a unique fault model: node-level network throughput degradation. This problem appears in many data-parallel frameworks such as Hadoop MapReduce and Spark. To address this, we present PBSE, a robust, path-based speculative execution that employs three key ingredients: path progress, path diversity, and path-straggler detection and speculation. We show how PBSE is superior to other approaches such as cloning and aggressive speculation under the aforementioned fault model. PBSE is a general solution, applicable to many data-parallel frameworks such as Hadoop/HDFS+QFS, Spark and Flume.

 

Performance Modeling and Optimization of Map-Reduce Programs

Abstract: 
MapReduce is a developer-friendly framework that encapsulates the underlying complexities of distributed computing. It is increasingly being used across enterprises for advanced data analytics, business intelligence, and data mining tasks. But there are two questions bothering Hadoop users: how to improve the performance of MapReduce workloads, and how to estimate the time needed to run a MapReduce job. In this paper, we provide some performance optimization techniques on the premise of workload characterization. After the cluster achieving the best performance, we further propose a modeling method to help Hadoop users estimate the execution time of MapReduce jobs. For evaluation, typical benchmarks are utilized to evaluate the accuracy of our techniques.

07175726-2c7ctk0

Modeling and Optimization of Map-Reduce

ABSTRACT

Map-Reduce framework is widely used to parallelize batch jobs since it exploits a high degree of multi-tasking to process them. However, it has been observed that when the number of mappers increases, the map phase can take much longer than expected. This paper analytically shows that stochastic behavior of mapper nodes has a negative effect on the completion time of a Map-Reduce job, and continuously increasing the number of mappers without accurate scheduling can degrade the overall performance. We analytically capture the effects of stragglers (delayed mappers) on the performance. Based on an observed delayed exponential distribution (DED) of the response time of mappers, we then model the map phase by means of hardware, system, and application parameters. Mean sojourn time (MST), the time needed to sync the completed map tasks at one reducer, is mathematically formulated. Following that, we optimize MST by finding the task inter-arrival time to each mapper node. The optimal mapping problem leads to an equilibrium property investigated for different types of inter-arrival and service time distributions in a heterogeneous data-center (i.e., a data-center with different types of nodes). Our experimental results show the performance and important parameters of the different types of schedulers targeting Map-Reduce applications. We also show that, in the case of mixed deterministic and stochastic schedulers, there is an optimal scheduler that can always achieve the lowest MST.

[Tech Report] [Master Thesis] [IEEE Trans]

Last version > MapReduce_Performance_Optimization

 

SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

As a widely-used parallel computing framework for big data processing today, the Hadoop MapReduce framework puts more emphasis on high-throughput of data than on low-latency of job execution. However, today more and more big data applications developed with MapReduce require quick response time. As a result, improving the performance of MapReduce jobs, especially for short jobs, is of great significance in practice and has attracted more and more attentions from both academia and industry. A lot of efforts have been made to improve the performance of Hadoop from job scheduling or job parameter optimization level. In this paper, we explore an approach to improve the performance of the Hadoop MapReduce framework by optimizing the job and task execution mechanism. First of all, by analyzing the job and task execution mechanism in MapReduce framework we reveal two critical limitations to job execution performance. Then we propose two major optimizations to the MapReduce job and task execution mechanisms: first, we optimize the setup and cleanup tasks of a MapReduce job to reduce the time cost during the initialization and termination stages of the job; second, instead of adopting the loose heartbeat-based communication mechanism to transmit all messages between the JobTracker and TaskTrackers, we introduce an instant messaging communication mechanism for accelerating performance-sensitive task scheduling and execution. Finally, we implement SHadoop, an optimized and fully compatible version of Hadoop that aims at shortening the execution time cost of MapReduce jobs, especially for short jobs. Experimental results show that compared to the standard Hadoop, SHadoop can achieve stable performance improvement by around 25% on average for comprehensive benchmarks without losing scalability and speedup. Our optimization work has passed a production-level test in Intel and has been integrated into the Intel Distributed Hadoop (IDH). To the best of our knowledge, this work is the first effort that explores on optimizing the execution mechanism inside map/reduce tasks of a job. The advantage is that it can complement job scheduling optimizations to further improve the job execution performance.

Efficient big data processing in Hadoop MapReduce

This tutorial is motivated by the clear need of many organizations, companies, and researchers to deal with big data volumes efficiently. Examples include web analytics applications, scientific applications, and social networks. A popular data processing engine for big data is Hadoop MapReduce. Early versions of Hadoop MapReduce suffered from severe performance problems. Today, this is becoming history. There are many techniques that can be used with Hadoop MapReduce jobs to boost performance by orders of magnitude. In this tutorial we teach such techniques. First, we will briefly familiarize the audience with Hadoop MapReduce and motivate its use for big data processing. Then, we will focus on different data management techniques, going from job optimization to physical data organization like data layouts and indexes. Throughout this tutorial, we will highlight the similarities and differences between Hadoop MapReduce and Parallel DBMS. Furthermore, we will point out unresolved research problems and open issues.