Performance and Energy Efficiency of Big Data Systems: Characterization, Implication and Improvement

ABSTRACT
Large volume of data is produced by various applications in the world, processing such scale of data has great challenges in not only performance but also energy efficiency. Researchers propose various techniques to either improve the performance or the energy efficiency. The techniques of these two trends, however, are significantly different. When both performance and energy efficiency are concerned in the big data systems, how to get balance has become an issuing and challenging problem for data center administrators and hardware designers. In this paper, we conduct comprehensive evaluations on two representative platforms with different types of processors. We quantify the performance and energy efficiency, relating the evaluation results to micro-architectural activities and application characteristics. Two interesting findings are made from our evaluations: (1) the performance and energy efficiency are not only determined by the hardware technology, but also associated with the application characteristics; (2) there is no ever victorious microprocessor in terms of both performance and energy efficiency in all the big data workloads. Based on the findings and quantified evaluation results, we provide great guidance and implications for both data center administrators and big data system designers, and we argue that a hybrid-core is an efficient way to improve the energy efficiency of big data systems with minimum performance degradation.

p55-shi-pcen16

PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel Frameworks

Abstract: We reveal loopholes of Speculative Execution (SE) implementations under a unique fault model: node-level network throughput degradation. This problem appears in many data-parallel frameworks such as Hadoop MapReduce and Spark. To address this, we present PBSE, a robust, path-based speculative execution that employs three key ingredients: path progress, path diversity, and path-straggler detection and speculation. We show how PBSE is superior to other approaches such as cloning and aggressive speculation under the aforementioned fault model. PBSE is a general solution, applicable to many data-parallel frameworks such as Hadoop/HDFS+QFS, Spark and Flume.

 

Q Learning Based Workflow Scheduling in Hadoop

Abstract: Hadoop on datacenter is the popular analytical platform for enterprises. Cloud vendors host Hadoop clusters on the datacenter to provide high performance analytical computing facilities to its customers. While many concurrent users try to use the Clusters to execute their jobs, scheduling should be very effective to complete their job in time and at same time use the resources efficiently with effective cost and time management. Workflows are repeatable pattern of dependable jobs. The workflows are executed in the Hadoop datacenter by allocating VMs. In our earlier papers, a mechanism to pack and execute the customer jobs as workflows on Hadoop platform was proposed which minimizes the VM cost and also executes the workflow Hadoop-MapReduce jobs within deadline.In this paper, we propose a Q learning based scheduling method to optimize the cloud resources in workflows. Q Learning is a model free reinforcement learning technique used to find an optimal action – selection policy for a given Markov decision process. The parameters considered for optimization are VMs consumed, bandwidth at data center and the electric power consumption.

An Optimization Algorithm for Heterogeneous Hadoop Clusters Based on Dynamic Load Balancing

Abstract: Hadoop is a popular cloud computing software, and its major component MapReduce can efficiently complete parallel computing in homogeneous environment. But in practical application heterogeneous cluster is a common phenomenon. In this case, it’s prone to unbalance load. To solve this problem, a model of heterogeneous Hadoop cluster based on dynamic load balancing is proposed in this paper. This model starts from MapReduce and tracks node information in real time by using its monitoring module. A maximum node hit rate priority algorithm (MNHRPA) is designed and implemented in the paper, and it can achieve load balancing by dynamic adjustment of data allocation based on nodes’ computing power and load. The experimental results show that the algorithm can effectively reduce tasks’ completion time and achieve load balancing of the cluster compared with Hadoop’s default algorithm.

Keywords: Hadoop; heterogeneous cluster; data allocation; load balancing.

07943366-1cqh172

The anatomy of big data computing

Abstract

Advances in information technology and its widespread growth in several areas of business, engineering, medical, and scientific studies are resulting in information/data explosion. Knowledge discovery and decision-making from such rapidly growing voluminous data are a challenging task in terms of data organization and processing, which is an emerging trend known as big data computing, a new paradigm that combines large-scale compute, new data-intensive techniques, and mathematical models to build data analytics. Big data computing demands a huge storage and computing for data cu-ration and processing that could be delivered from on-premise or clouds infrastructures. This paper discusses the evolution of big data computing, differences between traditional data warehousing and big data, taxonomy of big data computing and underpinning technologies, integrated platform of big data and clouds known as big data clouds, layered architecture and components of big data cloud, and finally open-technical challenges and future directions.

Performance Modeling and Optimization of Map-Reduce Programs

Abstract: 
MapReduce is a developer-friendly framework that encapsulates the underlying complexities of distributed computing. It is increasingly being used across enterprises for advanced data analytics, business intelligence, and data mining tasks. But there are two questions bothering Hadoop users: how to improve the performance of MapReduce workloads, and how to estimate the time needed to run a MapReduce job. In this paper, we provide some performance optimization techniques on the premise of workload characterization. After the cluster achieving the best performance, we further propose a modeling method to help Hadoop users estimate the execution time of MapReduce jobs. For evaluation, typical benchmarks are utilized to evaluate the accuracy of our techniques.

07175726-2c7ctk0

Modeling and Optimization of Map-Reduce

ABSTRACT

Map-Reduce framework is widely used to parallelize batch jobs since it exploits a high degree of multi-tasking to process them. However, it has been observed that when the number of mappers increases, the map phase can take much longer than expected. This paper analytically shows that stochastic behavior of mapper nodes has a negative effect on the completion time of a Map-Reduce job, and continuously increasing the number of mappers without accurate scheduling can degrade the overall performance. We analytically capture the effects of stragglers (delayed mappers) on the performance. Based on an observed delayed exponential distribution (DED) of the response time of mappers, we then model the map phase by means of hardware, system, and application parameters. Mean sojourn time (MST), the time needed to sync the completed map tasks at one reducer, is mathematically formulated. Following that, we optimize MST by finding the task inter-arrival time to each mapper node. The optimal mapping problem leads to an equilibrium property investigated for different types of inter-arrival and service time distributions in a heterogeneous data-center (i.e., a data-center with different types of nodes). Our experimental results show the performance and important parameters of the different types of schedulers targeting Map-Reduce applications. We also show that, in the case of mixed deterministic and stochastic schedulers, there is an optimal scheduler that can always achieve the lowest MST.

[Tech Report] [Master Thesis] [IEEE Trans]

Last version > MapReduce_Performance_Optimization

 

Big Data Deep Learning: Challenges and Perspectives

Abstract:
Deep learning is currently an extremely active research area in machine learning and pattern recognition society. It has gained huge successes in a broad area of applications such as speech recognition, computer vision, and natural language processing. With the sheer size of data available today, big data brings big opportunities and trans-formative potential for various sectors; on the other hand, it also presents unprecedented challenges to harnessing data and information. As the data keeps getting bigger, deep learning is coming to play a key role in providing big data predictive analytics solutions. In this paper, we provide a brief overview of deep learning, and highlight current research efforts and the challenges to big data, as well as the future trends.

SHadoop: Improving MapReduce performance by optimizing job execution mechanism in Hadoop clusters

As a widely-used parallel computing framework for big data processing today, the Hadoop MapReduce framework puts more emphasis on high-throughput of data than on low-latency of job execution. However, today more and more big data applications developed with MapReduce require quick response time. As a result, improving the performance of MapReduce jobs, especially for short jobs, is of great significance in practice and has attracted more and more attentions from both academia and industry. A lot of efforts have been made to improve the performance of Hadoop from job scheduling or job parameter optimization level. In this paper, we explore an approach to improve the performance of the Hadoop MapReduce framework by optimizing the job and task execution mechanism. First of all, by analyzing the job and task execution mechanism in MapReduce framework we reveal two critical limitations to job execution performance. Then we propose two major optimizations to the MapReduce job and task execution mechanisms: first, we optimize the setup and cleanup tasks of a MapReduce job to reduce the time cost during the initialization and termination stages of the job; second, instead of adopting the loose heartbeat-based communication mechanism to transmit all messages between the JobTracker and TaskTrackers, we introduce an instant messaging communication mechanism for accelerating performance-sensitive task scheduling and execution. Finally, we implement SHadoop, an optimized and fully compatible version of Hadoop that aims at shortening the execution time cost of MapReduce jobs, especially for short jobs. Experimental results show that compared to the standard Hadoop, SHadoop can achieve stable performance improvement by around 25% on average for comprehensive benchmarks without losing scalability and speedup. Our optimization work has passed a production-level test in Intel and has been integrated into the Intel Distributed Hadoop (IDH). To the best of our knowledge, this work is the first effort that explores on optimizing the execution mechanism inside map/reduce tasks of a job. The advantage is that it can complement job scheduling optimizations to further improve the job execution performance.

Efficient big data processing in Hadoop MapReduce

This tutorial is motivated by the clear need of many organizations, companies, and researchers to deal with big data volumes efficiently. Examples include web analytics applications, scientific applications, and social networks. A popular data processing engine for big data is Hadoop MapReduce. Early versions of Hadoop MapReduce suffered from severe performance problems. Today, this is becoming history. There are many techniques that can be used with Hadoop MapReduce jobs to boost performance by orders of magnitude. In this tutorial we teach such techniques. First, we will briefly familiarize the audience with Hadoop MapReduce and motivate its use for big data processing. Then, we will focus on different data management techniques, going from job optimization to physical data organization like data layouts and indexes. Throughout this tutorial, we will highlight the similarities and differences between Hadoop MapReduce and Parallel DBMS. Furthermore, we will point out unresolved research problems and open issues.

1 2 3