Category: Big Data

Q Learning Based Workflow Scheduling in Hadoop

Abstract: Hadoop on datacenter is the popular analytical platform for enterprises. Cloud vendors host Hadoop clusters on the datacenter to provide high performance analytical computing facilities to its customers. While many concurrent users try to use the Clusters to execute their jobs, scheduling should be very effective to complete their job in time and at same time use the resources efficiently with effective cost and time management. Workflows are repeatable pattern of dependable jobs. The workflows are executed in the Hadoop datacenter by allocating VMs. In our earlier papers, a mechanism to pack and execute the customer jobs as workflows on Hadoop platform was proposed which minimizes the VM cost and also executes the workflow Hadoop-MapReduce jobs within deadline.In this paper, we propose a Q learning based scheduling method to optimize the cloud resources in workflows. Q Learning is a model free reinforcement learning technique used to find an optimal action – selection policy for a given Markov decision process. The parameters considered for optimization are VMs consumed, bandwidth at data center and the electric power consumption.

An Optimization Algorithm for Heterogeneous Hadoop Clusters Based on Dynamic Load Balancing

Abstract: Hadoop is a popular cloud computing software, and its major component MapReduce can efficiently complete parallel computing in homogeneous environment. But in practical application heterogeneous cluster is a common phenomenon. In this case, it’s prone to unbalance load. To solve this problem, a model of heterogeneous Hadoop cluster based on dynamic load balancing is proposed in this paper. This model starts from MapReduce and tracks node information in real time by using its monitoring module. A maximum node hit rate priority algorithm (MNHRPA) is designed and implemented in the paper, and it can achieve load balancing by dynamic adjustment of data allocation based on nodes’ computing power and load. The experimental results show that the algorithm can effectively reduce tasks’ completion time and achieve load balancing of the cluster compared with Hadoop’s default algorithm.

Keywords: Hadoop; heterogeneous cluster; data allocation; load balancing.

07943366-1cqh172

Performance Modeling and Optimization of Map-Reduce Programs

Abstract: 
MapReduce is a developer-friendly framework that encapsulates the underlying complexities of distributed computing. It is increasingly being used across enterprises for advanced data analytics, business intelligence, and data mining tasks. But there are two questions bothering Hadoop users: how to improve the performance of MapReduce workloads, and how to estimate the time needed to run a MapReduce job. In this paper, we provide some performance optimization techniques on the premise of workload characterization. After the cluster achieving the best performance, we further propose a modeling method to help Hadoop users estimate the execution time of MapReduce jobs. For evaluation, typical benchmarks are utilized to evaluate the accuracy of our techniques.

07175726-2c7ctk0

Modeling and Optimization of Map-Reduce

ABSTRACT

Map-Reduce framework is widely used to parallelize batch jobs since it exploits a high degree of multi-tasking to process them. However, it has been observed that when the number of mappers increases, the map phase can take much longer than expected. This paper analytically shows that stochastic behavior of mapper nodes has a negative effect on the completion time of a Map-Reduce job, and continuously increasing the number of mappers without accurate scheduling can degrade the overall performance. We analytically capture the effects of stragglers (delayed mappers) on the performance. Based on an observed delayed exponential distribution (DED) of the response time of mappers, we then model the map phase by means of hardware, system, and application parameters. Mean sojourn time (MST), the time needed to sync the completed map tasks at one reducer, is mathematically formulated. Following that, we optimize MST by finding the task inter-arrival time to each mapper node. The optimal mapping problem leads to an equilibrium property investigated for different types of inter-arrival and service time distributions in a heterogeneous data-center (i.e., a data-center with different types of nodes). Our experimental results show the performance and important parameters of the different types of schedulers targeting Map-Reduce applications. We also show that, in the case of mixed deterministic and stochastic schedulers, there is an optimal scheduler that can always achieve the lowest MST.

[Tech Report] [Master Thesis] [IEEE Trans]

Last version > MapReduce_Performance_Optimization

 

Big Data Deep Learning: Challenges and Perspectives

Abstract:
Deep learning is currently an extremely active research area in machine learning and pattern recognition society. It has gained huge successes in a broad area of applications such as speech recognition, computer vision, and natural language processing. With the sheer size of data available today, big data brings big opportunities and trans-formative potential for various sectors; on the other hand, it also presents unprecedented challenges to harnessing data and information. As the data keeps getting bigger, deep learning is coming to play a key role in providing big data predictive analytics solutions. In this paper, we provide a brief overview of deep learning, and highlight current research efforts and the challenges to big data, as well as the future trends.

Skip to toolbar