The anatomy of big data computing

categories: Distributed Computing, Heterogeneous Cluster

Abstract

Advances in information technology and its widespread growth in several areas of business, engineering, medical, and scientific studies are resulting in information/data explosion. Knowledge discovery and decision-making from such rapidly growing voluminous data are a challenging task in terms of data organization and processing, which is an emerging trend known as big data computing, a new paradigm that combines large-scale compute, new data-intensive techniques, and mathematical models to build data analytics. Big data computing demands a huge storage and computing for data cu-ration and processing that could be delivered from on-premise or clouds infrastructures. This paper discusses the evolution of big data computing, differences between traditional data warehousing and big data, taxonomy of big data computing and underpinning technologies, integrated platform of big data and clouds known as big data clouds, layered architecture and components of big data cloud, and finally open-technical challenges and future directions.

20
Oct
2014

Modeling and Optimization of Map-Reduce

categories: Big Data, Distributed Computing, Hadoop, Heterogeneous Cluster, MapReduce, Modeling and Optimization

ABSTRACT

Map-Reduce framework is widely used to parallelize batch jobs since it exploits a high degree of multi-tasking to process them. However, it has been observed that when the number of mappers increases, the map phase can take much longer than expected. This paper analytically shows that stochastic behavior of mapper nodes has a negative effect on the completion time of a Map-Reduce job, and continuously increasing the number of mappers without accurate scheduling can degrade the overall performance. We analytically capture the effects of stragglers (delayed mappers) on the performance. Based on an observed delayed exponential distribution (DED) of the response time of mappers, we then model the map phase by means of hardware, system, and application parameters. Mean sojourn time (MST), the time needed to sync the completed map tasks at one reducer, is mathematically formulated. Following that, we optimize MST by finding the task inter-arrival time to each mapper node. The optimal mapping problem leads to an equilibrium property investigated for different types of inter-arrival and service time distributions in a heterogeneous data-center (i.e., a data-center with different types of nodes). Our experimental results show the performance and important parameters of the different types of schedulers targeting Map-Reduce applications. We also show that, in the case of mixed deterministic and stochastic schedulers, there is an optimal scheduler that can always achieve the lowest MST.

[Tech Report] [Master Thesis] [IEEE Trans]

Last version > MapReduce_Performance_Optimization

16
May
2014

Big Data Deep Learning: Challenges and Perspectives

categories: Big Data, Distributed Computing, Machine Learning

Abstract:
Deep learning is currently an extremely active research area in machine learning and pattern recognition society. It has gained huge successes in a broad area of applications such as speech recognition, computer vision, and natural language processing. With the sheer size of data available today, big data brings big opportunities and trans-formative potential for various sectors; on the other hand, it also presents unprecedented challenges to harnessing data and information. As the data keeps getting bigger, deep learning is coming to play a key role in providing big data predictive analytics solutions. In this paper, we provide a brief overview of deep learning, and highlight current research efforts and the challenges to big data, as well as the future trends.

BIG DATA Computation

Archive of ‘Distributed Computing’ category

The anatomy of big data computing

Modeling and Optimization of Map-Reduce

ABSTRACT

Big Data Deep Learning: Challenges and Perspectives