Big Data | BIG DATA Computation

Archive of ‘Big Data’ category

14
Nov
2017

FTPS to BOX @ PSU

Encrypted FTP (FTPS) is available for all Box users at Penn State. Unencrypted FTP is not available. FTPS is available to PSU users for transferring files to Box. This tool is designed to be used only for initial bulk uploading and occasional bulk downloading of files from your account. FTP to Box does not support SSO [Single Sign On, or WebAccess] logins, so you will need to create a Box-specific password to supplement your SSO login. Box calls this an external password. Note that you can’t create an external password for a non-person account, so you can only FTP files with a regular Box user account.

To create an external password in Box:

– When you are logged into Box in your browser, near the upper right side of the page, click your name or avatar and select Account settings.

– In the Account tab, in the Authentication section, click Change Password, edit password. This password is separate and different from your regular Box login password, which is your AccessID password. The interface may imply you have an existing password, but just enter your desired password. The external password is used for Box services that do not support WebAccess login.

Box supports FTPS implicit (port 990), and FTPES explicit (port 21), over passive FTP. Box does not support active FTP or SFTP.

The following description is for FileZilla FTP client.

Open FileZilla. In the left pane, navigate to the system from which you are migrating data.
Connect to Box via FTPS:
Host: ftp.box.com
Username: Your primary PSU email address (e.g., xyz123@psu.edu)
Password: The external password you created above.
Port: Use 990 for an implicit encrypted connection (FTPS); this ensures that your password is not sent in clear text as with standard FTP.
Once connected, from the Transfer menu, choose Preserve timestamps of transferred files.
Before transferring files to Box:

Windows: To improve transfer rates, from the Edit menu, select Settings…. Click Transfers and then set maximum simultaneous transfers (maximum setting is 10).
In the right pane (Box), you may want to create a new directory for the files you are copying over.

Mac OSX: To improve transfer rates, from the Filezilla menu, select Settings…. Click Transfers, and then set maximum simultaneous transfers (maximum setting is 10).

In the right pane (Box), you may want to create a new directory for the files you are copying over. Drag the file or folder from the left pane (source) to the appropriate folder in the right pane (Box).

For more about using FTPS with Box, see > community.box.com/t5/How-to-Guides-for-Managing/Using-Box-with-FTP/ta-p/26050

08
Sep
2017

ML-NA: A Machine Learning based Node Performance Analyzer Utilizing Straggler Statistics

categories: Big Data, MapReduce

Abstract: Current Cloud clusters often consist of heterogeneous machine nodes, which can trigger performance challenges such as the task straggler problem, whereby a small subset of parallel tasks running abnormally slower than the other sibling ones. The straggler problem leads to extended job response and deteriorates system throughput. Poor performance nodes are more likely to engender stragglers, and can undermine straggler mitigation effectiveness. For example, as the dominant mechanism for straggler alleviation, speculative execution functions by creating redundant task replicas on other machine nodes as soon as a straggler is detected. When speculative copies are assigned onto the poor performance nodes, it is hard for them to catch up with the stragglers compared to replicas run on fast nodes. And due to the fact that the performance heterogeneity is caused not only by static attribute variations such as physical capacity, but also dynamic characteristic fluctuations such as contention level, analyzing node performance is important yet challenging. In this paper we develop ML-NA, a Machine Learning based Node Performance Analyzer. By leveraging historical parallel tasks execution log data, ML-NA classes cluster nodes into different categories and predicts their performance in the near future as a scheduling guide to improve speculation effectiveness and minimize task straggler generation. We consider Map-Reduce as a representative framework to perform our analysis, and use the published OpenCloud trace as a case study to train and to evaluate our model. Results show that ML-NA can predict node performance categories with an average accuracy up to 92.86%.

Keywords—Node Performance, Straggler Problem, Machine Learning, Prediction.

01
Aug
2017

Modeling and developing conflict-aware scheduling on large-scale data centers

categories: Big Data

Abstract:

Large-scale data centers are the growing trend for modern computing systems. Since a largescale data center has to manage a large number of machines and jobs, deploying multiple independent schedulers (termed as distributed schedulers in literature) to make scheduling decisions simultaneously has been shown as an effective way to speed up the processing of large quantity of submitted jobs and data. The key drawback of distributed schedulers is that since these schedulers schedule different jobs independently, the scheduling decisions made by different schedulers may conflict with each other due to the possibility that different scheduling decisions refer to the same subset of the resources in the data center. Conflicting scheduling decisions cause additional scheduling attempts and consequently increase the scheduling cost. More resources each scheduler demands, higher scheduling cost may incur and longer job response times the users may experience. It is useful to investigate the balanced points in terms of resource demands for each of independent schedulers, so that the distributed schedulers can all achieve decent job performance without experiencing undesired resource competition. To address this issue, we model distributed scheduling and resource conflict using the game theory and conduct the quantitative analysis about scheduling cost and job performance. Further, based on the analysis, we develop the conflict-aware scheduling strategies to reduce the scheduling cost and improve job performance. We have conducted the simulation experiments with workload trace and also real experiments on Amazon Web Services (AWS). The experimental results verify the effectiveness of the proposed modeling approach and scheduling strategies.

30
May
2017

Don’t cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster scheduling

categories: Big Data, Hadoop

Abstract:
Understanding the performance of data-parallel workloads when resource-constrained has significant practical importance but unfortunately has received only limited attention. This paper identifies, quantifies and demonstrates memory elasticity, an intrinsic property of data-parallel tasks. Memory elasticity allows tasks to run with significantly less memory than they would ideally need while only paying a moderate performance penalty. For example, we find that given as little as 10% of ideal memory, Page-Rank and Nutch-Indexing Hadoop reducers become only 1.2x/1.75x and 1.08x slower. We show that memory elasticity is prevalent in the Hadoop, Spark, Tez and Flink frameworks. We also show that memory elasticity is predictable in nature by building simple models for Hadoop and extending them to Tez and Spark.

To demonstrate the potential benefits of leveraging memory elasticity, this paper further explores its application to cluster scheduling. In this setting, we observe that the resource vs. time trade-off enabled by memory elasticity becomes a task queuing time vs. task runtime trade-off. Tasks may complete faster when scheduled with less memory because their waiting time is reduced. We show that a scheduler can turn this task-level tradeoff into improved job completion time and cluster-wide memory utilization. We have integrated memory elasticity into Apache YARN. We show gains of up to 60% in average job completion time on a 50-node Hadoop cluster. Extensive simulations show similar improvements over a large number of scenarios.

30
May
2017

Load Balancing for Skewed Streams on Heterogeneous Cluster

categories: Big Data

Abstract: Primitive partitioning strategies for streaming applications operate efficiently under two very strict assumptions: the resources are homogeneous and the messages are drawn from a uniform key distribution. These assumptions are often not true for the real-world use cases. Dealing with heterogeneity and non-uniform workload requires inferring the resource capacities and input distribution at run time. However, gathering these statistics and finding an optimal placement often become a challenge when microsecond latency is desired. In this paper, we address the load balancing problem for streaming engines running on a heterogeneous cluster and processing skewed workload. In doing so, we propose a novel partitioning strategy called Consistent Grouping (cg) that is inspired by traditional consistent hashing. cg is a lightweight distributed strategy that enables each processing element instance (PEI) to process the workload according to its capacity. The main idea behind cg is the notion of equal-sized virtual workers at the sources, which are assigned to workers based on their capacities. We provide a theoretical analysis of the proposed algorithm and show via extensive empirical evaluation that the proposed scheme outperforms the state-of-the-art approaches. In particular, cg achieves 3.44 x superior performance in terms of latency compared to key grouping, which is the state-of-the-art grouping strategy for state-full streaming applications.

1705.09073-2bn4j16

30
May
2017

The Benefit of Being Flexible in Distributed Computation

categories: Big Data

Abstract: In wireless distributed computing, networked nodes perform intermediate data computations over data-placed in their memory and exchange these intermediate values to calculate function values. In this paper we consider an asymmetric setting where each node has access to a random subset of the data, i.e., we cannot control the data placement. The paper makes a simple point: we can realize significant benefits if we are allowed to be “flexible”, and decide which node computes which function, in our system. We make this argument in the case where each function depends on only two of the data messages, as is the case in similarity searches. We establish a percolation in the behavior of the system, where, depending on the amount of observed data, by being flexible, we may need no communication at all.

1705.08464-1tfz0h8

21
May
2017

Fluid Petri Nets for the Performance Evaluation of Map-Reduce and Spark Applications

categories: Big Data, MapReduce

ABSTRACT:

Big Data applications allow to successfully analyze large amounts of data not necessarily structured, though at the same time they present new challenges. For example, predicting the performance of frameworks such as Hadoop and Spark can be a costly task, hence the necessity to provide models that can be a valuable support for designers and developers. Big Data systems are becoming a central force in society and the use of models can also enable the development of intelligent systems providing Quality of Service (QoS) guarantees to their users through runtime system reconfiguration. This paper provides a new contribution in studying a novel modeling approach based on fluid Petri nets to predict MapReduce and Spark applications execution time which is suitable for runtime performance prediction. Models have been validated by an extensive experimental campaign performed at CINECA, the Italian supercomputing center, and on the Microsoft Azure HDInsight data platform. Results have shown that the achieved accuracy is around 9.5% for Map Reduce and about 10% for Spark of the actual measurements on average.

01
Apr
2017

Scheduling Job Queue On Hadoop

categories: Big Data, Hadoop

Abstract
Hadoop is a free, Java-based programming system that backings the preparing of vast informational collections in a Parallel and disseminated figuring condition. Enormous Data in many organizations are handled by Hadoop by presenting the employments to Master. Estimate based booking with maturing has been perceived as a compelling way to deal with certification powerful and close ideal framework reaction times. Hadoop Fair Sojourn Protocol (HFSP), a scheduler acquainting this procedure with a genuine, multi-server, complex and generally utilized framework, for example, Hadoop. In this paper, we introduce the plan of another booking convention that caters both to a reasonable and productive use of bunch assets, while endeavoring to accomplish short reaction times. Our answer actualizes a size-based, preemptive planning discipline. The scheduler apportions group assets with the end goal that employment measure data is surmised while the occupation gains ground toward its fruition. Planning choices utilize the idea of virtual time and bunch assets are centered around employments as per their need, processed through maturing. This guarantee neither little nor extensive employments experience the ill effects of starvation. The result of our work appears as an undeniable scheduler usage that coordinates consistently in Hadoop named HFSP. Measure based planning for HFSP receives offering need to little occupations that they won’t be backed off by expansive ones. The “Shortest Remaining Processing Time (SRPT) strategy, which organizes occupations that need minimal measure of work to finish, is the one that limits the mean reaction time (or visit time), that is the time that goes between an occupation accommodation and its fruition”. We Extend HFSP to respite occupations with Higher SRPT and permit other holding up employments in Queue in view of FCFS.

829-2424-1-PB-296ei1i

07
Mar
2017

Performance and Energy Efficiency of Big Data Systems: Characterization, Implication and Improvement

categories: Big Data

ABSTRACT
Large volume of data is produced by various applications in the world, processing such scale of data has great challenges in not only performance but also energy efficiency. Researchers propose various techniques to either improve the performance or the energy efficiency. The techniques of these two trends, however, are significantly different. When both performance and energy efficiency are concerned in the big data systems, how to get balance has become an issuing and challenging problem for data center administrators and hardware designers. In this paper, we conduct comprehensive evaluations on two representative platforms with different types of processors. We quantify the performance and energy efficiency, relating the evaluation results to micro-architectural activities and application characteristics. Two interesting findings are made from our evaluations: (1) the performance and energy efficiency are not only determined by the hardware technology, but also associated with the application characteristics; (2) there is no ever victorious microprocessor in terms of both performance and energy efficiency in all the big data workloads. Based on the findings and quantified evaluation results, we provide great guidance and implications for both data center administrators and big data system designers, and we argue that a hybrid-core is an efficient way to improve the energy efficiency of big data systems with minimum performance degradation.

p55-shi-pcen16

30
Jan
2017

PBSE: A Robust Path-Based Speculative Execution for Degraded-Network Tail Tolerance in Data-Parallel Frameworks

categories: Big Data, Hadoop, MapReduce

Abstract: We reveal loopholes of Speculative Execution (SE) implementations under a unique fault model: node-level network throughput degradation. This problem appears in many data-parallel frameworks such as Hadoop MapReduce and Spark. To address this, we present PBSE, a robust, path-based speculative execution that employs three key ingredients: path progress, path diversity, and path-straggler detection and speculation. We show how PBSE is superior to other approaches such as cloning and aggressive speculation under the aforementioned fault model. PBSE is a general solution, applicable to many data-parallel frameworks such as Hadoop/HDFS+QFS, Spark and Flume.

1 2 Next