BIG DATA Computation | Performance Modeling, Evaluation, and Optimization of MapReduce

06
Apr
2018

GDB Installation on Mac OS X

If you work on a Mac OS X 10.9 Mavericks or later, you will run into the problem of Eclipse refusing to interactively debug problems that otherwise build and run fine: An attempt to start a debugging session by selecting Run
Debug from the menu will result in Eclipse complaining that an Error with command: gdb –version has occurred.

The problem is caused by Apple switching away from GDB, the GNU debugger, to LLDB, the LLVM debugger, in their Xcode toolchain (along with the transition from GCC to Clang). Unfortunately, Eclipse is not capable of communicating with any debugger other than GDB (yet). Here is a step-by-step guide for installing and configuring GDB.

Installing GDB

As with GCC, the easiest way to install GDB is through Homebrew. In a Terminal window, run the command brew install gdb, and wait for it to complete. (As usual, it may ask for your password.)

Now, we need to code-sign the GDB executable, so it will be allowed to control other processes, as necessary for a debugger. For that, we will first create a new certificate in Keychain.

Creating a Certificate

Open the Keychain Access application (can be found in Applications/Utilities directory or through Spotlight). Select Certificate Assistant
Create a Certificate in the application menu (Keychain Access). An assistant window will appear for guiding you through the process.

First, you will be asked for the name and type of the certificate. You may choose the name arbitrarily, but to simplify its future use in command line, prefer names without spaces or other fancy characters, e.g., gdbcert.
Make sure that Identity Type is set to Self Signed Root, change Certificate Typeto Code Signing, check the Let me override defaults checkbox, and click Continue. Click Continue again in the popup prompt warning about the certificate being self-signed.
On the next page, leave Security Number to be 1, and set Validity Period to a large enough number of days to cover the duration of the class or more, say, 365. (Certificates cannot last forever; the maximum validity period is 20 years.)
Then click Continue once again, and keep doing so to skip the next six screens until you see the one entitled Specify a Location For The Certificate. For the only property, Keychain, choose System from the drop-down list. Lastly, click Create, type in your password, if prompted, and click Done.
Back in the main window, choose the System keychain in the sidebar on the left, and select the newly created certificate from the list. Open the context menu and select Get Info. In the information window that will appear, expand the Trustsection and set the Code Signing property to Always Trust. Close this window (you may be asked for your password), and quit Keychain Access.

Signing GDB

Our new certificate is now ready to be used. In order to make it immediately available for signing, we need to restart the Taskgate access-control service. You can use Activity Monitor to do this (also found in Applications/Utilities). Open it and filter the list of processes by typing taskgated in the search field in the toolbar. (If you cannot find it, make sure the menu item View
All Processes is checked.)

There should be exactly one process left in the list. Highlight it, then select View
Quit Process from the menu, and click Quit in the popup prompt. The Taskgate process will be terminated and, consequently, should disappear from the list. In a few seconds, it will be restarted by the system and should reappear in the list. Please wait for this to happen (it may take up to a minute or two, at worst).

Finally, in a Terminal window, run codesign -s gdbcert /usr/local/bin/gdb (if you named your certificate differently, replace gdbcert with its name here). Once again, you will be prompted for you username and password. If the command does not produce any output, then GDB is successfully signed.

Configuring Eclipse

The only thing left to do is to point Eclipse to the GDB executable. Open Eclipse
Preferences from the main menu (not to be confused with Project Preferences). In the tree of options listed in the sidebar, navigate to C/C++
Debug
GDB, and set the GDB debugger field to /usr/local/bin/gdb.

If there is no GDB section in the C/C++
Debug subtree, close the preferences window, and try to first start a debugging session for any project that you can already run without problems. You can do it by either clicking the Debug button on the toolbar, or selecting Run
Debug from the main menu. This attempt will, of course, fail with an error message about the gdb command, but it will force the said C/C++
Debug
GDB settings to appear in the preferences.

This will change the GDB executable for new projects; for all existing ones (that you are going to use debugging for), you will need to manually update their debug configurations. To do that, select Run
Debug Configurations from the menu. In the window that appears, one after another, select every project under the C++ Application section in the sidebar. For each of them, open the Debugger tab, set the GDB debugger field to the same path /usr/local/bin/gdb, and click the Apply button. After repeating this change for all listed projects, click Close.

14
Nov
2017

FTPS to BOX @ PSU

categories: Big Data, Coding, Tips

Encrypted FTP (FTPS) is available for all Box users at Penn State. Unencrypted FTP is not available. FTPS is available to PSU users for transferring files to Box. This tool is designed to be used only for initial bulk uploading and occasional bulk downloading of files from your account. FTP to Box does not support SSO [Single Sign On, or WebAccess] logins, so you will need to create a Box-specific password to supplement your SSO login. Box calls this an external password. Note that you can’t create an external password for a non-person account, so you can only FTP files with a regular Box user account.

To create an external password in Box:

– When you are logged into Box in your browser, near the upper right side of the page, click your name or avatar and select Account settings.

– In the Account tab, in the Authentication section, click Change Password, edit password. This password is separate and different from your regular Box login password, which is your AccessID password. The interface may imply you have an existing password, but just enter your desired password. The external password is used for Box services that do not support WebAccess login.

Box supports FTPS implicit (port 990), and FTPES explicit (port 21), over passive FTP. Box does not support active FTP or SFTP.

The following description is for FileZilla FTP client.

Open FileZilla. In the left pane, navigate to the system from which you are migrating data.
Connect to Box via FTPS:
Host: ftp.box.com
Username: Your primary PSU email address (e.g., xyz123@psu.edu)
Password: The external password you created above.
Port: Use 990 for an implicit encrypted connection (FTPS); this ensures that your password is not sent in clear text as with standard FTP.
Once connected, from the Transfer menu, choose Preserve timestamps of transferred files.
Before transferring files to Box:

Windows: To improve transfer rates, from the Edit menu, select Settings…. Click Transfers and then set maximum simultaneous transfers (maximum setting is 10).
In the right pane (Box), you may want to create a new directory for the files you are copying over.

Mac OSX: To improve transfer rates, from the Filezilla menu, select Settings…. Click Transfers, and then set maximum simultaneous transfers (maximum setting is 10).

In the right pane (Box), you may want to create a new directory for the files you are copying over. Drag the file or folder from the left pane (source) to the appropriate folder in the right pane (Box).

For more about using FTPS with Box, see > community.box.com/t5/How-to-Guides-for-Managing/Using-Box-with-FTP/ta-p/26050

08
Sep
2017

ML-NA: A Machine Learning based Node Performance Analyzer Utilizing Straggler Statistics

categories: Big Data, MapReduce

Abstract: Current Cloud clusters often consist of heterogeneous machine nodes, which can trigger performance challenges such as the task straggler problem, whereby a small subset of parallel tasks running abnormally slower than the other sibling ones. The straggler problem leads to extended job response and deteriorates system throughput. Poor performance nodes are more likely to engender stragglers, and can undermine straggler mitigation effectiveness. For example, as the dominant mechanism for straggler alleviation, speculative execution functions by creating redundant task replicas on other machine nodes as soon as a straggler is detected. When speculative copies are assigned onto the poor performance nodes, it is hard for them to catch up with the stragglers compared to replicas run on fast nodes. And due to the fact that the performance heterogeneity is caused not only by static attribute variations such as physical capacity, but also dynamic characteristic fluctuations such as contention level, analyzing node performance is important yet challenging. In this paper we develop ML-NA, a Machine Learning based Node Performance Analyzer. By leveraging historical parallel tasks execution log data, ML-NA classes cluster nodes into different categories and predicts their performance in the near future as a scheduling guide to improve speculation effectiveness and minimize task straggler generation. We consider Map-Reduce as a representative framework to perform our analysis, and use the published OpenCloud trace as a case study to train and to evaluate our model. Results show that ML-NA can predict node performance categories with an average accuracy up to 92.86%.

Keywords—Node Performance, Straggler Problem, Machine Learning, Prediction.

01
Aug
2017

Modeling and developing conflict-aware scheduling on large-scale data centers

categories: Big Data

Abstract:

Large-scale data centers are the growing trend for modern computing systems. Since a largescale data center has to manage a large number of machines and jobs, deploying multiple independent schedulers (termed as distributed schedulers in literature) to make scheduling decisions simultaneously has been shown as an effective way to speed up the processing of large quantity of submitted jobs and data. The key drawback of distributed schedulers is that since these schedulers schedule different jobs independently, the scheduling decisions made by different schedulers may conflict with each other due to the possibility that different scheduling decisions refer to the same subset of the resources in the data center. Conflicting scheduling decisions cause additional scheduling attempts and consequently increase the scheduling cost. More resources each scheduler demands, higher scheduling cost may incur and longer job response times the users may experience. It is useful to investigate the balanced points in terms of resource demands for each of independent schedulers, so that the distributed schedulers can all achieve decent job performance without experiencing undesired resource competition. To address this issue, we model distributed scheduling and resource conflict using the game theory and conduct the quantitative analysis about scheduling cost and job performance. Further, based on the analysis, we develop the conflict-aware scheduling strategies to reduce the scheduling cost and improve job performance. We have conducted the simulation experiments with workload trace and also real experiments on Amazon Web Services (AWS). The experimental results verify the effectiveness of the proposed modeling approach and scheduling strategies.

22
Jun
2017

How to install pip locally

categories: Coding

When “easy_install –user pip” doesn’t work, do the following:

wget https://bootstrap.pypa.io/get-pip.py

python get-pip.py –user

Also add the following line to your ~/.bashrc:

For bash shell:

export PATH=$HOME/.local/bin:$PATH

For cshrc shell:

setenv PATH ${HOME}/.local/bin:${PATH}

Now you can install any package locally by:

pip install –user PackageName

30
May
2017

Don’t cry over spilled records: Memory elasticity of data-parallel applications and its application to cluster scheduling

categories: Big Data, Hadoop

Abstract:
Understanding the performance of data-parallel workloads when resource-constrained has significant practical importance but unfortunately has received only limited attention. This paper identifies, quantifies and demonstrates memory elasticity, an intrinsic property of data-parallel tasks. Memory elasticity allows tasks to run with significantly less memory than they would ideally need while only paying a moderate performance penalty. For example, we find that given as little as 10% of ideal memory, Page-Rank and Nutch-Indexing Hadoop reducers become only 1.2x/1.75x and 1.08x slower. We show that memory elasticity is prevalent in the Hadoop, Spark, Tez and Flink frameworks. We also show that memory elasticity is predictable in nature by building simple models for Hadoop and extending them to Tez and Spark.

To demonstrate the potential benefits of leveraging memory elasticity, this paper further explores its application to cluster scheduling. In this setting, we observe that the resource vs. time trade-off enabled by memory elasticity becomes a task queuing time vs. task runtime trade-off. Tasks may complete faster when scheduled with less memory because their waiting time is reduced. We show that a scheduler can turn this task-level tradeoff into improved job completion time and cluster-wide memory utilization. We have integrated memory elasticity into Apache YARN. We show gains of up to 60% in average job completion time on a 50-node Hadoop cluster. Extensive simulations show similar improvements over a large number of scenarios.

30
May
2017

Load Balancing for Skewed Streams on Heterogeneous Cluster

categories: Big Data

Abstract: Primitive partitioning strategies for streaming applications operate efficiently under two very strict assumptions: the resources are homogeneous and the messages are drawn from a uniform key distribution. These assumptions are often not true for the real-world use cases. Dealing with heterogeneity and non-uniform workload requires inferring the resource capacities and input distribution at run time. However, gathering these statistics and finding an optimal placement often become a challenge when microsecond latency is desired. In this paper, we address the load balancing problem for streaming engines running on a heterogeneous cluster and processing skewed workload. In doing so, we propose a novel partitioning strategy called Consistent Grouping (cg) that is inspired by traditional consistent hashing. cg is a lightweight distributed strategy that enables each processing element instance (PEI) to process the workload according to its capacity. The main idea behind cg is the notion of equal-sized virtual workers at the sources, which are assigned to workers based on their capacities. We provide a theoretical analysis of the proposed algorithm and show via extensive empirical evaluation that the proposed scheme outperforms the state-of-the-art approaches. In particular, cg achieves 3.44 x superior performance in terms of latency compared to key grouping, which is the state-of-the-art grouping strategy for state-full streaming applications.

1705.09073-2bn4j16

30
May
2017

The Benefit of Being Flexible in Distributed Computation

categories: Big Data

Abstract: In wireless distributed computing, networked nodes perform intermediate data computations over data-placed in their memory and exchange these intermediate values to calculate function values. In this paper we consider an asymmetric setting where each node has access to a random subset of the data, i.e., we cannot control the data placement. The paper makes a simple point: we can realize significant benefits if we are allowed to be “flexible”, and decide which node computes which function, in our system. We make this argument in the case where each function depends on only two of the data messages, as is the case in similarity searches. We establish a percolation in the behavior of the system, where, depending on the amount of observed data, by being flexible, we may need no communication at all.

1705.08464-1tfz0h8

21
May
2017

Fluid Petri Nets for the Performance Evaluation of Map-Reduce and Spark Applications

categories: Big Data, MapReduce

ABSTRACT:

Big Data applications allow to successfully analyze large amounts of data not necessarily structured, though at the same time they present new challenges. For example, predicting the performance of frameworks such as Hadoop and Spark can be a costly task, hence the necessity to provide models that can be a valuable support for designers and developers. Big Data systems are becoming a central force in society and the use of models can also enable the development of intelligent systems providing Quality of Service (QoS) guarantees to their users through runtime system reconfiguration. This paper provides a new contribution in studying a novel modeling approach based on fluid Petri nets to predict MapReduce and Spark applications execution time which is suitable for runtime performance prediction. Models have been validated by an extensive experimental campaign performed at CINECA, the Italian supercomputing center, and on the Microsoft Azure HDInsight data platform. Results have shown that the achieved accuracy is around 9.5% for Map Reduce and about 10% for Spark of the actual measurements on average.

01
Apr
2017

Scheduling Job Queue On Hadoop

categories: Big Data, Hadoop

Abstract
Hadoop is a free, Java-based programming system that backings the preparing of vast informational collections in a Parallel and disseminated figuring condition. Enormous Data in many organizations are handled by Hadoop by presenting the employments to Master. Estimate based booking with maturing has been perceived as a compelling way to deal with certification powerful and close ideal framework reaction times. Hadoop Fair Sojourn Protocol (HFSP), a scheduler acquainting this procedure with a genuine, multi-server, complex and generally utilized framework, for example, Hadoop. In this paper, we introduce the plan of another booking convention that caters both to a reasonable and productive use of bunch assets, while endeavoring to accomplish short reaction times. Our answer actualizes a size-based, preemptive planning discipline. The scheduler apportions group assets with the end goal that employment measure data is surmised while the occupation gains ground toward its fruition. Planning choices utilize the idea of virtual time and bunch assets are centered around employments as per their need, processed through maturing. This guarantee neither little nor extensive employments experience the ill effects of starvation. The result of our work appears as an undeniable scheduler usage that coordinates consistently in Hadoop named HFSP. Measure based planning for HFSP receives offering need to little occupations that they won’t be backed off by expansive ones. The “Shortest Remaining Processing Time (SRPT) strategy, which organizes occupations that need minimal measure of work to finish, is the one that limits the mean reaction time (or visit time), that is the time that goes between an occupation accommodation and its fruition”. We Extend HFSP to respite occupations with Higher SRPT and permit other holding up employments in Queue in view of FCFS.

829-2424-1-PB-296ei1i

1 2 3 Next