As the definition of big data is still evolving, the required technical skills keep evolving. Many educational organizations are starting to pay particularly close attention to big data knowledge and application, where the big data scientist deals more with predictive analytics and unstructured data instead of the traditional descriptive analysis of structured data [23].
For the information technology students, this can lead to learning many different types of programming languages, anywhere from fluency in Java (the language of Hadoop), to competency in basic statistical methods like ANOVA and Chi-squared (the core statistical tools for companies where data is their product). Regardless, advanced and technically-minded students can find career paths in the management of relational databases, cloud computing platforms, and data visualization, and a person with expertise in any one of these tasks can qualify for a full-time job.
Fari Payandeh in a blog on Data Science Central describes that big data will create specializations in three specific areas: data infrastructure, data management, and data visualization [24]. The full spectrum of a big data scientist capabilities also include the statistical ability to act on the data and the machine learning skills to learn from previous analyses. We can further explore these five major skillsets in big data, and doing so allows us to take a closer look at which skills are more necessary for specific processes in big data manipulation and analysis.
2.8.1 Data Infrastructure
Data infrastructure consists of how data is shared and consumed; big data differs from traditional data in both ways. As a result, big data scientists face an almost entirely new infrastructure in how to distribute and gather information.
Distributed computing platforms are the keystone to big data analytics. The idea of big data has always been a dream for database designers, but the current power of cloud hardware and software has made this dream a reality [25].
Running distributed computing platforms needs a data scientist not only to be knowledgeable in technical aspects of operating such systems, but it also requires know-how of properly adopting cloud computing or distributed computing systems to meet an organization’s needs and expectations. This is a task that bridges the gap between IT and management.
This implies that a data scientist should have an understanding of characteristics of a variety of cloud/distributed computing systems and hands-on experiences in dealing with those systems. Andrew Olivier, founder of a big data consulting firm, has listed several skills that any prospective well-rounded big data employee would have, and more importantly, he emphasizes that these skills are not tied to any specific platform like Hadoop [1]. Potentially useful platforms, systems, and related skills are listed as follows:
- A distributed file sharing system including Redundant Array of Independent Disks (RAID) or Hadoop Distributed File System (HDFS),
- SQL database systems including Oracle or Microsoft SQL server experience,
- Frameworks behind distributed computing including MapReduce,
- Scripting programming languages including Python, Scala, or Spark.
Unstructured databases are a reservoir of information, but data streams feed them. Data streams involve data in motion, like active website behavior or sensor information from a video camera. The information found in data streams is valuable and accessible to many organizations, but there is just too much data produced to store it all. In many cases, the information must be processed on the fly, a process referred to as stream computing. As more data comes from new sources, stream computing skills will become more essential to a data scientist.
One of the best examples of this skill in action is IBM Stream Computing with full-fledged development environments that handle language processing, image/voice recognition, and location/time analyses, all done without storing the data [26]. With this specialization, a big data analyst can save an organization time, money, and resources because significantly less data can be stored while still finding many trends quickly.
2.8.2 Data Management
Data management is a large field, but for this discussion, we will use the Data Management Association‘s definition that “data management is the development, execution and supervision of plans, policies, programs and practices that control, protect, deliver and enhance the value of data and information assets” [27]. As such, technical database management skills are crucial for big data scientists.
Even with the changes to data infrastructure in big data, the need for traditional data management is not going away. In 2012, the Robert Half Professional Hiring Index showed that database management was the number one demanded skill by the 100 Chief Information Officers (CIOs) surveyed [28]. The demand for structured query languages may be growing at a slower rate compared to other big data skills [29], but that does not stop SQL from showing up in the top lists for big data needs [29] [30].
Nonetheless, most business data is unstructured, and the data comes in many forms. It has been said that 80% of business data is unstructured in the form of word processing, presentation, and log files, but now industries face a tipping point where more unstructured data is coming from social media actions taking place on the web [31]. Big data is not only exponentially increasing, but many additional data sets are being recorded in an unstructured fashion. Whether this data comes from Twitter or FitBits, the information is valuable, but so is the skill to actually manage it.
As a result, skills in managing and exploiting non-relational databases are becoming increasingly valuable. This is especially true because relational data management systems that were very popular between 2000 and 2010 may not be flexible enough for big data [31].
2.8.3 Visualization and Integration
Expertise in analytical and computational methods alone does not ensure success for a big data team. It is crucial for a big data professional to be able to integrate findings from data analysis into the bigger direction of the organization. The “big direction” of an organization can take many different forms. In short, it means that an organization, especially large corporations, can have several different, even contradictory agendas. One job of a data scientist is to align the priorities of different parts of an organization in order to steer them toward the discoveries made by data analysis [22]. The directional changes usually need to happen by a higher level of administration than a big data professional or data science team.
As a tool to intuitively convey the findings to a non-expert group, visualization is becoming one of the most important components in big data analytics. According to an article on R-bloggers, the skill with the highest ROI for data scientist is data visualization because clients want insight [32]. Besides the ability to use visualization techniques, communication skills are getting more attention. On the Datafloq website, founder Mark van Rijmenam breaks down what the interpersonal skills can entail. Specifically the list includes [33]:
- Strong interpersonal, oral and written communication and presentation skills,
- Ability to communicate complex findings and ideas in plain language,
- Being able to work in teams towards a shared goal,
- Ability to change direction quickly based on data analysis,
- Enjoying discovering and solving problems,
- Proactively seeking clarification of requirements and direction,
- Taking responsibility when needed, and
- Being able to work in stressful situation when insights in (new) data sets are required quickly.
These skills do not represent a “hard science” but rather an art of managing a group whose expertise and knowledge of the details are specialized. Nonetheless, for a data analyst’s success, it is crucial to be able to support both the business and team members.
2.8.4 Statistics
Statistics is its own field as broad as big data. Nonetheless, even for those not specializing in statistics, having a thorough understanding of R, MATLAB, or some other statistical software or tool is very important for data analysis. Lexin Li of the Department of Statistics from North Carolina State University categorizes the role of statistics in big data into two parts [34]:
1) to build and interpret appropriate models given the unusually huge and complicated data, and
2) the engineering skills to carry out all necessary operations.
In big data terms, this means not only being able to design an algorithm that gives interpretable results but also create it in a scalable manner.
2.8.5 Artificial Intelligence / Machine Learning
At first glance, artificial intelligence seems like an outlier on this list. Data scientists must be familiar with the architecture, management, representation, and statistics of data; that is somewhat intuitive. But what does AI have to do with big data? Well, someday, possibly everything.
Artificial intelligence adds a “layer to big data to tackle complex analytical tasks much faster than humans could ever hope to” [35]. This reliance on neural networks and machine learning will become even greater over time. By 2020, the size of the “digital universe” – the amount of data there is – will reach 44 zettabytes [36]. That’s over 44 trillion gigabytes of data!
Essentially, neural networks are machine learning or data mining techniques inspired by biological learning mechanisms in the human brain. In many ways, it allows for computing to be done in parallel and for new algorithms to be developed as a computer learns from its mistakes. There is a lot to machine learning, and its technical implementation, which goes far beyond the scope of this chapter, but being familiar with the complex hardware (like neural networks in the cloud) and communications systems (like sensors) will help many organizations break the limits of computing ever larger data sets.