3.4 Social Media Analytics – Big Data E-Book

As discussed in Section 3.3, analyzing social media is important. A good analysis demonstrates how information spreads can show businesses what’s working (or not working) and reveals trends in behavior and activity that may have been implicit before. Many projects in the humanities are moving to social media platforms for their analyses, and with such a rapidly increasing pool of information, they need new tools to analyze that data [22].

Social media is analyzed because of the unprecedented amount of information and connections it makes available. At the same time, the sheer size of the data poses new technical challenges for analysts. So far however, we have barely talked about social media analytics or any technical details behind it. Social media analytics are the processes by which data is collected and turned into information in order to make informed business decisions, or to interpret it for academic research [8].

3.4.1 Data collection

Social media relies on interaction with their user base, which many times includes businesses as well as individuals. Since organizations don’t have much leisure time to spend hours casually perusing their favorite website, many social media platforms connect with businesses in a more intimate way, particularly by providing more access to their information. The collection of data happens through interacting with the platform’s API (application programming interface). Essentially, a social media media API is used like a search engine for marketing campaigns [23].

Data collection itself is not the most difficult part of big data analytics. There are three major subsets of big data [22]:

Traditional data – data which an organization already has
Structured data – collected through Internet of Things (IoT) [24] and other sensors
Unstructured data – data from media files or textual data

In an academic project on disease and language relationships, these aspects make sense. Traditional data might be based on previous and established findings on this subject. Structured data might be found by measuring heart rate while the subject gives concrete responses to yes/no questions. And unstructured data might be found from a large-scale analysis of Facebook for a particular region. Many organizations will use a combination of these three.

Yet despite the seemingly straightforward breakdown, the process becomes a lot more complicated in its technical application. For example, we can break down data collection from Twitter. As might be expected with the sheer size of Twitter feeds, there are several ways in which researchers may access Twitter data. On the development page for Twitter, there are 3 streaming endpoints offered [25]:

Public streams – are public data flowing through Twitter; used for following specific users or topics, and data mining
User streams – contain nearly all of the data related to a single user’s view of Twitter
Site streams – is mostly used for servers which connect to Twitter on behalf of many users

Here are examples of use cases of each type of streams:

Public streams – analysis of people’s sentimental responses to upcoming Thanksgiving break using a hashtag #Thanksgiving.
User streams – analysis of presidential candidates’ interactions with the public on Twitter.
Site streams – analysis of contents of Instagram photos retweeted more than 100 times on Twitter.
Public & User streams – analysis of users’ follower networks who used a hashtag #blacklivesmatter.

More than likely, a data scientist will focus on Twitter’s public and/or User streams. Regardless, the results of following a particular stream API will be in the JSON format, a computer-readable file that contains content as well as context like location and engagement [26]. This will then be parsed (likely via scripting) to find a particular field or information.

Figure 2: JSON file excerpt example [26]

Let’s take a look at an example of collecting Twitter date done by Allen Zeng. In his example [27], he collects Tweets from verified celebrity accounts and stores them on a schedule. After manually choosing the accounts and verifying them himself, he essentially needed only four other tools to collect and analyze the Twitter posts of several celebrities:

A Python wrapper for Twitter’s API
MongoDB for managing the database from Javascript
PyMongo for wrapping Python code into MongoDB
Cron on the linux machine that lets a user schedule how often/when a script should be run

3.4.2 Data Manipulation

Data manipulation is a broad topic, and one of the most resource-intensive, time-consuming, and difficult aspects of big data analysis, especially for a new project. In this section, we can take a look at a small subset of data manipulation. Since social media is our theme, let’s first look at how one could go about manipulating and analyzing user text.

There are several tools used in text manipulation, and many have been around for a long time. The two most basic of which are sed (stream editor), and awk (derived from the last names Aho, Weinberger, and Kernighan). Sed and awk are very powerful tools for big data manipulation. Sed works on data on the fly in small pieces rather than loading the entire contents of the file, and the results can be re-inserted (or piped) into another sed or awk command if necessary [28]. Sed scales well, and it could be used to replace text in millions of records very quickly. For a social media example, common internet slang terms can be replaced with a counterpart to make analysis easier in a later phase (i.e., “lol”, “rofl”, “haha”, etc. could all simply be replaced with a string like “%LAUGHTER%” for a generic meaning). Since sed is stream-based, it could be used on a huge bulk of unstructured text with little problem, but awk is better on separated fields (by line, comma, etc.). Awk also includes other, more robust programming constructs like for-loops; this might lead a big data scientist to prefer awk over sed for more structured data. An example for awk would be to extract and reload text. Suppose there are huge logs for Facebook users. Each records the amount of time spent per session, per device. Awk could be used to count up the amount of time for each log for each user, add them together, and export the file with the total time spent online for each user. Considering the low level of operation of sed and awk, the process would be fairly quick even for numerous sets of large files.

However, much of the content uploaded to social media is not text. It can be images, music, videos, and more. In non-text analyses, the process of data manipulation involves several more steps, sometimes very creative ones. To best understand some of these procedures, we can look at a case study done by Ailey Crow of Pivotal.io [29]. In technical fields, it is common to use images to measure results of chemical reactions or biological processes, perhaps doing a before and after count of nanoparticles or cells. This case study looks at an image of a breast cancer tissue sample, and Crow shows the steps taken to automate counting the nuclei in the image, allowing for rapid scaling. The beginning image is here:

Figure 3: Original image of breast cancer tissue sample [29]

The goal is to count all the nuclei, which are the purple ovals. First, the image must be broken down into its color parts (either by grayscale values or by red, green, and blue values) pixel by pixel. In this example, Crow uses grayscale and then “smooths” the image; essentially, she rounds each pixel value to the mean of its neighbors’ to create this image:

Figure 4: Smoothed image before thresholding by Otsu’s algorithm [42]

Then, the foreground must be separated from the background using a thresholding algorithm. In this case, she uses the complex Otsu’s algorithm. After applying another noise filter, she gets this result:

Figure 5: Image after reducing pixel noise [42]

Finally, this image has enough information to be properly analyzed by object recognition. In short, Crow runs a basic SQL command to find any clustered object that is more than 50 pixels but less than 500. The result returns 217 cells in the original image, illustrated in Figure 6:

Figure 6: Resulting image after performing object recognition by connected components algorithm [42]

It is easy to imagine having hundreds if not thousands of these images for each patient in a study. And instead of assistants having to manually count the cells per image, which is both time-consuming and prone to error, the entire process can be automated. For larger research organizations that may have millions or more images of this nature, they could push this entire process to the cloud and simply recover a before and after count for each image. This can then be easily visualized.

This case study by no means captures every step in the process for non-text manipulation, but it does highlight the complex and creative steps involved. It also demonstrates the potential power that analytics can have through this kind of manipulation. At the same time, microscopic images are not likely to be posted over social media platforms. Nonetheless, these tools can be leveraged in conjunction with social media and could be used to analyze, for example, the average number of people occurring in photos uploaded by a particular subset of them.