With social media coming in so many different shapes and sizes, discussing the unique value of each platform for big data would be a daunting task. Throughout this application chapter, we’re going to primarily focus on Twitter as a source of big data. For example, it’s easy to see the sheer size of Twitter live by following this link to see the number of Tweets sent today, which is likely to be about 500 million [9]. The contents of Tweets are largely unstructured data. Other than a 140-character limit, a Tweet has no predefined model. In this section, let’s explore the growth of the value of social media and where its limitation are.
3.3.1 Traditional vs Modern Social Media and Data
“Traditional” social data can take many forms, and even in 2012, “the largest data sets being used in digital humanities projects are much smaller than big data used by scientists,” and using modern definitions, none of the projects would even qualify as big data. What further complicates this issue is that contemporary web content and data are “infinitely larger than all already digitized cultural heritage” [10].
Before looking too deeply into modern social media data, it is important to review the history of traditional social data, which takes many different forms and are still being heavily studied today. Networks between individuals were not created by modern technology; modern technology merely facilitated them. Letters were used to spread ideas and political discourse for hundred of years, and they have shaped world culture forever. One of the biggest groups analyzing this impact is the Republic of Letters. While even the entire corpus of correspondence may not qualify as “big data” in our sense today, the analysis of its contents has required immense effort and specialization. Some of the case studies the group has already done include influential people and places like Benjamin Franklin, John Locke, and even salons [11]. And when looking at the visualization of this information, it bares a striking resemblance to outcomes of big data analysis as seen in figure [12].
Figure 1: Letters Written to Franklin, 1757 – 1763 by Month by Source Country [12]
Other somewhat traditional forms of communication are still very relevant today. For example, even though Internet Relay Chat (IRC) communication has been around since the late 1980s, doing an analysis of them still poses a challenge. Although groups do not analyze these logs for emotional content or advertising purposes, their size, content, and nature still prove problematic in ways relatable to big data. A graduate thesis in 2013 was written on this topic and ways to improve current methods [13].
Location information is a big part in recent social data. Also, social media service supports real-time interactions. It is often unstructured data, which provides more contextual information on events recorded on social media. It records real-time interactions among people, which is very powerful for real-time situational awareness. The analysis of this information can then answer three very important business questions [14]:
- What am I doing right? This can come in the form of several different questions:
- How many website visits come from Facebook, from YouTube, etc.?
- Which kinds of content get the most retweets, or the most likes?
- Which part of the day gets the most engagement?
- Which region is engaged the most?
- How is my return on investment (ROI)? This is much more objective when looking at the actual size of followers on the platform. For example, an organization can measure how much it has grown and over what period of time. However, subjective analyses play a large role here, too.
- How should I spend my time in the future? Based on the answers to numbers 1 and 2, an organization should regroup and plan its social media campaign.
3.3.2 Social Media limitations
Social media is powerful. It is likely the most powerful tool for analyzing large groups of people, both objectively and subjectively. Altogether, Facebook and Twitter have about a billion and a half users, which out of the 7 billion people in the world is a huge number. Sociologists, psychologists, etc. have a huge database of moods and personal reports at their fingertips.
Nonetheless, the data has pitfalls in its representativeness, which becomes a problem especially in academia. One example is found in a Cornell study which tracked moods through the day using text analysis through Twitter posts, cross-culturally [15]. In short, their findings reveal statistics about the percentage of populations that are “night owls” vs “morning people” or work week mood patterns [16]. Without a doubt, the study has its virtues and can shed some light on the population being studied, but it does have serious limitations as well. It cannot make any accurate assumptions based on human behavior overall because it only takes into account one group of people: Twitter users. To elaborate further, using social media as a source has a selection bias in that social media’s users are “typically younger, more affluent individuals” [17].
Using social media data comes with its share of ethical issues as well. Much of the data essentially comes from participants who are not completely aware of how their information is shared or used [18]. There are hardly any tangible standards regarding how information from social media may be used across the board; it is frequently up to the platform, and it is usually all-or-none. If an individual doesn’t want to agree to Facebook’s terms of service, he or she simply can’t use it. This can be an ethical challenge to some researchers [19], which can be confounded even more because moral standards make for poor guides when directing information gathers; what one person finds “creepy”, another would find fair [20]. Despite lacking even some rudimentary privacy standards across the board, it is still likely big data will continue to battle with privacy concerns for a long time to come.
And finally, visualization of social media poses special kinds of problems, as opposed to big data gathered elsewhere. This is true because social media platforms have their own sometimes highly complex agreements about how and where their data or information can be shared. Twitter’s developer agreement lists many rules and restrictions regarding access to their API, which range from protecting a user’s reasonable expectations of privacy to prohibiting reverse engineering [21]. In short, it’s possible to do analyses on Twitter data for commercial purposes, but sharing an individual’s personally identifiable information is not allowed.