Skip to toolbar

1. Understanding Data and Research

1.1 Definitions of data

Resnik (2008) defines data as “recorded information used to develop or test human knowledge.” According to this very broad definition, there is very little in the research world that should not be considered data. However, it is important to note that the definition and scope of “data” vary in different disciplines, research communities, and institutions. For example, experimental scientists tend to emphasize the relation between data and experimentation. Thus Joshi and Krag (2010) define data more specifically as “extemporaneous experimental output.” However, the “outputs” in Joshi and Krag’s definition “include not only results from experiments and their analyses, but also experimental protocols, research materials, explanation of the data, and the computer programs that were used for the analysis.” This inclusive definition takes into account data in different format according to different stages of experimentation, e.g., data includes protocols during research design, results during data collection, and explanations during data analysis.


Penn State Nittany Lion in a lab coat, sitting at a computer.

Figure 2 Many Researchers at Penn State Deal with Data of Different Formats.
(NIttany Lion by Penn State News / CC BY-NC 2.0)


Funding agencies and regulators of research also have their own definitions of data. These definitions are important to researchers who seek grant support or who work in areas (e.g., animal research) that are regulated by the corresponding organizations. For example, the Federal Acquisition Regulation defines data as “recorded information, regardless of form or the media on which it may be recorded. The term includes technical data and computer software” (cited in Joshi and Krag (2010)). Similarly, the National Institutes of Health (NIH) defines data as “recorded information, regardless of the form or media on which it may be recorded,” followed by a comprehensive list that includes “writings, films, sound recordings, pictorial reproductions, drawings, designs, or other graphic representations, procedural manuals, forms, diagrams, work flow charts, equipment descriptions, data files, data processing or computer programs (software), statistical records, and other research data” (cited in Joshi and Krag (2010)). This extensive list encompasses not only the “recorded information” per se but also the intermediaries—procedures and tools that are used to generate, document, and analyze data. Therefore, if you work on a project that receives NIH funding, you are obligated to follow NIH’s policy and standards in handling both the “recorded information” and the less obvious types of data. Paying attention to the people and organizations that define data is always a good lesson to keep in mind, because these definitions sometimes reflect the priorities, concerns, and worries of relevant actors.

Whether to consider some information as data also depends on the context of the research. Consider the following example.

Dr. A is conducting a research that seeks to optimize energy supply in rural Pennsylvania. Part of the research involves interviewing residents in rural Pennsylvania to understand patterns of their energy use. One of the residents accepted to be interviewed on December 25, 2015. Should Dr. A enter the interview date as relevant data in her research journal?

One could say that it is always a good habit to record the date when you collect data, whether you are conducting an experiment, an observation, or an interview. Under certain circumstances, however, we might reasonably leave out the date of data collection; for example, when there is no designated space to store this information, or when the date of data collection might bring unnecessary bias to our analysis. In the scenario above, however, accepting an interview on Christmas might imply that the interviewee is eager to participate, have a strong will to share her experience, or very busy during other days of the year. Consideration of these factors may be important for deciding whether the date of interview merits documentation.

Recently, the rise of big data has significantly altered traditional understandings of data in research, business, and communication. Engineering researchers are closely involved with numerous areas of big data research and development; hence they are one of the key groups who can help us evaluate the ethical significance of this emerging technical paradigm. Yet, we should note that although big data has become a hot topic in media, industry, and academia, it still lacks a clear and widely accepted definition. Instead of defining big data, a number of authors have attempted to describe its particular characteristics. For example, some people use “big data” to refer to recent development in computing hardware and software that enables the processing of significantly larger amount of data than we could in the past (Snijders, Matzat, and Reips, 2012). Others use high “volume, velocity, and variety” to portray the nature of data that is processed by this new paradigm of information technology (Chen and Zhang, 2014).


How “Data” Became “Facts”

To properly understand the role of data in research, we should probably begin by examining how we have come to value data so much. It is often suggested that data provides us with a means to make objective decisions, without which we shall be left confused by ambiguous, conflicting, and subjective opinions and interpretations. This suggestion rests on the assumption that data is a representation of facts. But is this always the case? One way to answer this question is to examine the ways in which people use “data” in ordinary languages. According to Rosenberg (2013), the word “data” was first introduced to the English language from Latin in the 17th century, and the word originally meant “something given in an argument, something taken for granted.” According to this definition, “data” referred not to observable facts but rather to situations we should assume to be true for the purpose of an argument. Thus in the early 18th century the word “data” appeared primarily in works of mathematics and theology, rather than in the “empirical sciences.” The linguistic meaning of “data” shifted significantly over the course of the18th century: by the end of the century, “data” had been used most commonly to describe “facts in evidence determined by experiment, experience, or collection.”


1.2 The ecosystem of data: multiple actors and stakeholders

The following quotation from an ecoscientist (quoted in Ribes and Jackson (2013)) illustrates the complex system of persons, activities, and relations that are responsible for contributing data to a research publication.

“This morning I’m working on a paper and I’m looking at data and I’m making graphs, writing this paper and the graphs are swell and the statistical analysis is coming up super well. I nearly went down the hall to thank the lab crew because whenever I do this…. You realize how many things have to go right in order to get that graph. I mean, so we had to design the study well but then the samples had to be collected right and then they had to be handled right and they had to be extracted right then the chemical analysis and the incubation and like, so many…”

A variety of actors are captured in this snapshot: the researchers (PIs, postdocs, etc.) who designed the study, the lab crew who collected the sample, the (likely) graduate students who handled and extracted the sample, the chemists who analyzed the sample, the statisticians who conducted statistical analysis, and finally, the author who represented the data in graphs and reported the findings in a paper. However, this seemingly extensive list leaves out numerous actors and stakeholders who are directly or indirectly involved with the production and use of the data reported in this paper. Think about the staff who manage the grants that fund this research, experts who are employed by governmental agencies (e.g., NSF or EPA) to ensure that tax dollars are spent in the most productive and beneficial areas of research. Think about the university administrators who are responsible for creating this faculty position, hiring the ecoscientist, and providing start-up package for her to build a research team: the department head, the dean, the vice-president for research, etc. And there is the university ethics committee that oversees and ensures proper conduct of research. All of the aforementioned, as well as many other actors (e.g., manufacturers of the lab equipment), have a stake in the production of reliable data by the ecoscientist’s group. On the “user” end of the spectrum, a great number of actors and stakeholders are also connected with the data reported in the ecoscientist’s paper: peer scientists looking for new knowledge in the field, industry hunting opportunities for new products and services, environmental groups seeking to inform citizens with the research findings, and policy makers creating or revising environmental regulations, to name a few. To sum up, the research data acts like a thread in a giant web of humans, organizations, and relationships. This web illustrates what we call the ecosystem or ecology of data.