Authored by Xiaofeng Tang (firstname.lastname@example.org)
With contributions from Eduardo Mendieta and Thomas Litzinger
Welcome to the online tutorial on ethical data management. This tutorial provides some concepts, tools, and examples for engineering graduate students to think about issues related to academic norms, social impacts, and ethics that arise from the interplay between data and engineering research. It also introduces best practices for handling data in the research setting and invites you to reflect on researchers’ social and ethical responsibility.
Although the word “management” implies a strong sense of “control,”  ethical data management in research is not an isolated and linear process of merely “putting data under control” as if one were managing a warehouse of shoes. Instead, we suggest, data flows in multiple directions and interacts with a variety of actors throughout and beyond the research process. Therefore, we prefer to understand data through the analogy of an ecosystem, and we organize this tutorial around the concept “ecology of data.”
Another helpful concept for thinking about data management in research is the “lifecycle of research data.” The lifecycle approach breaks the management of data into several phases, thus enabling us to concentrate on the specific challenges related to a given phase (Chisholm, 2015). In this tutorial, we break the lifecycle of research data into four phases: data planning; data generating; data processing; and data using, sharing, and preserving. The concept “lifecycle” reminds us that data does not simply vanish with the conclusion of a research project or the publication of its findings. The existing data lays the groundwork for new research initiatives and thus starts new cycles of life.
The Ecology of Data
The ecological approach to data management is inspired by research in the information sciences. The traditional approach to information management, which focuses almost exclusively on the application of technology, fails to recognize the diverse culture, people, and relations that in important ways shape the flow of information. Therefore, Davenport and Prusak (1997) use the “information ecology” to call attention to “an organization’s entire information environment” (p.4). Just like an ecosystem describes a community of distinct and interrelated species, subsystems, and the environment, the “ecology of data” includes not only the datasets but also humans, organizations, technologies, and networks; data interacts with each of these groups like energy circulates in a biological ecosystem. Unlike people who understand data management as ensuring the proper functioning of a mechanical system that generates and transfers data, an ecological perspective calls our attention not only to how data performs but also to the values and decisions of the human and organizational actors, which play important roles in shaping the performance of data (Nadim, 2016). In other words, the humans and organizations in the ecology of data not only manage the data but also take part in creating the entire data environment. Accordingly, instead of being masters who exercise complete control of research data, researchers act like “residents” in the ecology of data: their interaction with data is impacted by, and in turn impacts, the entire ecosystem.
The four units in this tutorial do not progress linearly from one to the other. Instead, they act as four self-sustaining and interdependent subparts of an ecosystem of data. Figure 2 illustrates the relationship of these units. Following the lifecycle of research data, we introduce some concepts, norms, and tools related to responsible and ethical practice in research design, data collection, data processing, and use of data in publishing and applying research findings. These lessons are also meant to invite you to contemplate on questions about academic standards, communication in research, the purposes of research, as well as ethical issues, such as the rights of various groups and individuals based on their roles in the research process.
Besides traveling through the lifecycle of research, data also lives in a broader “life cycle,” i.e., the physical and virtual systems in which information about various aspects of our lives is translated into data and being collected, processed, and exchanged. Assisted by the data sciences and computing technologies, this seamless and boundless system of big data is continuously recording, overseeing, supporting, and communicating with numerous domains of our daily lives. Just consider how much information can be revealed in a credit card. Analyzing the credit card records can yield information about the cardholder’s routes and means of travel, consumptions of food, visits to clinics, purchase of medication, shopping of books, movies, games, and the list goes on. How does this all-encompassing and gigantic life cycle of data change the ways we think about ethics? What are researchers’ ethical responsibilities as they take part in creating and applying big data technologies? We also explore these questions as they are embedded in the lifecycle of data management.
Chisholm, Malcolm. 2015. “7 Phases of A Data Life Cycle.” Information Management. http://www.information-management.com/news/data-management/Data-Life-Cycle-Defined-10027232-1.html.
Davenport, Thomas H., and Laurance. Prusak. 1997. Information Ecology: Mastering the Information and Knowledge Environment. New York, NY: Oxford University Press.
Nadim, Tahani. 2016. “Data Labours: How the Sequence Databases GenBank and EMBL-Bank Make Data.” Science as Culture 25 (June): 1–24. doi:10.1080/09505431.2016.1189894.