3.2 Period of retention

In theory, all data should be preserved for as long as possible. However, in reality one often has to balance the length of data retention with other constraints, such as the availability of space, the cost of storage, and protection of confidential information. For research that receives extramural funding, the funding agencies usually have policies regarding the minimum length of data retention. For example, the NSF Engineering Directorate states that “minimum data retention of research data is three years after conclusion of the award or three years after public release, whichever is later” (NSF Engineering Directorate, 2017). Plans for data retention should anticipate personnel changes in the research group. For example, when students graduate and leave a research group, or when the principal investigator moves to another institution, the plan should ensure seamless transition of the responsibility for data retention to appropriate researchers. After passing the intended period of retention, the stored data should be appropriately destroyed (Coulehan and Wells, 2006). This is especially important when the data contains confidential information.

 

3.3 Data security

Securing your data means at least these three things: preserving the integrity of stored data, effective control of access to the data, and protecting the privacy of data contributors. Preserving data integrity during its storage is a basic requirement of data security. Threats to data integrity include data erosion and data loss. Eroded data becomes unreadable, or its value is changed by the condition of storage. Data loss might result from researchers’ mishandling or from data theft. Leaking access to unapproved personnel could also lead to significant loss, such as violation of intellectual property, loss of commercially valuable information, and compromise of privacy. Therefore, it is important to follow proper procedures and apply techniques to ensure data security. For example, physical data should be locked in safe space, and electronic data should be stored on reliable digital devices secured with a password, and the password should be updated regularly (Coulehan and Wells, 2006). It is also a good practice for a research group to establish formal policy and procedure for people to apply for access to the stored data.

When data is collected from human subjects, de-identification is a useful way to protect the privacy of data contributors. De-identification severs the content of data from its contributor. One could de-identify a data set by destroying any identifiers and making the data completely anonymous. If information about the contributor is necessary for the research, one could replace each identifier of contributor (e.g., her name) with a unique code. The information (e.g., a spreadsheet) that can link the code to the contributor should be kept separate from the data.

Data security is a systemic issue which intertwines with technology, regulations, ethics, business, training of researchers, and the organization of research and development. The ways in which we understand and protect data security are in part shaped by our collective commitment to respect, privacy, and the autonomy of people. Our response to data security also reflects how we come to terms with the complex impacts of contemporary data sciences and technology. For example, in the era of big data, data security concerns not only the data itself but also the metadata, that is, “data that provides information about other data.” For example, if the data of a book is its content, its metadata include information about the author, length of the book, its publisher, and publishing date, etc. When we create, edit, and save a file on a computer, it automatically generates relevant metadata and includes it in the file (Office of the Privacy Commissioner of Canada, 2006). While many people take care to examine the data they release, it is easy to ignore the metadata, which may contain and reveal sensitive information. For example, in a data security study, researchers accessed publicly available Microsoft Office files from the websites of the Fortune 100 companies. Subsequent analysis of the metadata from these files revealed a broad range of information, such as the network path of an organization, the authors’ email addresses, and even shortcomings of the software used by a company’s business partner (Oracle, 2007).

Storing data on the cloud poses additional risk to data security. When users upload data to the cloud, they arguably lose control of the data, because they don’t know where exactly the data is stored and how the data will be handled. In other words, the users leave the security and integrity of data in the hands of the service providers, and ultimately, of the vast and complex information systems that constitute the cloud. According to Timmermans et al. (2010), data stored on the cloud faces four types of risk: “unauthorized access, data corruption, infrastructure failure, or unavailability/outing.” In addition, it is difficult to identify who is accountable when the cloud fails to protect the security and integrity of stored data. This challenge stems in part from the complicated and often geographically separate subsystems that are connected on the cloud. The power of cloud computing lies in the distribution of data and tasks to remotely connected data centers, so that scattered computing resources can be used more efficiently. However, this distributed structure makes it extremely difficult to track which subsystem is processing which dataset at a given moment. If a dataset is lost or corrupted on the cloud, the user has little power to identify the mistakes and recover the data. Finally, providers of cloud service tend to place limits on users’ control of their data. Some service providers design their technology to “lock-in” users; i.e., users will lose existing data if they switch to a different service provider. Therefore, researchers should carefully assess the risk when they consider storing data on the cloud.