On September 5, 2008, I posted a report in the now defunct Science & Law Blog in the Law Professors Blog Network entitled Genetics Datasets Closed Due to Forensic DNA Discovery. It concerned a reported procedure for the equivalent of unscrambling a broken egg — using a large number of SNPs to determine whether a known individual’s DNA is part of a mixture of DNA from scores or even hundreds of DNA samples.
If Hollywood was paying attention, we would have seen CSI techs checking a door knob to find out whether the suspect ever touched it. And however the report might have been received in Hollywood, it scared the NIH and other scientific organizations that maintain research databases. After reproducing the posting about the international reaction, I shall quote an abstract from a study slated for publication in the journal Forensic Science International: Genetics. The latest work makes it even clearer that limiting access to the databases was unnecessary.
Posting of 5 Sept. 2008
Until last Friday, the National Institutes of Health (NIH) and other groups had posted large amounts of aggregate human DNA data for easy access to researchers around the world. On Aug. 25, however, NIH removed the aggregate files of individual Genome Wide Association Studies (GWAS).
The files, which include the Database of Genotypes and Phenotypes (dbGaP), run by the National Center for Biotechnology Information, and the Cancer Genetic Markers of Susceptibility database, run by the National Cancer Institute, remain available for use by researchers who apply for access and who agree to protect confidentiality using the same approach they do for individual-level study data.) The Wellcome Trust Case Control Consortium and the Broad Institute of MIT and Harvard also withdrew aggregate data.
The reason? The data keepers fear that police or other curious organizations or individuals might deduce whose DNA is reflected in the aggregated data, and hence, who participated in a research study. These data consist of SNPs — Single Nucleotide Polymorphisms. These are differences in the base-pair sequences from different people at particular points in their genomes. Many SNPs are neutral — they do not have have any impact on gene expression. Nonetheless, they can be helpful in determining the locations of nearby disease-related mutations.
The event that prompted the data keepers to act was the discovery at the Translational Genomics Research Institute (TGen) of a new way to check whether an individual’s DNA is a part of a complex mixture of DNA (possibly from hundreds of people). According to the TGen report, Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays, a statistic applied to intensity data from SNP microarrays (chips that detect tens of thousands of SNPs simultaneously) reveals whether the signals from an individual’s many SNPs are consistent with the possibility that the individual is not in the mixture. (Sorry for the wordiness, but the article uses hypothesis testing, and “not in the mixture” is the null hypothesis.)
How could this compromise the research databases? As best as I understand it, the scenario is that someone first would acquire a sample from somewhere. Your neighbor might check your garbage, isolate some of your DNA, get a SNP-chip readout, and check it against the public database to see if you were a research subject who donated DNA. Or, the police might have a crime-scene sample. Then they would use a SNP-chip to get a profile to compare to the record on the public database to see if the profile probably is part of the mixture data there. Finally, if they got a match, the police would approach the researchers to get the matching individual’s name.
Kathy Hudson, a public policy analyst at Johns Hopkins University, stated in an email that “While a fairly remote concern, and there are some protections even against subpoena, NIH did the right thing in acting to protect research participants.” However, scientists such David Balding in the U.K. are complaining that the restrictions on the databases are an overreaction. Indeed, an author of the TGen study is quoted as stating that the new policy is “a bit premature.” See http://www.nature.com/news/2008/080904/full/news.2008.1083.html.
It seems doubtful that anonymity of the research databases has been breached, or will be in the immediate future, by this convoluted procedure. Of course, the longer-term implications remain to be seen, and the technique has obvious applications in forensic science. If the technique works as advertised, police will be able to take a given suspect and determine whether his DNA is part of a mixture from a large number of individuals that was recovered at a crime scene. Analyzing complex mixtures for identity is difficult to do with standard (STR-based) technology.
– Homer N, Szelinger S, Redman M, Duggan D, Tembe W, et al., Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays, PLoS Genetics (2008). 4(8):e1000167. doi:10.1371/journal.pgen.1000167
-DNA databases shut after identities compromised, Nature 455:13. Sept. 3, 2008
-Natasha Gilbert, Researchers criticize genetic data restrictions, Nature Sept. 4, 2008, <http://www.nature.com/news/2008/080904/full/news.2008.1083.html>
The latest study is Egeland et al., Complex Mixtures: A Critical Examination of a Paper by Homer et al., Forensic Sci. Int’l: Genetics, 2011. It is in press. Corrected proofs are available online (for a price) at the journal’s website.
Abstract: DNA evidence in criminal cases may be challenging to interpret if several individuals have contributed to a DNA-mixture. The genetic markers conventionally used for forensic applications may be insufficient to resolve cases where there is a small fraction of DNA (say less than 10%) from some contributors or where there are several (say more than 4) contributors. Recently methods have been proposed that claim to substantially improve on existing approaches . The basic idea is to use high-density single nucleotide polymorphism (SNP) genotyping arrays including as many as 500,000 markers or more and explicitly exploit raw allele intensity measures. It is claimed that trace fractions of less than 0.1% can be reliably detected in mixtures with a large number of contributors. Specific forensic issues pertaining to the amount and quality of DNA are not discussed in the paper and will not be addressed here. Rather our paper critically examines the statistical methods and the validity of the conclusions drawn in Homer et al. (2008).
We provide a mathematical argument showing that the suggested statistical approach will give misleading results for important cases. For instance, for a two person mixture an individual contributing less than 33% is expected to be declared a non-contributor. The quoted threshold 33% applies when all relative allele frequencies are 0.5. Simulations confirmed the mathematical findings and also provide results for more complex cases. We specified several scenarios for the number of contributors, the mixing proportions and allele frequencies and simulated as many as 500,000 SNPs.
A controlled, blinded experiment was performed using the Illumina GoldenGate� 360 SNP test panel. Twenty-five mixtures were created from 2 to 5 contributors with proportions ranging from 0.01 to 0.99. The findings were consistent with the mathematical result and the simulations.
We conclude that it is not possible to reliably infer the presence of minor contributors to mixtures following the approach suggested in Homer et al. (2008). The basic problem is that the method fails to account for mixing proportions.