Monthly Archives: May 2011

There But for the Grace of God … The “Horrific False-positive DNA Match” to Diane Myers

I have been trying to make sense of a 2004 newspaper report [1] that has achieved a certain degree of fame in articles and briefs emphasizing the risks of false accusations or convictions resulting from cold hits in DNA databases. The article appeared in November 2004, in the Chicago Sun-Times, with no follow-up in that paper or, as far I can see, in any other. In 2009, the Federal Public Defender in Sacramento cited it as an example of “horrific tales of false-positive DNA matches” [2, at 17].

In this tale, detectives investigating a string of burglaries “were informed of a ‘hit’ between blood recovered at the scene and the genetic profile of a woman named Diane Myers.” [1] Evidently, they were not informed that “the ‘hit’ was not based on a direct match.” It was some kind of “partial match” provided as an “investigative lead.” The article does not explain further. One of the reporters responded to a recent email that she does not recall the details. The suspect promptly cleared herself by showing that “she was locked up in a Downstate prison [when someone] slipped into the Chicago apartment Dec. 12, 2002.” [1]

What might have happened had she not been so lucky as to have an airtight alibi? “Jack Rimland, a criminal defense attorney and former president of the Illinois Association of Criminal Defense Lawyers, said … ‘But for the fact that this woman was in prison [sic] … I absolutely believe she’d still be in custody.'” On the other hand, “Kathleen Zellner, a Naperville attorney who relied on DNA evidence to exonerate four men in the 1986 killing of medical student Lori Roscetti, said it was ‘reassuring’ the error was in paperwork, and not in the scientific process, and that the mistake appears to have been addressed.” [1]

All told, this incident does not seem to me to merit the appellation of “horrific,” but it does illustrate the need for the police to understand the true significance of every cold hit. When only a few loci are involved, the power of the association obviously is reduced. If anyone knows more about the laboratory report in the case, the probability of a random match for the limited number of loci involved, and how reports of database hits have changed in Illinois, please consider posting a comment or emailing me.

References

1. Annie Sweeney & Frank Main, Botched DNA Report Falsely Implicates Woman: Case Compels State to Change How It Reports Lab Findings, Chicago Sun-Times, Nov. 8, 2004, at 18.

2. United States v. Pool, Brief for Defendant-Appellant, No. 09-10303, 9th Cir. Oct. 5, 2009, at 17, available at http://edca.typepad.com/files/pool-opening-brief.pdf

Osama Bin Laden’s DNA? 99.9% Accuracy and 0.1% Nonsense

This morning the New York Times reported that genetic analysis established “with 99.9 percent accuracy” that the man killed by U.S. soldiers in Pakistan and quickly buried at sea was Osama Bin Laden. “Officials said they collected multiple DNA samples from Bin Laden’s relatives in the years since the Sept. 11 attacks. And they said the analysis, which was performed the day Bin Laden was killed but after his body was buried at sea, confirmed his identity with 99.9 percent accuracy.” [1] The 99.9% figure is quoted in other stories and commentary as a precise statement of the probability that the body was Osama’s.

But where would a number like 99.9% come from? In a typical criminal case, the issue is whether a trace of DNA left at a crime-scene or on a victim and a sample from a suspect or defendant share a sufficient number of highly variable features to justify the inference that they originated from the same individual. One can compute the probability of the match under different hypotheses. One hypothesis (I shall call it H) is that the defendant is indeed the source. A rival hypothesis (U) is that an unrelated person is. Other alternatives are F, that a father of the suspect is the source, or S, that a full sibling is. Still other relationships between the trace DNA and its source can be envisioned.

Empirically determined frequencies of the distinct DNA features (the alleles at each locus) can be combined according to a population genetics model to estimate the probability that an unrelated individual, a parent or child, a sibling, etc., would be born with the DNA profile in question. This is the probability of the DNA data if a given hypothesis is true. Suppose these probabilities are P(data | U) = 1/1012, that P(data | F) = 1/105, and that P(data | S) = 1/107. Ignoring the chance of laboratory error, the probability of the data if H is true is P(data | H) = 1. These conditional probabilities (for data, given hypotheses) often are called likelihoods. [2]

It is important to understand that none of these numbers is the probability, P(S | data), that the suspect is the source given the genetic data–the 99.9% figure. To find this probability, we would need to know the probability of all the hypotheses before considering the genetic data. Bayes’ rule then would permit us to combine these prior probability with the likelihoods. Using genetic data alone, however, it is not possible to state the probability of H. For that, we would need subjective probabilities based on nongenetic information. However, the data can produce likelihoods that swamp any reasonable choice for the prior probability, justifying assertions that the posterior probability exceeds a figure like 99.9%. The box gives an example.

BOX: Sample Computation with Bayes’ Rule
With likelihoods like L(U) = P(data | U) = 1/1010 and L(H) = P(data | H) = 1, the posterior probability will not be sensitive to the choice of the prior probabilities. Confining the analysis to the four hypotheses and assuming that the priors are P(H) = 0.7 and P(U) = P(F) = P(S) = 0.1, Bayes’ rule tells us that

P(H | data) = P(H) L(H)
————————————————————–
P(H) L(H) + P(U) L(U) + P(F) L(F) + P(S) L(S)
= (.7)(1)
————————————————————-
(.7)(1) + (.1)(10-12) + (.1)(10-5) + (.1)(10-7)
> 0.999

Because the likelihoods for all the rival hypotheses are orders of magnitude smaller than that for H, the weighted prior probabilities in the denominator are negligible, and P(H | data) is close to 1. The “accuracy” of the identification is even greater than 99.9%.

The Bin Laden case probably is different. Although Bin Laden had contacts with journalists before 9/11, presumably, the CIA had no sample of Osama’s DNA to compare to the body. But it could have obtained some DNA samples of at least one relative–Osama had a remarkable number of half-siblings and children. Kinship analysis is commonly used to produce likelihood ratios for a given relationship (such as paternity or siblingship). [3]

ABC News reported that the government used DNA from the brain of a half-sister who had died at Massachusetts General Hospital in Boston [1], but that level of relationship, standing alone, probably is too weak to give a large enough likelihood ratio to warrant assertions of “99.9 accuracy” (unless a huge number of loci were involved).

Could the government have had a sample from Osama’s son, Omar? Why not? ABC News interviewed him in 2010. [4] CNN did an interview in 2008. He was deported from England. (For that matter, eight days after 9/11/2001, at least 13 relatives, along with bodyguards and associates, left Boston on a chartered Ryan Airlines flight.) In the U.S. and elsewhere, police and private individuals have followed people around to get DNA samples without their knowledge. [5, 6]

With Omar’s DNA to compare against a sample from the body, Y-STRs combined with those from other chromosomes should have been enough to produce a very large likelihood ratio (relative to an unrelated man) for paternity if the body was indeed Omar’s. But were all the men in the compound that was assaulted unrelated to Osama? The likelihood ratio would be smaller for a comparison to one of Osama’s half-siblings (through Osama’s father) . Even with respect to the hypothesis that the body was a half-brother, however, the likelihoods could be quite convincing for a substantial number of loci. If the likelihood ratios for an uncle or for an unrelated man are many times smaller than that for paternity, the genetic evidence strongly favors paternity.

It also is reported that a different son was killed in the raid. Comparing samples from both bodies could help establish the father-son of those two bodies. Similar analyses helped demonstrate the identities of bones found in a mass grave in Siberia as members of the Russian royal Romanov family. [2]

In short, the claim that kinship testing with DNA from relatives of Osama Bin Laden establishes his death is credible, but 99.9% seems like a metaphor rather than the result of a direct computation. You cannot get around Bayes’ theorem. A posterior probability like 99.9% has to reflect some prior probability. That said, if we assume that the prior probability based on photographs and other information is substantial and that the likelihood for unrelated men and half-brothers of Osama are small relative to the likelihood for Osama, the posterior probability could well equal or exceed 99.9%.

References

1. Donald G. Mcneil Jr. & Pam Belluck, Experts Say DNA Match Is Likely a Parent or Child , N.Y. Times, May 3, 2011, at F2

2. David H. Kaye, The Double Helix and the Law of Evidence (2010)

3. Leslie G. Biesecker et al., DNA Identifications After the 9/11 World Trade Center Attack, 310 Science 1122 (2005)

4. Lara Setrakian, Bin Laden’s Son: Worst Is Yet to Come, ABC News International, May 2, 2011, http://abcnews.go.com/International/osama-bin-ladens-son-death-unleash-violent-enemies/story?id=13509779

5. Amy Harmon, Stalking Strangers’ DNA to Fill In the Family Tree, New York Times, April 2, 2007

6. Tracy Johnson, Police Ruse Case Argued Before State’s Highest Court: Convicted Murderer Says Officers Broke Law with DNA Trick, Seattle Post-intelligencer Reporter, Jan. 27, 2006

Cross-posted: Forensic Science, Statistics, and Law blog

New Doubts About Unscrambling Complex DNA Mixtures with SNPs

On September 5, 2008, I posted a report in the now defunct Science & Law Blog in the Law Professors Blog Network entitled Genetics Datasets Closed Due to Forensic DNA Discovery. It concerned a reported procedure for the equivalent of unscrambling a broken egg — using a large number of SNPs to determine whether a known individual’s DNA is part of a mixture of DNA from scores or even hundreds of DNA samples.

If Hollywood was paying attention, we would have seen CSI techs checking a door knob to find out whether the suspect ever touched it. And however the report might have been received in Hollywood, it scared the NIH and other scientific organizations that maintain research databases. After reproducing the posting about the international reaction, I shall quote an abstract from a study slated for publication in the journal Forensic Science International: Genetics. The latest work makes it even clearer that limiting access to the databases was unnecessary.

Posting of 5 Sept. 2008

Until last Friday, the National Institutes of Health (NIH) and other groups had posted large amounts of aggregate human DNA data for easy access to researchers around the world. On Aug. 25, however, NIH removed the aggregate files of individual Genome Wide Association Studies (GWAS).

The files, which include the Database of Genotypes and Phenotypes (dbGaP), run by the National Center for Biotechnology Information, and the Cancer Genetic Markers of Susceptibility database, run by the National Cancer Institute, remain available for use by researchers who apply for access and who agree to protect confidentiality using the same approach they do for individual-level study data.) The Wellcome Trust Case Control Consortium and the Broad Institute of MIT and Harvard also withdrew aggregate data.

The reason? The data keepers fear that police or other curious organizations or individuals might deduce whose DNA is reflected in the aggregated data, and hence, who participated in a research study. These data consist of SNPs — Single Nucleotide Polymorphisms. These are differences in the base-pair sequences from different people at particular points in their genomes. Many SNPs are neutral — they do not have have any impact on gene expression. Nonetheless, they can be helpful in determining the locations of nearby disease-related mutations.

The event that prompted the data keepers to act was the discovery at the Translational Genomics Research Institute (TGen) of a new way to check whether an individual’s DNA is a part of a complex mixture of DNA (possibly from hundreds of people). According to the  TGen report, Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays, a statistic applied to intensity data from SNP microarrays (chips that detect tens of thousands of SNPs simultaneously) reveals whether the signals from an individual’s many SNPs are consistent with the possibility that the individual is not in the mixture. (Sorry for the wordiness, but the article uses hypothesis testing, and “not in the mixture” is the null hypothesis.)

How could this compromise the research databases? As best as I understand it, the scenario is that someone first would acquire a sample from somewhere. Your neighbor might check your garbage, isolate some of your DNA, get a SNP-chip readout, and check it against the public database to see if you were a research subject who donated DNA. Or, the police might have a crime-scene sample. Then they would use a SNP-chip to get a profile to compare to the record on the public database to see if the profile probably is part of the mixture data there. Finally, if they got a match, the police would approach the researchers to get the matching individual’s name.

Kathy Hudson, a public policy analyst at Johns Hopkins University, stated in an email that “While a fairly remote concern, and there are some protections even against subpoena, NIH did the right thing in acting to protect research participants.” However, scientists such David Balding in the U.K. are complaining that the restrictions on the databases are an overreaction. Indeed, an author of the TGen study is quoted as stating that the new policy is “a bit premature.” See http://www.nature.com/news/2008/080904/full/news.2008.1083.html.

It seems doubtful that anonymity of the research databases has been breached, or will be in the immediate future, by this convoluted procedure. Of course, the longer-term implications remain to be seen, and the technique has obvious applications in forensic science. If the technique works as advertised, police will be able to take a given suspect and determine whether his DNA is part of a mixture from a large number of individuals that was recovered at a crime scene. Analyzing complex mixtures for identity is difficult to do with standard (STR-based) technology.

References

– Homer N, Szelinger S, Redman M, Duggan D, Tembe W, et al., Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays, PLoS Genetics (2008). 4(8):e1000167. doi:10.1371/journal.pgen.1000167
-DNA databases shut after identities compromised, Nature 455:13. Sept. 3, 2008
-Natasha Gilbert, Researchers criticize genetic data restrictions, Nature Sept. 4, 2008, <http://www.nature.com/news/2008/080904/full/news.2008.1083.html>

Latest study

The latest study is Egeland et al., Complex Mixtures: A Critical Examination of a Paper by Homer et al., Forensic Sci. Int’l: Genetics, 2011. It is in press. Corrected proofs are available online (for a price) at the journal’s website.

Abstract: DNA evidence in criminal cases may be challenging to interpret if several individuals have contributed to a DNA-mixture. The genetic markers conventionally used for forensic applications may be insufficient to resolve cases where there is a small fraction of DNA (say less than 10%) from some contributors or where there are several (say more than 4) contributors. Recently methods have been proposed that claim to substantially improve on existing approaches [1]. The basic idea is to use high-density single nucleotide polymorphism (SNP) genotyping arrays including as many as 500,000 markers or more and explicitly exploit raw allele intensity measures. It is claimed that trace fractions of less than 0.1% can be reliably detected in mixtures with a large number of contributors. Specific forensic issues pertaining to the amount and quality of DNA are not discussed in the paper and will not be addressed here. Rather our paper critically examines the statistical methods and the validity of the conclusions drawn in Homer et al. (2008).

We provide a mathematical argument showing that the suggested statistical approach will give misleading results for important cases. For instance, for a two person mixture an individual contributing less than 33% is expected to be declared a non-contributor. The quoted threshold 33% applies when all relative allele frequencies are 0.5. Simulations confirmed the mathematical findings and also provide results for more complex cases. We specified several scenarios for the number of contributors, the mixing proportions and allele frequencies and simulated as many as 500,000 SNPs.

A controlled, blinded experiment was performed using the Illumina GoldenGate� 360 SNP test panel. Twenty-five mixtures were created from 2 to 5 contributors with proportions ranging from 0.01 to 0.99. The findings were consistent with the mathematical result and the simulations.

We conclude that it is not possible to reliably infer the presence of minor contributors to mixtures following the approach suggested in Homer et al. (2008). The basic problem is that the method fails to account for mixing proportions.