Tag Archives: transposition fallacy

Taking Liberties with the Numbers

I am enhancing material that appeared on the Science and Law Blog in the Law Professors Blog Network. That blog, initiated by David Faigman, a colleague and former student, has been deactivated for want of sustained activity.  In April 2009, I remarked there that an article in the California Lawyer exemplified and perpetuated the confusion in the media about DNA database trawls. Credulous bloggers with PhDs seized on the story, which harks back to a controversial article in the Los Angeles Times. I’ll discuss both articles here.

I. Fuzzy Thinking About Math

In “Guilt by the Numbers: How Fuzzy is the Math that Makes DNA Evidence Look So Compelling to Jurors?” award-winning journalist Edward Humes discusses the unusual case of People v. Puckett, No. A121368, Cal. Ct. App., 1st Dist., May 1, 2008). John Puckett, now an elderly man, is appealing his recent conviction for the 1972 murder of Diane Sylvester, a San Francisco nurse. The conviction rests on a cold hit in California’s convicted-offender database at a small number of STR loci (genetic locations). Hume writes that in Puckett, “the prosecution’s expert estimated that the chances of a coincidental match between the defendant’s DNA and the biological evidence found at the crime scene were 1 in 1.1 million.” Id. at 22. Then he adds “there’s another way to run the numbers” which shows that “the odds of a coincidental match in Puckett’s case are a whopping 1 in 3.” Id. “Both calculations,” he maintains, “are accurate. The problem is that they answer different questions.” Id. The explanation, he believes, lies in “a classic statistical puzzle known as the ‘birthday problem.'” Id.

The author’s skill as a writer exceeds his insight as a mathematician. Surely the probability of “a coincidental match” cannot have such fantastically different “accurate” values. Moreover, the birthday problem has almost nothing to do with these numbers. The fuzziness is in the words of the article, not in the math. Only if we define “a coincidental match” can we begin to see what its probability would be and how unlike the birthday problem it is.

Definition 1. The probability of coincidental match is the chance that Mr. Puckett is innocent and the match to him is just a coincidence

The average reader might think that a coincidental match means that Mr. Puckett is innocent and the match to him is just a coincidence.  If this is what it means, however, its probability is neither 1 in 1.1 million nor 1 in 3.  The former figure is the probability that Puckett’s DNA would match if he were the only one whose DNA had been checked and if he were unrelated to the killer. The latter figure is the probability that at least one profile in the California database — not necessarily Puckett’s — would match if no one in the database were the killer.  Notice that both probabilities are conditional — they depend on assumptions about who the real killer is or is not.  They cannot readily be inverted or transposed into the probability of who the real killer is. Under Definition 1, therefore, neither number is an “accurate” statement of the probability of a coincidental match.  Neither one expresses the chance that the match to Mr. Puckett is just a coincidence.

A technical note: This description of the probabilities of 1 in 1.1 million and 1 in 3 assumes, for simplicity, that it was the killer’s DNA that was found near the victim and later typed and that there was no possibility of error in the DNA typing, no ambiguity in the test results, and no selectivity in presenting them. Statisticians will immediately recognize that Bayes’ rule could be used to arrive at the posterior probability of Puckett’s innocence.

Definition 2. The probability of a coincidental match means the chance that Mr. Puckett’s DNA would match (and no other DNA in the database would) if he were not the killer and if he were unrelated to the killer.

This definition refers to the probability of the DNA evidence given the hypothesis of coincidence. Again, neither 1 in 1.1 million nor 1 in 3 expresses this value, but 1 in 1.1 million is a far closer estimate than is 1 in 3. The reason is that the DNA evidence includes not merely the datum that Puckett’s DNA matches, but the additional information that no one else’s does. If Puckett were the only one tested (a database of size 1) and if he were innocent, then the chance that he would match would be 1 in 1.1 million. Now we test an unrelated second person. The chance that this individual would match if he were innocent also is 1 in 1.1 million, and the chance that he would match if he were the killer is 1. The chance that Puckett matches and the other man does not is therefore either (1/1,100,000) x (1/1,100,000) (if both men are innocent) or 1/1,100,000 x 1 (if Puckett is innocent and the other man is the killer). In other words, the probability that Puckett matches just by coincidence (he matches if he is innocent) in a search of a database of size 2 is, at most, 1 in 1.1 million. Searching the database and finding that only Puckett matches is better evidence than testing only Puckett.  This reasoning is developed more fully, for a database of any size, in, e.g., David H. Kaye, Rounding Up the Usual Suspects: A Legal and Logical Analysis of DNA Database Trawls, 87 N. Car. L. Rev. 425 (2009).

Definition 3. The probability of a coincidental match means the chance that one or more DNA profiles in the database would match if no one in the database is the killer.

This definition refers to the probability of one or more hits in the database given that the database is innocent. This probability is approximately 1 in 3. What it has to do with the probability that the DNA in the bedroom was Mr. Puckett’s is obscure.  It is not even the expected rate at which searches of innocent databases would lead to prosecutions. After all, the 1 in 3 figure includes people who were not even born in 1972, when Puckett allegedly killed Diane Sylvester. If the probability that applies under Definition 3 were to be admitted, it should be adjusted so that it it is not so misleadingly large. See id.; David H. Kaye, People v. Nelson: A Tale of Two Statistics, 7 L., Probability, & Risk 247 (2008).

The Birthday Problem

Also contrary to the claim in the California Lawyer, the birthday problem is not involved in Puckett. The birthday problem, in its simplest form, asks for is the smallest number of people in a room such that the probability that at least two of them will have birthdays on the same day of the same month exceeds one-half. The answer (23) is surprisingly small because no particular birthday is specified. In the Puckett search, however, a particular DNA profile — the one from the crime-scene — is specified. Finding that this particular profile matches at least one in the database is much less likely than finding at least one match between all pairs of profiles in the database. The latter event is the kind that is at issue in the birthday problem.  See David H. Kaye, DNA Database Woes: What Is the FBI Afraid Of?, Cornell J. L. & Public Policy (2010, in press). It is not involved in a cold hit to a crime-scene profile.

There are other errors in the California Lawyer article, but I hope I have said enough to caution readers to be wary. The media portrait of the database-trawl issue bears but a faint resemblance to the statistical literature on the subject.

II. The LA Times‘s Gaffe

On March 4, 2008, the Los Angeles Times published “When a Match is Far from a Lock,” an account of the perceived need to adjust the probability for a random match when an individual emerges as a suspect because of a trawl through a database of DNA profiles. The reporters suggested that there was a grave injustice because “the prosecutor told the jury that the chance of such a coincidence was 1 in 1.1 million,” but “jurors were not told the statistic that leading scientists consider the most significant: the probability that the database search had hit upon an innocent person. In Puckett’s case, it was 1 in 3.” They added that “the case is emblematic of a national problem,” announcing that “The Times has found [that p]rosecutors and crime labs across the country routinely use numbers that exaggerate the significance of DNA matches in ‘cold hit’ cases, in which a suspect is identified through a database search.”

The Times received some flak for this breathless reporting. Not only do many leading statisticians dispute the claim that an adjustment for the size of the database searched produces the most significant statistic, but, it was said, the description of “1 in 3” as “the probability that the database had hit upon an innocent person” was wrong. The critical readers complained that, at best, 1/3 was the chance of a match to someone in the database if neither Puckett nor anyone else in the database were the source of the DNA in the bedroom of the murdered woman. It is not the chance that Puckett is not the source given that his DNA matches.

To equate the two probabilities is to slip into the transposition fallacy that P(A given B) = P(B given A). Conditional probabilities do not work this way. For instance, the chance that a card randomly drawn from a deck of ordinary playing cards is a picture card given that it is red is not the chance that it is red given that it is a picture card. The former probability is P(picture if red) = 6/26. The latter is P(red if picture) = 6/12.

The reporters responded with the following defense:

In our story, we did not write that there was a 1 in 3 chance that Puckett was innocent, which would be a clear example of the prosecutor’s fallacy. Rather, we wrote: “Jurors were not told, however, the statistic that leading scientists consider the most significant: the probability that the database search had hit upon an innocent person. In Puckett’s case, it was 1 in 3.” The difference is subtle, but real.

Interestingly, when asked whether there was any difference on a listserve of evidence professors, two professors described the statement as ambiguous, while four saw it as a clear instance of transposition.

My view is that the following two statements are true:

1. IF THE DATABASE WERE INNOCENT (meaning that it does not contain the source of the crime-scene DNA and everyone in it is unrelated), then (prior to the trawl) the probability that SOMEONE (regardless of his or her name) would match is roughly 1/3.

2. IF THE DATABASE WERE INNOCENT, then (prior to the trawl) the probability that a man named Puckett would match is 1/N = 1/1,100,000.

But neither (1) nor (2) is equivalent to

3. The probability that the database search hit upon an innocent person named Puckett was 1/3.

Yet, it seems that reporters Jason Felch and Maura Dolan told at least one juror who had convicted Puckett that he had done so even though the probability was as high as 1 in 3 that the cold hit was to an innocent person named Puckett. The juror responded predictably to this distressing news: “Of course it would have changed things. It would have changed a lot of things.” Perhaps someone should debrief the juror and tell him precisely what the 1/3 figure refers to.

References

Edward Hume, Guilt by the Numbers: How fuzzy is the math that makes DNA evidence look so compelling to jurors?, California Lawyer, Apr. 2009, at 21-24.

Donnelly, Peter, and Richard D. Friedman. 1999. “DNA Database Searches and the Legal Consumption of Scientific Evidence.” Michigan Law Review 97: 931-984.

Kaye, Jane. 2006. “Police Collection and Access to DNA Samples.” Genomics, Society and Policy. 2: 16-27.

David H. Kaye, People v. Nelson: A Tale of Two Statistics, 7 L., Probability, & Risk 247 (2008)

David H. Kaye, Rounding Up the Usual Suspects: A Legal and Logical Analysis of DNA Database Trawls, 87 N. Car. L. Rev. 425 (2009)