Tag Archives: CODIS STR loci

Trashing Junk DNA: Alice in Genomeland

Earlier today, I introduced the concepts and terms required to ascertain whether the estimated proportion of the genome that encodes the structure of proteins or regulates gene expression has jumped from 5 or 10% to 80%. I now focus on the possible meanings of “functional” to see whether the ENCODE papers state or imply and such seismic change. It appears that they do not.

“Functional” is an adjective, and Alice learned from Humpty Dumpty that adjectives are malleable:

“When I use a word,” Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean–neither more nor less.”
“The question is,” said Alice, “whether you can make words mean so many different things.”
“The question is,” said Humpty Dumpty, “which is to be master–that’s all.”
Alice was too much puzzled to say anything, so after a minute Humpty Dumpty began again. “They’ve a temper, some of them–particularly verbs, they’re the proudest–adjectives you can do anything with, but not verbs–however, I can manage the whole lot! Impenetrability! That’s what I say!”

Like Humpty, who was redefining the word “glory,” the ENCODE authors recognized that “functional” can have many meanings. As Ewan Birney later explained:

Like many English language words, “functional” is a very useful but context-dependent word. Does a “functional element” in the genome mean something that changes a biochemical property of the cell (i.e., if the sequence was not here, the biochemistry would be different) or is it something that changes a phenotypically observable trait that affects the whole organism?1/

Still other possibilities exist. For example, the first paper to use the adjective “junk” for noncoding DNA noted that even debris accumulated in the course of evolution or introduced from viral infections could have a function simply by creating spaces between genes.2/ The pieces of dead wood that are joined together to form the hull of a row boat have a function–they exclude the water from the vessel to keep it afloat. This does not mean that the detailed structure of the planks–the precise width of each plank or the number of ridges on its surface–affects its functionality. And, just as something can be inactive and functional, so too something can be alive with activity and yet be nonfunctional.

ENCODE uses biochemical activity–the notion that “the biochemistry would be different”–as a synonym for functional. Here is the definition of “functional” in the top-level paper:

Operationally, we define a functional element as a discrete genome segment that encodes a defined product (for example, protein or non-coding RNA) or displays a reproducible biochemical signature (for example, protein binding, or a specific chromatin structure).3/

This definition may be useful for the purpose of describing the size of ENCODE’s catalog of elements for later study, but it contrasts sharply with the notion of functional as affecting a nontrival phenotype. The ENCODE papers show that 80% of the genome displays signs of certain types of biochemical activity–even though the activity may be insignificant, pointless, or unnecessary. This 80% includes all of the introns, for they are active in the production of pre-mRNA transcripts. But this hardly means that they are regulatory or otherwise functional.4/ Indeed, if one carries the ENCODE definition to its logical extreme, 100% of the genome is functional–for all of it participates in at least one biochemical process–DNA replication.

That the ENCODE project would not adopt the most extreme biochemical definition is understandable–that definition would be useless. But the ENCODE definition is still grossly overinclusive from the standpoint of evolutionary biology. From that persective, most estimates of the proportion of “functional” DNA are well under 80%. Various biologists or related specialists have provided varying guestimates:

  • Under 50%: “About 1% … is coding. Something like 1-4% is currently expected to be regulatory noncoding DNA … . About 40-50% of it is derived from transposable elements, and thus affirmatively already annotated as “junk” in the colloquial sense that transposons have their own purpose (and their own biochemical functions and replicative mechanisms), like the spam in your email. And there’s some overlap: some mobile-element DNA has been co-opted as coding or regulatory DNA, for example. [�] … Transposon-derived sequence decays rapidly, by mutation, so it’s certain that there’s some fraction of transposon-derived sequence we just aren’t recognizing with current computational methods, so the 40-50% number must be an underestimate. So most reasonable people (ok, I) would say at this point that the human genome is mostly junk (“mostly” as in, somewhere north of 50%).”5/
  • 40%: “ENCODE biologist John Stamatoyannopoulos … said … that some of the activity measured in their tests does involve human genes and contributes something to our human physiology. He did admit that the press conference mislead people by claiming that 80% of our genome was essential and useful. He puts that number at 40%.”6/
  • 20%: “[U]sing very strict, classical definitions of “functional” [to refer only to] places where we are very confident that there is a specific DNA:protein contact, such as a transcription factor binding site to the actual bases–we see a cumulative occupation of 8% of the genome. With the exons (which most people would always classify as “functional” by intuition) that number goes up to 9%. … [�] In addition, in this phase of ENCODE we did [not] sample … completely in terms of cell types or transcription factors. [W]e’ve seen [at most] around 50% of the elements. … A conservative estimate of our expected coverage of exons + specific DNA:protein contacts gives us 18%, easily further justified (given our [limited] sampling) to 20%.”7/

So why did the ENCODErs opt for the broadest arguable definition of “functional”? Birney’s answer is that it describes a quantity that the project could measure; that the larger number underscores that a lot is happening in the genome; that it would have confused readers to receive a range of numbers; and that the smaller number would not have counted the efforts of all the researchers.

Whether these are very satisfactory reasons for trumpeting a widely misunderstood number is a matter that biologists can debate. All I can say is that (1) I have been unable to extract a clear number–whatever one should make of it–for a percentage of the genome that constitutes the regulatory elements–the promoters, enhancers, silencers, ncRNA “genes,” and so on; (2) this number is almost surely less than the 80% figure that, at first glance, one might have thought ENCODE was reporting; and (3) “functional element” as defined by the ENCODE Project is not a term that has clear or direct implications for claims of the law enforcement community that the loci used in forensic identification are not coding and therefore not informative.

Of course, none of this means that the description of the information content of the CODIS STRs traditionally presented by law enforcement authorities is correct. It simply means that even after this phase of ENCODE, there are still a huge number of base pairs that might or might not be regulatory or influence regulation and, hence, gene expression. The CODIS STRs might or might not be among them. Published reports suggest that they are not,8/ but the logic that just because a DNA sequence is noncoding (and nonregulatory), it conveys zero information about phenotype is flawed. It overlooks the possibility of a correlation between the nonfunctional sequence (because it sits next to an exon or a regulatory sequence).9/ Again, however, the published literature reviewing the CODIS STRs does not reveal any population-wide correlations that permit valid and strong inferences about disease status or propensity or other socially significant phenotypes.10/

Will this situation change? A thoughtful answer would take up a lot of space.11/ For now, I’ll just repeat the aphorism attributed to Yogi Berra, Neils Bohr, and Storm P: “It’s hard to make predictions, especially about the future.”


1. Ewan Birney, ENCODE: My Own Thoughts, Ewan’s Blog: Bioinformatician at Large, Sept. 5, 2012, http://genomeinformatician.blogspot.co.uk/2012/09/encode-my-own-thoughts.html.

2. David E. Comings, The Structure and Function of Chromatin, in 3 Advances in Human Genetics 237, 316 (H. Harris & K. Hirschhorn eds. 1972) (“Large spaces between genes may be a contributing factor to the observation that most recombination in eukaryotes is inter- rather than intragenic. Furthermore, if recombination tended to be sloppy with most mutational errors occurring in the process, it would an obvious advantage to have it occur in intergenic junk.”). For more discussion of this paper, see T. Ryan Gregory, ENCODE (2012) vs. Comings (1972), Sept. 7, 2012, http://www.genomicron.evolverzone.com/2012/09/encode-2012-vs-comings-1972/.

3. Ian Dunham et al., An Integrated Encyclopedia of DNA Elements in the Human Genome, 489 Nature 57 (2012).

4. These regions do contain some RNA-coding sequences, and those small parts could be doing something interesting (producing RNAs that are regulatory or that defend against infection by viral DNA, for example), but this kind of activity does not exist in the bulk of the introns that are, under the ENCODE definition, 100% functional.

5. Sean Eddy, ENCODE Says What?, Sept. 8, 2012, http://selab.janelia.org/people/eddys/blog/?p=683. He adds that:

[A]s far as questions of “junk DNA” are concerned, ENCODE’s definition isn’t relevant at all. The “junk DNA” question is about how much DNA has essentially no direct impact on the organism’s phenotype–roughly, what DNA could I remove (if I had the technology) and still get the same organism. Are transposable elements transcribed as RNA? Do they bind to DNA-binding proteins? Is their chromatin marked? Yes, yes, and yes, of course they are–because at least at one point in their history, transposons are “alive” for themselves (they have genes, they replicate), and even when they die, they’ve still landed in and around genes that are transcribed and regulated, and the transcription system runs right through them.

6. Faye Flam, Skeptical Takes on Elevation of Junk DNA and Other Claims from ENCODE Project, Sept. 12, 2012, http://ksj.mit.edu/tracker/2012/09/skeptical-takes-elevation-junk-dna-and-o. Stamatoyannopoulos added that:

What the ENCODE papers … have to say about transposons is incredibly interesting. Essentially, large numbers of these elements come alive in an incredibly cell-specific fashion, and this activity is closely synchronized with cohorts of nearby regulatory DNA regions that are not in transposons, and with the activity of the genes that those regulatory elements control. All of which points squarely to the conclusion that such transposons have been co-opted for the regulation of human genes — that they have become regulatory DNA. This is the rule, not the exception.

7. Ewan Birney, ENCODE: My Own Thoughts, Ewan’s Blog: Bioinformatician at Large, Sept. 5, 2012, http://genomeinformatician.blogspot.co.uk/2012/09/encode-my-own-thoughts.html.

8. E.g., Sara H. Katsanis & Jennifer K. Wagner, Characterization of the Standard and Recommended CODIS Markers, J. Forensic Sci. (2012).

9. E.g., David H. Kaye, Two Fallacies About DNA Databanks for Law Enforcement, 67 Brook. L. Rev. 179 (2001).

10. E.g., Sara H. Katsanis & Jennifer K. Wagner, Characterization of the Standard and Recommended CODIS Markers, J. Forensic Sci. (2012).

11. For my earlier, and possibly dated, effort to evaluate the likelihood that the CODIS loci someday will prove to be powerfully predictive or diagnostic, see David H. Kaye, Please, Let’s Bury the Junk: The CODIS Loci and the Revelation of Private Information, 102 Nw. U. L. Rev. Colloquy 70 (2007), and Mopping Up After Coming Clean About “Junk DNA”, Nov. 23, 2007.

Trashing Junk DNA: The Notorious 80%

Last week I noted some of the hyperbolic headlines accompanying the coordinated publication of a large number of datasets from the ENCODE Project . The abstract of the top-level paper begins as follows:

The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions.1/

Hoping to decipher these sentences, I have been reading about gene regulation. This modest effort stems from more than academic curiosity. If the popular and even some of the scientific press is to be believed, ENCODE has exorcized “junk DNA” from the body of scientific knowledge.2/ The bright light suddenly shining on the “dark matter” of the genome (to introduce another sloppy metaphor)3/ raises a giant question mark for the criminal justice system. Law enforcement authorities have always insisted that the snippets of DNA used to generate DNA identification profiles are just nonfunctional “junk.”4/ Now, according to New York Times science correspondent Gina Kolata,

As scientists delved into the “junk” — parts of the DNA that are not actual genes containing instructions for proteins — they discovered a complex system that controls genes. At least 80 percent of this DNA is active and needed. … [�] … The thought before the start of the project, said Thomas Gingeras, an Encode researcher from Cold Spring Harbor Laboratory, was that only 5 to 10 percent of the DNA in a human being was actually being used.5/

This juxtaposition of percentages suggests that the scientific community has shifted from the view that “only 5 to 10 percent” of the genome is functional (“needed” for the organism to function normally) to a sudden realization that 80% falls into this category.

But the more I read, the clearer it became that this description of a sudden phase transition in science is wildly inaccurate. Johns Hopkins biostatistian Steve Salzberg, in a provocative Simply Statistics podcast interview, describes the 80% figure touted in the ENCODE paper as irresponsible.6/ University of Toronto biochemist Lawrence Moran saw it as a repeat of a similar, problematic performance five years ago, at the conclusion of the pilot phase of ENCODE.7/ Responding to criticism, ENCODE Project leader Ewan Birney explained the new knowledge this way:

After all, 60% of the genome with the new detailed manually reviewed (GenCode) annotation is either exonic or intronic, and a number of our assays (such as PolyA- RNA, and H3K36me3/H3K79me2) are expected to mark all active transcription. So seeing an additional 20% over this expected 60% is not so surprising.8/

“Not so surprising”? A whopping 60%–not a minor 5 or 10%–was already estimated to be “active”? What is going on here?

The answer lies in the definition of some key terms (like exons, introns, and transcription) and requires a rudimentary understanding of the fundamentals of gene expression and its regulation in human beings. This posting presents the essential terminology and concepts. A sequel will apply them to explain what ENCODE’s “assign[ing] biochemical functions for 80% of the genome” means. Anyone who knows what RNA transcripts and transcription factors do can skip this first part (or can read it to let me know of my inaccuracies).

To avoid suspense, I shall lay out my conclusions here and now: (1) if ENCODE gives a clear number for a percentage of the genome that regulates genes–the promoters, enhancers, silencers, ncRNA “genes,” and so on–I have yet to find it; (2) this number is almost surely less than the 80% figure reported for functionality; and (3) “functional element” as defined by the ENCODE Project is not a term that has clear or direct implications for claims of the law enforcement community that the loci used in forensic identification are not coding and therefore not informative. Those claims of zero information are somewhat exaggerated, but that is another story. For now, I merely describe some basics of gene expression and regulation.

Genes make proteins. But how? There are three big steps (with many activities within each step): transcription; post-transcription modification and transportation; and translation. All involve RNA, a single-stranded molecule related to DNA, and proteins. The basic picture is

  • Transcription to precursor messenger RNA: DNA + proteins –> pre-mRNA (in nucleus)
  • Post-transcriptional modification and transportation: pre-mRNA + proteins and RNAs -> mature m-RNA (in cytoplasm)
  • Translation to protein: mRNA + tRNA and proteins –> expressed protein (in cytoplasm)

In the first big step, the base pairs of the gene are transcribed jot-for-jot into an RNA molecule (precursor messenger RNA, or pre-mRNA). In the second major step, the transcript is modified at its ends, edited to remove parts that do not code for the protein that will be made (splicing), and the mature messenger RNA (m-RNA) is moved outside the nucleus. In the third phase, another type of RNA (transfer RNA, or tRNA) stitches together individual amino acids in the order dictated by the m-RNA transcript to form a protein, thereby translating the DNA sequence mirrored in the mRNA into the amino-acid order of the protein. Translation occurs on a kind of microscopic workbench (a ribosome) made of yet another RNA (ribosomal RNA, or rRNA).

For all this to happen, the DNA, which lies tightly coiled in the chromosomes (in a protein-DNA matrix known as chromatin), must open up for transcription to occur. Thus, changes in the chromatin regulate transcription, and these changes can be brought about in a number of ways. Transcription factors (specialized proteins) bind to the DNA. The bound transcription factors then recruit an enzyme (RNA polymerase) that produces RNA. This occurs within a region of DNA, known as a promoter, near the start of the protein-coding DNA (the structural gene). The level of transcription is influenced by activator or repressor proteins that bind to still other small regions (enhancers and silencers, respectively) that also lie outside the structural gene. In short, chemical interactions that open or close the chromatin that houses the DNA and transcription factors regulate the first step in the DNA-to-protein process.

In the past decade, other mechanisms of regulation or control of gene expression have been discovered. Many DNA sequences are not transcribed into messenger RNA, but they are transcribed into a variety of other RNAs. These non-protein-coding DNA sequences can be thought of as genes for RNA. Courting confusion, they usually are called “noncoding” (ncDNA)–because they do not code for protein–but they certainly code for RNAs that are crucial to translation–rRNA and tRNA–and for other RNAs that affect transcription, translation, and DNA replication. So it turns out that the genome is abuzz with transcription-to-RNA activity and other events that feed into the expression of the (protein-)coding DNA.

Yet, this hardly means that every biochemical event along the DNA is functionally important. Some, perhaps many, non-mRNA transcripts are just “noise.” They may float around for a while, but they may not do anything except wither away. In addition, large segments of the DNA transcribed in the course of making mRNA appear in the initial transcript (the pre-mRNA) but never make it into mature mRNA. These unused parts of the pre-mRNA transcripts correspond to long stretches of DNA, known as introns, that interrupt the smaller coding parts–the exons–that are translated into proteins. The initially transcribed intronic parts are removed from the pre-mRNA in a process called RNA splicing. Most of the RNA from introns probably just dissipates.9/

All these terms are a mouthful, but armed with this basic understanding of genes, RNA, and proteins, we can see why the 80% figure does not mean what one might think. We shall also see that the estimated proportion of the genome that encodes the structure of proteins or regulates gene expression has not jumped from 5 or 10% to 80%.


1. Ian Dunham et al., An Integrated Encyclopedia of DNA Elements in the Human Genome, 489 Nature 57 (2012).

2. E.g., Elizabeth Pennisi, ENCODE Project Writes Eulogy for Junk DNA, 337 Science 1159 (2012).

3. E.g., Gina Kolata, Bits of Mystery DNA, Far From ‘Junk,’ Play Crucial Role, N.Y. Times, Sept. 5, 2012. In one respect, the “dark matter” metaphor misrepresents dark matter. The presence of dark matter is inferred from its gravitational effects on visible matter. The presence of noncoding DNA is known from experiments that detect and characterize it just as they do coding DNA. Perhaps the metaphor means that the sequence of “dark matter” DNA cannot be deduced from the structure of a protein made in a cell. This, however, is like saying that dark matter is matter than cannot be seen with the naked eye. And that is not what astronomers mean by dark matter.

4. E.g., House Committee on the Judiciary, Report on the DNA Analysis Backlog Elimination Act of 2000, 106th Cong., 2d Sess., H.R. Rep. No. 106-900(1), at 27 (“the genetic markers used for forensic DNA testing … show only the configuration of DNA at selected ‘junk sites’ which do not control or influence the expression of any trait.”); New York State Law Enforcement Council, Legislative Priorities 2012: DNA at Arrest, at 5, http://nyslec.org/pdfs/2012/1_DNA_2012.pdf (“The pieces of DNA that are analyzed for the databank were specifically chosen because they are ‘junk DNA.’).

5. Kolata, supra note 3.

6. Interview by Roger Peng with Steven Salzberg, podcast on Simply Statistics, Sept. 7, 2012, http://simplystatistics.org/post/31056769228/interview-with-steven-salzberg-about-the-encode (“Why do they feel a need to say that 80% of the genome is functional? … They know it’s not true. They shouldn’t say it. … You don’t distort the science to get into the headlines.”).

7. Laurence A. Moran, The ENCODE Data Dump and the Responsibility of Scientists, Sept. 6, 2012, http://sandwalk.blogspot.com/2012/09/the-encode-data-dump-and-responsibility_6.html (“This is, unfortunately, another case of a scientist acting irresponsibly by distorting the importance and the significance of the data.”).

8. Ewan Birney, ENCODE: My Own Thoughts, Sept. 5, 2011

9. Post-splicing processing of a small fraction of the RNA from introns can produce noncoding RNAs that may regulate protein expression. L. Fedorova1 & A. Fedorov, Puzzles of the Human Genome: Why Do We Need Our Introns?, 6 Current Genomics 589, 592 (2005).

I am grateful to Eileen Kane for explaining some of the molecular biology to me. This entry is cross-posted to the Forensic Science, Statistics, and the Law Blog

CODIS Loci Ready for Disease Prediction, Vermont Court Says

A trial court in Vermont has gone where no court has gone before. In State v. Abernathy [1], Chittenden Superior Court Judge Alison Sheppard Arms found that because “[s]ix CODIS loci … have associations with an increased risk of disease or have functional properties,” the custodians of law enforcement DNA databases can make “probabilistic predictions of disease.” According to the judge, modern research has established that “some of the CODIS loci have associations with identifiable serious medical conditions,” making the scientific evidence “sufficient to overcome the previously held belief[s]” about the innocuous nature of the CODIS loci.

Emphasizing this finding that richly information-laden STR profiles reside in identification databases, the court proceeded to strike down “Vermont’s new pre-conviction DNA testing requirement … that requires submission of a DNA sample from a ‘person for whom the court has determined at arraignment there is probable cause that the person has committed a felony … .'” In an atypical opinion, the court applied a “special needs” balancing test, placed the burden of proof on the state, and held that this law violates the state constitution.

A major theme in Judge Arms’ discussion of human genetics is that there has been a revolution in our understanding of what used to be called “junk DNA.” Even though the CODIS loci originally were described as “junk” in “good faith,” that understanding was wrong–we now know that even DNA that does not code for proteins is biologically important.1 Other judges, advocacy groups, and at least one law professor have jumped from the discovery that the triplet code for proteins is not the sole message inscribed in DNA to the conclusion that all the CODIS loci may well convey significant information about disease states or propensities.

There are a couple of problems with this reasoning. All that we actually know is that some non-protein-coding DNA regulates gene expression. Scientists do not believe that all non-protein-coding sequences are regulatory. In particular, whether noncoding, nontranscribed, and largely nonconserved sequences are part of a regulatory system (even if their presence might have some function) is far from established.2 The opinion cites an essay I wrote making this point [3] but then ignores its content. It quotes the legal treatise, Modern Scientific Evidence, for the view that “while it is generally agreed that no single loci [sic] contains a gene that definitively determines any discernible characteristic of significance, there are nonetheless indications that they may play a role in some sensitive matters, and continued debates about their importance.” Before Abernathy, it appeared that the “continued debates” ended five years ago with agreement on what already was known — that even if the loci do not play a functional role, they might, like certain fingerprint patterns or blood types, have some statistical associations with diseases.3

Venturing beyond the inconclusive generalities like these, Abernathy refers to the biomedical literature on five loci and to a testifying expert’s characterization of the literature (with no specific references) on another locus. The opinion does not give the magnitude of any putative association, let alone any measure of predictive utility.4. It uses the following phrases: “a fairly large effect size,” “a modest association,” “not the most strongly associated,” “small but … not zero,”5 and “cannot find that this marker has no association.” It does not provide measures of the uncertainty in these estimates. Finally, the opinion does not discuss the extent to which the studies said to prove that the associations have been replicated.6

Of course, few judges could confidently review the flood of studies on human genetics. Unlike some previous opinions and law review articles, however, this opinion does not rely entirely or largely on newspaper headlines and stories about “junk DNA.” Here, the iconoclastic findings came after an evidentiary hearing. But, as has happened before with DNA evidence [8], the evidentiary hearing was one-sided. The defendants presented the testimony of Professor Gregory Wray of Duke University, a specialist in genetics and evolutionary biology, and the state did not to present an expert in medical genetics or genomics to counter his testimony. Although Professor Wray reviewed the biomedical literature before he testified, the defense submitted no written report, and the state rather than the defense introduced the papers cited in the opinion as exhibits. Scanning the testimony, it seems to me that Dr. Wray never was asked a series of critical questions:

  1. Is it generally accepted that the associations he pointed to apply to the population of individuals whose DNA is placed in law enforcement databanks?
  2. Assuming that they do apply to that population, what is the positive and negative predictive value of any inference about disease status or propensity derived from these particular CODIS alleles?
  3. How would the predictive or diagnostic disease-related information in a state DNA database compare to that of (a) color photographs, (b) fingerprints, (c) blood types used in conventional serology, and (d) the HLA-A and HLA-B haplotypes that used to be a mainstay of parentage testing?
  4. Are the CODIS genotypes likely to be substantially more predictive in the future?

Until these questions are answered, there is reason to ask whether the trial court’s findings fairly represent the status quo or instead are grim predictions of what could come to pass.


1. For a short audio clip reporting on the revolutionary discoveries, click on Joe Palca, Don’t Throw It Out: ‘Junk DNA’ Essential In Evolution, All Things Considered, Aug. 19, 2011 (with a sound bite from Professor Gregory Wray, among other interviewees).

2. According to Judge Arms,”[t]he term ‘junk DNA’ was coined in the early 1980s.” In fact, the phrase normally is attributed to Susumu Ohno, who used it in the title of a 1972 paper [2]. Ohno did not reason that “we don’t know what noncoding DNA does, therefore, is it is useless junk.” Indeed, he proposed that the duplication and inactivation of genes produce non-protein-coding DNA (now designated pseudogenes) that might have a function. A video introducing Ohno and reading an excerpt from the paper about the role of the noncoding sequences as “spacers” with evolutionary importance can be found at http://www.youtube.com/watch?v=nomI35DJB40&noredirect=1. Since 1972, other possible functions for noncoding DNA have been proposed. Some functions imply that the sequences should be conserved as one species evolves into another. Others, such as Ohno’s suggestion that noncoding sequences act as buffers between genes, do not.

3. See [5, p. 228] (referring to “a brief debate in the legal literature” necessitated by “a misunderstanding by Simon Cole over some of the things I [John Butler] had written in a review article on STR markers” and emphasizing that “STR markers used for human identity testing do not predict disease.”). One source of confusion, which also infects the Abernathy opinion is the thought that a statistical association between a locus and a disease detected in a family study in say, Northern India, establishes that the same association exists throughout the population in the United States.

4. Even a strong association (large relative risk) would not make for a useful predictive test if the prevalence of the condition is very small. See [3].

5. The sentence “[t]he relative risk of developing schizophrenia associated with this marker is small but it is not zero” is technically flawed. A relative risk of 1 would express a 0 correlation.

6. Replication is always important, and the problem of false positives is especially acute with genome-wide association studies. See, e.g., [6, 7].


1. State v. Abernathy, No. 3599-9-11 (Vt. Super. Ct. June 1, 2012).

2. S. Ohno, So Much “Junk” DNA in our Genome, 23 Brookhaven Symp. Biol. 366 (1972) (also published in Evolution of Genetic Systems 366 (H.H. Smith ed. 1972).

3. David H. Kaye, Please, Let’s Bury the Junk: The CODIS Loci and the Revelation of Private Information, 102 Nw. U. L. Rev. Colloquy 70 (2007).

4. David H. Kaye, Mopping Up After Coming Clean About “Junk DNA”, Nov. 23, 2007, available at http://ssrn.com/abstract=1032094.

5. John M. Butler, Advanced Topics in Forensic DNA Typing: Methodology (2012).

6. D.J. Hunter & P. Kraft, Drinking from the Fire Hose–Statistical Issues in Genomewide Association Studies, 357 N. Engl. J. Med. 436 (2007).

7. Thomas A. Pearson, & Teri A. Manolio, How to Interpret a Genome-wide Association Study, 299 J. Am. Med. Ass’n 1335 (2008).

8. David H. Kaye, The Double Helix and the Law of Evidence (2010).

Cross-posted to Forensic Science, Statistics, and the Law.