Earlier today, I introduced the concepts and terms required to ascertain whether the estimated proportion of the genome that encodes the structure of proteins or regulates gene expression has jumped from 5 or 10% to 80%. I now focus on the possible meanings of “functional” to see whether the ENCODE papers state or imply and such seismic change. It appears that they do not.
“Functional” is an adjective, and Alice learned from Humpty Dumpty that adjectives are malleable:
“When I use a word,” Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean–neither more nor less.”
“The question is,” said Alice, “whether you can make words mean so many different things.”
“The question is,” said Humpty Dumpty, “which is to be master–that’s all.”
Alice was too much puzzled to say anything, so after a minute Humpty Dumpty began again. “They’ve a temper, some of them–particularly verbs, they’re the proudest–adjectives you can do anything with, but not verbs–however, I can manage the whole lot! Impenetrability! That’s what I say!”
Like Humpty, who was redefining the word “glory,” the ENCODE authors recognized that “functional” can have many meanings. As Ewan Birney later explained:
Like many English language words, “functional” is a very useful but context-dependent word. Does a “functional element” in the genome mean something that changes a biochemical property of the cell (i.e., if the sequence was not here, the biochemistry would be different) or is it something that changes a phenotypically observable trait that affects the whole organism?1/
Still other possibilities exist. For example, the first paper to use the adjective “junk” for noncoding DNA noted that even debris accumulated in the course of evolution or introduced from viral infections could have a function simply by creating spaces between genes.2/ The pieces of dead wood that are joined together to form the hull of a row boat have a function–they exclude the water from the vessel to keep it afloat. This does not mean that the detailed structure of the planks–the precise width of each plank or the number of ridges on its surface–affects its functionality. And, just as something can be inactive and functional, so too something can be alive with activity and yet be nonfunctional.
ENCODE uses biochemical activity–the notion that “the biochemistry would be different”–as a synonym for functional. Here is the definition of “functional” in the top-level paper:
Operationally, we define a functional element as a discrete genome segment that encodes a defined product (for example, protein or non-coding RNA) or displays a reproducible biochemical signature (for example, protein binding, or a specific chromatin structure).3/
This definition may be useful for the purpose of describing the size of ENCODE’s catalog of elements for later study, but it contrasts sharply with the notion of functional as affecting a nontrival phenotype. The ENCODE papers show that 80% of the genome displays signs of certain types of biochemical activity–even though the activity may be insignificant, pointless, or unnecessary. This 80% includes all of the introns, for they are active in the production of pre-mRNA transcripts. But this hardly means that they are regulatory or otherwise functional.4/ Indeed, if one carries the ENCODE definition to its logical extreme, 100% of the genome is functional–for all of it participates in at least one biochemical process–DNA replication.
That the ENCODE project would not adopt the most extreme biochemical definition is understandable–that definition would be useless. But the ENCODE definition is still grossly overinclusive from the standpoint of evolutionary biology. From that persective, most estimates of the proportion of “functional” DNA are well under 80%. Various biologists or related specialists have provided varying guestimates:
- Under 50%: “About 1% … is coding. Something like 1-4% is currently expected to be regulatory noncoding DNA … . About 40-50% of it is derived from transposable elements, and thus affirmatively already annotated as “junk” in the colloquial sense that transposons have their own purpose (and their own biochemical functions and replicative mechanisms), like the spam in your email. And there’s some overlap: some mobile-element DNA has been co-opted as coding or regulatory DNA, for example. [�] … Transposon-derived sequence decays rapidly, by mutation, so it’s certain that there’s some fraction of transposon-derived sequence we just aren’t recognizing with current computational methods, so the 40-50% number must be an underestimate. So most reasonable people (ok, I) would say at this point that the human genome is mostly junk (“mostly” as in, somewhere north of 50%).”5/
- 40%: “ENCODE biologist John Stamatoyannopoulos … said … that some of the activity measured in their tests does involve human genes and contributes something to our human physiology. He did admit that the press conference mislead people by claiming that 80% of our genome was essential and useful. He puts that number at 40%.”6/
- 20%: “[U]sing very strict, classical definitions of “functional” [to refer only to] places where we are very confident that there is a specific DNA:protein contact, such as a transcription factor binding site to the actual bases–we see a cumulative occupation of 8% of the genome. With the exons (which most people would always classify as “functional” by intuition) that number goes up to 9%. … [�] In addition, in this phase of ENCODE we did [not] sample … completely in terms of cell types or transcription factors. [W]e’ve seen [at most] around 50% of the elements. … A conservative estimate of our expected coverage of exons + specific DNA:protein contacts gives us 18%, easily further justified (given our [limited] sampling) to 20%.”7/
So why did the ENCODErs opt for the broadest arguable definition of “functional”? Birney’s answer is that it describes a quantity that the project could measure; that the larger number underscores that a lot is happening in the genome; that it would have confused readers to receive a range of numbers; and that the smaller number would not have counted the efforts of all the researchers.
Whether these are very satisfactory reasons for trumpeting a widely misunderstood number is a matter that biologists can debate. All I can say is that (1) I have been unable to extract a clear number–whatever one should make of it–for a percentage of the genome that constitutes the regulatory elements–the promoters, enhancers, silencers, ncRNA “genes,” and so on; (2) this number is almost surely less than the 80% figure that, at first glance, one might have thought ENCODE was reporting; and (3) “functional element” as defined by the ENCODE Project is not a term that has clear or direct implications for claims of the law enforcement community that the loci used in forensic identification are not coding and therefore not informative.
Of course, none of this means that the description of the information content of the CODIS STRs traditionally presented by law enforcement authorities is correct. It simply means that even after this phase of ENCODE, there are still a huge number of base pairs that might or might not be regulatory or influence regulation and, hence, gene expression. The CODIS STRs might or might not be among them. Published reports suggest that they are not,8/ but the logic that just because a DNA sequence is noncoding (and nonregulatory), it conveys zero information about phenotype is flawed. It overlooks the possibility of a correlation between the nonfunctional sequence (because it sits next to an exon or a regulatory sequence).9/ Again, however, the published literature reviewing the CODIS STRs does not reveal any population-wide correlations that permit valid and strong inferences about disease status or propensity or other socially significant phenotypes.10/
Will this situation change? A thoughtful answer would take up a lot of space.11/ For now, I’ll just repeat the aphorism attributed to Yogi Berra, Neils Bohr, and Storm P: “It’s hard to make predictions, especially about the future.”
1. Ewan Birney, ENCODE: My Own Thoughts, Ewan’s Blog: Bioinformatician at Large, Sept. 5, 2012, http://genomeinformatician.blogspot.co.uk/2012/09/encode-my-own-thoughts.html.
2. David E. Comings, The Structure and Function of Chromatin, in 3 Advances in Human Genetics 237, 316 (H. Harris & K. Hirschhorn eds. 1972) (“Large spaces between genes may be a contributing factor to the observation that most recombination in eukaryotes is inter- rather than intragenic. Furthermore, if recombination tended to be sloppy with most mutational errors occurring in the process, it would an obvious advantage to have it occur in intergenic junk.”). For more discussion of this paper, see T. Ryan Gregory, ENCODE (2012) vs. Comings (1972), Sept. 7, 2012, http://www.genomicron.evolverzone.com/2012/09/encode-2012-vs-comings-1972/.
3. Ian Dunham et al., An Integrated Encyclopedia of DNA Elements in the Human Genome, 489 Nature 57 (2012).
4. These regions do contain some RNA-coding sequences, and those small parts could be doing something interesting (producing RNAs that are regulatory or that defend against infection by viral DNA, for example), but this kind of activity does not exist in the bulk of the introns that are, under the ENCODE definition, 100% functional.
5. Sean Eddy, ENCODE Says What?, Sept. 8, 2012, http://selab.janelia.org/people/eddys/blog/?p=683. He adds that:
[A]s far as questions of “junk DNA” are concerned, ENCODE’s definition isn’t relevant at all. The “junk DNA” question is about how much DNA has essentially no direct impact on the organism’s phenotype–roughly, what DNA could I remove (if I had the technology) and still get the same organism. Are transposable elements transcribed as RNA? Do they bind to DNA-binding proteins? Is their chromatin marked? Yes, yes, and yes, of course they are–because at least at one point in their history, transposons are “alive” for themselves (they have genes, they replicate), and even when they die, they’ve still landed in and around genes that are transcribed and regulated, and the transcription system runs right through them.
6. Faye Flam, Skeptical Takes on Elevation of Junk DNA and Other Claims from ENCODE Project, Sept. 12, 2012, http://ksj.mit.edu/tracker/2012/09/skeptical-takes-elevation-junk-dna-and-o. Stamatoyannopoulos added that:
What the ENCODE papers … have to say about transposons is incredibly interesting. Essentially, large numbers of these elements come alive in an incredibly cell-specific fashion, and this activity is closely synchronized with cohorts of nearby regulatory DNA regions that are not in transposons, and with the activity of the genes that those regulatory elements control. All of which points squarely to the conclusion that such transposons have been co-opted for the regulation of human genes — that they have become regulatory DNA. This is the rule, not the exception.
7. Ewan Birney, ENCODE: My Own Thoughts, Ewan’s Blog: Bioinformatician at Large, Sept. 5, 2012, http://genomeinformatician.blogspot.co.uk/2012/09/encode-my-own-thoughts.html.
8. E.g., Sara H. Katsanis & Jennifer K. Wagner, Characterization of the Standard and Recommended CODIS Markers, J. Forensic Sci. (2012).
9. E.g., David H. Kaye, Two Fallacies About DNA Databanks for Law Enforcement, 67 Brook. L. Rev. 179 (2001).
10. E.g., Sara H. Katsanis & Jennifer K. Wagner, Characterization of the Standard and Recommended CODIS Markers, J. Forensic Sci. (2012).
11. For my earlier, and possibly dated, effort to evaluate the likelihood that the CODIS loci someday will prove to be powerfully predictive or diagnostic, see David H. Kaye, Please, Let’s Bury the Junk: The CODIS Loci and the Revelation of Private Information, 102 Nw. U. L. Rev. Colloquy 70 (2007), and Mopping Up After Coming Clean About “Junk DNA”, Nov. 23, 2007.