Earlier today, I introduced the concepts and terms required to ascertain whether the estimated proportion of the genome that encodes the structure of proteins or regulates gene expression has jumped from 5 or 10% to 80%. I now focus on the possible meanings of “functional” to see whether the ENCODE papers state or imply and such seismic change. It appears that they do not.
“Functional” is an adjective, and Alice learned from Humpty Dumpty that adjectives are malleable:
“When I use a word,” Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean–neither more nor less.”
“The question is,” said Alice, “whether you can make words mean so many different things.”
“The question is,” said Humpty Dumpty, “which is to be master–that’s all.”
Alice was too much puzzled to say anything, so after a minute Humpty Dumpty began again. “They’ve a temper, some of them–particularly verbs, they’re the proudest–adjectives you can do anything with, but not verbs–however, I can manage the whole lot! Impenetrability! That’s what I say!”
Like Humpty, who was redefining the word “glory,” the ENCODE authors recognized that “functional” can have many meanings. As Ewan Birney later explained:
Like many English language words, “functional” is a very useful but context-dependent word. Does a “functional element” in the genome mean something that changes a biochemical property of the cell (i.e., if the sequence was not here, the biochemistry would be different) or is it something that changes a phenotypically observable trait that affects the whole organism?1/
Still other possibilities exist. For example, the first paper to use the adjective “junk” for noncoding DNA noted that even debris accumulated in the course of evolution or introduced from viral infections could have a function simply by creating spaces between genes.2/ The pieces of dead wood that are joined together to form the hull of a row boat have a function–they exclude the water from the vessel to keep it afloat. This does not mean that the detailed structure of the planks–the precise width of each plank or the number of ridges on its surface–affects its functionality. And, just as something can be inactive and functional, so too something can be alive with activity and yet be nonfunctional.
ENCODE uses biochemical activity–the notion that “the biochemistry would be different”–as a synonym for functional. Here is the definition of “functional” in the top-level paper:
Operationally, we define a functional element as a discrete genome segment that encodes a defined product (for example, protein or non-coding RNA) or displays a reproducible biochemical signature (for example, protein binding, or a specific chromatin structure).3/
This definition may be useful for the purpose of describing the size of ENCODE’s catalog of elements for later study, but it contrasts sharply with the notion of functional as affecting a nontrival phenotype. The ENCODE papers show that 80% of the genome displays signs of certain types of biochemical activity–even though the activity may be insignificant, pointless, or unnecessary. This 80% includes all of the introns, for they are active in the production of pre-mRNA transcripts. But this hardly means that they are regulatory or otherwise functional.4/ Indeed, if one carries the ENCODE definition to its logical extreme, 100% of the genome is functional–for all of it participates in at least one biochemical process–DNA replication.
That the ENCODE project would not adopt the most extreme biochemical definition is understandable–that definition would be useless. But the ENCODE definition is still grossly overinclusive from the standpoint of evolutionary biology. From that persective, most estimates of the proportion of “functional” DNA are well under 80%. Various biologists or related specialists have provided varying guestimates:
- Under 50%: “About 1% … is coding. Something like 1-4% is currently expected to be regulatory noncoding DNA … . About 40-50% of it is derived from transposable elements, and thus affirmatively already annotated as “junk” in the colloquial sense that transposons have their own purpose (and their own biochemical functions and replicative mechanisms), like the spam in your email. And there’s some overlap: some mobile-element DNA has been co-opted as coding or regulatory DNA, for example. [�] … Transposon-derived sequence decays rapidly, by mutation, so it’s certain that there’s some fraction of transposon-derived sequence we just aren’t recognizing with current computational methods, so the 40-50% number must be an underestimate. So most reasonable people (ok, I) would say at this point that the human genome is mostly junk (“mostly” as in, somewhere north of 50%).”5/
- 40%: “ENCODE biologist John Stamatoyannopoulos … said … that some of the activity measured in their tests does involve human genes and contributes something to our human physiology. He did admit that the press conference mislead people by claiming that 80% of our genome was essential and useful. He puts that number at 40%.”6/
- 20%: “[U]sing very strict, classical definitions of “functional” [to refer only to] places where we are very confident that there is a specific DNA:protein contact, such as a transcription factor binding site to the actual bases–we see a cumulative occupation of 8% of the genome. With the exons (which most people would always classify as “functional” by intuition) that number goes up to 9%. … [�] In addition, in this phase of ENCODE we did [not] sample … completely in terms of cell types or transcription factors. [W]e’ve seen [at most] around 50% of the elements. … A conservative estimate of our expected coverage of exons + specific DNA:protein contacts gives us 18%, easily further justified (given our [limited] sampling) to 20%.”7/
So why did the ENCODErs opt for the broadest arguable definition of “functional”? Birney’s answer is that it describes a quantity that the project could measure; that the larger number underscores that a lot is happening in the genome; that it would have confused readers to receive a range of numbers; and that the smaller number would not have counted the efforts of all the researchers.
Whether these are very satisfactory reasons for trumpeting a widely misunderstood number is a matter that biologists can debate. All I can say is that (1) I have been unable to extract a clear number–whatever one should make of it–for a percentage of the genome that constitutes the regulatory elements–the promoters, enhancers, silencers, ncRNA “genes,” and so on; (2) this number is almost surely less than the 80% figure that, at first glance, one might have thought ENCODE was reporting; and (3) “functional element” as defined by the ENCODE Project is not a term that has clear or direct implications for claims of the law enforcement community that the loci used in forensic identification are not coding and therefore not informative.
Of course, none of this means that the description of the information content of the CODIS STRs traditionally presented by law enforcement authorities is correct. It simply means that even after this phase of ENCODE, there are still a huge number of base pairs that might or might not be regulatory or influence regulation and, hence, gene expression. The CODIS STRs might or might not be among them. Published reports suggest that they are not,8/ but the logic that just because a DNA sequence is noncoding (and nonregulatory), it conveys zero information about phenotype is flawed. It overlooks the possibility of a correlation between the nonfunctional sequence (because it sits next to an exon or a regulatory sequence).9/ Again, however, the published literature reviewing the CODIS STRs does not reveal any population-wide correlations that permit valid and strong inferences about disease status or propensity or other socially significant phenotypes.10/
Will this situation change? A thoughtful answer would take up a lot of space.11/ For now, I’ll just repeat the aphorism attributed to Yogi Berra, Neils Bohr, and Storm P: “It’s hard to make predictions, especially about the future.”
1. Ewan Birney, ENCODE: My Own Thoughts, Ewan’s Blog: Bioinformatician at Large, Sept. 5, 2012, http://genomeinformatician.blogspot.co.uk/2012/09/encode-my-own-thoughts.html.
2. David E. Comings, The Structure and Function of Chromatin, in 3 Advances in Human Genetics 237, 316 (H. Harris & K. Hirschhorn eds. 1972) (“Large spaces between genes may be a contributing factor to the observation that most recombination in eukaryotes is inter- rather than intragenic. Furthermore, if recombination tended to be sloppy with most mutational errors occurring in the process, it would an obvious advantage to have it occur in intergenic junk.”). For more discussion of this paper, see T. Ryan Gregory, ENCODE (2012) vs. Comings (1972), Sept. 7, 2012, http://www.genomicron.evolverzone.com/2012/09/encode-2012-vs-comings-1972/.
3. Ian Dunham et al., An Integrated Encyclopedia of DNA Elements in the Human Genome, 489 Nature 57 (2012).
4. These regions do contain some RNA-coding sequences, and those small parts could be doing something interesting (producing RNAs that are regulatory or that defend against infection by viral DNA, for example), but this kind of activity does not exist in the bulk of the introns that are, under the ENCODE definition, 100% functional.
5. Sean Eddy, ENCODE Says What?, Sept. 8, 2012, http://selab.janelia.org/people/eddys/blog/?p=683. He adds that:
[A]s far as questions of “junk DNA” are concerned, ENCODE’s definition isn’t relevant at all. The “junk DNA” question is about how much DNA has essentially no direct impact on the organism’s phenotype–roughly, what DNA could I remove (if I had the technology) and still get the same organism. Are transposable elements transcribed as RNA? Do they bind to DNA-binding proteins? Is their chromatin marked? Yes, yes, and yes, of course they are–because at least at one point in their history, transposons are “alive” for themselves (they have genes, they replicate), and even when they die, they’ve still landed in and around genes that are transcribed and regulated, and the transcription system runs right through them.
6. Faye Flam, Skeptical Takes on Elevation of Junk DNA and Other Claims from ENCODE Project, Sept. 12, 2012, http://ksj.mit.edu/tracker/2012/09/skeptical-takes-elevation-junk-dna-and-o. Stamatoyannopoulos added that:
What the ENCODE papers … have to say about transposons is incredibly interesting. Essentially, large numbers of these elements come alive in an incredibly cell-specific fashion, and this activity is closely synchronized with cohorts of nearby regulatory DNA regions that are not in transposons, and with the activity of the genes that those regulatory elements control. All of which points squarely to the conclusion that such transposons have been co-opted for the regulation of human genes — that they have become regulatory DNA. This is the rule, not the exception.
7. Ewan Birney, ENCODE: My Own Thoughts, Ewan’s Blog: Bioinformatician at Large, Sept. 5, 2012, http://genomeinformatician.blogspot.co.uk/2012/09/encode-my-own-thoughts.html.
8. E.g., Sara H. Katsanis & Jennifer K. Wagner, Characterization of the Standard and Recommended CODIS Markers, J. Forensic Sci. (2012).
9. E.g., David H. Kaye, Two Fallacies About DNA Databanks for Law Enforcement, 67 Brook. L. Rev. 179 (2001).
10. E.g., Sara H. Katsanis & Jennifer K. Wagner, Characterization of the Standard and Recommended CODIS Markers, J. Forensic Sci. (2012).
11. For my earlier, and possibly dated, effort to evaluate the likelihood that the CODIS loci someday will prove to be powerfully predictive or diagnostic, see David H. Kaye, Please, Let’s Bury the Junk: The CODIS Loci and the Revelation of Private Information, 102 Nw. U. L. Rev. Colloquy 70 (2007), and Mopping Up After Coming Clean About “Junk DNA”, Nov. 23, 2007.
Last week I noted some of the hyperbolic headlines accompanying the coordinated publication of a large number of datasets from the ENCODE Project . The abstract of the top-level paper begins as follows:
The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions.1/
Hoping to decipher these sentences, I have been reading about gene regulation. This modest effort stems from more than academic curiosity. If the popular and even some of the scientific press is to be believed, ENCODE has exorcized “junk DNA” from the body of scientific knowledge.2/ The bright light suddenly shining on the “dark matter” of the genome (to introduce another sloppy metaphor)3/ raises a giant question mark for the criminal justice system. Law enforcement authorities have always insisted that the snippets of DNA used to generate DNA identification profiles are just nonfunctional “junk.”4/ Now, according to New York Times science correspondent Gina Kolata,
As scientists delved into the “junk” — parts of the DNA that are not actual genes containing instructions for proteins — they discovered a complex system that controls genes. At least 80 percent of this DNA is active and needed. … [�] … The thought before the start of the project, said Thomas Gingeras, an Encode researcher from Cold Spring Harbor Laboratory, was that only 5 to 10 percent of the DNA in a human being was actually being used.5/
This juxtaposition of percentages suggests that the scientific community has shifted from the view that “only 5 to 10 percent” of the genome is functional (“needed” for the organism to function normally) to a sudden realization that 80% falls into this category.
But the more I read, the clearer it became that this description of a sudden phase transition in science is wildly inaccurate. Johns Hopkins biostatistian Steve Salzberg, in a provocative Simply Statistics podcast interview, describes the 80% figure touted in the ENCODE paper as irresponsible.6/ University of Toronto biochemist Lawrence Moran saw it as a repeat of a similar, problematic performance five years ago, at the conclusion of the pilot phase of ENCODE.7/ Responding to criticism, ENCODE Project leader Ewan Birney explained the new knowledge this way:
After all, 60% of the genome with the new detailed manually reviewed (GenCode) annotation is either exonic or intronic, and a number of our assays (such as PolyA- RNA, and H3K36me3/H3K79me2) are expected to mark all active transcription. So seeing an additional 20% over this expected 60% is not so surprising.8/
“Not so surprising”? A whopping 60%–not a minor 5 or 10%–was already estimated to be “active”? What is going on here?
The answer lies in the definition of some key terms (like exons, introns, and transcription) and requires a rudimentary understanding of the fundamentals of gene expression and its regulation in human beings. This posting presents the essential terminology and concepts. A sequel will apply them to explain what ENCODE’s “assign[ing] biochemical functions for 80% of the genome” means. Anyone who knows what RNA transcripts and transcription factors do can skip this first part (or can read it to let me know of my inaccuracies).
To avoid suspense, I shall lay out my conclusions here and now: (1) if ENCODE gives a clear number for a percentage of the genome that regulates genes–the promoters, enhancers, silencers, ncRNA “genes,” and so on–I have yet to find it; (2) this number is almost surely less than the 80% figure reported for functionality; and (3) “functional element” as defined by the ENCODE Project is not a term that has clear or direct implications for claims of the law enforcement community that the loci used in forensic identification are not coding and therefore not informative. Those claims of zero information are somewhat exaggerated, but that is another story. For now, I merely describe some basics of gene expression and regulation.
Genes make proteins. But how? There are three big steps (with many activities within each step): transcription; post-transcription modification and transportation; and translation. All involve RNA, a single-stranded molecule related to DNA, and proteins. The basic picture is
- Transcription to precursor messenger RNA: DNA + proteins –> pre-mRNA (in nucleus)
- Post-transcriptional modification and transportation: pre-mRNA + proteins and RNAs -> mature m-RNA (in cytoplasm)
- Translation to protein: mRNA + tRNA and proteins –> expressed protein (in cytoplasm)
In the first big step, the base pairs of the gene are transcribed jot-for-jot into an RNA molecule (precursor messenger RNA, or pre-mRNA). In the second major step, the transcript is modified at its ends, edited to remove parts that do not code for the protein that will be made (splicing), and the mature messenger RNA (m-RNA) is moved outside the nucleus. In the third phase, another type of RNA (transfer RNA, or tRNA) stitches together individual amino acids in the order dictated by the m-RNA transcript to form a protein, thereby translating the DNA sequence mirrored in the mRNA into the amino-acid order of the protein. Translation occurs on a kind of microscopic workbench (a ribosome) made of yet another RNA (ribosomal RNA, or rRNA).
For all this to happen, the DNA, which lies tightly coiled in the chromosomes (in a protein-DNA matrix known as chromatin), must open up for transcription to occur. Thus, changes in the chromatin regulate transcription, and these changes can be brought about in a number of ways. Transcription factors (specialized proteins) bind to the DNA. The bound transcription factors then recruit an enzyme (RNA polymerase) that produces RNA. This occurs within a region of DNA, known as a promoter, near the start of the protein-coding DNA (the structural gene). The level of transcription is influenced by activator or repressor proteins that bind to still other small regions (enhancers and silencers, respectively) that also lie outside the structural gene. In short, chemical interactions that open or close the chromatin that houses the DNA and transcription factors regulate the first step in the DNA-to-protein process.
In the past decade, other mechanisms of regulation or control of gene expression have been discovered. Many DNA sequences are not transcribed into messenger RNA, but they are transcribed into a variety of other RNAs. These non-protein-coding DNA sequences can be thought of as genes for RNA. Courting confusion, they usually are called “noncoding” (ncDNA)–because they do not code for protein–but they certainly code for RNAs that are crucial to translation–rRNA and tRNA–and for other RNAs that affect transcription, translation, and DNA replication. So it turns out that the genome is abuzz with transcription-to-RNA activity and other events that feed into the expression of the (protein-)coding DNA.
Yet, this hardly means that every biochemical event along the DNA is functionally important. Some, perhaps many, non-mRNA transcripts are just “noise.” They may float around for a while, but they may not do anything except wither away. In addition, large segments of the DNA transcribed in the course of making mRNA appear in the initial transcript (the pre-mRNA) but never make it into mature mRNA. These unused parts of the pre-mRNA transcripts correspond to long stretches of DNA, known as introns, that interrupt the smaller coding parts–the exons–that are translated into proteins. The initially transcribed intronic parts are removed from the pre-mRNA in a process called RNA splicing. Most of the RNA from introns probably just dissipates.9/
All these terms are a mouthful, but armed with this basic understanding of genes, RNA, and proteins, we can see why the 80% figure does not mean what one might think. We shall also see that the estimated proportion of the genome that encodes the structure of proteins or regulates gene expression has not jumped from 5 or 10% to 80%.
1. Ian Dunham et al., An Integrated Encyclopedia of DNA Elements in the Human Genome, 489 Nature 57 (2012).
2. E.g., Elizabeth Pennisi, ENCODE Project Writes Eulogy for Junk DNA, 337 Science 1159 (2012).
3. E.g., Gina Kolata, Bits of Mystery DNA, Far From ‘Junk,’ Play Crucial Role, N.Y. Times, Sept. 5, 2012. In one respect, the “dark matter” metaphor misrepresents dark matter. The presence of dark matter is inferred from its gravitational effects on visible matter. The presence of noncoding DNA is known from experiments that detect and characterize it just as they do coding DNA. Perhaps the metaphor means that the sequence of “dark matter” DNA cannot be deduced from the structure of a protein made in a cell. This, however, is like saying that dark matter is matter than cannot be seen with the naked eye. And that is not what astronomers mean by dark matter.
4. E.g., House Committee on the Judiciary, Report on the DNA Analysis Backlog Elimination Act of 2000, 106th Cong., 2d Sess., H.R. Rep. No. 106-900(1), at 27 (“the genetic markers used for forensic DNA testing … show only the configuration of DNA at selected ‘junk sites’ which do not control or influence the expression of any trait.”); New York State Law Enforcement Council, Legislative Priorities 2012: DNA at Arrest, at 5, http://nyslec.org/pdfs/2012/1_DNA_2012.pdf (“The pieces of DNA that are analyzed for the databank were specifically chosen because they are ‘junk DNA.’).
5. Kolata, supra note 3.
6. Interview by Roger Peng with Steven Salzberg, podcast on Simply Statistics, Sept. 7, 2012, http://simplystatistics.org/post/31056769228/interview-with-steven-salzberg-about-the-encode (“Why do they feel a need to say that 80% of the genome is functional? … They know it’s not true. They shouldn’t say it. … You don’t distort the science to get into the headlines.”).
7. Laurence A. Moran, The ENCODE Data Dump and the Responsibility of Scientists, Sept. 6, 2012, http://sandwalk.blogspot.com/2012/09/the-encode-data-dump-and-responsibility_6.html (“This is, unfortunately, another case of a scientist acting irresponsibly by distorting the importance and the significance of the data.”).
8. Ewan Birney, ENCODE: My Own Thoughts, Sept. 5, 2011
9. Post-splicing processing of a small fraction of the RNA from introns can produce noncoding RNAs that may regulate protein expression. L. Fedorova1 & A. Fedorov, Puzzles of the Human Genome: Why Do We Need Our Introns?, 6 Current Genomics 589, 592 (2005).
You have seen this week’s headlines:
- Bits of Mystery DNA, Far From ‘Junk,’ Play Crucial Role (New York Times)
- ‘Junk DNA’ Concept Debunked by New Analysis of Human Genome (Washington Post)
- ‘Junk DNA’ Debunked (Wall Street Journal)
- Breakthrough Study Overturns Theory of ‘Junk DNA’ in Genome (Guardian)
Or maybe you heard MSNBC report that the data from ENCODE “shows us living beyond our genes” –whatever that means — or listened to CBC intone that “‘Junk DNA has a purpose” — sounds divine — or saw the Independent‘s mishugina announcement that “Scientists Debunk ‘Junk DNA’ Theory to Reveal Vast Majority of Human Genes Perform a Vital Function!” — like we did not know that genes were functional and important?
The level of hype here is phenomenal. (Some useful clarification can be found at the Nature News blog). In the next few days, I hope to post some quick thoughts on what the ENCODE figures (like 80%) being bandied about for the “functional” or “biologically active” fraction of the human genome mean for the loci used in forensic DNA identification.
Cross-posted to Forensic Science, Statistics, and the Law
(If any readers have insights to share, send me an email at kaye at alum.mit.edu, and I’ll try to use them. I am still educating myself about some of the details of gene regulation and can use any help I can get.)