Trashing Junk DNA: The Notorious 80%

Last week I noted some of the hyperbolic headlines accompanying the coordinated publication of a large number of datasets from the ENCODE Project . The abstract of the top-level paper begins as follows:

The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions.1/

Hoping to decipher these sentences, I have been reading about gene regulation. This modest effort stems from more than academic curiosity. If the popular and even some of the scientific press is to be believed, ENCODE has exorcized “junk DNA” from the body of scientific knowledge.2/ The bright light suddenly shining on the “dark matter” of the genome (to introduce another sloppy metaphor)3/ raises a giant question mark for the criminal justice system. Law enforcement authorities have always insisted that the snippets of DNA used to generate DNA identification profiles are just nonfunctional “junk.”4/ Now, according to New York Times science correspondent Gina Kolata,

As scientists delved into the “junk” — parts of the DNA that are not actual genes containing instructions for proteins — they discovered a complex system that controls genes. At least 80 percent of this DNA is active and needed. … [�] … The thought before the start of the project, said Thomas Gingeras, an Encode researcher from Cold Spring Harbor Laboratory, was that only 5 to 10 percent of the DNA in a human being was actually being used.5/

This juxtaposition of percentages suggests that the scientific community has shifted from the view that “only 5 to 10 percent” of the genome is functional (“needed” for the organism to function normally) to a sudden realization that 80% falls into this category.

But the more I read, the clearer it became that this description of a sudden phase transition in science is wildly inaccurate. Johns Hopkins biostatistian Steve Salzberg, in a provocative Simply Statistics podcast interview, describes the 80% figure touted in the ENCODE paper as irresponsible.6/ University of Toronto biochemist Lawrence Moran saw it as a repeat of a similar, problematic performance five years ago, at the conclusion of the pilot phase of ENCODE.7/ Responding to criticism, ENCODE Project leader Ewan Birney explained the new knowledge this way:

After all, 60% of the genome with the new detailed manually reviewed (GenCode) annotation is either exonic or intronic, and a number of our assays (such as PolyA- RNA, and H3K36me3/H3K79me2) are expected to mark all active transcription. So seeing an additional 20% over this expected 60% is not so surprising.8/

“Not so surprising”? A whopping 60%–not a minor 5 or 10%–was already estimated to be “active”? What is going on here?

The answer lies in the definition of some key terms (like exons, introns, and transcription) and requires a rudimentary understanding of the fundamentals of gene expression and its regulation in human beings. This posting presents the essential terminology and concepts. A sequel will apply them to explain what ENCODE’s “assign[ing] biochemical functions for 80% of the genome” means. Anyone who knows what RNA transcripts and transcription factors do can skip this first part (or can read it to let me know of my inaccuracies).

To avoid suspense, I shall lay out my conclusions here and now: (1) if ENCODE gives a clear number for a percentage of the genome that regulates genes–the promoters, enhancers, silencers, ncRNA “genes,” and so on–I have yet to find it; (2) this number is almost surely less than the 80% figure reported for functionality; and (3) “functional element” as defined by the ENCODE Project is not a term that has clear or direct implications for claims of the law enforcement community that the loci used in forensic identification are not coding and therefore not informative. Those claims of zero information are somewhat exaggerated, but that is another story. For now, I merely describe some basics of gene expression and regulation.

Genes make proteins. But how? There are three big steps (with many activities within each step): transcription; post-transcription modification and transportation; and translation. All involve RNA, a single-stranded molecule related to DNA, and proteins. The basic picture is

  • Transcription to precursor messenger RNA: DNA + proteins –> pre-mRNA (in nucleus)
  • Post-transcriptional modification and transportation: pre-mRNA + proteins and RNAs -> mature m-RNA (in cytoplasm)
  • Translation to protein: mRNA + tRNA and proteins –> expressed protein (in cytoplasm)

In the first big step, the base pairs of the gene are transcribed jot-for-jot into an RNA molecule (precursor messenger RNA, or pre-mRNA). In the second major step, the transcript is modified at its ends, edited to remove parts that do not code for the protein that will be made (splicing), and the mature messenger RNA (m-RNA) is moved outside the nucleus. In the third phase, another type of RNA (transfer RNA, or tRNA) stitches together individual amino acids in the order dictated by the m-RNA transcript to form a protein, thereby translating the DNA sequence mirrored in the mRNA into the amino-acid order of the protein. Translation occurs on a kind of microscopic workbench (a ribosome) made of yet another RNA (ribosomal RNA, or rRNA).

For all this to happen, the DNA, which lies tightly coiled in the chromosomes (in a protein-DNA matrix known as chromatin), must open up for transcription to occur. Thus, changes in the chromatin regulate transcription, and these changes can be brought about in a number of ways. Transcription factors (specialized proteins) bind to the DNA. The bound transcription factors then recruit an enzyme (RNA polymerase) that produces RNA. This occurs within a region of DNA, known as a promoter, near the start of the protein-coding DNA (the structural gene). The level of transcription is influenced by activator or repressor proteins that bind to still other small regions (enhancers and silencers, respectively) that also lie outside the structural gene. In short, chemical interactions that open or close the chromatin that houses the DNA and transcription factors regulate the first step in the DNA-to-protein process.

In the past decade, other mechanisms of regulation or control of gene expression have been discovered. Many DNA sequences are not transcribed into messenger RNA, but they are transcribed into a variety of other RNAs. These non-protein-coding DNA sequences can be thought of as genes for RNA. Courting confusion, they usually are called “noncoding” (ncDNA)–because they do not code for protein–but they certainly code for RNAs that are crucial to translation–rRNA and tRNA–and for other RNAs that affect transcription, translation, and DNA replication. So it turns out that the genome is abuzz with transcription-to-RNA activity and other events that feed into the expression of the (protein-)coding DNA.

Yet, this hardly means that every biochemical event along the DNA is functionally important. Some, perhaps many, non-mRNA transcripts are just “noise.” They may float around for a while, but they may not do anything except wither away. In addition, large segments of the DNA transcribed in the course of making mRNA appear in the initial transcript (the pre-mRNA) but never make it into mature mRNA. These unused parts of the pre-mRNA transcripts correspond to long stretches of DNA, known as introns, that interrupt the smaller coding parts–the exons–that are translated into proteins. The initially transcribed intronic parts are removed from the pre-mRNA in a process called RNA splicing. Most of the RNA from introns probably just dissipates.9/

All these terms are a mouthful, but armed with this basic understanding of genes, RNA, and proteins, we can see why the 80% figure does not mean what one might think. We shall also see that the estimated proportion of the genome that encodes the structure of proteins or regulates gene expression has not jumped from 5 or 10% to 80%.

Notes

1. Ian Dunham et al., An Integrated Encyclopedia of DNA Elements in the Human Genome, 489 Nature 57 (2012).

2. E.g., Elizabeth Pennisi, ENCODE Project Writes Eulogy for Junk DNA, 337 Science 1159 (2012).

3. E.g., Gina Kolata, Bits of Mystery DNA, Far From ‘Junk,’ Play Crucial Role, N.Y. Times, Sept. 5, 2012. In one respect, the “dark matter” metaphor misrepresents dark matter. The presence of dark matter is inferred from its gravitational effects on visible matter. The presence of noncoding DNA is known from experiments that detect and characterize it just as they do coding DNA. Perhaps the metaphor means that the sequence of “dark matter” DNA cannot be deduced from the structure of a protein made in a cell. This, however, is like saying that dark matter is matter than cannot be seen with the naked eye. And that is not what astronomers mean by dark matter.

4. E.g., House Committee on the Judiciary, Report on the DNA Analysis Backlog Elimination Act of 2000, 106th Cong., 2d Sess., H.R. Rep. No. 106-900(1), at 27 (“the genetic markers used for forensic DNA testing … show only the configuration of DNA at selected ‘junk sites’ which do not control or influence the expression of any trait.”); New York State Law Enforcement Council, Legislative Priorities 2012: DNA at Arrest, at 5, http://nyslec.org/pdfs/2012/1_DNA_2012.pdf (“The pieces of DNA that are analyzed for the databank were specifically chosen because they are ‘junk DNA.’).

5. Kolata, supra note 3.

6. Interview by Roger Peng with Steven Salzberg, podcast on Simply Statistics, Sept. 7, 2012, http://simplystatistics.org/post/31056769228/interview-with-steven-salzberg-about-the-encode (“Why do they feel a need to say that 80% of the genome is functional? … They know it’s not true. They shouldn’t say it. … You don’t distort the science to get into the headlines.”).

7. Laurence A. Moran, The ENCODE Data Dump and the Responsibility of Scientists, Sept. 6, 2012, http://sandwalk.blogspot.com/2012/09/the-encode-data-dump-and-responsibility_6.html (“This is, unfortunately, another case of a scientist acting irresponsibly by distorting the importance and the significance of the data.”).

8. Ewan Birney, ENCODE: My Own Thoughts, Sept. 5, 2011

9. Post-splicing processing of a small fraction of the RNA from introns can produce noncoding RNAs that may regulate protein expression. L. Fedorova1 & A. Fedorov, Puzzles of the Human Genome: Why Do We Need Our Introns?, 6 Current Genomics 589, 592 (2005).

I am grateful to Eileen Kane for explaining some of the molecular biology to me. This entry is cross-posted to the Forensic Science, Statistics, and the Law Blog