Introduction. A new type of selfish DNA element that was first identified in yeast in the 1970s (reviewed in Refs [54,55]), homing endonucleases have been highly successful in invading eukaryotic, bacterial, archaeal, and even phage genomes despite their simple structure—a single open reading frame. Many (but not all) homing endonucleases identified to date recognize highly specific DNA target sequences found in self-splicing introns. Others, known as inteins, are translated along with a host protein and are posttranslationally processed in a manner that frees the homing endonuclease but leaves the host protein intact. Both strategies have the advantage of limiting detrimental effects to the host cell and its genome, enabling the spread of these selfish genes. Following the identification and biochemical characterization of many homing endonucleases in the 1980s and 1990s, researchers started to repurpose these simple elements for various genome manipulation applications. For example, the introduction of the target sequence recognized by the homing endonuclease I-SceI into the Drosophila genome enabled studies of DNA break repair and homologous recombination [56–59]. Homing endonucleases were relatively unknown to the vector biology community at this time, their adaptation beginning when Burt  suggested that the primitive form of genome invasion used by homing endonucleases may be well-suited for genetic strategies to control malaria or other vector-borne diseases. Indeed, several homing endonucleases have since been used effectively to introduce site-specific DNA breaks in An. gambiae[61,62] and to excise genes from Ae. aegypti[63,64]. Through biochemical redesign, the homing endonucleases I-CreI and I-AniI have been modified to recognize targets in the An. gambiae genome directly . Transgenic expression of wild-type I-PpoI during spermatogenesis completely sterilized An. gambiae mosquitoes ; large cage trials using these transgenic sterile mosquitoes indicate effectiveness at crashing populations of this mosquito . The expression of modified versions of I-PpoI in the male testes of An. gambiae has led to the development of sex-distorter strains that produce almost all males , while I-SceI was shown to successfully invade large cage populations .
How they work. The phrase “homing” refers to their ability as mature proteins to return to the site of their mRNA “birth” and introduce a double-stranded break on the homologous chromosome lacking the homing endonuclease gene. Unlike TEs and recombinases, homing endonucleases do not catalyze any reactions beyond DNA cleavage. This extraordinarily simplistic mode of action thus relies entirely on the cellular DNA repair machinery to generate a duplicate version of the homing endonuclease gene, by using the original gene as a repair template. Homing endonucleases can be categorized by their mode of catalyzing DNA cleavage into at least four completely independent families, and an extensive body of literature is available concerning the structural basis for DNA-binding specificity and double-stranded DNA break formation (reviewed in Ref. ). Of particular interest is that the target sequences for most homing endonucleases are 18 bp or longer, meaning that even in large eukaryotic genomes the probability of finding an endogenous site is very small. For example, the target sequence of I-SceI is 18 bp, and the corresponding chance of finding an exact match in a random sequenceis 1 in 6.87×1010. Given the size of the An. gambiae (2.7×108 bp) and Ae. aegypti (1.3×109 bp) genomes, the corresponding chances of randomly finding an I-SceI target site are approximately 1:250 and 1:50, respectively.
Strengths. Homing endonuclease genes are relatively small, with dimeric nucleases such as I-CreI and I-PpoI both less than 500 bp; the larger monomeric nucleases such as I-SceI are still less than 1 kb. Thus, these genes and their resultant proteins are easy to manipulate, express, and purify. When integrated into a mosquito genome, their small size may also reduce the risks of acquiring deleterious mutations that can affect their function. Their extreme specificity can be expected to reduce or virtually eliminate the impacts of off-target cutting; the transgenic expression of I-SceI is well-tolerated by both Drosophila and An. gambiae[56,65]. Homing endonucleases are a mature technology that have been used to generate the first experimentally validated gene drive system in mosquitoes  as well as transgenic sex-distorter strains ready to be deployed in SIT-type programs .
Weaknesses. The extreme specificity of homing endonucleases due to extensive protein–DNA contacts at the target site also makes reengineering these molecules to recognize new target sequences extremely difficult. While such modifications are possible [39,67–69], the cost (primarily human capital) associated with modifying homing endonucleases to recognize new and diverse target sites is thus the primary barrier to their widespread use as gene editing and modification tools. While large-scale sequencing of new microbial genomes and comparative genomic analyses have accelerated the discovery of new homing endonuclease genes , many of these have reduced or completely lost catalytic activity, meaning that each new nuclease must still be verified experimentally.
As someone who has studied the concept of “junk DNA” for over twenty years, I am dismayed by two statements that appear repeatedly on various blog sites discussing evolution. No, I am not referring to arguments of the form “the onion has six times more DNA than do mammals; therefore, there is no deity,” that are invariably followed by terms of disparagement hurled at anyone who even marginally departs from the Darwinian perspective. Rather, my consternation stems from a half-truth and a false fact that are recycled ad nauseum by those who apparently believe that, despite all the genomic and transcriptomic data that have been obtained only in this decade–data that have overturned a number of trenchant assumptions–a certain hypothesis published in 1980 is outside the purview of serious questioning.
The half-truth is the oft-read comment that goes something like this: “No one ever asserted that junk DNA is without function…it was long suspected that these sequences have important roles in the cells.” Now, to be fair, it is correct to say that models for, say, repetitive DNA-based operations in metazoan development, have been proposed since the 1960s.1 It is also true that the evolutionary process of exaptation–the accidental acquisition of a function–has been used to explain how the odd transposon here or there along a chromosome can regulate a locus. Nonspecific effects of “extra” DNA on the cell have also been suggested for around three decades, if not longer. That said, the junk DNA hypothesis that one commonly reads as being an unassailable observation, as an incontrovertible empirical conclusion, presents as a clear prediction that the vast majority of non-gene sequences are devoid of any precise specificational role in ontogeny. Allow me to explain.
Two papers appeared back to back in the journal Nature in 1980: “Selfish Genes, the Phenotype Paradigm and Genome Evolution” by W. Ford Doolittle and Carmen Sapienza2 and “Selfish DNA: The Ultimate Parasite” by Leslie Orgel and Francis Crick.3 These laid the framework for thinking about nonprotein-coding regions of chromosomes, judging from how they are cited. What these authors effectively did was advance Dawkins’s 1976 selfish gene idea4 in such a way that all the genomic DNA evidence available up to that time could be accounted for by a plausible scenario. The thesis presented in both articles is that the only specific function of the vast bulk of “nonspecific” sequences, especially repetitive elements such as transposons, is to replicate themselves — this is the consequence of natural selection operating within genomes, beneath the radar of the cell. These junk sequences, it was postulated, can duplicate and disperse throughout chromosomes because they have little or no effect on the phenotype, save for the occasional mutation that results from their mobility. On the positive side, the C-value paradox, the longstanding puzzle that genome sizes have no correlation with perceived organismal complexity — a lily, for instance, can have twenty times more nuclear DNA than a mouse — was satisfactorily explained by the hypothesis. Also, the problem of repetitive elements of which the “variety and patterns of their interspersion with unique sequence DNA make no particular phylogenetic or phenotypically functional sense” 3 was argued to have a simple solution. Likewise, the finding in the late 1970s that protein-coding regions in eukaryotes are interrupted by nonprotein-coding “introns” could be understood…as perhaps the degenerate remains of old transposable sequences.
A careful reading of these papers reveals, though, in what ways nonprotein-coding DNA function were thought by these authors to be likely. At the risk of being accused of quote-mining, let me first note the definitions of junk or selfish DNA:
A piece of selfish DNA, in its purest form, has two distinct properties:
(1) It arises when a DNA sequence spreads by forming additional copies of itself within the genome.
(2) It makes no specific contribution to the phenotype.
[W]e shall use the term selfish DNA in a wider sense, so that it can refer not only to obviously repetitive DNA but also to certain other DNA sequences which appear to have little or no function, such as much of the DNA in the introns of genes and parts of the DNA sequences between genes…The conviction has been growing that much of this extra DNA is ‘junk’, in other words, that it has little specificity and conveys little or no selective advantage to the organism…in the case of selfish DNA, the sequence which spreads makes no contribution to the phenotype of the organism, except insofar as it is a slight burden to the cell that contains it. Selfish DNA sequences may be transcribed in some cases and not in others. The spread of selfish DNA within the genome can be compared to the spread of a not-too-harmful parasite within its host.3
Natural selection operating within genomes will inevitably result in the appearance of DNAs with no phenotypic expression whose only ‘function’ is survival within genomes.2
Second, no prohibition was placed on relatively few selfish motifs modulating a gene in a way that they positively contributed to fitness, or on these elements en masse having nonspecific effects on the cell:
We do not deny that prokaryotic transposable elements or repetitive and unique-sequence DNAs not coding for protein in eukaryotes may have roles of immediate phenotypic benefit to the organism.2
It would be surprising if the host organism did not occasionally find some use for particular selfish DNA sequences, especially if there were many different sequences widely distributed over the chromosomes. One obvious use, as repeatedly stressed by Britten and Davidson, would be for control purposes at one level or another. This seems more than plausible.
A mechanism which scattered, more or less at random, many kinds of repeated sequences in many places in the genome would appear to be rather good for this purpose [of gene regulation]. Most sets of such sequences would be unlikely to find themselves in the right combination of places to be useful but, by chance, the members of one particular set might be located so that they could be used to turn on (or turn off) together a set of genes which had never been controlled before in a coordinated way. A next way of doing this would be to use as control sequences not the many identical copies distributed over the genome, but a small subset of these which had mutated away from the master sequence in the same manner.
On this picture, each set of repeated sequences might be ‘tested’ from time to time in evolution by the production of a control macromolecule…to recognize those sequences. If this produced a favorable result, natural selection would confirm and extend the mechanism. If not, it would be selected against and discarded. Such a process implies that most sets of repeated sequences will never be of use since, on statistical grounds, their members will usually be in unsuitable places.
It thus seems unlikely that all selfish DNA has acquired a special function…
In some circumstances, the sheer bulk of selfish DNA may be used by the organism for its own purpose. That is, the selfish DNA may acquire a nonspecific function which gives the organism a selective advantage.3
In other words, the opinion expressed these two works is that “excess” DNA is junk in the sense that it is largely devoid of phenotype-specifying information. This perspective was being discussed in the 1970s and it quickly became the consensus after this pair of papers appeared. Don’t take my word for it–follow the literature trail. Simply type in terms such as “junk DNA,” “selfish DNA,” “repetitive DNA,” “noncoding,” etc. using the Pubmed search engine and read the articles. What should become obvious is that the view expounded by Orgel and Crick on the one hand, and Doolittle and Sapienza on the other, has been considered by many cellular and molecular biologists to be the correct explanation for much of genomic DNA until very recently.
So the oft-read claim on the web that the term “junk DNA” never implied developmentally “non-functional DNA” is one that is made either out of ignorance or disingenuousness.
That said, the success of the junk DNA proposal was based in part on the narrative it provided. But its acceptance was also due to definitions and presuppositions that remain with us today. Regarding the former, a gene was described in 1980 as a discrete section of the chromosome that encoded a protein or in some instances an RNA, with the “one gene, one enzyme” model exemplifying this concept. Sequences of DNA that do not specify a protein were labeled “noncoding” or, as we have seen, “nonspecific.” By connotation, then, almost all genomic regions of any given eukaryote lack coding potential, which was understood then and now to mean being a part of the “genetic program.” Linked to this definition of a gene was the assumption that cross-species conservation of a DNA string implies that it has been retained by natural selection, because it embodies some instructions that enhance the fitness of an organism. Since a large fraction of nonprotein-coding DNA is often restricted to members of a species or a genus or a family, it fails the conservation test and thus is said to be dispensable: the refuse of the duplication and transposition process. In short, the backdrop of the junk DNA hypothesis was the premise that sequences like repetitive elements are noncoding in the strictest way–encoding no proteins or RNAs other than those used in their own manipulative, lascivious, and licentious replication; and their evolutionary lability reflects this lack of coding potential.
This brings me to the false fact. It has been said that 90% of all genomic DNA (in eukaryotes) is junk. No taxon is mentioned; no reference is cited…the value is just repeated by those commenting on evo blogs. To be sure, tagging a percentage to such a claim is a lot better than simply saying that “most DNA is junk.” In lieu of an actual piece of research that demonstrated support for this proclamation, let’s critically examine the 90% junk figure by focusing on human genomic DNA. Only around 1.5% of our chromosomal sequences encode proteins, which entails that 98.5% of the genome is noncoding by the classical definition. If someone wanted to make the equation noncoding = junk, then lo and behold functional sequences in Homo sapiens drop far below the 10% value. But we know that this equation is not valid. A surprising finding of ENCODE and other transcriptome projects is that almost every nucleotide of human (and mouse) chromosomes is transcribed in a regulated way 56789101112131415. Most of the RNAs produced are various nonprotein-coding transcripts that are copied from both strands in a cell type-, tissue type-, or developmental stage-specific manner 161718. These RNAs belong to a number of different functional classes and new categories are being discovered all the time 192021222324. Further, these nonprotein-coding transcriptional units extend into and arise from protein-coding segments. Many also map to the regions between protein-coding loci.25 The RNA map of the mammalian genome has moreover been demonstrated to be hierarchical and far from random. 131526
Clearly, the “gene” definition that provided the framework for the junk DNA hypothesis is defunct2728, and much discussion now centers on providing an operational description.29303132 That is to say, the coding/noncoding distinction is being rethought. And if one considers functional DNA to be equivalent to transcription units that are developmentally expressed together with their regulatory regions, the fraction that can be dismissed as junk becomes startlingly small–this is what the results of recent studies imply. 33
Indeed, if we accept the equation transcription units + control elements = developmentally functional DNA, then the number of loci in the human genome jumps from a paltry 20,000 to hundreds of thousands, and the percentage of non-junk DNA increases to well over 90%.
It could be argued that most of these RNA-encoding loci are really cellular “noise” due to transcription running amok, on the basis that so few are phylogenetically conserved–after all, didn’t Orgel and Crick foresee such a possibility in their definition of selfish DNA? Well, this line of argumentation doesn’t hold. Another counterintuitive result of the ENCODE project and other comparative genomic analyses is that known functional sections of the mammalian genome such as protein-coding segments appear to be diverging without constraint 534, whereas a host of “junk” sequences are under some type of selective pressure–including most human “noncoding” DNA stretches. 3536 The same has been repeatedly detected for the fruit fly genome, where most nonprotein-coding sequences appear to be under functional constraint–with the species-specific differences having the statistical hallmarks of being “adaptive” 37383940. Even the Y chromosome of the fruit fly, long presented as “exhibit A” in the gallery of garbage DNA, has been shown to have diverse effects on the phenotype of this insect.41 Such results are exactly the opposite of what Orgel and Crick and Doolittle and Sapienza predicted.
Instead of 90% of the human or fly genome being junk, it seems that 90% or more of chromosomal DNA has some kind of specific developmental function, given the available data. Indeed, the emerging picture is that the species-specific nonprotein-coding regions encode numerous RNAs that help to shape the phenotype in ways that we are only beginning to understand.4243444546 This is especially true for the transposable element fraction of human chromosomes–about 50% of our DNA–much of which is arranged and expressed in a taxon-specific manner. 33474849 Part of the reason for why a human is not a chimp is not a cow is not a whale, then, is that each species has its own set of sui generis “genes”–genomic texts specifying unique RNAs or even proteins that are used in embryogenesis.
To put everything into perspective, I’ll mine another quote from a paper worth reading:
We now know that more of the DNA in eukaryotic cells is copied into RNA than previously had been thought. Many of these transcripts serve regulatory instead of template functions in gene readout. Some of these newly recognized RNAs come from regions of the genome that had heretofore been deemed “junk DNA,” yet no one could answer the obvious question: if “junk,” then why still around? Before memory fades, we should note that there were some reasonably well articulated ideas 30-40 years ago that anticipated these recent discoveries.1
Indeed, those were the very same well-articulated ideas that the selfish DNA hypothesis was supposed to have dispensed with, once and for all.
How things have changed since 1980.
1 Pederson T. 2009. The discovery of eukaryotic genome design and its forgotten corollary–the postulate of gene regulation by nuclear RNA. FASEB J. 23(7): 2019-2021.
2 Doolittle WF, Sapienza C. Selfish genes, the phenotype paradigm and genome evolution. Nature 284(5757): 601-603.
3 Orgel LE, Crick FH. 1980. Selfish DNA: the ultimate parasite. Nature 284(5757): 604-607.
4 Dawkins, R. 1976. The Selfish Gene. Oxford University Press, New York, New York.
5 ENCODE Project Consortium, Birney E, et al. 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447(7146): 799-816.
6 Frith MC, et al. 2006. Pseudo-messenger RNA: phantoms of the transcriptome. PLoS Genet. 2(4): e23.
7 Katayama S, et al. 2005. Antisense transcription in the mammalian transcriptome. Science 309(5740): 1564-1566.
8 Kapranov P, et al. 2007. RNA maps reveal new RNA classes and a possible function for pervasive transcription. Science 316(5830): 1484-1488.
9 Wu JQ, et al. 2008. Systematic analysis of transcribed loci in ENCODE regions using RACE sequencing reveals extensive transcription in the human genome. Genome Biol. 9(1): R3.
10 Furuno M, et al. 2006. Clusters of internally primed transcripts reveal novel long noncoding RNAs. PLoS Genet. 2(4): e37.
11 Amaral PP, et al. 2008. The eukaryotic genome as an RNA machine. Science 319(5871): 1787-1789.
12 Dinger ME, et al. In press. Pervasive transcription of the eukaryotic genome: functional indices and conceptual implications. Brief Funct Genomic Proteomic.
13 Kapranov P, et al. 2007. Genome-wide transcription and the implications for genomic organization. Nat Rev Genet. 8(6):413-423.
14 Kapranov P, et al. 2005. Examples of the complex architecture of the human transcriptome revealed by RACE and high-density tiling arrays. Genome Res. 15(7):987-997.
15 Carninci P, et al. 2008. Multifaceted mammalian transcriptome. Curr Opin Cell Biol. 20(3): 274-280.
16 Rinn JL, et al. 2007. Functional demarcation of active and silent chromatin domains in human HOX loci by noncoding RNAs. Cell 129(7): 1311-1323.
17 Amaral PP, Mattick JS. 2008. Noncoding RNA in development. Mamm. Genome 19(7-8): 454-492.
18 Mercer TR, et al. 2008. Specific expression of long noncoding RNAs in the mouse brain. Proc Natl Acad Sci U S A 105(2): 716-721.
19 Taft RJ, et al. 2009. Small RNAs derived from snoRNAs. RNA 15(7): 1233-1240.
20 Taft RJ, et al. 2009. Tiny RNAs associated with transcription start sites in animals. Nat Genet. 41(5): 572-578.
21 Kawaji H, et al. 2008. Hidden layers of human small RNAs. BMC Genomics 9: 157.
22 Wilusz JE, et al. 2009. Long noncoding RNAs: functional surprises from the RNA world. Genes Dev. 23(13): 1494-1504.
23 Affymetrix ENCODE Transcriptome Project; Cold Spring Harbor Laboratory ENCODE Transcriptome Project. 2009. Post-transcriptional processing generates a diversity of 5′-modified long and short RNAs. Nature 457(7232): 1028-1032.
24 Borel C, et al. 2008. Mapping of small RNAs in the human ENCODE regions. Am J Hum. Genet. 82(4): 971-981.
25 Khalil AM, et al. 2009. Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression. Proc Natl Acad Sci USA 106(28): 11667-11672.
26 Thurman RE, et al. 2007. Identification of higher-order functional domains in the human ENCODE regions. Genome Res. 17(6): 917-927.
27 Gerstein MB, et al. 2007. What is a gene, post-ENCODE? History and updated definition. Genome Res. 17(6): 669-681.
28 Gingeras TR. 2007. Origin of phenotypes: genes and transcripts.Genome Res. 17(6): 682-690.
29 Scherrer K, Jost J. 2007. Gene and genon concept: coding versus regulation. A conceptual and information-theoretic analysis of genetic storage and expression in the light of modern molecular biology. Theory Biosci. 126(2-3): 65-113.
30 Pesole G. 2008. What is a gene? An updated operational definition. Gene 417(1-2): 1-4.
31 Prohaska SJ, Stadler PF. 2008. “Genes”. Theory Biosci. 127(3): 215-221.
32 Stadler PF, et al. In press. Defining genes: a computational framework. Theory Biosci.
33 Faulkner GJ, Carninci P. 2009. Altruistic functions for selfish DNA. Cell Cycle 8(18): 2895-2900.
34 Margulies EH, et al. 2007. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. 17(6): 760-774.
35 Asthana S, et al. 2007. Widely distributed noncoding purifying selection in the human genome. Proc Natl Acad Sci USA. 104(30): 12410-12415.
36 Eory L, et al. In press. Distributions of selectively constrained sites and deleterious mutation rates in the hominid and murid genomes. Mol Biol Evol.
37 Andolfatto P. 2005. Adaptive evolution of non-coding DNA in Drosophila. Nature 437(7062): 1149-1152.
38 Kondrashov AS. 2005. Evolutionary biology: fruitfly genome is not junk. Nature 437(7062): 1106.
39 Halligan DL, Keightley PD. 2006. Ubiquitous selective constraints in the Drosophila genome revealed by a genome-wide interspecies comparison. Genome Res. 16(7): 875-884.
40 Haddrill PR, et al. 2008. Positive and negative selection on noncoding DNA in Drosophila simulans. Mol Biol Evol. 25(9): 1825-1834.
41 Lemos B, et al. 2008. Polymorphic Y chromosomes harbor cryptic variation with manifold functional consequences. Science 319(5859): 91-93.
42 Glinsky GV. 2008. Phenotype-defining functions of multiple non-coding RNA pathways. Cell Cycle 7(11): 1630-1639.
43 Bond CS, Fox AH. 2008. Paraspeckles: nuclear bodies built on long noncoding RNA. J. Cell Biol. 186(5): 637-644.
44 Barak M, et al. In press. Evidence for large diversity in the human transcriptome created by Alu RNA editing. Nucleic Acids Res.
45 Lee JT. 2009. Lessons from X-chromosome inactivation: long ncRNA as guides and tethers to the epigenome. Genes Dev. 23(16): 1831-1842.
46 Guttman M, et al. 2009. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458(7235): 223-227.
47 Faulkner GJ, et al. 2009. The regulated retrotransposon transcriptome of mammalian cells. Nat Genet. 41(5): 563-571.
48 Tay SK, et al. 2009. Global discovery of primate-specific genes in the human genome. Proc Natl Acad Sci U S A. 106(29): 12019-12024.
49 Walters RD, et al. 2009. InvAluable junk: the cellular impact and function of Alu and B2 RNAs. IUBMB Life. 61(8): 831-837.