Tag Archives: junk DNA

Maybe there really is junk DNA

Until about 20 years ago, molecular biology was incredibly protein-centric.  Consider the following terms — nonsense codon, noncoding DNA, junk DNA.  All are pejorative and arose from the view that all the genome does is code for protein.  Nonsense codon means one of the 3 termination codons, which tells the ribosome to stop making protein.  Noncoding DNA means not coding for protein (with the implication that DNA not coding for protein isn’t coding for anything).

The term Junk DNA goes back to the 60s, a time of tremendous hubris as the grand biochemical plan of life was being discovered. People were not embarrassed to use the term ‘central dogma’ which was DNA makes RNA makes protein. It therefore came as a shock once we had a better handle on the size of the genome to discover that less than 2% of it coded for protein. Since much of it was made of repetitive sequences it was called junk DNA.

I never bought it, thinking it very dangerous to dismiss as unimportant what you did not understand or could not measure. Probably this was influenced by my experience as an Air Force M.D. ’68 – ’70 during the Vietnam war.

But now comes a sure to be contentious but well reasoned paper arguing that junk DNA does exist, even though it is occasionally transcribed [ Cell vol. 183 pp. 1151 – 1161 ’20 ]. The paper discusses all RNAs in the cell not part of the ribosome, or small nucleolar RNAs (snoRNAs) or microRNAs.

They note that no enzyme is perfect acting on only the substrate we think evolution optimized it for — they call this promiscuous behavior. So a transcription factor which binds to a particular promoter sequence will also bind to near miss sequence. Moreover such near misses are constantly being generated in our genome by random mutation. This is why they think that the ENCODE (ENCyopedia Of Dna Elements) found that the entire genome is transcribed into RNA. The implication made by many is that this must be functional.

However many random pieces of DNA can activate transcription [ Genes Dev. vol. 30 pp. 1895 – 1907 ’16 ] producing what the authors call transcriptional noise.

There is evidence that the cell has evolved a way to stop some of this. U1 snRNP recognizes the 5′ splice site motif. It is present in nuclei at an order of magnitude higher than other spliceosomal subcomplexes, so it monitors for RNAs which have a 5′ splice site motif but which lack the 3′ splice site. These RNAs are subsequently destroyed, never making it out of the nucleus.

They think the primary function of lncRNA is chromatin remodeling affecting gene expression — this is certainly true of XIST which silences one of the two X chromosomes females carry.

There is a lot more very technical molecular biology and close reasoning in the paper, but this should be enough to whet your interest. It is well worth reading. Probably, like me, you’ll be mentally arguing with the authors as you read it, but that’s the sign of a good paper.

Now for a question which has always puzzled me. Consider the leprosy organism. It’s a mycobacterium (like the organism causing TB), but because it essentially is confined to man, and lives inside humans for most of its existence, it has jettisoned large parts of its genome, first by throwing about 1/3 of it out (the genome is 1/3 smaller than TB from which it is thought to have diverged 66 million years ago), and second by mutation of many of its genes so protein can no longer be made from them. Why throw out all that DNA? The short answer is that it is metabolically expensive to produce and maintain DNA that you’re not using

If you want a few numbers here they are:
Genome of M. TB 4,441,529 nucleotides
Genome of M. Leprae 3,268,203 nucleotides

Clearly microorganisms are under high selective pressure, and the paper says that humans are under almost none, but it seems to me that multicellular organisms would have found a way to get rid of DNA it doesn’t need.

It may well be that all this DNA and the RNA transcribed from it is evolutionary potting soil, waiting for some new environmental stress to put it to use.

Junk that isn’t

The more we understand, the more we realize how little we’ve understood what we thought we understood.   Here is a double example.

We have 1,400,000 Alu elements in our genome.  They are about 300 nucleotides long, meaning that there is over 1 every 3,000 nucleotides in our 3,200,000,000 nucleotide genome.  They don’t code for protein, and were widely thought to be junk, selfish genes whose only role was to ensure that the organism carrying them, kept them along as they reproduced.

This post contains a heavy dose of contemporary molecular biology.  If you’re a little shaky on some of it have a look at — https://luysii.wordpress.com/2010/07/07/molecular-biology-survival-guide-for-chemists-i-dna-and-protein-coding-gene-structure/ — and follow the links.

Not so says Proc. Natl. Acad. Sci. vol. 117 pp. 415 – 425  ’20.  They are part of several important physiologic processes (1) T lymphocyte activation (2) heat shock stress (3) endoplasmic reticulum stress.  All 3 cause transcription of Alu’s by RNA polymerase III (pol III).

All RNA levels increase with heat shock, including RNAs made from Alu elements.  They bind directly and tightly (nanoMolar affinity) to RNA polymerase II (which transcribes protein coding genes) and co-occupy the promoters of repressed genes, preventing transcription of these genes and protein synthesis of them.  At least that was the state of play 11 years ago (PNAS 105 5569 – 5574 ’09)

This paper notes that Alu is not passive, but actually a self-cleaving ribozyme (an enzyme made of RNA), an entirely new role.  When complexed with another protein EZH2 (a polycomb protein thought to be a transcriptional repressor using its lysine methylation activity), the rate of Alu self-cleavage increases by 40%.

So what?

In addition to stoping transcription, Alu also retards transcription elongation.  So stress increases in EZH2 causes Alu to cleave itself faster, turning off  repression and improving the responses to the 3 types of stresses above.

So we really didn’t understand both Alu which has been studied for years, or EZH2 a polycomb protein (ditto).  Alu is a self-cleaving ribozyme, and EZH2 doesn’t just turn off genes by its enzymatic activity (lysine trimethylation), but binds to an RNA so it can cleave itself faster (e.g. its a cofactor).

Fascinating and humbling to see how much there is to know about things we thought we knew.  But it’s also exciting.  Who knows what else is out there to discover about the known, never mind the known unknowns.

It ain’t the bricks it’s the plan — take II

A recent review in Neuron (vol. 88 pp. 681 – 677 ’15) gives a possible new explanation of how our brains came to be so different from apes (if not our behavior of late).

You’ve all heard that our proteins are only 2% different than the chimp, so we are 98% chimpanzee. The facts are correct, the interpretation wrong. We are far more than the protein ‘bricks’ that make us up, and two current papers in Cell [ vol. 163 pp. 24 – 26, 66 – 83 ’15 ] essentially prove this.

This is like saying Monticello and Independence Hall are just the same because they’re both made out of bricks. One could chemically identify Monticello bricks as coming from the Virginia piedmont, and Independence Hall bricks coming from the red clay of New Jersey, but the real difference between the buildings is the plan.

It’s not the proteins, but where and when and how much of them are made. The control for this (plan if you will) lies outside the genes for the proteins themselves, in the rest of the genome (remember only 2% of the genome codes for the amino acids making up our 20,000 or so protein genes). The control elements have as much right to be called genes, as the parts of the genome coding for amino acids. Granted, it’s easier to study genes coding for proteins, because we’ve identified them and know so much about them. It’s like the drunk looking for his keys under the lamppost because that’s where the light is.

We are far more than the protein ‘bricks’ that make us up, and two current papers in Cell [ vol. 163 pp. 24 – 26, 66 – 83 ’15 ] essentially prove this.

All the molecular biology you need to understand what follows is in the following post — https://luysii.wordpress.com/2010/07/07/molecular-biology-survival-guide-for-chemists-i-dna-and-protein-coding-gene-structure.

The neuron paper is detailed and fascinating to a neurologist, but toward the end it begins to fry far bigger fish.

Until about 10 years ago, molecular biology was incredibly protein-centric. Consider the following terms — nonsense codon, noncoding DNA, junk DNA. All are pejorative and arose from the view that all the genome does is code for protein. Nonsense codon means one of the 3 termination codons, which tells the ribosome to stop making protein. Noncoding DNA means not coding for protein (with the implication that DNA not coding for protein isn’t coding for anything).

Well all that has changed. The ENCODE Consortium showed that well over half (and probably all) our DNA is transcribed into RNA — for details see https://en.wikipedia.org/wiki/ENCODE. This takes energy, and it is doubtful (to me at least) that organisms would waste this much energy if the products were not doing something useful.

I’ve discussed microRNAs elsewhere — for details please see — https://luysii.wordpress.com/2010/07/14/junk-dna-that-isnt-and-why-chemistry-isnt-enough/. They don’t code for protein either, but control how much of a given protein is made.

The Neuron paper concerns lncRNAs (long nonCoding RNAs). They don’t code for protein either and contain over 200 nucleotides. There are a lot of them (10,000 – 50,000 are known to be expressed in man. Amazingly 40% of them are expressed in the brain, and not just in adult life, but during embryonic development. Expression of some of them is restricted to specific brain areas. It is easier for an embryologist to tell what type a cell is during brain cortical development by looking at the lncRNAs expressed than by the proteins a given cell is making. The paper contains multiple examples of the lncRNAs controlling when and where a protein is made in the brain.

lncRNAs can contain multiple domains, each of which has a different affinity for a particular RNA (such as the mRNA for a protein), or DNA, or protein. In the nucleus they influence the DNA binding sites of transcription factors, RNA polymerase II, the polycomb repressor complex. The review goes on with many specific examples of lncRNA function — synaptic plasticity, neurotic extension.

Getting back to proteins, the vast majority are nearly the same in all mammals (this is where the 2% Chimpanzee argument comes from). Here is where it gets interesting. Roughly 1/3 of lncRNAs found in man are primate specific. This includes hundreds of lncRNAs found only in man. The paper gives evidence that hundreds of them have shown evidence of positive selection in humans.

So the paper provides yet another mechanism (with far more detail than I’ve been able to provide here) for why our brains are so much larger, and different in many ways than our nearest evolutionary ancestor, the chimpanzee. This is the largest molecular biological difference found so far for the human brain as opposed to every other brain. Fascinating stuff. Stay tuned. I think this is a watershed paper.

None dare call it junk

There has been a huge amount of controversy about whether all the DNA we carry about has some purpose to carry out — or not. Could some of it be ‘junk’?.

At most 2% of our DNA actually codes for the amino acids comprising our proteins. Some (particularly the ENCODE consortium) have used the criterion of transcription of the DNA into RNA (a process which takes energy) as a sign that well over 50% of our genome is NOT junk. Others regard this transcription as the unused turnings from a lathe.

All agree however, that bacteria use a good deal of their small genomes to code for protein. The following paper http://www.pnas.org/content/112/14/4251.full quotes a figure of 84 – 89%.

Consider the humble leprosy organism.It’s a mycobacterium (like the organism causing TB), but because it essentially is confined to man, and lives inside humans for most of its existence, it has jettisoned large parts of its genome, first by throwing about 1/3 of it out (the genome is 1/3 smaller than TB from which it is thought to have diverged 66 million years ago), and second by mutation of many of its genes so protein can no longer be made from them. Why throw out all that DNA? The short answer is that it is metabolically expensive to produce and maintain DNA that you’re not using

If you want a few numbers here they are:
Genome of M. TB 4,441,529 nucleotides
Genome of M. Leprae 3,268,203 nucleotides
1,604 genes coding for protein
1,116 pseudoGenes (e.g. genes that look like they could code for proteins, but no longer can because of premature termination codons.

This brings us to the organism described in the paper — Trichodesmium erythraeum — a photosynthetic bacterium living in the ocean. When conditions are right it multiplies rapidly causing a red algal bloom (even though it isn’t an algae which are cellular). It’s probably how the Red Sea got its name.

The organism only uses 64% of its genome to code for its protein. The most interesting point is that 86% of the nonCoding (for protein anyway) DNA is transcribed into RNA.

The authors wrestle with the question of what the nonCoding DNA is doing.

“Because it is thought that many bacteria are deletion-biased (47, 77), stable maintenance of these elements from laboratory isolates to the natural samples suggest that they may be required in some fashion for growth both in culture and in situ.”

Translation: The nonCoding DNA probably isn’t junk.

They give it another shot.

“Others have hypothesized that the conserved repeat structures observed in some bacteria could function as recombination-dependent “promoter banks” for adaptation to new conditions, thereby allowing relatively quick “rewiring” of metabolism in subpopulations”

Plausible, but why waste the energy transcribing the DNA into RNA if it isn’t doing anything for the organism doing the transcribing?

Never assume that what you can’t measure or don’t understand is unimportant.

Scary stuff

While you were in your mother’s womb, endogenous viruses were moving around the genome in your developing developing brain according to [ Neuron vol. 85 pp. 49 – 59 ’15 ].

The evidence is pretty good. For a while half our genome was called ‘junk’ by those who thought they had molecular biology pretty well figured out. For instance 17% of our 3.2 gigaBase DNA genome is made of LINE1 elements. These are ‘up to’ 6 kiloBases long. Most are defective in the sense that they stay where they are in the genome. However some are able to be transcribed into RNA, the RNA translated into proteins, among which is a reverse transcriptase (just like the AIDS virus) and an integrase. The reverse transcriptase makes a DNA copy of the RNA, and the integrates puts it back into the genome in a different place.

Most LINE1 DNA transcribed into RNA has a ‘tail’ of polyAdenine (polyA) tacked onto the 3′ end. The numbers of A’s tacked on isn’t coded in the genome, so it’s variable. This allows the active LINE1’s (under 1/1,000 of the total) to be recognized when they move to a new place in the genome.

It’s unbelievable how far we’ve come since the Human Genome Project which took over a decade and over a billion dollars to sequence a single human genome (still being completed by the way filling in gaps etc. etc [ Nature vol. 517 pp. 608 – 611 ’15 ] using a haploid human tumor called a hydatidiform mole ). The Neuron paper sequenced the DNA of 16 single neurons. They found LINE1 movement in 4

Once a LINE1 element has moved (something very improbable) it stays put, but all cells derived from it have the LINE1 element in the new position.

They found multiple lineages and sublineages of cells marked by different LINE1 retrotransposition events and subsequent mutation of polyA microsatellites within L1. One clone contained thousands of cells limited to the left middle frontal gyrus, while a second clone contained millions of cells distributed over the whole left hemisphere (did they do whole genome on millions of cells).

There is one fly in the ointment. All 16 neurons were from the same ‘neurologically normal’ individual.

Mosaicism is a term used to mean that different cells in a given individual have different genomes. This is certainly true in everyone’s immune system, but we’re talking brain here.

Is there other evidence for mosaicism in the brain? Yes. Here it is

[ Science vol. 345 pp. 1438 – 1439 ’14 ] 8/158 kids with brain malformations with no genetic cause (as found by previous techniques) had disease causing mutations in only a fraction of their cells (hopefully not brain cells produced by biopsy). Some mosaicism is obvious — the cafe au lait spots of McCune Albright syndrome for example. DNA sequencing takes the average of multiple reads (of the DNA from multiple cells?). Mutations foudn in only a few reads are interpreted as part of the machine’s inherent error rate. The trick was to use sequencing of candidate gene regions to a depth of 300 (rather than the usual 50 – 60).

It is possible that some genetically ‘normal’ parents who have abnormal kids are mosaics for the genetic abnormality.

[ Science vol. 342 pp. 564 – 565, 632 -637 ’13 ] Our genomes aren’t perfect. Each human genome contains 120 protein gene inactivating variants, with 20/120 being inactivated in both copies.

The blood of ‘many’ individuals becomes increasingly clonal with age, and the expanded clones often contain large deletions and duplications, a risk factor for cancer.

Some cases of hemimegalencephaly are due to somatic mutations in AKT3.

30% of skin fibroblasts ‘may’ have somatic copy number variations in their genomes.

The genomes of 110 individual neurons from the frontal cortex of 3 people were sequenced. 45/110 of the neurons had copy number variations (CNVs) — ranging in size from 3 megaBases to a whole chromosome. 15% of the neurons accounted for 73% of of the CNVs. However, 59% of neurons showed no CNVs, while 25% showed only 1 or 2.

What junk DNA is doing

I’ve never bought the idea that the 98% of our 3.2 gigaBase genome not coding for protein is junk. Consider the humble leprosy organism.It’s a mycobacterium (like the organism causing TB), but because it essentially is confined to man, and lives inside humans for most of its existence, it has jettisoned large parts of its genome, first by throwing about 1/3 of it out (the genome is 1/3 smaller than TB from which it is thought to have diverged 66 million years ago), and second by mutation of many of its genes so protein can no longer be made from them. Why throw out all that DNA? The short answer is that it is metabolically expensive to produce and maintain DNA that you’re not using.

Which brings us to Cell vol. 156 pp. 907 – 919 ’14. At least half of our genome is made of repetitive elements. We have some 520,000 (imperfect) copies of LINE1 elements — each up to 6,000 nucleotides long. There are 1,400,000 (imperfect) copies of Alu each around 300 nucleotides long. This stuff has been called junk for decades. However it has become apparent that over 50% of our entire genome is transcribed into RNA. This is also expensive metabolically.

Addendum 17 Mar: Just the cost of making a single nucleotide from scratch to hook into mRNA is 50 ATP molecules (according to an estimate I read). It also takes energy for the polymerase to hook two nucleotides together — but I can’t find out what it is (anyone know?). It’s hard to avoid teleology when thinking about biology — but why should a cell expend all this metabolic energy to copy half or more of its genome into RNA, if it weren’t getting something useful back?

Why hasn’t evolution got rid of this stuff, like the leprosy organism? Probably because it’s doing several important things we don’t understand. Here’s one of them. The cell paper did something clever and obvious (now that someone else though of it). C0T-1 DNA is placental DNA predominantly 50 – 300 nucleotides in size, very enriched in repetitive DNA sequences. It is used to block nonspecific hybridization in microarray screening for mRNA coding for protein. The authors used C0T-1 DNA to look at whole cells to find RNA transcribed from these repetitive elements, and more importantly, to find where in the cell it was located.

Guess what they found? Repetitive DNA is associated big time with interphase (e.g. not undergoing mitosis) active chromatin (aka euchromatin). So RNA transcribed from Alu and LINE1 is a structural component of our chromosomes. Since the length of the 3.2 gigaBases of our genome, if stretched out, is 1 METER, a lot of our DNA occurs in very compact structures (heterochromatin) which is thought to be transcriptionally inactive. What happens when you use RNAase (an enzyme breaking down RNA) to remove it? The chromosomes condense to heterochromatin. So the junk may be keeping our chromosomes in an ‘open’ state, a fairly significant function.

This is the exact opposite of XIST, a 17,000 nucleotide RNA transcribed from the X chromosome, which keeps one of the two X’s each female possesses inactive by coating it like the ecRNAs

The authors conclude with “we are far from understanding genome expression and regulation.” Amen.

If some of this is a bit above your molecular biological pay grade — please see a series of articles “Molecular Biology Survival Guide for Chemists” — here’s a link to the first one — https://luysii.wordpress.com/2010/07/07/molecular-biology-survival-guide-for-chemists-i-dna-and-protein-coding-gene-structure/. There are 4 more.

We wouldn’t exist if retroviruses weren’t moving around in our genome.

Time for some of the excellent molecular biology I’ve put off writing about while I plow through the new Clayden.  I reached the halfway point today (p. 590) Exactly 2 months and 2 weeks after it arrived.  The chemist might need  some brushing up on DNA and messenger RNA before pushing on.  Pretty much all the background needed is found in https://luysii.wordpress.com/2010/07/07/molecular-biology-survival-guide-for-chemists-i-dna-and-protein-coding-gene-structure/ an d https://luysii.wordpress.com/2010/07/11/molecular-biology-survival-guide-for-chemists-ii-what-dna-is-transcribed-into/.

Everyone has heard of the AIDs virus.  It has so far been impossible to cure because it hides in our DNA doing next to nothing.  Tickle it in a variety of unknown ways, and it’s DNA is transcribed into messenger RNA (mRNA), the virus is assembled and goes on to wreak havoc with our immune system.  How does the AIDs virus get into our DNA in the first place?  Its genome is made of RNA, not DNA.  It has an enzyme (reverse transcriptase) which transcribes its RNA into DNA, and another enzyme (the integrate, which is actually a complex of proteins) which patches the DNA copy (called cDNA) into our genome.  That’s why we can’t get rid of it.  That’s also why it’s called a retrovirus — because of retrograde transcription of its RNA into cDNA).

Well, sorry to say, but at least 10% of our DNA is made of retrovirus remnants.  The vast majority of them have been crippled by mutation so their reverse transcriptases  don’t work any more, or there is something wrong with their integrase, etc. etc.  Some of them do make RNA copies of themselves however, but the copies are mutated enough that infectious virus doesn’t form.  But the RNA copies can be reverse transcribed  into cDNA and reinserted back into our DNA, and in a new site to boot.  This is why they are called retrotransposons.

The whole bunch of retroviruses, retrotransposons, and other repetitive elements of DNA have been called ‘junk’ by eminent authority.  Another epithet for them is the selfish gene — which exists only to reproduce itself.  Humans are said to be machines for reproducing human DNA.

Enter  [ Cell vol. 150 pp. 7 – 9, 29 – 38 ’12 ].  Now it’s time for some very human biology The fetus represents an immunologically different graft to the mother.  Half its antigens are tolerated because they are maternal, the paternal half are not likely to be.  Allogeneic means a transplant from a different member of the same species, so the fetus is regarded as semiallogeneic. 

So why doesn’t our immune system attack the placenta surrounding the fetus, which expresses the paternal proteins?  There’s probably a lot more to it but a class of immune cell called a regulatory T cell (Treg) shuts down the immune response wherever they are found, and the placenta has lots of them.

Different cells express different proteins, and Tregs are no exception. A transcription factor is something that binds to the DNA in front of a gene, turning on transcription of the gene,  ultimately increasing production of the protein the gene codes for. Specificity is obtained by the transcription factor binding to particular sequences of DNA, which are found in only in front of a subset of  genes

The transcription factor which turns on genes necessary to turn an immune cell into a Treg is called Foxp3.  Foxp3 is a protein and to have lots of it around the gene for it must be turned on so its mRNA can be made.  Guess what?  This means that other transcription factors must bind in front the Foxp3 gene.
Here’s Jonathan Swift on the subject
So nat’ralists observe, a flea
Hath smaller fleas that on him prey,
And these have smaller fleas that bite ’em,
And so proceed ad infinitum.”

An important protein like Foxp3 is highly controlled.  There are 3 distinct regions in front of the gene were other transcription factors and repressors of transcription bind.  They are called conserved nonCoding sequences (CNSs), an oxymoron, because they are clearly coding for something quite important. The 3 sequences are called CNS1, CNS2 and CNS3.    Technology has progressed to the point where we can remove just about any DNA sequence from the mouse genome we wish (the resultant mice are called knockout mice).  

Anyway if you knockout CNS1 the mice resorb semiallogenic fetuses (where the father and the mother aren’t genetically related), but not allogenic fetuses (where the genomes of the father and the mother are pretty much the same due to inbreeding).  It’s possible to trace Foxp3 far back in evolution.  Only animals with placentas (eutherians) have CNS1 in addition to CNS2 and CNS3. Marsupials, which don’t have placentas, just have CNS2 and CNS3. 

So where do retrotransposons come in?  The structure of CNS1 shows that it is a retrotransposon which moved in front of the Foxp3 gene.  It mutated enough for a new and different set of transcription factors to bind to it and turn on Foxp3 expression in the placenta allowing survival of the fetus.  Some Junk DNA indeed !