When I study math books, I’m always amazed at how much the reader is expected to internalize and retain. A theorem proved 100 pages or so ago is referred to in the course of a proof without further ado. The pure chemist reading this longest of posts, with minimal exposure to modern molecular biology, may feel the same way. You’ll need all 4 articles of https://luysii.wordpress.com/category/molecular-biology-survival-guide/, and all 6 articles of https://luysii.wordpress.com/category/the-cell-nucleus-on-a-human-scale/ at your fingertips to get through this one. The stuff is at my mental fingertips because I’ve been learning and thinking about it for decades. Perhaps mathematicians are the same way, or perhaps they really are smarter than everyone else.
The article assumes you have a solid chemical background. I find it somewhat sad that only a chemist with a decent molecular biological background can fully understand the elegance and beauty of what is to follow. I hope this post and the 10 above provide enough background for what is to follow.
Recall that eukaryotic RNA polymerase II (pol II) is really a complex of 12 distinct proteins in man with a total mass of 550 kiloDaltons. The RBP1 subunit is the largest of the 12 and contains a truly fascinating carboxy terminal domain (CTD) — to be discussed in some detail later in this post. The function of pol II is transcription of a protein coding gene into messenger RNA (mRNA). Pol II binds to DNA upstream (5′ to) the DNA which actually codes for the amino acids making up the protein. Just binding there (this site is called the promoter) is far from enough for gene transcription to actually begin. 5 general transcription factors (pol II transcription factors B, D, E, F, H — aka TFIIB, etc.) are required. All 5 general transcription factors are actually multiprotein complexes. Then there is the mediator complex, a complex of more than 20 proteins which allows communication between transcriptional activators (enhancers) and repressors found elsewhere in the DNA. So the whole gemish contains 60 proteins with a mass of 3,500,000 Daltons. The heaviest atom in all this is phosphorus, so this means at least 100,000 atoms are involved. Have a look at Science vol. 288 pp. 632 – 633, 640 – 649 ’00 — it’s old but good and written by Kornberg fils who won his Nobel for this work.
I’ve mentioned some of the processing that goes on after the section of the DNA actually coding for amino acids is transcribed into RNA (splicing, the polyA tail, etc. etc.). There is also some modification of the 5′ end of the RNA (called the cap), requiring a variety of binding proteins and enzymes to occur.
Just binding to the promoter, separating the two strands of DNA and starting to copy (transcribe) one of them into RNA is not enough. This happens all the time, but after making RNAs 5 – 10 nucleotides long, pol II pauses, releases the RNA just made and pops back to the promoter (which it really never left). The other proteins of the 3.5 megaDalton initiation complex hold onto pol II keeping it there.
Here is where the carboxy terminal domain of the largest subunit of pol II comes in. It is a fascinating structure, which can only be completely understood by the chemist. It is made of 52 imperfect repeats of the 7 amino acids. Here is the consensus repeat (listed from the amino terminal end to the carboxy terminal end — as protein sequences are always presented).
Tyrosine Serine Proline Threonine Serine Proline Serine
What should strike the biochemically oriented chemist is that the 3 (out of 20) amino acids with hydroxyl groups account for 5/7 ths of the structure. This means that all of them can be phosphorylated. The two prolines are hardly dull, because they make it impossible for classic alpha helices to form — sometimes they are called helix breakers. The OH groups mean that the heptad is quite hydrophilic. Phosphorylation of any two OHs of the heptad means that the chain will be pretty much straight out due to charge charge repulsion. The number of distinct phosphorylated states of even one heptad is 2^5 =32, that for the whole CTD is 32^52.
Chemists more familiar with biochemistry, know that phosphorylation and dephosphorylation of serine, threonine and tyrosine is extensively used by the cell to control protein/protein interactions. That’s why our genome codes for 518 different protein kinases (which esterify hydroxyls by phosphate despite the rather weird name) and 137 phosphatases.
So the phosphorylation state (how much, which ones) of the carboxy terminal domain determine which proteins bind to it. Here is where the fun begins.
Just to give a glimpse of what is going on in our cells all the time, here are the gory details of formation of the cap at the 5′ of mRNA. You don’t have to read the details between the asterisks to follow the rest of the post
***
[ Proc. Natl. Acad. Sci. vol. 86 pp. 5795 – 5799 ’89 ] All cellular cytoplasmic mRNAs have a 7 methyl guanylate cap attached to their 5′ ends. The cap structure is added early during the transcription of mRNA by RNA polymerase II in the nucleus (after the first 25 nucleotides of a given mRNA are formed).
Three enzymes are involved in mRNA cap formation
(1) an RNA triphosphatase which cleaves the 5′ triphosphate terminus of the primary transcript to a 5′ diphosphate terminated RNA
(2) a guanyltransferase, which caps the structure with GMP — forming a 5′ – 5′ linkage
(3) a methyl transferase which adds a methyl group to the nitrogen at position #7 of guanine (see the structure of 7 methyl guanosine).
(4) The cap structure can then be further methylated by a ribose 2’0 methyltransferase.
***
The 3 capping enzymes bind to the phosphorylated carboxy terminal domain of pol II, so they can grab the newly formed 5′ end of the mRNA as it emerges from a tunnel in pol II. Not only that, but the enzymes bind to a specific pattern of phosphorylation of the tail (namely serine #5 by a kinase called Cdk7).
An intricate mechanism exists to stop transcription from proceeding too far, so the 5′ end of the emerging RNA is properly processed. During the formation of the transcription initiation complex (or soon after initiation) DRB sensitivity inducing factor (DSIF) is recruited to the transcription complex (by binding to the CTD). Additionally, after initiation of transcription, the negative elongation factor (NELF) is recruited through interaction with DSIF. This results in the arrest of the transcription complex before it enters into productive elongation. DSIF/NELF mediated arrest is then relieved by means of phosphorylation of the carboxy terminal domain on serine #2 by positive transcription elongation factor b (P-TEFb) and the transcription complex resumes elongation. This causes DSIF and NELF (both are proteins) to drop away from the CTD.
Even so, pol II is still linked to the initiation complex at the promoter. How does it get started again and move away from the promoter? The process is called promoter clearance or promoter escape. Another phosphorylation of the CTD is involved — this time on serine #5 by a kinase called Cdk7, which is found in one of the general transcription factor complexes (TFIIH).
Eventually a whole bunch of proteins (called the super elongation complex) binds to the CTD allowing not just escape, but movement down DNA. The complex includes the P-TEFb, ELL2, AFF4, AFF1 ENL and AF9 proteins. So now pol II is chugging down DNA adding a new base every 50 milliSeconds or so. A whole other group of kinases modifies the CTD so different proteins can bind to it after the terminal codon is reached and finish processing the mRNA. I’m going to skip this as you have the general idea, but rest assured it is just as complicated as putting on the 5′ cap described above.
Now for the exquisite mechanisms described in Proc. Natl. Acad. Sci. vol. 108 pp. 14717 – 14718 ’11. In the previous post –https://luysii.wordpress.com/2011/09/18/the-cell-and-its-nucleus-on-a-human-scale-vi-untwisting-the-linguini/ — I wondered how the large pol II enzyme transcribes DNA wound twice around the nucleosome (I really haven’t found an answer that satisfies me). Work has shown that pol II slows down when it reaches a nucleosome (it incorporates fewer nucleotides into the growing mRNA per second.
“95% of human multiexon protein coding genes are alternatively spliced” [ Nature vol. 465 pp. 16 – 17 ‘1o ] So how is the decision made between two alternative exons by the splicing machinery? It turns out that pol II is involved here as well. There is no logical reason it has to be. The whole mRNA could be formed by the polymerase and then it could move elsewhere in the nucleus to the splicing machinery. But in this one well studied case, alternative splicing occurs as pol II is transcribing one particular gene (which is mutated in type I neurofibromatosis).
Now for a side trip to neurology. There is an awful disease called paraneoplastic encephalomyelitis. The brain is subject to an immune attack in some patients with cancer (and in some it can be the first symptom) with resultant dementia, convulsions, incoordination and death. For years we wondered what the immune system was attacking. Now we know it is any of three proteins (HuB, HuC, HuD) found only in the brain. They bind to messenger RNA. Why the immune system sometimes chooses them for attack and how cancer sometimes triggers this isn’t known for sure. One of the theories is that the cancer cells produce something that immunologically looks lik the Hu proteins, which the immune system regards as foreign. Fortunately it is fairly rare, but I did see a few cases.
Also recall that the nucleosome is only the first stage of the 100,000 fold compaction of DNA required to fit it into the nucleus. The higher order arrangement of nucleosomes is the matter of decades of intense study which unfortunately hasn’t reached a conclusion, but there is no question that nucleosomes are close together in the nucleus, whether or not the 30 nanoMeter fiber packing 6 or so nucleosomes per level of the fiber.
So the 3 Hu’s are yet another set of proteins binding to the carboxy terminal domain (CTD) of the large subunit of pol II. So what? They interact with histone deacetylase 2 (HDAC2) which removes the acetyl group from the the epsilon amino group of lysine, changing an amide to an amine — increasing the positive charge on the nitrogen. This has the effect of compacting DNA as the protonated amine can then bind to the zillions of negatively charged phosphates of the DNA backbone. Here’s another place where you simply must know chemistry to understand what’s going on.
So a protein bound to the CTD of pol II recruits another protein which chemically modifies another protein around which DNA is wrapped. This has the remarkable effect of directly linking the epigenetic machinery to the transcription machinery. Epigenetics had been thought of as determining which proteins were made in a given cell (e.g. an on/off effect) rather than how they were spliced.
How does this work? The theory is advanced the certain splicing signals are stronger than others. This means if the transcription machinery is slowed down (say by more chromosome compaction), it will have a chance to splice at the weaker splicing signal.
Things are even more complicated. Back in the day, newsreels were shown before movies (rather than the hideous trailers of today). They sometimes amused American audiences by showing sped up films of crazed foreigners playing the sport of curling — see http://en.wikipedia.org/wiki/Curling. A (very heavy) stone is essentially slid on ice toward a target. In front of the stone are two guys sweeping furiously, to alter the surface of the ice, so the stone lands where they want it to. With sped up film, they look like idiots.
The PNAS article proposes that something like that happens during transcription — preceding the pol II complex are enzymes called histone acetyl transferase (HATs) the yang to the yin of the HDAC. They acetylate the epsilon amino group of lysines on the histones making up the nucleosomes (making it harder for lysine to bind to the phosphates of DNA. This presumably opens up compacted DNA letting pol II (which is pretty large itself at 5 x 5 x 7 nanoMeters) get through the chroatin easing transcription. These are the sweepers of curling. Then along comes pol II. Near the end of its run along the gene, it recruits Hu proteins which recruit HDAC2 which closes up chromatin again.
Elegant yes? Incredible, no?
Hopefully, a few readers have actually made it this far. For questions, critiques, ambiguities, errors of fact, etc. etc., just post a comment.
Now for some philosophy. You can’t really understand any of this without knowing a fair amount of organic chemistry and some protein chemistry as well. Chemistry explains how all this happens. It is totally useless in explaining why. As soon as you ask just what the CTD, the Hu proteins, HDACs, HATs, pol II or anything else in the cell are for, you are in the land of Aristotle, where everything had an innate purpose and function. You have crossed the Cartesian divide between the physical and the world of ideas, a place where chemistry can no longer help you.
Still, it is a magnificent thing to have the background to contemplate all this. Even so, I’m sure our knowledge is far from complete. No one said it better than Pascal — “Man is but a reed, the most feeble thing in nature, but he is a thinking reed.”