Category Archives: Molecular Biology Survival Guide

The Bach Fugue of the Genome

There are more things in heaven and earth, Horatio,
Than are dreamt of in your philosophy.
– Hamlet (1.5.167-8), Hamlet to Horatio

Just when you thought we’d figured out what genomes could do, the virusoid of rice yellow mottle virus performs a feat of dense coding I’d have thought impossible. The following work requires a fairly sophisticated understanding of molecular biology which the articles in “Molecular Biology Survival Guide for Chemists” might provide the background. Give it a shot. This is fascinating stuff. If the following seems incomprehensible, start with –https://luysii.wordpress.com/2010/07/07/molecular-biology-survival-guide-for-chemists-i-dna-and-protein-coding-gene-structure/ and then follow the links forward.

Virusoids are single stranded circular RNAs which are dependent on a virus for replication. They are distinct from viroids because viroids need nothing else to replicate. Neither the virusoid or the viroid were thought to code for protein (until now). They are usually found inside the protein shells of plant viruses.

[ Proc. Natl. Acad. Sci. vol. 111 pp. 14542 - 14547 '14 ] Viroids and virusoids (viroid like satellite RNAs) are small (220 – 450 nucleotide) covalently closed circular RNAs. They are the smallest known replicating circular RNA pathogens. They replicate via a rolling circle mechanism to produce larger concatemers which are then processed into monomeric forms by a self-splicing hammerhead ribozyme, or by cellular enzymes.

The rice yellow mottle virus (RYMV) contains a virusoid which is a covalently closed circular RNA of a mere 220 nucleotides. A 16 kiloDalton basic protein is made from it. How can this be? Figure the average molecular mass of an amino acid at 100 Daltons, and 3 codons per amino acid. This means that 220 can code for 73 amino acids at most (e.g. for a 7 – 8 kiloDalton protein).

So far the RYMV virusoid is the only RNA of viroids and virusoids which actually codes for a protein. The virusoid sequence contains an internal ribosome entry site (IRES) of the following form UGAUGA. Intiation starts at the AUG, and since 220 isn’t an integral multiple of 3 (the size of amino acid codons), it continues replicating in another reading frame until it gets to one of the UGAs (termination codons) in UGAUGA or UGAUGA. Termination codons can be ignored (leaky codons) to obtain larger read through proteins. So this virusoid is a circular RNA with no NONcoding sequences which codes for a protein in either 2 or 3 of the 3 possible reading frames. Notice that UGAUGA contains UGA in both of the alternate reading frames ! So it is likely that the same nucleotide is being read 2 or 3 ways. Amazing ! ! !

It isn’t clear what function the virusoid protein performs for the virus when the virus has infected a cell. Perhaps there aren’t any, and the only function of the protein is to help the virusoid continue existence inside the virus.

Talk about information density. The RYMV virusoid is the Bach Fugue of the genome. Bach sometimes inverts the fugue theme, and sometimes plays it backwards (a musical palindrome if you will).

It is unfortunate that more people don’t understand the details of molecular biology so they can appreciate mechanisms of this elegance. Whether you think understanding it is an esthetic experience, is up to you. I do. To me, this resembles the esthetic experience that mathematics offers.

A while back I wrote a post, wondering if the USA was acquiring brains from the MidEast upheavals, the way we did from Europe because of WWII. Here’s the link https://luysii.wordpress.com/2014/09/28/maryam-mirzakhani/.

Clearly Canada has done just that. Here are the authors of the PNAS paper above and their affiliations. Way to go Canada !

Mounir Georges AbouHaidar
aDepartment of Cell and Systems Biology, University of Toronto, Toronto, ON, Canada M5S 3B2; and
Srividhya Venkataraman
aDepartment of Cell and Systems Biology, University of Toronto, Toronto, ON, Canada M5S 3B2; and
Ashkan Golshani
bBiology Department, Carleton University, Ottawa, ON, Canada K1S 5B6
Bolin Liu
aDepartment of Cell and Systems Biology, University of Toronto, Toronto, ON, Canada M5S 3B2; and
Tauqeer Ahmad
aDepartment of Cell and Systems Biology, University of Toronto, Toronto, ON, Canada M5S 3B2; and

A primer on prions

Actually Kurt Vonnegut came up with the basic idea behind prions in his 1963 Novel “Cat’s Cradle”. Instead of proteins, it involved a form of water (Ice-9) which had never been seen before, but one which was solid at room temperature. Unfortunately, it also solidified all liquid water it came in contact with effectively ending life on earth.

Now for some history.

The first Xray crystallographic structures of proteins were incredibly seductive intellectually, much as false color functional magnetic resonance (fMRI) images are today. It was hard not to think of them as the structure of the protein.

Nowaday we know that lots of proteins have at least one intrinsically disordered (trans. unstructured) segment of 30 amino acids ore more. [ Nature vol. 411 pp. 151 - 153 '11 ] says 40%, and also that 25% of all human proteins are likely to be disordered (translation; unstructured) from end to end — basic on a bioinformatics program.

I’ve always been amazed that any protein has only a few shapes, purely on the basis of the chemistry — read this if you have the time — http://luysii.wordpress.com/2010/08/04/why-should-a-protein-have-just-one-shape-or-any-shape-for-that-matter/. Clearly the proteins making us up do have a relatively limited number of shapes (or we’d all be dead).

The possible universe of proteins from which our proteins are selected is enormously large. In fact the whole earth doesn’t have enough mass (even if it were made entirely of hydrogen, carbon, nitrogen, oxygen and sulfur) to make just one copy of the 20^100 possible proteins of length 100. For the calculation please see — http://luysii.wordpress.com/2009/12/20/how-many-proteins-can-be-made-using-the-entire-earth-mass-to-do-so/ — if you have the time.

So, even though it is meaningful question philosophically, just how common proteins with a few shapes are in this universe, we’ll never be able to carry out the experiment. Popper would say it’s a scientifically meaningless question, because it can’t be experimentally decided. Bertrand Russell would not.

Again, if you have time, take a look at http://luysii.wordpress.com/2010/08/08/a-chemical-gedanken-experiment/

Which, at long last, brings us to prions.

They were first discovered in yeast, and were extremely hard to figure out as they represented something in the cytoplasm which contained no DNA and yet which was heritable. The first prion was discovered nearly 50 years ago. It was called [PSI+] and it produced a lot of new proteins in yeast containing it (which is how its effects were measured) Mating [ PSI+ ] with [ psi-] (e.g. yeast cells without [ PSI+ ] converted the [ psi-] to [ PSI+ ]. It couldn’t be mapped to any known genetic element. Also [ PSI+ ] was lost at a higher rate than would be expected for a DNA mutation. The first clue that [ PSI+ ] was a protein was that it was lost faster when yeast were grown in the presence of protein denaturants (such as guanidine).

It turned out that [ PSI + ] was an aggregated form of the Sup35 protein, which basically functioned to suppress the ribosome from reading through the stop codon. If you need background on what was just said please see — https://luysii.wordpress.com/2010/07/07/molecular-biology-survival-guide-for-chemists-i-dna-and-protein-coding-gene-structure/ and the subsequent 4 posts. This is why [ PSI+ ] yeast produced longer proteins.Things began to get exciting when Sup35 was dissected so domains could be found which induced [ PSI+ ] formation. Amazingly these domains spontaneously formed visible fibers in vitro resembling amyloid in some respects (binding the dye Congo Red for one). Then they found that preformed fibers, greatly accelerated fiber formation by unpolymerized Sup35 — beginning to sound a bit lice Ice 9 doesn’t it. Yeasts have many other prions, but the best studied and most informative is the one formed from Sup35.

So that’s how prions were found (in yeast) and what they are — an aggregated form of a given protein in a slightly different shape, which can cause another molecule of the same protein to adopt the prion proteins new shape. Amazingly, we have prions within us. But that’s the subject of the next post.

Molecular Biology Survival Guide for Chemists — V: The Ribosome

The ribosome is where the rubber meets the road (in the protein-centric view of the cell).  It is a monstrously large molecular machine 200 – 300 Angstroms in diameter.  Remember that the diameter of the double helix is only 20 Angstroms.   It takes messenger RNA (mRNA) and, using it as a code translates the sequence of nucleotides into a sequence of amino acids (e.g. a protein).  Get a copy of the 16 December ’11 issue of Science, and stare at the cover for a while.  It’s a picture of the eukaryotic (yeast) ribosome in all its glory. The details are to be found [ Science vol. 334 pp. 1524 - 1539 '11 ].  If you have an issue hanging around. around also look at pp. 1509 – 1510, as some ribosomal background is required before a post on that subject.

The article gives the structure of the Saccharomyces cerevisiae ribosome at 3 Angstroms resolution.  Quite a feat.  It comes in two parts, a large subunit which sediments at 60 Svedberg units, and a ‘small’ one at 40S.

The large subunit contains 3 RNA molecules and 46 proteins, the small one contains 1 RNA and 33 proteins.  Total molecular mass is around 2.5 megadaltons.  It’s maddening, but I can’t seem to find out just how many nucleotides our ribosomal RNAs (rRNAs) contain in toto.  It is well over 5,000 however.   So the number of atoms in the RNAs alone is over 200,000.  There must be many more atoms than that contained in the associated proteins, as the phosphates have a mass of 98, the ribose 115, the pyrmidines around 100.  So they don’t account for more than 40% of the total ribosomal mass.  If anyone can give me exact numbers, I’ll update this.

The actual catalysis is not accomplished by the 79 proteins, but by the RNAs themselves.  This is thought to be a living relic of an RNA world where life actually began.  The proteins are mostly found on the surface of the ribosome.

There are a gigantic number of things to say about the ribosome, but I’m just going to put in the facts needed so pure chemist types can read other posts. This post will be expanded as necessary when further background is needed.

Amino acids are linked together (the rate is only 2 – 6 per second) by the beast. This is OK as the average cell has over 10 million ribosomes (neurons probably have more).  The article above notes that most of the changes between the ribosome of bacteria and that of celled organisms (eukaryotes) make our ribosomes bigger.  The proteins are bigger, the rRNAs are longer.

The actual synthesis of proteins takes place deep in the center of the ribosome, where the two subunits come together.  How does the protein get out?  It is extruded (like sausage) through the exit tunnel, which is 100 Angstroms long in the E. Coli ribosome, where it’s diameter varies between 10 to 20 Angstroms.  Since the alpha helix is 11 Angstroms wide, this means that little if any other secondary structures (beta turns, beta sheets) and no tertiary structure at all can form within it.  It’s probably longer (and possibly wider) in our ribosomes.

The tail of RNA polymerase II and the limits of chemical explanation

When I study math books, I’m always amazed at how much the reader is expected to internalize and retain.  A theorem proved 100 pages or so ago is referred to in the course of a proof without further ado.  The pure chemist reading this longest of posts, with minimal exposure to modern molecular biology, may feel the same way.  You’ll need all 4 articles of https://luysii.wordpress.com/category/molecular-biology-survival-guide/, and all 6 articles of https://luysii.wordpress.com/category/the-cell-nucleus-on-a-human-scale/ at your fingertips to get through this one.  The stuff is at my mental fingertips because I’ve been learning and thinking about it for decades.  Perhaps mathematicians are the same way, or perhaps they really are smarter than everyone else.

The article assumes you have a solid chemical background.  I find it somewhat sad that only a chemist with a decent molecular biological background can fully understand the elegance and beauty of what is to follow. I hope this post and the 10 above provide enough background for what is to follow.

Recall that eukaryotic RNA polymerase II (pol II) is really a complex of 12 distinct proteins in man with a total mass of 550 kiloDaltons.  The RBP1 subunit is the largest of the 12 and contains a truly fascinating carboxy terminal domain (CTD) — to be discussed in some detail later in this post.  The function of pol II is transcription of a protein coding gene into messenger RNA (mRNA). Pol II binds to DNA upstream (5′ to) the DNA which actually codes for the amino acids making up the protein. Just binding there (this site is called the promoter) is far from enough for gene transcription to actually begin.  5 general transcription factors (pol II transcription factors B, D, E, F, H — aka TFIIB, etc.) are required.  All 5  general transcription factors are actually multiprotein complexes.  Then there is the mediator complex, a complex of more than 20 proteins which allows communication between transcriptional activators (enhancers) and repressors found elsewhere in the DNA.  So the whole gemish contains 60 proteins with a mass of 3,500,000 Daltons.  The heaviest atom in all this is phosphorus, so this means at least 100,000 atoms are involved.  Have a look at Science vol. 288 pp. 632 – 633, 640 – 649 ’00 — it’s old but good and written by Kornberg fils who won his Nobel for this work.

I’ve mentioned some of the processing that goes on after the section of the DNA actually coding for amino acids is transcribed into RNA (splicing, the polyA tail, etc. etc.).  There is also some modification of the 5′ end of the RNA (called the cap), requiring a variety of binding proteins and enzymes to occur.

Just binding to the promoter, separating the two strands of DNA and starting to copy (transcribe) one of them into RNA is not enough.  This happens all the time, but after making  RNAs 5 – 10 nucleotides long, pol II pauses, releases the RNA just made and pops back to the promoter (which it really never left).  The other proteins of the 3.5 megaDalton initiation complex hold onto pol II keeping it there.

Here is where the carboxy terminal domain of the largest subunit of pol II comes in.  It is a fascinating structure, which can only be completely understood by the chemist.  It is made of 52 imperfect repeats of the 7 amino acids.  Here is the consensus repeat (listed from the amino terminal end to the carboxy terminal end — as protein sequences are always presented).

Tyrosine Serine Proline Threonine Serine Proline Serine

What should strike the biochemically oriented chemist is that the 3 (out of 20) amino acids with hydroxyl groups account for 5/7 ths of the structure.  This means that all of them can be phosphorylated.  The two prolines are hardly dull, because they make it impossible for classic alpha helices to form — sometimes they are called helix breakers.  The OH groups mean that the heptad is quite hydrophilic.  Phosphorylation of any two OHs of the heptad means that the chain will be pretty much straight out due to charge charge repulsion.  The number of distinct phosphorylated states of even one heptad is 2^5 =32, that for the whole CTD is 32^52.

Chemists more familiar with biochemistry, know that phosphorylation and dephosphorylation of serine, threonine and tyrosine is extensively used by the cell to control protein/protein interactions.  That’s why our genome codes for 518 different protein kinases (which esterify hydroxyls by phosphate  despite the rather weird name) and 137 phosphatases.

So the phosphorylation state (how much, which ones) of the carboxy terminal domain determine which proteins bind to it.  Here is where the fun begins.

Just to give a glimpse of what is going on in our cells all the time, here are the gory details of formation of the cap at the 5′ of mRNA.  You don’t have to read the details between the asterisks to follow the rest of the post

***

   [ Proc. Natl. Acad. Sci. vol. 86 pp. 5795 - 5799 '89 ] All cellular cytoplasmic mRNAs have a 7 methyl guanylate cap attached to their 5′ ends.  The cap structure is added early during the transcription of mRNA by RNA polymerase II in the nucleus (after the first 25 nucleotides of a given mRNA are formed).  
       Three enzymes are involved in mRNA cap formation 
   (1) an RNA triphosphatase which cleaves the 5′ triphosphate terminus of the primary transcript to a 5′ diphosphate terminated RNA 
   (2) a guanyltransferase, which caps the structure with GMP — forming a 5′ – 5′ linkage 
   (3) a methyl transferase which adds a methyl group to the nitrogen at position #7 of guanine (see the structure of 7 methyl guanosine). 
    (4) The cap structure can then be further methylated by a ribose 2’0 methyltransferase.
*** 

The 3 capping enzymes bind to the phosphorylated carboxy terminal domain of pol II, so they can grab the newly formed 5′ end of the mRNA as it emerges from a tunnel in pol II.  Not only that, but the enzymes bind to a specific pattern of phosphorylation of the tail (namely serine #5 by a kinase called Cdk7).

         An intricate mechanism exists to stop transcription from proceeding too far, so the 5′ end of the emerging RNA is properly processed.  During the formation of the transcription initiation complex (or soon after initiation) DRB sensitivity inducing factor (DSIF) is recruited to the transcription complex (by binding to the CTD).  Additionally, after initiation of transcription, the negative elongation factor (NELF) is recruited through interaction with DSIF.  This results in the arrest of the transcription complex before it enters into productive elongation. DSIF/NELF mediated arrest is then relieved by means of phosphorylation of the carboxy terminal domain on serine #2 by positive transcription elongation factor b (P-TEFb) and the transcription complex resumes elongation.  This causes DSIF and NELF (both are proteins) to drop away from the CTD.

       Even so, pol II is still linked to the initiation complex at the promoter.  How does it get started again and move away from the promoter? The process is called promoter clearance or promoter escape.  Another phosphorylation of the CTD is involved — this time on serine #5 by a kinase called Cdk7, which is found in one of the general transcription factor complexes (TFIIH).     

       Eventually a whole bunch of proteins (called the super elongation complex) binds to the CTD allowing not just escape, but movement down DNA.  The complex includes the P-TEFb, ELL2, AFF4, AFF1 ENL and AF9 proteins.  So now pol II is chugging down DNA adding a new base every 50 milliSeconds or so.  A whole other group of kinases modifies the CTD so different proteins can bind to it after the terminal codon is reached and finish processing the mRNA.  I’m going to skip this as you have the general idea, but rest assured it is just as complicated as putting on the 5′ cap described above.

Now for the exquisite mechanisms described in Proc. Natl. Acad. Sci. vol. 108 pp. 14717 – 14718 ’11.  In the previous post –https://luysii.wordpress.com/2011/09/18/the-cell-and-its-nucleus-on-a-human-scale-vi-untwisting-the-linguini/ — I wondered how the large pol II enzyme transcribes DNA wound twice around the nucleosome (I really haven’t found an answer that satisfies me).  Work has shown that pol II slows down when it reaches a nucleosome (it incorporates fewer nucleotides into the growing mRNA per second.

“95% of human multiexon protein coding genes are alternatively spliced” [ Nature vol. 465 pp. 16 - 17 '1o ]  So how is the decision made between two alternative exons by the splicing machinery?  It turns out that pol II is involved here as well.  There is no logical reason it has to be.  The whole mRNA could be formed by the polymerase and then it could move elsewhere in the nucleus to the splicing machinery.  But in this one well studied case, alternative splicing occurs as pol II is transcribing one particular gene (which is mutated in type I neurofibromatosis).

Now for a side trip to neurology.  There is an awful disease called paraneoplastic encephalomyelitis.  The brain is subject to an immune attack in some patients with cancer (and in some it can be the first symptom) with resultant dementia, convulsions, incoordination and death.  For years we wondered what the immune system was attacking.  Now we know it is any of three proteins (HuB, HuC, HuD) found only in the brain.  They bind to messenger RNA.  Why the immune system sometimes chooses them for attack and how cancer sometimes triggers this isn’t known for sure.  One of the theories is that the cancer cells produce something that immunologically looks lik the Hu proteins, which the immune system regards as foreign.  Fortunately it is fairly rare, but I did see a few cases.

Also recall that the nucleosome is only the first stage of the 100,000 fold compaction of DNA required to fit it into the nucleus.  The higher order arrangement of nucleosomes is the matter of decades of intense study which unfortunately hasn’t reached a conclusion, but there is no question that nucleosomes are close together in the nucleus, whether or not the 30 nanoMeter fiber packing 6 or so nucleosomes per level of the fiber.

So the 3 Hu’s are yet another set of proteins binding to the carboxy terminal domain (CTD) of the large subunit of pol II.  So what?  They interact with histone deacetylase 2 (HDAC2) which removes the acetyl group from the the epsilon amino group of lysine, changing an amide to an amine — increasing the positive charge on the nitrogen.  This has the effect of compacting DNA as the protonated amine can then bind to the zillions of negatively charged phosphates of the DNA backbone.  Here’s another place where you simply must know chemistry to understand what’s going on.

So a protein bound to the CTD of pol II recruits another protein which chemically modifies another protein around which DNA is wrapped.  This has the remarkable effect of directly linking the epigenetic machinery to the transcription machinery.  Epigenetics had been thought of as determining which proteins were made in a given cell (e.g. an on/off effect) rather than how they were spliced.

How does this work? The theory is advanced the certain splicing signals are stronger than others. This means if the transcription machinery is slowed down (say by more chromosome compaction), it will have a chance to splice at the weaker splicing signal.

Things are even more complicated.  Back in the day, newsreels were shown before movies (rather than the hideous trailers of today). They sometimes amused American audiences by showing sped up films of crazed foreigners playing the sport of curling — see http://en.wikipedia.org/wiki/Curling.  A (very heavy) stone is essentially slid on ice toward a target.  In front of the stone are two guys sweeping furiously, to alter the surface of the ice, so the stone lands where they want it to.  With sped up film, they look like idiots.

The PNAS article proposes that something like that happens during transcription — preceding the pol II complex are enzymes called histone acetyl transferase (HATs) the yang to the yin of the HDAC. They acetylate the epsilon amino group of lysines on the histones making up the nucleosomes (making it harder for lysine to bind to the phosphates of DNA.  This presumably opens up compacted DNA letting pol II (which is pretty large itself at 5 x 5 x 7 nanoMeters) get through the chroatin easing transcription. These are the sweepers of curling.  Then along comes pol II.  Near the end of its run along the gene, it recruits Hu proteins which recruit HDAC2 which closes up chromatin again.

Elegant yes?  Incredible, no?

Hopefully, a few readers have actually made it this far.  For questions, critiques, ambiguities, errors of fact, etc. etc., just post a comment.

Now for some philosophy. You can’t really understand any of this without knowing a fair amount of organic chemistry and some protein chemistry as well.   Chemistry explains how all this happens.  It is totally useless in explaining why.  As soon as you ask just what the CTD, the Hu proteins, HDACs, HATs, pol II or anything else in the cell are for, you are in the land of Aristotle, where everything had an innate purpose and function.  You have crossed the Cartesian divide between the physical and the world of ideas, a place where chemistry can no longer help you.

        Still, it is a magnificent thing to have the background to contemplate all this.  Even so,  I’m sure our knowledge is far from complete.  No one said it better than Pascal — “Man is but a reed, the most feeble thing in nature, but he is a thinking reed.”

Molecular Biology Survival Guide for Chemists – IV Epigenetics

Cells in the body look incredibly different. Here are some pictures of a type of neuron found in the brain — the Purkinje cell http://www.google.com/images?client=safari&rls=en&q=purkinje+cell&ie=UTF-8&oe=UTF-8&oi=image_result_group&sa=X and several views of some liver cells — http://www.google.com/images?client=safari&rls=en&q=hepatocyte&ie=UTF-8&oe=UTF-8&oi=image_result_group&sa=X.

How can this be?   Back in the day, long before the human genome project, Dolly, and induced Pluripotent Stem Cells (iPSCs), I wondered if they didn’t simply jettison the parts of their genome they didn’t need.   This turned out to be true in a very limited sense for cells making antibodies, but the 3 billion or so basepairs of the Purkinje cell and the liver cell (hepatocyte) are identical.

We know that different cells express different parts of their genome.  Proteins are the best and longest studied genome products, but there is good evidence the microRNAs and other ‘noncoding’ (for protein) parts of the genome are also differentially expressed.   Large parts of the genome are essentially locked up by a set of proteins binding to specific regions of the genome.  This occurs on such a large scale that it was visible at the light microscopic level using dyes to stain it.   The locked up portion is called heterochromatin, the unlocked part is called euchromatin.  All very nice, but this just moves the problem one step back.  Why do different cells have different distributions of heterochromatin?

For any sort of explanation we must turn to epigenetics.  These are changes in DNA and/or the proteins which wrap it up so it fits in the cell which are not inherited from parent to child but which are inherited in some way from parent cell to child cell.  Just how this is done isn’t known completely, but there is no question that it occurs.

Presently we know of essentially two types of epigenetic change — modification of the DNA nucleotides themselves and modification of histones.

The easiest to understand is modification of cytosine, one of the 4 bases making up DNA.  A methyl group can be placed at the 5 position of cytosine — http://ndbserver.rutgers.edu/education/education_RNA.html.  Note that the 5 position is on the opposite side of cytosine from the side involved in base pairing.  This means that proteins binding to the double helix can get at it.  Since it adds bulk and mass to the outside of the DNA it also means that other proteins normally binding to cytosine can’t get at it.  In general, methylated cytosine in front of (5′ to) a protein coding gene, means it won’t be made into its gene product (a protein).

We have 3 enzymes which put methyl groups on cytosine at the 5′ position (DNMT1, 3a, 3b), and we even know how this is passed on from parent cell to daughter cell, when DNA is copied and copies given to the two daughter cells, but that would take us to far afield. The more introspective among you may wonder why all cells don’t have identical methylation patterns if this is so, and how differential cytosine methylation gets established.  The short answer is, we don’t really know.

There is another cytosine modification which is quite similar — hydroxymethyl cytosine.  We don’t know much, but it appears to be important for genome stability and a variety of other effects — X chromosome inactivation, imprinting anbd repression of repetitive genomic sequences (important as they constitute well over half of our genome).

The really big epigenetic effect are chemical modifications of histone proteins.  For a description of what they are see — https://luysii.wordpress.com/2010/04/16/the-cell-nucleus-and-its-dna-on-a-human-scale-iv/.  Actually there is a series of 5 posts on just how crowded and complicated things are in the nucleus.  For all 5 (probably soon to be 6) posts  see https://luysii.wordpress.com/category/the-cell-nucleus-on-a-human-scale/.

The skinny about them is that they are a way to compact DNA so it fits into the nucleus.  Our genome with its 3 billion basepairs actually has a length of 1 meter !   The thickness of the aromatic rings of the nucleotides is 3.4 Angstroms ( — 3.4 x 10^-10 meters). It must be mushed down to 10 microns (or 10^-5 meters).  8 histones form a squat cylinder called a nucleosome with a diameter of 110 Angstroms and height of 60 Angstroms.  147 nucleotides wrap around the nucleosome twice. The net effect is to shorten the overall length of DNA.  This results in DNA compaction by a factor of 10.

Histones are quite basic with lots of arginines and lysines which are positively charged at physiologic pH allowing them to counteract the negative phosphates holding DNA together.  They also have lots of serines and threonines in the parts of the nucleosome.

So what?  Histone proteins have been shown to have a large number of different chemical modifications.  Some are methylation. Lysine can have from 1 to 3 methyl groups, arginine 2.  One of the most important modifications is acetylation of the lysine amino group converting an amine to an amide with subsequent loss of basicity and its positive charge, so that it binds to the negatively charged phosphates less well.  This would loosen up DNA nucleosome binding, making DNA more accessible for transcription by the humungously sized RNA polymerase II complex.  For details, see The cell nucleus and its DNA on a human scale – VI — not written as of 15 Sep ’11 (but hopefully soon).

At least 10 different chemical modifications of histones are known.  The first 3 have been mentioned above.  The serines and threonines can be phosphorylated.  Some are ubiquitinated, some are sumoylated, some ADP ribosylated.  Then there is cis trans proline isomerization, di-iminization (e.g. arginine to citrulline) and NEDDylation.

The real complexity comes with just the first 3 (at least these are the best studied ones).  You will see the term’ histone tail‘ used to refer to the amino terminal or carboxy terminal ends protruding from the nucleosome cylinder, and not involved in binding DNA. This is where the modifications take place, and the combinatorial possibilities are enormous.  Of the amino terminal 36 amino acids of histone H3, half are modifiable — arg, lys, ser and thr — statistically it should be 20%.

Again, so what?  Phosphorylation is used all over the cell to determine which proteins bind and where they bind.  The proteins binding to the nucleosomes determine what can or can’t be done with DNA.  This has nothing to do with the nucleotide sequence of DNA, so it is epigenesis par excellance.

I leave it to your imagination (or your research project) how these changes are inherited when DNA is duplicated prior to mitosis. Somehow they are, but the mechanisms are unclear.  Somehow the 8 histones of the nucleosome must be disassembled, 8 more with similar modifications produced, to produce a nucleosome for each newly replicated DNA strand and its partner.

That’s plenty to digest for chemists unfamiliar with the material.  Persevere and you will be exposed to some incredibly elegant chemical and molecular biological gymnastics as the cell goes about its business.

Molecular Biology survival guide for Chemists — III: Codons, synonymous and not

Chemistry wouldn’t be what it is without quantum mechanics.  No, I’m not talking about solving the Schrodinger equation, or the approximations we must use for any minimally complicated molecule.  The fact that the energy levels of each element are quantized, means that each element acts exactly the same way, so the carbon atom at the edge of the universe has exactly the  same energy levels as the carbon atoms in the 10 billion bacteria in each gram of the stuff sitting in your colon.

What about codons?

Each of the amino acids found in proteins is one of 20 possibilities, each position of DNA (a nucleotide) is one of four possibilities, so 2 consecutive nucleotides aren’t enough (16 possibilities) while 3 are too many (44 too many in fact). Each of the 64 possible combinations of 4 nucleotides taken 3 at a time is called a codon.  3 of the 64 don’t code for an amino acid at all — they are (inappropriately) called nonsense codons.  Their function, however, is vital.  They tell the cellular machinery making a protein (e.g. the ribosome) to stop adding amino acids to the chain.  41 extra codons is a lot of redundancy, so that some amino acids (leucine for example) have 6 different codons which code for them — the 6 are called synonymous codons. Other amino acids (methionine) have just one codon for them.  Each choice of 3 nucleotides (a codon) codes for one and only one amino acid.

Codons are therefore either synonymous or nonsynonymous.  So changing one nucleotide for another in a codon may lead to a change in the amino acid it was coding for, or it may not.  If it doesn’t, the thinking until a few years ago that natural selection shouldn’t care as the amino acid sequence of the protein remained unchanged (and proteins were thought to be the only thing DNA codes for back then).  Since changing one synonymous codon to another (say by mutation) doesn’t change the protein made these were called neutral mutations.

Much evolutionary hay was made using these concepts.  People attempted to measure the rate of natural selection acting on proteins using synonymous and nonsynonymous codons in the same protein in different organisms (hemoglobin for example).  Positive selection is measured as the rate of nonsynonymous nucleotide substitution (Ka) per nonsynonymous site, relative to the underlying ‘neutral mutation’ given by the rate of synonymous substitution per synonymous site (Ks).  Usually Ka is much less than Ks (as most new mutations aren’t helpful or are actually harmful — this is negative selection).  Positive selection is implied by a Ka/Ks ratio greater than 1.    However, strictly by chance  the ratio of nonsynonymous (Ka) to synonymous (Ks) amino acid substitutions is 2:1.

However, there are several very well documented examples of synonymous codons acting very differently.  That’s for the next post.

One last technical point.  Each of the 44 possible codons has a transfer RNA (tRNA) associated with it, along with an enzyme (tRNA synthase, aka tRNA synthetase) which takes one specific amino acid, and plunks it onto the tRNA specific for  a particular codon.  The possibilities for error are enormous.  Just look how close chemically and structurally serine and threonine are, or phenylalanine and tyrosine, or glutamic and aspartic acid.  tRNA synthases containing proofreading capacity to make sure that the right amino acid gets linked to the right tRNA.  The error rate is impressively low — mistakes in selecting the amino acid occurs every 1/10,000 – 1/100,000, and a mistake in the selection of the tRNA occurs every 1/1,000,000 [ Cell vol. 103 pp. 877 - 894 '00 ].  Remember the synthetase has to grab the correct tRNA and the correct amino acid and then stitch them together.   It is thought that the error rate between synthase and tRNA is so low, because both the enzyme and the tRNA  molecules are large, allowing a large number of contacts to be formed (correctly) between the two of them, providing a lot of ways to detect a mismatch.

Well, that’s the background.  Now to see what nature (or something) has made of all this.

Molecular Biology survival guide for Chemists – II: What DNA is transcribed into

We have 3 RNA polymerases which transcribe DNA into RNA.  Transcription starts at the 3′  end of one of the members of the DNA helix and proceeds toward the 5′ end.  However the RNA produced starts at the 5′ end and proceeds toward the 3′ end.  Why transcribe you might ask?  Because the chemical language is the same — DNA and RNA are both polynucleotides.  The Guanine in DNA codes for Cytosine in RNA, etc. etc.

RNA polymerase I (Pol I to you) transcribes the genes for the RNA found in the ribosome (ribosomal RNA also known as rRNA), RNA polymerase II (Pol II) transcribes the genes for proteins into messenger RNA (mRNA), while RNA polymerase III (Pol III) transcribes the genes for transfer RNA (tRNA) and a lot more. Med students love mnemonics, so here’s one — I makes rRNA, II makes mRNA, III makes tRNA — so the polymerases and the products are in (semi) alphabetical order.

The ribosome is an incredible molecular machine — it contains several RNAs (called rRNAs) containing in total about 4,500 nucleotides and about 50 proteins.  The molecular mass is about 2,500,000 Daltons.  Its job, and its only job as far as we know is to translate the mRNA into protein.  Why translate? Because polynucleotides and proteins are chemically quite different. So information is being translated from one language to another.  Transfer RNAs (tRNAs) are involved. Each different tRNA brings a just one specific amino acid to the ribosome, which then stitches the amino acid to the growing protein.  Since we have 64 possible codons for amino acids (that’s 4^3), we have an abundance of tRNA genes in our DNA, well over 400.

Now it’s time to speak of mRNA or, actually, pre-mRNA.  The previous post noted that most genes come in pieces, parts coding for amino acids (called exons) and parts between the exons, called the introns.  Pol II knows nothing of them, just as the CPU knows nothing of the series of bits it is fed in a program.  It just starts transcribing DNA at a certain point, making mRNA willy nilly, intron and exon and finally quiting.

As mentioned in the previous post, dystrophin has over 2 million nucleotides in its DNA, all of which are transcribed into RNA.  The parts of the RNA actually coding for amino acids is under 15,000 nucleotides long, so all the introns must be spliced out.  This is the function of the spliceosome — another huge molecular machine. It contains 5 RNAs (called small nuclear RNAs, aka snRNAs), along with 50 or so proteins with a total molecular mass again of around 2,500,000 kiloDaltons.   Splicing out introns is a tricky process which is still being worked on.  Mistakes are easy to make, and different tissues will splice the same pre-mRNA in different ways.  All this happens in the nucleus before the mRNA is shipped outside where the ribosome can get at it.

There are some incredible fail safe mechanisms here.  The spliceosome associates a few proteins with the spliced together exon/exon junction, so that if and when the mRNA is read (translated) by the ribosome, if a termination codon occurs too early in the gene, truncating the protein prematurely, a process called nonsense mediated decay destroys the defective mRNA.

The mature mRNA just before it is ready to leave the nucleus has several parts.  From the 5′ end it has a bunch of nucleotides prior to the first codon for the protein (always an AUG which codes for methionine). This is called the 5′ UnTranslated Region (5′ UTR).   U, by the way, stands for Uridine which is the nucleotide in RNA corresponding to thymine in DNA.  Then there is the protein coding part, then there is the 3′ part which is not translated into protein (called the 3′ UnTranslated Region, 3′ UTR).  When Pol II is finished translating the gene, a long stretch of adenines (polyAdenine aka polyA) is added somewhere in the 3′ UTR.   It is added about 30 nucleotides downstream (3′ to) an AAUAAA sequence found in the 3′ UTRs of most protein coding genes.   There are some 20 – 260 adenines in a row in the polyA tract.  Addition is important, as polyA protects the mRNA from degradation — very few things in the cell hang around forever.   Each time the ribosome translates the mRNA into protein some adenines are lost, so for those of you familiar with computer programming, you can regard the polyA as a loop counter.

The 3′ UTR also contains sites where yet another type of RNA (called microRNA) binds.  Genes for microRNA  are also transcribed by Pol II.  Their precursor (pre-microRNA) is then extensively processed (I’ll spare you the gory details)  to form mature microRNAs, which, as the name implies, are rather short — only 20 – 22 nucleotides.  MicroRNAs represent one of the many forms of control on the amount of a given protein that a cell contains. They basepair with complementary sequences in the 3′ UTR of mRNAs and either (1) inhibit protein synthesis of the mRNA by the ribosome or (2) cause degradation of the mRNA.  It’s important to note that a given microRNA can control the levels of many different proteins, if the complementary region is present in their 3′ UTRs.  Also the 3′ UTR of a given mRNA can have regions complementary to many different microRNAs.

That’s quite a bit to throw at you.  I’ve omitted a lot of the complexity, to make the goings on as simple and clear as possible.  Hopefully, I haven’t violated Einstein’s dictum “Everything should be made as simple as possible, but not simpler”.  I think what I’ve said is quite accurate, but comments and corrections are always welcome.

The more I know about the goings on inside our cells, the more impressed I become, and the greater the leap of faith I must make to accept that this all arose by chance.

Molecular Biology survival guide for Chemists – I: DNA and protein coding gene structure

You can’t really understand molecular biology without knowing what the major players (DNA, RNA, protein) look like.  Perhaps not their detailed chemistry, but certainly their chemical structure.   I’ll assume that you know what a protein is, and what the double helix of DNA (or a DNA RNA or RNA RNA double helix) looks like.  The (so far incomplete) series The Cell Nucleus on  Human Scale http://luysii.wordpress.com/category/the-cell-nucleus-on-a-human-scale/ attempts to describe them physically (e.g. how they all fit into a nucleus).  For great pictures I suggest the third edition of my friends’ book “Biochemistry” by Don and Judy Voet.  I realize that I’m hopelessly last century by doing so, but Voet and Voet is definitely the most chemically oriented of all the biochemistry out there. Get the third edition, not the dumbed down one (Fundamentals of Biochemistry) — if you’re real chemists you should want the hardcore stuff, anyway.  If any of the web mavens out there have favorite sites, post a comment.

Even knowing the chemistry and their structures isn’t enough.  For the cell to function, DNA, RNA and proteins must functionally interact, and the functionally significant parts have names, as well as the processes linking them together.   You may know what B and Z DNA look like, but do you know what a promoter, an enhancer, microRNA, transcription, translation, introns, exons, intronic and exonic enhancers and repressors, and the mediator complex are?  My guess is that most organic chemists don’t.  This is the place where all this stuff will be explained (or at least defined).  This post will be a work in progress, and added to as other posts need some of the background here.

So let’s start with DNA.  It has two chains of nucleotides (Adenine, Cytosine, Guanine and Thymine) each attached to a five carbon sugar (deoxyribose) by basically an acetal.  The sugars are linked together by a phosphate group forming an ester with the 3′ hydroxyl group of one sugar and the 5′ hydroxyl of the next.  This means that each chain has a definite direction (the 5′ end is different than the 3′ end).  When the nucleotides pair (A with T and G with C) on different chains, this means that the chains run in different direction, so one blunt end of a DNA helix has a free 3′ hydroxyl on one chain and a free 5′ hydroxyl on the other chain — the other end of the helix is the same way.

So far pretty simple (to the chemist anyway).  I’ll assume you understand the hydrogen bonding between A and T, G and C permitting the existence of the double helix and the way proteins and enzymes can bind to specific sequences of nucleotides.

The fun begins when DNA is transcribed into another chain of nucleotides (polynucleotides) either DNA or RNA.  Distinguish transcription from translation (which will be defined later).  DNA transcription into more DNA is accomplished by one of several (man has at least 16) enzymes called DNA polymerase —  incredibly complicated machines.  DNA is transcribed into RNA by one of 3 RNA polymerases.  The polymerase which transcribes DNA into the RNA which codes for protein (messenger RNA, aka mRNA) is RNA polymerase II — usually abbreviated Pol II.

For most of the past 50 years, it was thought that just about all the genes we had were those coding for protein.  We know better now, but the genes in DNA coding for proteins are the best understood by far (because the most work has been done on them).

So what does the gene for a protein look like? I’ll assume you know that any group of 3 nucleotides (triplet) is known as a codon, and that each codon codes for an amino acid (except the 3 codons which don’t, and which have been called nonsense codons, a pejorative name if there ever was one, stop codon is better).  All proteins begin with methionine, for which the codon is AUG (in RNA language, Uridine (U) is the stand in for thymine).  Then there follows amino acid after amino acid (in the DNA codon after codon) until we get to the stop codon.  So that’s the end of the gene for a protein.  Correct?

Wrong, very wrong.  If only it were that simple. The stretch of DNA coding for a protein, doesn’t all code for amino acids.  It is interrupted by stretches of nucleotides called introns, which don’t code for protein at all.   Why introns are there at all (bacteria mostly don’t have them) is a mystery.  Some think that they are there as a mechanism of diversity generation for natural selection to work on.  Theories abound. The part of a protein gene actually coding for amino acids is called an exon.  I’ll talk more about getting rid of introns in the next post in the series which will concern mRNA.

As a former director of a muscular dystrophy clinic, I still find this amazing.  The gene defective in one of the most common forms of muscular dystrophy (Duchenne) is called dystrophin.  The protein is large (3685 amino acids), so the gene should have at least 3 * 3685 = 11,055 nucleotides.  It has far more, 2,220,233 nucleotides to be exact.  The parts of the gene coding for amino acids is split into 79 exons. How the 2,220,233 nucleotides is transcribed into RNA over and over in all of us without fault is remarkable.  It’s miraculous that we’re not all in wheelchairs.

But protein coding genes don’t begin with AUG (the initiator codon) and end in a stop codon.  Pol II binds to a stretch of DNA 5′ to the AUG codon, called the promoter. Now here is where chemistry begins to be not enough for a full understanding of molecular biology. The promoter is defined as what Pol II binds to.  Certain combinations of nucleotides are found in promoters, there is no rigid code for them (as AUG always codes for methionine).   The promoter is 5′ to (upstream) from the actual transcription start site which is 3′ (downstream) to the promoter but upstream from the AUG.

I should mention that in a protein the amino acid with the free amino group is called the amino terminal (N-terminal) amino acid, and that with the free carboxyl group is the carboxy terminal (C-terminal) amino acid.  So proteins are always made starting at the amino terminal end, with amide bonds formed between carboxyl group of the existing peptide and the amino group of the incoming amino acid.  Proteins are numbered from the amino terminal amino acid (it’s #1) to the carboxy terminal amino acid.  So proteins are made amino terminal to carboxy terminal and the product of DNA transcription is made 5′ to 3′.

It’s not enough to have pol II bound to the promoter, it needs help.  This is where enhancers come in.  They are functionally defined stretches of DNA to which specific enhancer proteins bind which then loop over to the promoter and help pol II bind and start transcribing DNA into RNA.  Enhancers can be found THOUSANDS of nucleotides 5′ to the promoter.  The specific combinations of proteins binding to enhancers, is what allows different cells to make different proteins.

When pol II gets to the stop codon, it doesn’t stop, but continues on making RNA copies, sometimes for thousands of nucleotides until it quits (sometimes it needs help to do so).  So that’s the structure of  DNA coding for protein.  Some parts can be defined chemically, others only functionally.

Next up, the structure of the RNA transcript of a protein coding gene (pre-mRNA)

The next article in the series — https://luysii.wordpress.com/2010/07/11/molecular-biology-survival-guide-for-chemists-ii-what-dna-is-transcribed-into/

Follow

Get every new post delivered to your Inbox.

Join 69 other followers