Category Archives: Molecular Biology Survival Guide

Molecular Biology survival guide for Chemists — III: Codons, synonymous and not

Chemistry wouldn’t be what it is without quantum mechanics.  No, I’m not talking about solving the Schrodinger equation, or the approximations we must use for any minimally complicated molecule.  The fact that the energy levels of each element are quantized, means that each element acts exactly the same way, so the carbon atom at the edge of the universe has exactly the  same energy levels as the carbon atoms in the 10 billion bacteria in each gram of the stuff sitting in your colon.

What about codons?

Each of the amino acids found in proteins is one of 20 possibilities, each position of DNA (a nucleotide) is one of four possibilities, so 2 consecutive nucleotides aren’t enough (16 possibilities) while 3 are too many (44 too many in fact). Each of the 64 possible combinations of 4 nucleotides taken 3 at a time is called a codon.  3 of the 64 don’t code for an amino acid at all — they are (inappropriately) called nonsense codons.  Their function, however, is vital.  They tell the cellular machinery making a protein (e.g. the ribosome) to stop adding amino acids to the chain.  41 extra codons is a lot of redundancy, so that some amino acids (leucine for example) have 6 different codons which code for them — the 6 are called synonymous codons. Other amino acids (methionine) have just one codon for them.  Each choice of 3 nucleotides (a codon) codes for one and only one amino acid.

Codons are therefore either synonymous or nonsynonymous.  So changing one nucleotide for another in a codon may lead to a change in the amino acid it was coding for, or it may not.  If it doesn’t, the thinking until a few years ago that natural selection shouldn’t care as the amino acid sequence of the protein remained unchanged (and proteins were thought to be the only thing DNA codes for back then).  Since changing one synonymous codon to another (say by mutation) doesn’t change the protein made these were called neutral mutations.

Much evolutionary hay was made using these concepts.  People attempted to measure the rate of natural selection acting on proteins using synonymous and nonsynonymous codons in the same protein in different organisms (hemoglobin for example).  Positive selection is measured as the rate of nonsynonymous nucleotide substitution (Ka) per nonsynonymous site, relative to the underlying ‘neutral mutation’ given by the rate of synonymous substitution per synonymous site (Ks).  Usually Ka is much less than Ks (as most new mutations aren’t helpful or are actually harmful — this is negative selection).  Positive selection is implied by a Ka/Ks ratio greater than 1.    However, strictly by chance  the ratio of nonsynonymous (Ka) to synonymous (Ks) amino acid substitutions is 2:1.

However, there are several very well documented examples of synonymous codons acting very differently.  That’s for the next post.

One last technical point.  Each of the 44 possible codons has a transfer RNA (tRNA) associated with it, along with an enzyme (tRNA synthase, aka tRNA synthetase) which takes one specific amino acid, and plunks it onto the tRNA specific for  a particular codon.  The possibilities for error are enormous.  Just look how close chemically and structurally serine and threonine are, or phenylalanine and tyrosine, or glutamic and aspartic acid.  tRNA synthases containing proofreading capacity to make sure that the right amino acid gets linked to the right tRNA.  The error rate is impressively low — mistakes in selecting the amino acid occurs every 1/10,000 – 1/100,000, and a mistake in the selection of the tRNA occurs every 1/1,000,000 [ Cell vol. 103 pp. 877 – 894 ’00 ].  Remember the synthetase has to grab the correct tRNA and the correct amino acid and then stitch them together.   It is thought that the error rate between synthase and tRNA is so low, because both the enzyme and the tRNA  molecules are large, allowing a large number of contacts to be formed (correctly) between the two of them, providing a lot of ways to detect a mismatch.

Well, that’s the background.  Now to see what nature (or something) has made of all this.

Here’s the next article in the series

Molecular Biology survival guide for Chemists – II: What DNA is transcribed into

We have 3 RNA polymerases which transcribe DNA into RNA.  Transcription starts at the 3′  end of one of the members of the DNA helix and proceeds toward the 5′ end.  However the RNA produced starts at the 5′ end and proceeds toward the 3′ end.  Why transcribe you might ask?  Because the chemical language is the same — DNA and RNA are both polynucleotides.  The Guanine in DNA codes for Cytosine in RNA, etc. etc.

RNA polymerase I (Pol I to you) transcribes the genes for the RNA found in the ribosome (ribosomal RNA also known as rRNA), RNA polymerase II (Pol II) transcribes the genes for proteins into messenger RNA (mRNA), while RNA polymerase III (Pol III) transcribes the genes for transfer RNA (tRNA) and a lot more. Med students love mnemonics, so here’s one — I makes rRNA, II makes mRNA, III makes tRNA — so the polymerases and the products are in (semi) alphabetical order.

The ribosome is an incredible molecular machine — it contains several RNAs (called rRNAs) containing in total about 4,500 nucleotides and about 50 proteins.  The molecular mass is about 2,500,000 Daltons.  Its job, and its only job as far as we know is to translate the mRNA into protein.  Why translate? Because polynucleotides and proteins are chemically quite different. So information is being translated from one language to another.  Transfer RNAs (tRNAs) are involved. Each different tRNA brings a just one specific amino acid to the ribosome, which then stitches the amino acid to the growing protein.  Since we have 64 possible codons for amino acids (that’s 4^3), we have an abundance of tRNA genes in our DNA, well over 400.

Now it’s time to speak of mRNA or, actually, pre-mRNA.  The previous post noted that most genes come in pieces, parts coding for amino acids (called exons) and parts between the exons, called the introns.  Pol II knows nothing of them, just as the CPU knows nothing of the series of bits it is fed in a program.  It just starts transcribing DNA at a certain point, making mRNA willy nilly, intron and exon and finally quiting.

As mentioned in the previous post, dystrophin has over 2 million nucleotides in its DNA, all of which are transcribed into RNA.  The parts of the RNA actually coding for amino acids is under 15,000 nucleotides long, so all the introns must be spliced out.  This is the function of the spliceosome — another huge molecular machine. It contains 5 RNAs (called small nuclear RNAs, aka snRNAs), along with 50 or so proteins with a total molecular mass again of around 2,500,000 kiloDaltons.   Splicing out introns is a tricky process which is still being worked on.  Mistakes are easy to make, and different tissues will splice the same pre-mRNA in different ways.  All this happens in the nucleus before the mRNA is shipped outside where the ribosome can get at it.

There are some incredible fail safe mechanisms here.  The spliceosome associates a few proteins with the spliced together exon/exon junction, so that if and when the mRNA is read (translated) by the ribosome, if a termination codon occurs too early in the gene, truncating the protein prematurely, a process called nonsense mediated decay destroys the defective mRNA.

The mature mRNA just before it is ready to leave the nucleus has several parts.  From the 5′ end it has a bunch of nucleotides prior to the first codon for the protein (always an AUG which codes for methionine). This is called the 5′ UnTranslated Region (5′ UTR).   U, by the way, stands for Uridine which is the nucleotide in RNA corresponding to thymine in DNA.  Then there is the protein coding part, then there is the 3′ part which is not translated into protein (called the 3′ UnTranslated Region, 3′ UTR).  When Pol II is finished translating the gene, a long stretch of adenines (polyAdenine aka polyA) is added somewhere in the 3′ UTR.   It is added about 30 nucleotides downstream (3′ to) an AAUAAA sequence found in the 3′ UTRs of most protein coding genes.   There are some 20 – 260 adenines in a row in the polyA tract.  Addition is important, as polyA protects the mRNA from degradation — very few things in the cell hang around forever.   Each time the ribosome translates the mRNA into protein some adenines are lost, so for those of you familiar with computer programming, you can regard the polyA as a loop counter.

The 3′ UTR also contains sites where yet another type of RNA (called microRNA) binds.  Genes for microRNA  are also transcribed by Pol II.  Their precursor (pre-microRNA) is then extensively processed (I’ll spare you the gory details)  to form mature microRNAs, which, as the name implies, are rather short — only 20 – 22 nucleotides.  MicroRNAs represent one of the many forms of control on the amount of a given protein that a cell contains. They basepair with complementary sequences in the 3′ UTR of mRNAs and either (1) inhibit protein synthesis of the mRNA by the ribosome or (2) cause degradation of the mRNA.  It’s important to note that a given microRNA can control the levels of many different proteins, if the complementary region is present in their 3′ UTRs.  Also the 3′ UTR of a given mRNA can have regions complementary to many different microRNAs.

That’s quite a bit to throw at you.  I’ve omitted a lot of the complexity, to make the goings on as simple and clear as possible.  Hopefully, I haven’t violated Einstein’s dictum “Everything should be made as simple as possible, but not simpler”.  I think what I’ve said is quite accurate, but comments and corrections are always welcome.

The more I know about the goings on inside our cells, the more impressed I become, and the greater the leap of faith I must make to accept that this all arose by chance.


The next article in the series —

Molecular Biology survival guide for Chemists – I: DNA and protein coding gene structure

You can’t really understand molecular biology without knowing what the major players (DNA, RNA, protein) look like.  Perhaps not their detailed chemistry, but certainly their chemical structure.   I’ll assume that you know what a protein is, and what the double helix of DNA (or a DNA RNA or RNA RNA double helix) looks like.  The (so far incomplete) series The Cell Nucleus on  Human Scale attempts to describe them physically (e.g. how they all fit into a nucleus).  For great pictures I suggest the third edition of my friends’ book “Biochemistry” by Don and Judy Voet.  I realize that I’m hopelessly last century by doing so, but Voet and Voet is definitely the most chemically oriented of all the biochemistry out there. Get the third edition, not the dumbed down one (Fundamentals of Biochemistry) — if you’re real chemists you should want the hardcore stuff, anyway.  If any of the web mavens out there have favorite sites, post a comment.

Even knowing the chemistry and their structures isn’t enough.  For the cell to function, DNA, RNA and proteins must functionally interact, and the functionally significant parts have names, as well as the processes linking them together.   You may know what B and Z DNA look like, but do you know what a promoter, an enhancer, microRNA, transcription, translation, introns, exons, intronic and exonic enhancers and repressors, and the mediator complex are?  My guess is that most organic chemists don’t.  This is the place where all this stuff will be explained (or at least defined).  This post will be a work in progress, and added to as other posts need some of the background here.

So let’s start with DNA.  It has two chains of nucleotides (Adenine, Cytosine, Guanine and Thymine) each attached to a five carbon sugar (deoxyribose) by basically an acetal.  The sugars are linked together by a phosphate group forming an ester with the 3′ hydroxyl group of one sugar and the 5′ hydroxyl of the next.  This means that each chain has a definite direction (the 5′ end is different than the 3′ end).  When the nucleotides pair (A with T and G with C) on different chains, this means that the chains run in different direction, so one blunt end of a DNA helix has a free 3′ hydroxyl on one chain and a free 5′ hydroxyl on the other chain — the other end of the helix is the same way.

So far pretty simple (to the chemist anyway).  I’ll assume you understand the hydrogen bonding between A and T, G and C permitting the existence of the double helix and the way proteins and enzymes can bind to specific sequences of nucleotides.

The fun begins when DNA is transcribed into another chain of nucleotides (polynucleotides) either DNA or RNA.  Distinguish transcription from translation (which will be defined later).  DNA transcription into more DNA is accomplished by one of several (man has at least 16) enzymes called DNA polymerase —  incredibly complicated machines.  DNA is transcribed into RNA by one of 3 RNA polymerases.  The polymerase which transcribes DNA into the RNA which codes for protein (messenger RNA, aka mRNA) is RNA polymerase II — usually abbreviated Pol II.

For most of the past 50 years, it was thought that just about all the genes we had were those coding for protein.  We know better now, but the genes in DNA coding for proteins are the best understood by far (because the most work has been done on them).

So what does the gene for a protein look like? I’ll assume you know that any group of 3 nucleotides (triplet) is known as a codon, and that each codon codes for an amino acid (except the 3 codons which don’t, and which have been called nonsense codons, a pejorative name if there ever was one, stop codon is better).  All proteins begin with methionine, for which the codon is AUG (in RNA language, Uridine (U) is the stand in for thymine).  Then there follows amino acid after amino acid (in the DNA codon after codon) until we get to the stop codon.  So that’s the end of the gene for a protein.  Correct?

Wrong, very wrong.  If only it were that simple. The stretch of DNA coding for a protein, doesn’t all code for amino acids.  It is interrupted by stretches of nucleotides called introns, which don’t code for protein at all.   Why introns are there at all (bacteria mostly don’t have them) is a mystery.  Some think that they are there as a mechanism of diversity generation for natural selection to work on.  Theories abound. The part of a protein gene actually coding for amino acids is called an exon.  I’ll talk more about getting rid of introns in the next post in the series which will concern mRNA.

As a former director of a muscular dystrophy clinic, I still find this amazing.  The gene defective in one of the most common forms of muscular dystrophy (Duchenne) is called dystrophin.  The protein is large (3685 amino acids), so the gene should have at least 3 * 3685 = 11,055 nucleotides.  It has far more, 2,220,233 nucleotides to be exact.  The parts of the gene coding for amino acids is split into 79 exons. How the 2,220,233 nucleotides is transcribed into RNA over and over in all of us without fault is remarkable.  It’s miraculous that we’re not all in wheelchairs.

But protein coding genes don’t begin with AUG (the initiator codon) and end in a stop codon.  Pol II binds to a stretch of DNA 5′ to the AUG codon, called the promoter. Now here is where chemistry begins to be not enough for a full understanding of molecular biology. The promoter is defined as what Pol II binds to.  Certain combinations of nucleotides are found in promoters, there is no rigid code for them (as AUG always codes for methionine).   The promoter is 5′ to (upstream) from the actual transcription start site which is 3′ (downstream) to the promoter but upstream from the AUG.

I should mention that in a protein the amino acid with the free amino group is called the amino terminal (N-terminal) amino acid, and that with the free carboxyl group is the carboxy terminal (C-terminal) amino acid.  So proteins are always made starting at the amino terminal end, with amide bonds formed between carboxyl group of the existing peptide and the amino group of the incoming amino acid.  Proteins are numbered from the amino terminal amino acid (it’s #1) to the carboxy terminal amino acid.  So proteins are made amino terminal to carboxy terminal and the product of DNA transcription is made 5′ to 3′.

It’s not enough to have pol II bound to the promoter, it needs help.  This is where enhancers come in.  They are functionally defined stretches of DNA to which specific enhancer proteins bind which then loop over to the promoter and help pol II bind and start transcribing DNA into RNA.  Enhancers can be found THOUSANDS of nucleotides 5′ to the promoter.  The specific combinations of proteins binding to enhancers, is what allows different cells to make different proteins.

When pol II gets to the stop codon, it doesn’t stop, but continues on making RNA copies, sometimes for thousands of nucleotides until it quits (sometimes it needs help to do so).  So that’s the structure of  DNA coding for protein.  Some parts can be defined chemically, others only functionally.

Next up, the structure of the RNA transcript of a protein coding gene (pre-mRNA)

The next article in the series —