Molecular Biology survival guide for Chemists – I: DNA and protein coding gene structure

You can’t really understand molecular biology without knowing what the major players (DNA, RNA, protein) look like.  Perhaps not their detailed chemistry, but certainly their chemical structure.   I’ll assume that you know what a protein is, and what the double helix of DNA (or a DNA RNA or RNA RNA double helix) looks like.  The (so far incomplete) series The Cell Nucleus on  Human Scale attempts to describe them physically (e.g. how they all fit into a nucleus).  For great pictures I suggest the third edition of my friends’ book “Biochemistry” by Don and Judy Voet.  I realize that I’m hopelessly last century by doing so, but Voet and Voet is definitely the most chemically oriented of all the biochemistry out there. Get the third edition, not the dumbed down one (Fundamentals of Biochemistry) — if you’re real chemists you should want the hardcore stuff, anyway.  If any of the web mavens out there have favorite sites, post a comment.

Even knowing the chemistry and their structures isn’t enough.  For the cell to function, DNA, RNA and proteins must functionally interact, and the functionally significant parts have names, as well as the processes linking them together.   You may know what B and Z DNA look like, but do you know what a promoter, an enhancer, microRNA, transcription, translation, introns, exons, intronic and exonic enhancers and repressors, and the mediator complex are?  My guess is that most organic chemists don’t.  This is the place where all this stuff will be explained (or at least defined).  This post will be a work in progress, and added to as other posts need some of the background here.

So let’s start with DNA.  It has two chains of nucleotides (Adenine, Cytosine, Guanine and Thymine) each attached to a five carbon sugar (deoxyribose) by basically an acetal.  The sugars are linked together by a phosphate group forming an ester with the 3′ hydroxyl group of one sugar and the 5′ hydroxyl of the next.  This means that each chain has a definite direction (the 5′ end is different than the 3′ end).  When the nucleotides pair (A with T and G with C) on different chains, this means that the chains run in different direction, so one blunt end of a DNA helix has a free 3′ hydroxyl on one chain and a free 5′ hydroxyl on the other chain — the other end of the helix is the same way.

So far pretty simple (to the chemist anyway).  I’ll assume you understand the hydrogen bonding between A and T, G and C permitting the existence of the double helix and the way proteins and enzymes can bind to specific sequences of nucleotides.

The fun begins when DNA is transcribed into another chain of nucleotides (polynucleotides) either DNA or RNA.  Distinguish transcription from translation (which will be defined later).  DNA transcription into more DNA is accomplished by one of several (man has at least 16) enzymes called DNA polymerase —  incredibly complicated machines.  DNA is transcribed into RNA by one of 3 RNA polymerases.  The polymerase which transcribes DNA into the RNA which codes for protein (messenger RNA, aka mRNA) is RNA polymerase II — usually abbreviated Pol II.

For most of the past 50 years, it was thought that just about all the genes we had were those coding for protein.  We know better now, but the genes in DNA coding for proteins are the best understood by far (because the most work has been done on them).

So what does the gene for a protein look like? I’ll assume you know that any group of 3 nucleotides (triplet) is known as a codon, and that each codon codes for an amino acid (except the 3 codons which don’t, and which have been called nonsense codons, a pejorative name if there ever was one, stop codon is better).  All proteins begin with methionine, for which the codon is AUG (in RNA language, Uridine (U) is the stand in for thymine).  Then there follows amino acid after amino acid (in the DNA codon after codon) until we get to the stop codon.  So that’s the end of the gene for a protein.  Correct?

Wrong, very wrong.  If only it were that simple. The stretch of DNA coding for a protein, doesn’t all code for amino acids.  It is interrupted by stretches of nucleotides called introns, which don’t code for protein at all.   Why introns are there at all (bacteria mostly don’t have them) is a mystery.  Some think that they are there as a mechanism of diversity generation for natural selection to work on.  Theories abound. The part of a protein gene actually coding for amino acids is called an exon.  I’ll talk more about getting rid of introns in the next post in the series which will concern mRNA.

As a former director of a muscular dystrophy clinic, I still find this amazing.  The gene defective in one of the most common forms of muscular dystrophy (Duchenne) is called dystrophin.  The protein is large (3685 amino acids), so the gene should have at least 3 * 3685 = 11,055 nucleotides.  It has far more, 2,220,233 nucleotides to be exact.  The parts of the gene coding for amino acids is split into 79 exons. How the 2,220,233 nucleotides is transcribed into RNA over and over in all of us without fault is remarkable.  It’s miraculous that we’re not all in wheelchairs.

But protein coding genes don’t begin with AUG (the initiator codon) and end in a stop codon.  Pol II binds to a stretch of DNA 5′ to the AUG codon, called the promoter. Now here is where chemistry begins to be not enough for a full understanding of molecular biology. The promoter is defined as what Pol II binds to.  Certain combinations of nucleotides are found in promoters, there is no rigid code for them (as AUG always codes for methionine).   The promoter is 5′ to (upstream) from the actual transcription start site which is 3′ (downstream) to the promoter but upstream from the AUG.

I should mention that in a protein the amino acid with the free amino group is called the amino terminal (N-terminal) amino acid, and that with the free carboxyl group is the carboxy terminal (C-terminal) amino acid.  So proteins are always made starting at the amino terminal end, with amide bonds formed between carboxyl group of the existing peptide and the amino group of the incoming amino acid.  Proteins are numbered from the amino terminal amino acid (it’s #1) to the carboxy terminal amino acid.  So proteins are made amino terminal to carboxy terminal and the product of DNA transcription is made 5′ to 3′.

It’s not enough to have pol II bound to the promoter, it needs help.  This is where enhancers come in.  They are functionally defined stretches of DNA to which specific enhancer proteins bind which then loop over to the promoter and help pol II bind and start transcribing DNA into RNA.  Enhancers can be found THOUSANDS of nucleotides 5′ to the promoter.  The specific combinations of proteins binding to enhancers, is what allows different cells to make different proteins.

When pol II gets to the stop codon, it doesn’t stop, but continues on making RNA copies, sometimes for thousands of nucleotides until it quits (sometimes it needs help to do so).  So that’s the structure of  DNA coding for protein.  Some parts can be defined chemically, others only functionally.

Next up, the structure of the RNA transcript of a protein coding gene (pre-mRNA)

The next article in the series —

Post a comment or leave a trackback: Trackback URL.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: