Tag Archives: codon

A synonymous codon that isn’t

Molecular biology is simply too elegant and beautiful to be left to the molecular biologists.  So I’m going to present the intriguing result of a recent paper as I would take notes on it for myself, and then unpack it explaining the various terms contained as I go along.

It you’re really adventurous — start reading a series of 5 posts I wrote starting with https://luysii.wordpress.com/2010/07/07/molecular-biology-survival-guide-for-chemists-i-dna-and-protein-coding-gene-structure/ and follow the links.

It should explain everything in the paper below.

The paper itself is Nature vol. 602 pp. 335 – 342 ’22 — https://www.nature.com/articles/s41586-022-04451-4.pdf.

The unvarnished result:  Just mutating glutamine to lysine at position 61 of the KRAS oncogene (Q61K)isn’t enough to make KRAS resistant to an anticancer drug that attacks it (Osimertinib).  One of the synonymous codons for glycine at position 60 must be switched to another.

OK:  let’s unpack this starting with synonymous codon.

The DNA making up our genome is a string of elements (nucleotides also known as bases) strung together.  Similarly, our proteins are strings of elements (amino acids).  The order is crucial; just as it is with the 26 letters making up words. Consider the two words united and untied.

Bases come on 4 varieties (A, T, G and C).  Amino acids come in twenty varieties (of which three are glycine (G), Glutamine (Q) and lysine (K) — the one letter abbreviations don’t make much sense but that’s the way it is.

Since order of both bases and amino acids are important, it’s clear that  A T and T A are different. 2 bases  can only code for 16 amino acids.  Go up to 3 bases and you can code for 64 amino acids, which is overkill.   A sequence of 3 bases is called a codon. All 64 codons   code for an amino acid (except for three of them about which much more later).  This means that there must be several codons coding for the same amino acid —  these are the synonymous codons.

The number of codons for a given amino acid ranges from 1 (methionine M) to 6 (Leucine L).  Here are the 4 synonymous codons for glycine — GGA, GGC, GGG and GGT.  Note how similar they are.

Now the human genome has 3,200,000,000 bases strung together divided into 46 pieces (the chromosomes).  If placed end to end (Dorothy Parker fashion) they would be 3 feet 3 inches (1 meter) long.  All this is in a cell so small it is invisible to the naked eye.   If this is too much to get your head around, you might enjoy the following series of 6 posts — start here and follow the links https://luysii.wordpress.com/2010/03/22/the-cell-nucleus-and-its-dna-on-a-human-scale-i/

Any 3 bases linked together code for an amino acid, but there are many different ways to ‘read’ the genome. Among the many proteins our genome codes for are the transcription factors (1,639 of them as of 2018) which bind to stretches of 10 or more bases, to activate certain genes.   That’s 4^10 possibilities (over a million) allowing a unique binding site for the 1,639.  So transcription factors read the genome in groups of 10 or so not 3.

There is yet another way to read the genome, and this has to do with the fact the genes coding for proteins are much longer (have more bases) than the 3 times the number of amino acids they code for.  The classic example is dystrophin, a gene mutated in Duchenne muscular dystrophy.  It’s a monster protein with 3,685 amino acids — so it needs 3,685 *3 = 11,055 bases in a row to code for them at 3 bases/amino acids.  The dystrophin gene, however, stretches for 2,220,223  bases.  So the protein coding parts of the gene (the exons) come in 79 different pieces separated by parts that don’t code for amino acids (the introns).

I’m skipping a lot here, but the introns must be spliced out of a copy of the gene (mRNA).  Again the genome is read by yet another machine (the spliceosome) which removes introns from newly formed copies of the gene (the mRNA).  The spliceosome is a huge molecular machine containing 5 RNAs (called small nuclear RNAs, aka snRNAs), along with 50 or so proteins with a total molecular mass again of around 2,500,000 kiloDaltons (a carbon atom is 12 Daltons).  Most proteins have introns and exons, and most of them exist in multiple forms due to alternative splicing of introns.  The spliceosome reads the mRNA in 6 – 8 base chunks looking for sites (splicing sites) to bind and begin splicing out introns. Yet another way to ‘read’ a sequence of bases.   Exon sequences which promote or repress alternative splicing sites are known (these are called EXE == exonic splicing enhancers, and ESSs = exonic splicing suppressors).

And now, at very long last, we get to the four synonymous codons of glycine which aren’t functionally synonymous at all.  This isn’t trivial: they determine the base sequence a mutated gene must have to produce cancer.

Here’s the unvarnished result once again — Just mutating glutamine to lysine at position 61 of the KRAS oncogene (Q61K) isn’t enough to make KRAS resistant to an anticancer drug that attacks it (Osimertinib).  One of the synonymous codons for glycine at position 60 must be switched to another.

What is KRAS?  A protein which gets its name from a virus causing cancer in rats.  Kirsten RAt Sarcoma virus.  KRAS, when active, relays signals from outside the cell to the nucleus to make the cell proliferate.  The protein exists in active and inactive forms.  Humans have KRAS, and 3 similar proteins.  Mutations causing  members of the protein family to remain in constantly active form are found in 1/3 of all cancers.  In the case of KRAS some activating mutations occur at positions 60 and 61 of the 189 amino acid protein.  That’s all it takes.

The codon for glutamine at position 61 in KRAS is CAA.  To change it to the codon for lysine requires a change of just one base e.g. from CAA (glutamine) to AAA (lysine) and now you have  a KRAS which is always active producing cancer.

Recall that glycine has 4 codons (GGA, GGC, GGG and GGT).  The one found in unmutated KRAS is GGT.  This codon is never found in the KRAS Q61K mutant seen in tumors.  Why?  Because GGTAAA forms a splice site which the splicing machine uses to cut out a different set of introns going to an exon.  This exon contains one of the 3 codons  mentioned above not coding for an amino acid.  They are called termination codons or stop codons, and tell the machinery making mRNA from DNA to quit.   This means that the full mutated  KRAS with its 188 amino acids is never made.  So tumor producing KRAS has GGGAAA or GGAAAA or GGCAAA at positions 60 and 61 and never GGTAAA

So the 3 synonymous glycine codons have very nonsynonymous effects.  Now you know.  Elegant isn’t it?




hed oga tet hec atw hoa tet her atw hob ith erp aw

Say what?  It’s a simple sentence made of 3 letter words frameshifted by one

he dog ate the cat who ate the rat who bit her paw

Codons are read as groups of three nucleotides, and frameshifting has always been thought to totally destroy the meaning of a protein, as an entirely different protein is made.

Not so says PNAS vol. 117 pp. 5907 – 5912 ’20. Normally a frameshifted protein has only 7% sequence identity with the original.  This is about what one would expect given that there are 20 amino acids, and chance coincidence would argue for 5%.  But there are more ways for proteins to be similar rather than identical.  One can classify our amino acids in several ways, charged vs. uncharged, aromatic vs. nonaromatic, hydrophilic vs. hydrophobic etc. etc.

The authors looked at 2,900 human proteins, then they frameshifted the original by +1 and compared the hydrophobicity profiles of the two.  Amazingly there was a correlation of .7 between the two, despite sequence identity of 7%.  Similarly frameshifting didn’t disturb the chance of intrinsic disorder.  So frameshifting is embedded in the structure of the universal genetic code, and may have actually contributed to its shaping.  Frameshifting could be an evolutionary mechanism of generating proteins with similar attributes (hydrophobicity, intrinsic order vs. disorder, etc.) but with vastly different sequences.  The evolution, aka natural selection aka deus ex machine aka God could muck about the ready made protein and find something new for it to do.   A remarkable concept.

The gag-pol precursor p180 of the AIDS virus is derived from the gag-pol mRNA by translation involving ribosomal frameshifting within the gag-pol overlap region.  The overlap is 241 nucleotides with pol in the -1 phase with respect to gag (that’s an amazing 80 amino acids).  I was amazed at the efficiency of coding of two different proteins (one and enzyme and one structural), but perhaps they aren’t that different in terms of hydrophobicity (or something else).

I’d love to see the hydropathy profile of the overlap of the two proteins, but I don’t know how to get it.

The incredible information economy of frameshifting

Her fox and dog ate our pet rat

H erf oxa ndd oga teo urp etr at

He rfo xan ddo gat eou rpe tra t

The last two lines make no sense at all, but (neglecting the spaces) they have identical letter sequences.

Here are similar sequences of nucleotides making up the genetic code as transcribed into RNA




Again, in our genome there are no spaces between the triplets. But all the triplets you see are meaningful in the sense that they each code for one of the twenty amino acids (except for TAA which says stop). ATG codes for methionine (the purists will note that all the T’s should be U). I’m too lazy to look the rest up, but the ribosome doesn’t care, and will happily translate all 3 sequences into the sequential amino acids of a protein.

Both sets of sequences have undergone (reading) frame shifts.

A previous post https://luysii.wordpress.com/2014/10/13/the-bach-fugue-of-the-genome/ marveled about how something too small even to be called a virus coded for a protein whose amino acids were read in two different frames.

Frameshifting is used by viruses to get more mileage out of their genomes. Why? There is only so much DNA you can pack into the protein coat (capsids) of a virus.

[ Proc. Natl. Acad. Sci. vol. 111 pp. 14675 – 14680 ’14 ] Usually DNA density in cell nuclei or bacteria is 5 – 10% of volume. However, in viral capsids it is 55% of volume. The pressure inside the viral capsid can reach ten atmospheres. Ejection is therefore rapid (60,000 basepairs/second).

The AIDS virus (HIV1) relies on frame shifting of its genome to produce viable virus. The genes for two important proteins (gag and pol) have 240 nucleotides (80 amino acids) in common. Frameshifting occurs to allow the 240 nucleotides to be read by the cell’s ribosomes in two different frames (not at once). Granted that there are 61 3 nucleotide combinations to code for only 20 amino acids, so some redundancy is built in, but the 80 amino acids coded by the two frames are usually quite different.

That the gag and pol proteins function at all is miraculous.

The phenomenon is turning out to be more widespread. [ Proc. Natl. Acad. Sci. vol. 111 pp. E4342 – E4349 ’14 ] KSHV (Kaposi’s Sarcoma HerpesVirus) causes (what else?) Kaposi’s sarcoma, a tumor quite rare until people with AIDS started developing it (due to their lousy immune system being unable to contend with the virus). Open reading frame 73 (ORF73) codes for a major latency associated nuclear antigen 1 (LANA1). It has 3 domains a basic amino terminal region, an acidic central repeat region (divisible into CR1, CR2 and CR3) and another basic carboxy terminal region. LANA1 is involved in maintaning KSHV episomes, regulation of viral latency, transcriptional regulation of viral and cellular genes.

LANA1 is made of multiple high and lower molecular weight isoforms — e.g. a LANA ladder band pattern seen in immunoblotting.

This work shows that LANA1 (and also Epstein Barr Nuclear antigen 1` ) undergo highly efficient +1 and -2 programmed frameshifting, to generate previously undescribed alternative reading frame proteins in their repeat regions. Programmed frameshifting to generate multiple proteins from one RNA sequence can increase coding capacity, without increasing the size of the viral capsid.

The presence of similar repeat sequences in human genes (such as huntingtin — the defective gene in Huntington’s chorea) implies that we should look for frame shifting translation in ourselves as well as in viruses. In the case of mutant huntingtin frame shifting in the abnormally expanded CAG tracts rproduces proteins containing polyAlanine or polySerineArginine tracts.

Well G, A , T and C are the 1’s and 0’s of the way genetic information is stored in our genomic computer. It really isn’t surprising that the genome can be read in alternate frames. In the old days, textual information in bytes had parity bits to make sure the 1’s and 0’s were read in the correct frame. There is nothing like that in our genome (except for the 3 stop codons).

What is truly suprising it that reading in alternate frame produces ‘meaningful’ proteins. This gets us into philosophical waters. Clearly

Erf oxa ndd oga teo urp etr at

Rfo xan ddo gat eou rpe tra t

aren’t meaningful to us. Yet gag and pol are quite meaningful (even life and death meaningful) to the AIDS virus. So meaningful in the biologic sense, means able to function in the larger context of the cell. That really is the case for linguistic meaning. You have to know a lot about the world (and speak English) for the word cat to be meaningful to you. So meaning can never be defined by the word itself. Probably the same is true for concepts as well, but I’ll leave that to the philosophers, or any who choose to comment on this.

The death of the synonymous codon – IV

The coding capacity of our genome continues to amaze. The redundancy of the genetic code has been put to yet another use. Depending on how much you know, skip the following three links and read on. Otherwise all the background to understand the following is in them.




There really was no way around it. If you want to code for 20 different amino acids with only four choices at each position, two positions (4^2) won’t do. You need three positions, which gives you 64 possibilities (61 after the three stop codons are taken into account) and the redundancy that comes with it. The previous links show how the redundant codons for some amino acids aren’t redundant at all but used to code for the speed of translation, or for exonic splicing enhancers and inhibitors. Different codons for the same amino acid can produce wildly different effects leaving the amino acid sequence of a given protein alone.

If anything will figure out a way to use synonymous codons for its own ends, it’s cancer. [ Cell vol. 156 pp. 1129 – 1131, 1324 – 1335 ’14 ] analyzed protein coding genes in cancer. Not just a few cases, but the parts of the genome coding for the exons of a mere 3,851 cases of cancer. In addition they did whole genome sequencing in 400 cases of 19 different tumor types.

There are genes which suppress cancer (which cancer often knocks out — such as the retinoblastoma or the ubiquitous p53), and genes which when mutated promote it (oncogenes like ras). They found a 1.3 fold enrichment of synonymous mutations in oncogenes (which would tend to activate them) than in the tumor suppressors. The synonymous mutations accounted for 20 – 40 % of somatic mutations found in cancer exomes.

Unfortunately, synonymous mutations have been used to estimate the background mutation frequency for evolutionary analysis, on the theory that they are neutral (e.g. because they don’t change protein structure, they are assumed not to change how the gene for the protein functions). Wrong. Wrong. They can change how much, or where, or what exons of a protein are included in the final product.