Category Archives: Molecular Biology

Catching God’s dice being thrown

Einstein famously said “Quantum theory yields much, but it hardly brings us close to the Old One’s secrets. I, in any case, am convinced He does not play dice with the universe.”  Astronomers have caught the dice being thrown (at least as far as the origin of life is concerned).

This post will contain a lot more background than most, as I expect some readers won’t have much scientific background.  The technically inclined can read the article on which this is based —

To cut to the chase — astronomers have found water, a simple sugar, and a compound containing carbon, hydrogen, oxygen and nitrogen around newly forming stars and planets.  You need no more than these 4 atoms to build the bases making up the DNA of our genes, all our sugars and carbohydrates, and 18 of the 20 amino acids that make up our proteins. Throw in sulfur and you have all 20 amino acids.  Add phosphorus and you have DNA and its cousin RNA (neither has been found around newly forming stars so far).

These are the ingredients of life itself. Here’s a quote from the article — “What I can definitively say is that the ingredients needed to make biogenic molecules like DNA and RNA are found around every forming protostar. They are there at an early stage, incorporating into bodies at least as large as comets, which we know are the building blocks of terrestrial planets. Whether these molecules survive or are delivered at the late stage of planet formation, that’s the part of it we don’t know very well.”

So each newly formed star and planetary system is a throw of God’s/Nature’s/Soulless physics’ dice for the creation of life.

As of 1 July 2018, there are 3,797 confirmed planets around 2,841 stars, with 632 having more than one (Wikipedia).  And that’s just in the stars close enough to us to study.  Our galaxy, the milky way, contains 400,000,000,000.

Current estimates have some 100,000,000,000 galaxies in the universe.  That’s a lot tosses for life to arise.

Suppose that some day life is found on one such planet.  Does this invalidate Genesis, the Koran?  Assume that they are the word of God somehow transmitted to man.  If the knowledge we have about astronomy (above), biology etc. etc. were imparted to Jesus, Mohammed, Abraham, Moses — it never would have been believed.  The creator had to start with something plausible.




How many more metabolites like this are out there?

3′ deoxy 3′ 4′ didehydro cytidine triphosphate — doesn’t roll’ tripgingly on the tongue’ does it? (Hamlet Act 3, scene 2, 1–4).  Can you draw the structure?  It is the product of another euphoniously named enzyme — Viperin.  Abbreviated ddhCTP it is just cytidine triphosphate with a double bond between carbons 3 and 4 of the sugar.

Viperin is an enzyme induced by interferon which inhibits the replication of a variety of viruses. [ Nature vol. 558 pp. 610 – 614 ’18 ] describes a  beautiful sequence  of reactions for ddhCTP’s formation using S-Adenosyl Methionine (SAM).  ddhCTP acts as a chain terminator for the RNA dependent RNA polymerases of multiple members of Flaviviruses (including Zika).

However the paper totally blows it for not making the point that ddhCTP is extremely close to a drug (which has been used against AIDS for years — Zalcitabine (Hivid) — which is just ddC.  ddhCTP is almost the same as ddC — except that there is no triphosphate on the 5′ hydroxyl (which enzymes in the body add), and instead of a double bond between carbons 3 and 4 of the sugar, both carbons are fully reduced (CH2 and CH2).  So ddhCTP is Nature’s own Zalcitabine.

It is worth reflecting on just how many other metabolites are out there acting as ‘natural’ drugs that we just haven’t found yet.

Remember entropy – take III

Pop quiz.  How would you make an enzyme in a cold dwelling organism (0 Centrigrade) as catalytically competent as its brothers living in us at 37 C?

We know that reactions go faster the hotter it is, because there is more kinetic energy of the reactants to play with.  So how do you make an enzyme move more when it’s cold and there is less kinetic energy to play with.

Well for most cold tolerance enzymes (psychrophilic enzymes — a great scrabble word), evolution mutates surface amino acids to glycine.  Why glycine?  Well it’s lighter, and there is no side chain to get in the way  when the backbone moves.  The mutations aren’t in the active site but far away.   This means more wiggling of the backbone — which means more entropy of the backbone.

The following papers [ Nature vol. 558 pp. 195 – 196, 324 – 218 ’81 ] studied adenylate kinase, an enzyme found in most eukaryotes  which catalyzes

ATP + AMP < — > 2 ADP.

They studied the enzyme from E. Coli which happily lives within us at 37 C, and mutated a few surface valines and isoleucines to glycine, lowered the temperature and found the enzyme works as well (the catalytic rate of the mutated enzyme at 0 C is the same as the rate of the unmutated enzyme at 37).

Chemists have been studying transition state theory since the days of Eyring, and reaction rates are inversely proportional the the amount of free energy (not enthalpy) to raise the enzyme to the transition state.

F = H – TS (Free energy = enthalpy – Temperature * Entropy).

So to increase speed decrease the enthalpy of activation (deltaH) or increase the amount of entropy.

It is possible to separately measure enthalpy and entropies of activation, and the authors did just that (figure 4 p. 326) and showed that the enthalpy of activation of the mutated enzyme (glycine added) was the same as the unmutated enzyme, but that the free energy of activation of the mutated enzyme was less because of an increase in entropy (due to unfolding of different parts of the enzyme).

Determining these two parameters takes an enormous amount of work (see the story from grad school at the end). You have to determine rate constants at various temperatures, plot the rate constant divided by temperature and then measure the slope of the line you get to obtain the enthalpy of activation.   Activation entropy is determined by the intercepts of the straight line (which hopefully IS straight) with the X axis.  Determining the various data points is incredibly tedious and uninteresting.

So enzymes  of cold tolerant organisms are using entropy to make their enzymes work.

Grad school story — back in the day, studies of organic reaction mechanisms were very involved with kinetic measurements (that’s where Sn1 and Sn2 actually come from).  I saw the following happen several times, and resolved never get sucked in to having to actually do kinetic measurements.  Some hapless wretch would present his kinetic data to a seminar, only to have Frank Westheimer think of something else and suggest another 6 months of kinetic measurements, so back he went to the lab for yet more drudgery.



Molecular biology’s oxymoron

Dear reader.  What does a gene do?  It codes for something.  What does a nonCoding Gene do?  It also codes for something, just RNA instead of protein. It’s molecular biology’s very own oxymoron, a throwback to the heroic protein-centric early days of molecular biology. The term has been enshrined by usage for so long that it’s impossible to get rid of.  Nonetheless, the the latest work found even more nonCoding genes than genes actually coding for  protein.

An amusing article from Nature (vol. 558 pp. 354 – 355 ’18) has the current state of play.   The latest estimate is from GTex which sequenced 900 billion RNAs found in various human tissues, matched them to the sequence(s) of the human genome and used computer algorithms to determine which  of them were the product of genes coding for proteins and genes coding for something else.

The report from GTex  (Genotype Tissue expression Project) found 21,306 protein-coding genes and 21,856 non-coding genes — amazingly there are more nonCoding genes than protein coding ones.  This  is many more genes than found in the two most widely used human gene databases. The GENCODE gene set, maintained by the EBI, includes 19,901 protein-coding genes and 15,779 non-coding genes. RefSeq, a database run by the US National Center for Biotechnology Information (NCBI), lists 20,203 protein-coding genes and 17,871 non-coding genes.

Stay tuned.  The fat lady hasn’t sung.

Chemistry and Biochemistry can’t answer the important questions but without them we are lost

The last two posts — one concerning the histone code and cerebral embryogenesis and the other concerning PVT1 enhancers promoters and cancer &#8212; would be impossible without chemical and biochemical knowledge and technology, but the results they produce and the answers they seek and lie totally outside both disciplines.

In fact they belong outside the physical realm in the space of logic, ideas, function — e.g. in the other half of the Cartesian dichotomy — the realm of ideas and spirit.  Certainly the biological issues are instantiated physically in molecules, just as computer memory used to be instantiated in magnetic cores, rather than transistors.

Back when I was starting out as a grad student in Chemistry in the early 60s, people were actually discovering the genetic code, poly U coded for phenylalanine etc. etc.  Our view was that all we had to do was determine the structure of things and understanding would follow.  The first xray structures of proteins (myoglobin) and Anfinsen’s result on ribonuclease showing that it could fold into its final compact form all by itself reinforced this. It also led us to think that all proteins had ‘a’ structure.

This led to people thinking that the only difference between us and a chimpanzee were a few amino acid differences in our proteins (remember the slogan that we were 98% chimpanzee).

So without chemistry and biochemistry we’d be lost, but the days of crude reductionism of the 60s and 70s are gone forever.  Here’s another example of chemical and biochemical impotence from an earlier post.

The limits of chemical reductionism

“Everything in chemistry turns blue or explodes”, so sayeth a philosophy major roommate years ago.  Chemists are used to being crapped on, because it starts so early and never lets up.  However, knowing a lot of organic chemistry and molecular biology allows you to see very clearly one answer to a serious philosophical question — when and where does scientific reductionism fail?

Early on, physicists said that quantum mechanics explains all of chemistry.  Well it does explain why atoms have orbitals, and it does give a few hints as to the nature of the chemical bond between simple atoms, but no one can solve the equations exactly for systems of chemical interest.  Approximate the solution, yes, but this his hardly a pure reduction of chemistry to physics.  So we’ve failed to reduce chemistry to physics because the equations of quantum mechanics are so hard to solve, but this is hardly a failure of reductionism.

The last post “The death of the synonymous codon – II” puts you exactly at the nidus of the failure of chemical reductionism to bag the biggest prey of all, an understanding of the living cell and with it of life itself.  We know the chemistry of nucleotides, Watson-Crick base pairing, and enzyme kinetics quite well.  We understand why less transfer RNA for a particular codon would mean slower protein synthesis.  Chemists understand what a protein conformation is, although we can’t predict it 100% of the time from the amino acid sequence.  So we do understand exactly why the same amino acid sequence using different codons would result in slower synthesis of gamma actin than beta actin, and why the slower synthesis would allow a more leisurely exploration of conformational space allowing gamma actin to find a conformation which would be modified by linking it to another protein (ubiquitin) leading to its destruction.  Not bad.  Not bad at all.

Now ask yourself, why the cell would want to have less gamma actin around than beta actin.  There is no conceivable explanation for this in terms of chemistry.  A better understanding of protein structure won’t give it to you.  Certainly, beta and gamma actin differ slightly in amino acid sequence (4/375) so their structure won’t be exactly the same.  Studying this till the cows come home won’t answer the question, as it’s on an entirely different level than chemistry.

Cellular and organismal molecular biology is full of questions like that, but gamma and beta actin are the closest chemists have come to explaining the disparity in the abundance of two closely related proteins on a purely chemical basis.

So there you have it.  Physicality has gone as far as it can go in explaining the mechanism of the effect, but has nothing to say whatsoever about why the effect is present.  It’s the Cartesian dualism between physicality and the realm of ideas, and you’ve just seen the junction between the two live and in color, happening right now in just about every cell inside you.  So the effect is not some trivial toy model someone made up.

Whether philosophers have the intellectual cojones to master all this chemistry and molecular biology is unclear.  Probably no one has tried (please correct me if I’m wrong).  They are certainly capable of mounting intellectual effort — they write book after book about Godel’s proof and the mathematical logic behind it. My guess is that they are attracted to such things because logic and math are so definitive, general and nonparticular.

Chemistry and molecular biology aren’t general this way.  We study a very arbitrary collection of molecules, which must simply be learned and dealt with. Amino acids are of one chirality. The alpha helix turns one way and not the other.  Our bodies use 20 particular amino acids not any of the zillions of possible amino acids chemists can make.  This sort of thing may turn off the philosophical mind which has a taste for the abstract and general (at least my roommates majoring in it were this way).

If you’re interested in how far reductionism can take us  have a look at

Were my two philosopher roommates still alive, they might come up with something like “That’s how it works in practice, but how does it work in theory? 


Marshall McLuhan rides again

Marshall McLuhan famously said “the medium is the message”. Who knew he was talking about molecular biology?  But he was, if you think of the process of transcription of DNA into various forms of RNA as the medium and the products of transcription as the message.  That’s exactly what this paper [ Cell vol. 171 pp. 103 – 119 ’17 ] says.

T cells are a type of immune cell formed in the thymus.  One of the important transcription factors which turns on expression of the genes which make a T cell a Tell is called Bcl11b.  Early in T cell development it is sequestered away near the nuclear membrane in highly compacted DNA. Remember that you must compress your 1 meter of DNA down by 100,000fold to have it fit in the nucleus which is 1/100,000th of a meter (10 microns).

What turns it on?  Transcription of nonCoding (for protein) RNA calledThymoD.  From my reading of the paper, ThymoD doesn’t do anything, but just the act of opening up compacted DNA near the nuclear membrane produced by transcribing ThymoD is enough to cause this part of the genome to move into the center of the nucleus where the gene for Bcl11b can be transcribed into RNA.

There’s a lot more to the paper,  but that’s the message if you will.  It’s the act of transcription rather than what is being transcribed which is important.

Here’s more about McLuhan —

If some of the terms used here are unfamiliar — look at the following post and follow the links as far as you need to.

Well that was an old post.  Here’s another example [ Cell vol. 173 pp. 1318 – 1319, 1398 – 1412 ’18 ] It concerns a gene called PVT1 (Plasmacytoma Variant Translocation 1) found 25 years ago.  It was the first gene coding for a long nonCoding (for proteins RNA (lncRNA) found as a recurrent breakpoint in Burkitt’s lymphoma, which sadly took a friend (Nick Cozzarelli) far too young as (he edited PNAS for 10 years).

So PVT1 is involved in cancer.  The translocation turns on expression of the myc oncogene, something that has been studied out the gazoo and we’re still not sure of how it causes cancer. I’ve got 60,000 characters of notes on the damn thing, but as someone said 6 years ago “Whatever the latest trend in cancer biology — cell cycle, cell growth, apoptosis, metabolism, cancer stem cells, microRNAs, angiogenesis, inflammation — Myc is there regulating most of the key genes”

We do know that the lncRNA coded by PVT1 in some way stabilizes the myc protein [ Nature vol. 512 pp. 82 – 87 ’14 ].  However the cell experiments knocked out the lncRNA of PVT1 and myc expression was still turned on.

PVT1 resides 53 kiloBases away from myc on chromosome #8.  That’s about 17% of the diameter of the average nucleus (10 microns) if the DNA is stretched out into the B-DNA form seen in all the textbooks.  Since each base is 3.3 Angstroms thick that’s 175,000 Angstroms 17,500 nanoMeters 1.7 microns.  You can get an idea of how compacted DNA is in the nucleus when you realize that there are 3,200,000,000/53,000 = 60,000 such segments in the genome all packed into a sphere 10 microns in diameter.

To cut to the chase, within the PVT1 gene there are at least 4 enhancers (use the link above to find what all the terms to be used actually mean).  Briefly enhancers are what promoters bind to to help turn on the transcription of the genes in DNA into RNA (messenger and otherwise).  This means that the promoter of PVT1 binds one or more of the enhancers, preventing the promoter of the myc oncogene from binding.

Just how they know that there are 4 enhancers in PVT1 is a story in itself.  They cut various parts of the PVT1 gene (which itself has 306,721 basepairs) out, and place it in front of a reporter gene and see if transcription increases.

The actual repressor of myc is the promoter of PVT1 according to the paper (it binds to the enhancers present in the gene body preventing the myc promoter from doing so).  Things may be a bit more complicated as the PVT1 gene also codes for a cluster of 7 microRNAs and what they do isn’t explained in the paper.

So it’s as if the sardonic sense of humor of ‘nature’, ‘evolution’, ‘God’, (call it what you will) has set molecular biologists off on a wild goose chase, looking at the structure of the gene product (the lncRNA) to determine the function of the gene, when actually it’s the promoter in front of the gene and the enhancers within which are performing the function.

The mechanism may be more widespread, as 4/36 lncRNA promoters silenced by CRISPR techniques subsequently activated genes in a 1 megaBase window (possibly by the same mechanism as PVT1 and myc).

Where does McLuhan come in?  The cell paper also notes that lncRNA gene promoters are more evolutionarily conserved than their gene bodies.  So it’s the medium (promoter, enhancer) is the message once again (rather than what we thought the message was).


The other uses of amyloid (not all bad)

Neurologists and drug chemists pretty much view amyloid as a bad thing.  It is the major component of the senile plaque of Alzheimer’s disease, and when deposited in nerve causes amyloidotic polyneuropathy.  A recent paper and editorial casts amyloid in a different light [ Cell vol. 173 pp. 1068 – 1070, 1244 – 2253 ’18 ].  However if amyloid is so bad why do cytomegalovirus, herpes simplex viruses and E. Coli make proteins to prevent a type of amyloid from forming.

Cell death isn’t what it used to be.  Back in the day, they just died when things didn’t go well.  Now we know there are a variety of ways that cells die, and all of them have rather specific mechanisms.  Apoptosis (aka programmed cell death) is a mechanism of cell death used widely during embryonic development.  It allows the cell to die very quietly without causing inflammation.  Necroptosis is entirely different, it is another type of programmed cell death, designed to cause inflammation — bringing the immune system in to attack invading pathogens.

Two proteins (Receptor Interacting Protein Kinase 1 — RIPK1, and RIPK3) bind to each other forming amyloid, that looks for all the world like typical amyloid –it binds Congo Red, shows crossBeta diffraction and has a filamentous appearance.  Fascinating chemistry aside, the amyloid formed is crucial for necroptosis to occur, which is why various bugs try to prevent it.

The paper above describes the structure of the amyloid formed — unusual in itself, because until now amyloid was thought to involve the aggregation of a single protein.

The proteins are large: RIPK1 contains 671 amino acids, and RIPK3 contains 518.  They  both contain RHIMs (Receptor interacting protein Homotypic Interaction Motifs) which are fairly large themselves (amino acids 496 – 583 of RIPK1 and 388 – 518 of RIPK3).  Yet the amyloid the two proteins form use a very small stretches (amino acids 532 – 543 from RIPK1 and 451 – 462 from RIPK3).  How the rest of these large proteins pack around the beta strands of the 11 amino acid stretches isn’t discussed in the paper.  Even within these stretches, it is two consensus tetrapeptides (IQIG from RIPK1, and VQVG from RIPK3) that do most of the binding.

Even if you assume that I (Isoleucine) Q (glutamine) G (glycine) V (valine) occur at a frequency of 5%, in our proteome of 20,000 proteins assuming a length of amino acids IQIG and VQVG should occur 10 times each.  This may explain why 300/20,000 of our proteins contain a 100 amino acid  segment called BRICHOS which acts as a chaperone preventing amyloid formation. For details see —

Just another reason to take up the research idea in the link and find out just what other things amyloid is doing within our cells in the course of their normal functioning.


Cultural appropriation, neuroscience division

If Deng Xiaoping can have Socialism with Chinese Characteristics, I can have a Chinese saying with neuroscientific characteristics — “The axon and the dendrite are long and the nucleus is far away” mimicking “The mountains are high and the Emperor is far away”. The professionally offended will react to the latest offense du jour — cultural appropriation  — of course.  But I’m entitled and I spoke to my Chinese daughter in law, and people over there found it flattering and admiring of Chinese culture that the girl in Utah wore a Chinese cheongsam dress to her prom.

Back to the quote.  “The axon and the dendrite are long and the nucleus is far away”.  Well, neuronal ends are far away from the cell body — the best example are axons from the sacral spinal cord which in an NBA player can be a yard long.  But forget that, lets talk about the ends of dendrites which are much closer to the cell body than that.

Presumably neurons have different types of dendrites so they can respond to different types of inputs. Why should dendrites respond identically if their inputs are different? They don’t.    A dendrite responding to acetyl choline will express neurotransmitter receptors distinct from another dendrite on the same neuron distinct from a dendrite responding to dopamine.  The protein cohorts of axons and dendrites are different.  How does this come about?  Because the untranslated part of mRNA on the 3′ end (3’UTR) contains a sequence called a zipcode which binds to specific proteins which then move the mRNA to a specific location in the neuron (axon or dendrite).  Presumably all dendrites initially had the same complement of mRNA.

So depending on what’s happening at a particular dendrite on a neuron, more or less of a given protein is made.   This is way too abstract.  Suppose you want to strengthen a synapse.  You’d make more of a neurotransmitter receptor or an ion channel for whatever transmitter that dendrite is getting.

It is well established that axons and dendrites store mRNAs and make proteins from them far from the nucleus (aka the emperor).  If you think about it, just how a receptor for dopamine gets to a dendrite receiving dopamine and not to a dendrite (on the same neuron) getting glutamic acid as a transmitter, is far from clear.  There are zipcodes distinguishing axons from dendrites, but I’m unaware that there are zipcodes for dopamine dendrites distinct from other types of dendrites.

If that weren’t enough consider [ Neuron vol. 98 pp. 495 – 511 ’18 ].  Even for an mRNA coding for the same protein (presumably transcribed from just one gene), there can be more than one type of 3’UTR (and this in the same cell).  Note also that 3’UTRs are longer in neurons than in other tissues.

So the authors looked at the mRNAs in dendrites — they did this by choosing a tissue (the hippocampus) where rows of cell bodies are well separated from their dendrites.  They found that for a given dendritic mRNA there was more than one 3’UTR, and that the mRNAs with longer 3’UTRs had longer halflives.  Even more exquisitly neuronal activity altered the proportion of the different 3’UTR isoforms. The phenomenon is quite general — over 50% of all genes and over 70% of genes enriched in neurons showed multiple 3′ UTRs.

So there is a whole control system built into the dendritic system, and it varies with what is happening locally.

The emperor emits directives (mRNAs) but what happens locally is anyone’s guess


Very sad — Nature vol. 557 p. 144 ’18 (10 May) “PNAS resignation On 1 May, Inder Verma, a cancer researcher at the Salk Institute for Biological Sciences in La Jolla, California, resigned as editor-in-chief of the journal Proceedings of the National Academy of Sciences. The move comes after the publication of an investigation by Science, in which several female researchers who were either at the institute or had ties to it between 1976 and 2016 allege that Verma harassed them. Verma, who served on powerful committees at the institute, vehemently denied the allegations in a statement to Nature. The Salk Institute suspended him on 21 April while it investigates the claims.”

Why sad?  Because my late Princeton classmate and good friend Nick Cozzarelli edited PNAS for 10 years.  He died far too soon at 68 of Burkitt’s lymphoma after doing great work on DNA gyrase.  From the Wiki about him ” In 1995, Cozzarelli was invited to become the editor-in-chief of the Proceedings of the National Academy of Sciences. He took the position because felt that the journal had great unrealized potential as a scientific publication.[3] During his tenure, he expanded the editorial board from 26 to more than 140 and created a second track to allow scientists to submit manuscripts directly.”

Nick was credited for strongly increasing the quality and influence of PNAS.  This was recognized by the journal in the form of the Cozzarelli prizes established a year after his death.  There are 6 chosen from the more than 3,200 research articles appearing in the journal each year, representing the six broadly defined classes under which the National Academy of Sciences is organized.

A social note:  Although Princeton University was the home of many bluebloods in the late 50s, this was not true of all.  Nick went through Princeton on scholarship (waiting on tables in commons etc. etc.).  He was the son of an immigrant shoemaker from Jersey City.  Hopefully Princeton is still doing this.

Addendum 10 May — a friend said  ”

Your blog post seems to be one big non sequitur.
I doubt that harassment victims are “sad” that their complaints are finally getting heard and acted on. The fact that Verma’s behavior was allowed to continue all these years reflects poorly on the Salk Institute, but I don’t see how it reflects poorly on PNAS, where he was simply an editor and has now resigned. Essentially, Verma received PNAS submissions while sitting at his desk (at the Salk Institute) and declared “yes” or “no.”  I don’t see how your late friend Nick’s PNAS legacy has been sullied by any of that. “
To which I replied

No it’s sad because of what Verma’s behavior (at Salk and likely as PNAS editor) would have meant to Nick (and how he loved PNAS), given the type of guy Nick was.  My late father (an attorney) and uncle (a judge) took things the same way when a lawyer got disbarred for some malfeasance or other, e.g. as a reflection on the institution of the legal profession.   They took it personally as a reflection on them.  Perhaps illogically, but that’s the way they and Nick were. “

How Badly are Thy Genomes, Oh Humanity — take II

With apologies to Numbers 24:5, “How goodly are thy tents, Oh Jacob” —  a recent paper shows how shockingly error ridden our genomes actually are [ Science vol. 360 pp. 327 – 331  ’18 ].  I’d written about this in 2012 (see the end), but technology has marched on.  Back then only the parts of the genome coding for protein (the exome) were sequenced.  The present work did whole genome sequencing (WGS) to a mean coverage of 40+ (e.g. they sequenced the other 98 percent of the genome).

The authors were studying families in which one or more children had autism spectrum disorder to find genome abnormalities which might have caused the ASD. They were looking for structural variants (SVs) by which they mean ” biallelic deletion, tandem duplications, inversions, four classes of complex SV, and four families of mobile element insertions”

Why?  Because studying proteins alone doesn’t tell you how they are controlled.  That’s in the DNA surrounding them.  Structural variants are more likely to affect control elements than the proteins themselves.

Showing how technology has marched on they determined the whole genomes of 9274 subjects from 2600 families affected by ASD.

The absolutely mindboggling point in the article is the following direct quote “An average of 3746 SVs were detected per individual”.  That’s simply incredible (assuming the above isn’t a misprint).

Here’s the older post

How Badly Are Thy Genomes, Oh Humanity

With apologies to Numbers 24:5, “How goodly are thy tents, Oh Jacob” —  a recent paper shows how shockingly error ridden our genomes actually are [ Science vol. 337 pp. 64 – 69 ’12 ].  The authors sequenced roughly three quarters of the genes coding for proteins in some 2,439 people — e.g. 15,585 protein coding genes.  This left 98% of the genome untouched, primarily because we really don’t know what it does or how it does it, despite the fact that it controls, when, where and how much of each protein is made.  So they basically looked at the bricks from which we are built (the proteins) and not the plans (the 98%).

The news is not very good.  The subjects came from two groups: 1,351 Europeans and 1,088 Africans (the latter, because genetic diversity is far higher among Africans as that’s where humanity arose, and where mutations have had the longest time to accumulate).

The news is not very good. First, some background.

Recall that each nucleotide is one of four possibilities (A, T, G, C), and that each 3 nucleotides therefore has 4^3 = 64 possibilities.  61/64 combinations code for amino acids which, since we have only 20 gives a certain redundancy of the famed genetic code.   The other 3 combinations code for no amino acid (usually) and tell the machinery making proteins to stop.  Although crucial to our existence, these are called nonsense codons.

The genetic code is therefore 3fold degenerate (on average).  However, some amino acids are coded for by just 1 combination of 3 nucleotides while others are coded by as many as 6.  So some single nucleotide variants (SNVs) leave the amino acid coded for the same (these are the synonymous SNVs), while others change the amino acid (nonSynonymous SNVs), and possibly protein function.

Ask some one with sickle cell anemia how much trouble just one nonSynonymous SNV can cause — it’s only 1 amino acid out of 147.  Even worse, ask someone with cystic fibrosis where just one of 1,480 amino acids is missing.

Here’s the bad news.  In the population as a whole, they found 500,000 single nucleotide variants (SNVs).  If you’re still not sure what is meant by this, the 5 articles in should be all the background you need.

More than 400,000 of the variants were previously unknown.  Also more than 400,000 of them were found either in Africans or Europeans but not both.  If you divide 500,000 by 2,439 you get 205 variants per person.  However, SNVs are far more common than that, and each individual contains an average of 14,000.

Well, how many of the 500,000 or so SNVs they found are nonSynonymous? One would think about 1/3 statistically.  However, They found more than half 292,125/500,000 — nearly 60% — were nonSynonymous.

It get’s worse: 6,165 of the nonSynonymous variants are nonSense codons.  This means that the protein coded for by such a gene, terminates prematurely, meaning that it can terminate anywhere.  On average one would expect that half of these nonsense codons result in a protein of less than half the normal length.   This would very likely obliterate whatever function the protein had.

Obviously, they couldn’t test all 500,000 SNVs to see how they affected protein function (and we really only have a decent idea of what half our 20,000 or so proteins are doing).  They had to guess.  They came up with a figure of 2 – 4% of the 14,000 SNVs being functionally significant — That’s 280 – 560 significant mutations per individual.

Clearly, despite the horrible examples of cystic fibrosis and sickle cell anemia above, most of these can’t be doing very much, because these were normal people being studied.

There are all sorts of implications of this work.  One is the subject of a future post — how hard this diversity makes drug discovery.  Another reiterates the Tolstoy theme mentioned earlier about the genetic defects causing schizophrenia and autism — ““Happy families are all alike; every unhappy family is unhappy in its own way”.  Thus beginneth Anna Karenina.

For details please see  and

A third is that this shows that the 1000 fold expansion of the human population has pretty much obviated much natural selection eliminating these variants.  I’ll leave it to the geneticists to figure out what this means for the eventual survival of the species, as these mutants continue to accumulate.

The paper is fascinating, and sure to change our conception of what a ‘normal’ genome actually is.  Nonetheless, all they did was follow Yogi Berra’s dictum — “You can observe a lot by watching.”   It certainly wasn’t creative or ingenious in any sense.  Sometimes grunt work like this wins the day.