Tag Archives: Exome

Another research idea yours for the taking

How many of our 20,000 or so protein coding genes are essential for human existence?  There is a way to find out with no human experimentation whatsoever.  Even better, probably all the data is out there.  Looking at it the right way, finding and collating it is where you come in.  Be warned, it would be a lot of work.

Previous work [ Science vol. 350 pp. 1028 – 1029, 1092 – 1096, 1096 – 1101 ’15 ] came up with the idea that only 2,000 or so of our protein coding genes were truly essential.  The authors cleverly looked at a ‘near haploid’ chronic myelogenous leukemia cell line (KBM7).  Then because only one copy of a gene was present, they systematically knocked out gene after gene using CRISPR and looked at viability.

Similar work in yeast stated that only 1,000 of its 6,000 protein coding genes were essential.

But this is single cell stuff.  What about living breathing people?

Where is this data?  How should it be interrogated?  See if you can figure it out before reading further.

Probably more has been done since Science vol. 337 pp. 64 – 69 ’12 sequenced just the portion of our genome coding for proteins (the exomes) in 1,351 Europeans and 1,088 Africans.  Each individual had 35 premature termination codons, meaning that the gene likely didn’t produce a functional protein.  The average person also had 13,595 single nucleotide polymorphisms (from the standard genome), and probably some of them a less than functional protein.

Do you see how you could use this sort of thing to find out which genes are essential to our existence?

People sequence exomes because it’s easy and because the exome accounts for only 2% of our genome.

My guess is that probably a million exomes have been sequenced thus far, if not more.

So all you have to do is look at all million exome sequences and all 20,000 protein coding genes, and see —

In one of the Sherlock Holmes stories the following dialog appears

Gregory (Scotland Yard): “Is there any other point to which you would wish to draw my attention?”
Holmes: “To the curious incident of the dog in the night-time.”
Gregory: “The dog did nothing in the night-time.”
Holmes: “That was the curious incident.”

The curious incident would be a gene which never (or rarely) had a premature termination codon in the 1,000,000 or so exomes.  That would imply that it was essential for the existence of a living breathing human being.

Cute !  Well I’m a retired neurologist with no academic affiliation — take the idea and run with it.

Addendum 31 Mar ’19 – I received the following comment from Bryan

You may be interested in reading this pre-print on the topic:
Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes https://www.biorxiv.org/content/10.1101/531210v2

To which I replied
    • Bryan– thanks for the link. It was a good enough idea that the people at the Broad Institute had thought of it and carried it out. As people in grad school used to say when they got scooped on a paper — at least we were thinking well.

      It was hard to tell from reading the preprint whether there were genes with no pLoF (predicted loss of function) proving them essential. They do say that the 678 genes essential for human cell viability (characterized by CRISPR screening were ‘depleted’ for pLoF.



The plural of anecdote IS data

Five years ago I wrote a post on the perils of implicating a gene as the cause of a disease because one or two people with the disease had a mutation there (see the bottom). That is now back in spades with a new report from the Exome Aggregation Consortium (ExAC) [ Nature vol. 536 pp. 249, 277 – 278, 285 – 291 ’16 ].

What they did was to aggregate sequence data from 60,704 people on the parts of their genomes coding for the amino acids making up proteins (the exome — https://en.wikipedia.org/wiki/Exome). The paper has 80+ authors. The data is publicly available and is planed to grow to 120,000 exomes and 20,000 whole genomes in the next year. Both are orders of magnitude larger than any individual exome study so far. So study enough anecdotes (small studies) and pretty soon you have real data

The articles state that over a million people have now had either their exomes or their whole genomes sequenced ! ! !

The amount of variation in the human genome is simply incredible. Some 7,404,909 variants in the exome were described, of which 54% had never been seen before. These account for 1/8 of all the sites in all our exomes, implying that the exome comprises 60 megaBases of the 3200 megaBase human genome (1.8%). Most of the variants were single amino acid changes due changes in a single nucleotide, but there were 317,381 insertions or deletions (95% shorter than 6 nucleotides).

99% of all variants had a frequency of under 1% (e.g. not found in in more than 607 people), with half being found only once in the 60,704. 8% of the sites with variation contain more than one (consistent with what you’d expect of a Poisson distribution).

What is so remarkable is that the average participant has 54 variants previously classified as responsible for a genetic disorder. Not only that 183/192 variants thought to cause a rare hereditary disease were found in many healthy people, implying that they were incidental findings (anecdotes) rather than causal. It shows you what happens when you have adequate data.

They are pretty sure that their work will stand, because the exomes were sequenced many times over (deeply sequenced in the lingo) more than 10x in over 80% of the cohort.

I’d also written earlier about how full of errors our genomes are — see https://luysii.wordpress.com/2012/07/31/how-badly-are-thy-genomes-oh-humanity/

A lot of the variants produced termination codons in the body of the exome, so a full-length protein couldn’t be produced from the gene (these are called truncation variants) — some 179,774 in the 7,404,909. Most occurred just once. Even so this means that most of the cohort had at least one or two. Even this rather negative knowledge was useful — since we have about 20,000 protein coding genes, they found 3,230 in which truncation variants NEVER occurred, implying that the protein is crucial to survival.


We’ve found the mutation causing your disease — not so fast, says this paper (posted 17 July 2011)

This post takes a while to get to the main points, but hang in there, the results are striking (and disturbing).

First: a bit of history. In the bad old days (any time over about 30 years ago) there was basically only one way to look for a disc in the spinal canal pressing on a nerve producing symptoms (usually pain, followed by numbness and weakness). It was the myelogram, where a spinal tap was done, an oily substance (containing iodine which Xrays don’t penetrate well) was injected into the spinal canal, and Xrays taken. The disc showed up as a defect in the column of dye (not really a dye as any chemist can see). This usually led to surgery if a disc was found, even if it was one or two spinal levels from where clinicians thought it should be based on their examination and other tests such as electromyography (EMG). This was usually put down to anatomic variability. Results were less than perfect.

Myelography was a rather stressful procedure, and I usually brought patients into the hospital the night before, got a cardiogram (to make sure their heart could take it, and that they hadn’t had a silent heart attack). Then the myelography itself, which wasn’t painful as the radiologist put the needle in under fluoroscopy so they could see exactly where to go. However many people got severe post-spinal headaches (invariably doctor’s wives), sometimes requiring a blood patch to plug the hole where the (large) needle used to inject the ‘dye’ went — it had to be large because the ‘dye’ was rather oily (viscous). The bottom line was that you didn’t subject a patient to a myelogram unless they were having a significant problem. Only very symptomatic people had the test, and usually when nonsurgical therapy had been tried and failed.

Fast forward to the MRI (Magnetic Resonance Imaging) era (nuclear magnetic resonance to the chemist, but radiologists were smart enough to get the word nuclear removed so patients would submit to the test). A painless technique, but stressful for some because of the close quarters in the MRI machine. You could look at the whole spinal canal, and see far more anatomic detail, because you actually see the disc (rather than its impression on a column of dye) and the surrounding bones, ligaments etc. etc.

What did we find? There were tons of people with discs where they shouldn’t be (e.g. herniated discs) who were having no problems at all. This led to a lot more careful assessment of patients, with far better correlation of anatomic defect and clinical symptoms.

What in the world does this all have to do with the genetics of disease? Patience; you’re about to find out.

There’s an interesting interview with Eric Lander (of Human Genome Project fame) in the current PNAS (p. 11319). He notes that in 1990 sequencing a single genome cost $3,000,000,000. He thinks that at some time in the next 5 years we’ll be able to do this for $1,000, a 3 million-fold improvement in cost. The genome has around 3,000,000,000 positions to sequence. As things stand now, it’s literally nothing to determine the sequence of a few million positions in DNA.

On to Cell vol. 145 pp. 1036 – 1048 ’11 which sequenced some 9,000,000 positions of DNA. This didn’t make a big splash (but its implications might). Just a single paper, buried in the middle of the 24 June ’11 Cell — it didn’t even rate an editorial. Now, as chemists, if you’re a bit shaky on what follows, all the background you need can be found in the series of articles found here –https://luysii.wordpress.com/category/molecular-biology-survival-guide/

As a neurologist, I treated a lot of patients with epilepsy (recurrent convulsions, recurrent seizures). 2% of children and 1% of adults have it (meaning that half of the kids with it will outgrow it, as did the wife of an old friend I saw this afternoon). Some forms of epilepsy run in families with strict inheritance (like sickle cell anemia or cystic fibrosis). 20 such forms have been tied down to single nucleotide polymorphisms (SNPs) in 20 different genes coding for protein (there are other kinds of genes) — all is explained in the background material above). 17/20 of these SNPs are in a type of protein known as an ion channel. These channels are present in all our cells, but in neurons they are responsible for the maintenance of a membrane potential across the membrane, which has the ability change abruptly causing an nerve cell to fire an impulse. In a very simplistic way, one can regard a convulsion (epileptic seizure) as nerve cells gone wild, firing impulses without cease, until the exhausted neurons shut down and the seizure ends.

However, the known strictly hereditary forms of epilepsy account for at most 1 – 2% of all people with epilepsy. The 9,000,000 determinations of DNA sequence were performed on 237 ion channel genes, but just those parts of the genes actually coding for amino acids (these are the exons). They studied 152 people with nonhereditary epilepsy (also known as idiopathic epilepsy) and, most importantly, they looked at the same channels in 139 healthy normal people with no epilepsy at all.

Looking at the 17/237 ion channels known to cause strictly hereditary epilepsy they found that 96% of cases of nonhereditary (idiopathic) epilepsy had one or more missense mutations (an amino acid at a given position different than the one that should be there). Amazingly, 70% of normal people also had missense mutations in the 17. Looking at the broader picture of all 237 channels, they found 300 different mutations in the 139 normals, of which 23 were in the 17. Overall they found 989 SNPs in all the channels in the whole group, of which 415 were nonsynonumous.

Well what about mutational load? Suppose you have more than one mutation in the 17 genes. 77% the cases with idiopathic epilepsy had 2 or more mutations in the 17, but so did 30% of the people without epilepsy at all.

The relation between myelography and early genetic work on disease should be clear. Back then, a lot was taken as abnormal as only the severely afflicted could be studied, due to time, money and technological constraints. As the authors note “causality cannot be assigned to any particular variant”. Many potentially pathogenic genetic variants in known dominant channel genes are present in normals.

What was not clear to me from reading the paper is whether any of the previously described mutations in the 17 are thought to be causative of strictly hereditary epilepsy were present in the 139 normals.

A very interesting point is how genetically diverse the human population actually is (and they only studied Caucasians and Hispanics — apparently no Blacks). No individual was free of SNPs. No two individuals (in the 139 + 152) had the same set of SNPs. Since they found 989 SNPs in the combined group, even in this small sample of proteins (17 of 20,000) this averages out to more than 3 per individual. Well, are there ‘good’ SNPs in the asymptomatic group, and ‘bad’ SNPs in the patients with idiopathic epilepsy? Not really, the majority of the SNPs were present in both groups.

I leave it to your imagination what this means for ‘personalized medicine’. We’re literally just beginning to find out what’s out there. This is the genetic analog of the asymptomatic disc. We may not know all we thought we knew about genetics and disease. Heisenberg must be smiling, wherever he is.