The plural of anecdote IS data

Five years ago I wrote a post on the perils of implicating a gene as the cause of a disease because one or two people with the disease had a mutation there (see the bottom). That is now back in spades with a new report from the Exome Aggregation Consortium (ExAC) [ Nature vol. 536 pp. 249, 277 – 278, 285 – 291 ’16 ].

What they did was to aggregate sequence data from 60,704 people on the parts of their genomes coding for the amino acids making up proteins (the exome — The paper has 80+ authors. The data is publicly available and is planed to grow to 120,000 exomes and 20,000 whole genomes in the next year. Both are orders of magnitude larger than any individual exome study so far. So study enough anecdotes (small studies) and pretty soon you have real data

The articles state that over a million people have now had either their exomes or their whole genomes sequenced ! ! !

The amount of variation in the human genome is simply incredible. Some 7,404,909 variants in the exome were described, of which 54% had never been seen before. These account for 1/8 of all the sites in all our exomes, implying that the exome comprises 60 megaBases of the 3200 megaBase human genome (1.8%). Most of the variants were single amino acid changes due changes in a single nucleotide, but there were 317,381 insertions or deletions (95% shorter than 6 nucleotides).

99% of all variants had a frequency of under 1% (e.g. not found in in more than 607 people), with half being found only once in the 60,704. 8% of the sites with variation contain more than one (consistent with what you’d expect of a Poisson distribution).

What is so remarkable is that the average participant has 54 variants previously classified as responsible for a genetic disorder. Not only that 183/192 variants thought to cause a rare hereditary disease were found in many healthy people, implying that they were incidental findings (anecdotes) rather than causal. It shows you what happens when you have adequate data.

They are pretty sure that their work will stand, because the exomes were sequenced many times over (deeply sequenced in the lingo) more than 10x in over 80% of the cohort.

I’d also written earlier about how full of errors our genomes are — see

A lot of the variants produced termination codons in the body of the exome, so a full-length protein couldn’t be produced from the gene (these are called truncation variants) — some 179,774 in the 7,404,909. Most occurred just once. Even so this means that most of the cohort had at least one or two. Even this rather negative knowledge was useful — since we have about 20,000 protein coding genes, they found 3,230 in which truncation variants NEVER occurred, implying that the protein is crucial to survival.


We’ve found the mutation causing your disease — not so fast, says this paper (posted 17 July 2011)

This post takes a while to get to the main points, but hang in there, the results are striking (and disturbing).

First: a bit of history. In the bad old days (any time over about 30 years ago) there was basically only one way to look for a disc in the spinal canal pressing on a nerve producing symptoms (usually pain, followed by numbness and weakness). It was the myelogram, where a spinal tap was done, an oily substance (containing iodine which Xrays don’t penetrate well) was injected into the spinal canal, and Xrays taken. The disc showed up as a defect in the column of dye (not really a dye as any chemist can see). This usually led to surgery if a disc was found, even if it was one or two spinal levels from where clinicians thought it should be based on their examination and other tests such as electromyography (EMG). This was usually put down to anatomic variability. Results were less than perfect.

Myelography was a rather stressful procedure, and I usually brought patients into the hospital the night before, got a cardiogram (to make sure their heart could take it, and that they hadn’t had a silent heart attack). Then the myelography itself, which wasn’t painful as the radiologist put the needle in under fluoroscopy so they could see exactly where to go. However many people got severe post-spinal headaches (invariably doctor’s wives), sometimes requiring a blood patch to plug the hole where the (large) needle used to inject the ‘dye’ went — it had to be large because the ‘dye’ was rather oily (viscous). The bottom line was that you didn’t subject a patient to a myelogram unless they were having a significant problem. Only very symptomatic people had the test, and usually when nonsurgical therapy had been tried and failed.

Fast forward to the MRI (Magnetic Resonance Imaging) era (nuclear magnetic resonance to the chemist, but radiologists were smart enough to get the word nuclear removed so patients would submit to the test). A painless technique, but stressful for some because of the close quarters in the MRI machine. You could look at the whole spinal canal, and see far more anatomic detail, because you actually see the disc (rather than its impression on a column of dye) and the surrounding bones, ligaments etc. etc.

What did we find? There were tons of people with discs where they shouldn’t be (e.g. herniated discs) who were having no problems at all. This led to a lot more careful assessment of patients, with far better correlation of anatomic defect and clinical symptoms.

What in the world does this all have to do with the genetics of disease? Patience; you’re about to find out.

There’s an interesting interview with Eric Lander (of Human Genome Project fame) in the current PNAS (p. 11319). He notes that in 1990 sequencing a single genome cost $3,000,000,000. He thinks that at some time in the next 5 years we’ll be able to do this for $1,000, a 3 million-fold improvement in cost. The genome has around 3,000,000,000 positions to sequence. As things stand now, it’s literally nothing to determine the sequence of a few million positions in DNA.

On to Cell vol. 145 pp. 1036 – 1048 ’11 which sequenced some 9,000,000 positions of DNA. This didn’t make a big splash (but its implications might). Just a single paper, buried in the middle of the 24 June ’11 Cell — it didn’t even rate an editorial. Now, as chemists, if you’re a bit shaky on what follows, all the background you need can be found in the series of articles found here –

As a neurologist, I treated a lot of patients with epilepsy (recurrent convulsions, recurrent seizures). 2% of children and 1% of adults have it (meaning that half of the kids with it will outgrow it, as did the wife of an old friend I saw this afternoon). Some forms of epilepsy run in families with strict inheritance (like sickle cell anemia or cystic fibrosis). 20 such forms have been tied down to single nucleotide polymorphisms (SNPs) in 20 different genes coding for protein (there are other kinds of genes) — all is explained in the background material above). 17/20 of these SNPs are in a type of protein known as an ion channel. These channels are present in all our cells, but in neurons they are responsible for the maintenance of a membrane potential across the membrane, which has the ability change abruptly causing an nerve cell to fire an impulse. In a very simplistic way, one can regard a convulsion (epileptic seizure) as nerve cells gone wild, firing impulses without cease, until the exhausted neurons shut down and the seizure ends.

However, the known strictly hereditary forms of epilepsy account for at most 1 – 2% of all people with epilepsy. The 9,000,000 determinations of DNA sequence were performed on 237 ion channel genes, but just those parts of the genes actually coding for amino acids (these are the exons). They studied 152 people with nonhereditary epilepsy (also known as idiopathic epilepsy) and, most importantly, they looked at the same channels in 139 healthy normal people with no epilepsy at all.

Looking at the 17/237 ion channels known to cause strictly hereditary epilepsy they found that 96% of cases of nonhereditary (idiopathic) epilepsy had one or more missense mutations (an amino acid at a given position different than the one that should be there). Amazingly, 70% of normal people also had missense mutations in the 17. Looking at the broader picture of all 237 channels, they found 300 different mutations in the 139 normals, of which 23 were in the 17. Overall they found 989 SNPs in all the channels in the whole group, of which 415 were nonsynonumous.

Well what about mutational load? Suppose you have more than one mutation in the 17 genes. 77% the cases with idiopathic epilepsy had 2 or more mutations in the 17, but so did 30% of the people without epilepsy at all.

The relation between myelography and early genetic work on disease should be clear. Back then, a lot was taken as abnormal as only the severely afflicted could be studied, due to time, money and technological constraints. As the authors note “causality cannot be assigned to any particular variant”. Many potentially pathogenic genetic variants in known dominant channel genes are present in normals.

What was not clear to me from reading the paper is whether any of the previously described mutations in the 17 are thought to be causative of strictly hereditary epilepsy were present in the 139 normals.

A very interesting point is how genetically diverse the human population actually is (and they only studied Caucasians and Hispanics — apparently no Blacks). No individual was free of SNPs. No two individuals (in the 139 + 152) had the same set of SNPs. Since they found 989 SNPs in the combined group, even in this small sample of proteins (17 of 20,000) this averages out to more than 3 per individual. Well, are there ‘good’ SNPs in the asymptomatic group, and ‘bad’ SNPs in the patients with idiopathic epilepsy? Not really, the majority of the SNPs were present in both groups.

I leave it to your imagination what this means for ‘personalized medicine’. We’re literally just beginning to find out what’s out there. This is the genetic analog of the asymptomatic disc. We may not know all we thought we knew about genetics and disease. Heisenberg must be smiling, wherever he is.

Post a comment or leave a trackback: Trackback URL.


  • Imaging guy  On August 23, 2016 at 2:51 pm

    60 million base pairs (Mb or mega-basepairs) figure you obtained from multiplying 7,404,909 variants with 8 is based on diploid genome which has 6400 Mb. So exome occupies ~ 1% of human genome. Since the authors talk about homozygous individuals and haploinsufficiency, it is quite clear that they have sequenced both chromosomes. (What I don’t understand is why 3 billion base pairs (3000 Mb) figure is frequently given as the size of human genome in many papers and on websites. 3000 Mb is just for haploid genome).

    You wrote that “some 179,774 in the 7,404,909. Most occurred just once. Even so this means that most of the cohort had at least one or two”. That is true only for singleton protein truncating variants or PTVs (121,309). Non-singleton PTVs (58,435 out of 121,309 PTVs) are quite common. They said, “this corresponds to an average of 85 heterozygous and 35 homozygous PTVs per individual”. So in this cohort each individual has 35 genes completely knocked out and 85 partially knocked out. Previously MacArthur (1) reported, “we estimate that human genomes typically contain ~100 genuine LoF variants with ~20 genes completely inactivated”.

    1) A Systematic Survey of Loss-of-Function Variants in Human Protein-Coding Genes (PMID: 22344438)

  • luysii  On August 23, 2016 at 7:08 pm

    Imaging guy, thanks for commenting. I saw the same sentence that you did about 58,435 out of 121.309 PTVs being common It clashed with the following sentence. If 60,704 people had 85 heterozygous PTV mutations, that’s 5,149,840 in the total group and these are being picked from a group of 58,435 common ones. So EACH of these 58,435 PTVs must occur 100 times in the group (on average). This didn’t make a lot of sense to me so I left it out.

    If you look at the link in the post — — you’ll find that it comments on work similar to MacArthur’s.

    With respect to the 3200 megabase genome, this is for basePAIRS which is 6,400 nucleotide positions as you note.

  • Imaging guy  On August 24, 2016 at 5:06 am

    Here I am not talking about nucleotide positions or just bases. The size of human genome in a normal diploid cell is 6.4 billion base pairs. In sperm and egg cells which are haploid, the size is 3.2 billion base pairs. But when you google “the size of human genome” or “human genome”, many websites give a figure of 3 billion though Wikipedia page of “Human Genome” gives the correct figure in its info box.
    1) This Nature website also gives the correct figure. “The haploid human genome contains approximately 3 billion base pairs of DNA packaged into 23 chromosomes. Of course, most cells in the body (except for female ova and male sperm) are diploid, with 23 pairs of chromosomes. That makes a total of 6 billion base pairs of DNA per cell.”

    2) This is what “Mapping and Sequencing the Human Genome” published by US National Research Council said, “The diploid human genome is thus composed of 46 DNA molecules of 24 distinct types. Because human chromosomes exist in pairs that are almost identical, only 3 billion nucleotide pairs (the haploid genome) need to be sequenced to gain complete information concerning a representative human genome. The human genome is thus said to contain 3 billion nucleotide pairs, even though most human cells contain 6 billion nucleotide pairs”.

    1) Annunziato, A. (2008) DNA Packaging: Nucleosomes and Chromatin. Nature Education 1(1):26

  • Imaging guy  On August 24, 2016 at 6:38 am

    “I saw the same sentence that you did about 58,435 out of 121.309 PTVs being common It clashed with the following sentence. If 60,704 people had 85 heterozygous PTV mutations, that’s 5,149,840 in the total group and these are being picked from a group of 58,435 common ones. So EACH of these 58,435 PTVs must occur 100 times in the group (on average). This didn’t make a lot of sense to me so I left it out.”

    Actually it makes sense. Just think of this way. You can get 5× 10^276 combinations of 85 PTVs from 58,435 PTVs. Even if you have 90 PTVs, you can get 44 million combinations of 85 PTVs and if you have 88 PTVs, you can get 109,736 combinations of 85 PTVs, which exceeds the size of the cohort.

    The legend in figure 5b says, “Across all populations, most PTVs found in a given individual are common (> 5% allele frequency)”.

    • luysii  On August 24, 2016 at 7:09 am

      Thanks — do you suppose most of them are in olfactory receptor genes? I know that a lot of them are nonfunctional.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: