Category Archives: Uncategorized

A visual proof of the the theorem egregium of Gauss

Nothing better illustrates the difference between the intuitive understanding that something is true and being convinced by logic that something is true  than the visual proof of the theorem egregium of Gauss found in “Visual Differential Geometry and Forms” by Tristan Needham and  the 9 step algebraic proof in  “The Geometry of Spacetime” by Jim Callahan.

Mathematicians attempt to tie down the Gulliver of our powerful appreciation of space with Lilliputian strands of logic.

First: some background on the neurology of vision and our perception of space and why it is so compelling to us.

In the old days, we neurologists figured out what the brain was doing by studying what was lost when parts of the brain were destroyed (usually by strokes, but sometimes by tumors or trauma).  This wasn’t terribly logical, as pulling the plug on a lamp plunges you in darkness, but the plug has nothing to do with how the lightbulb or LED produces light.  Even so,  it was clear that the occipital lobe was important — destroy it on both sides and you are blind — but the occipital lobe accounts for only 10% of the gray matter of the cerebral cortex.

The information flowing into your brain from your eyes is enormous.  The optic nerve connecting the eyeball to the brain has a million fibers, and they can fire ‘up to 500 times a second.  If each firing (nerve impulse) is a bit, then that’s an information flow into your brain of a gigaBit/second.   This information is highly processed by the neurons and receptors in the 10 layers of the retina. Over 30 retinal cell types in our retinas are known, each responding to a different aspect of the visual stimulus.  For instance, there are cells responding to color, to movement in one direction, to a light stimulus turning on, to a light stimulus turning off, etc. etc.

So how does the relatively small occipital lobe deal with this? It doesn’t.  At least half of your the brain responds to visual stimuli.  How do we know?   It’s complicated, but something called functional Magnetic Resonance Imaging (fMRI) is able to show us increased neuronal activity primarily by the increase in blood flow it causes.

Given that half of your brain is processing what you see, it makes sense to use it to ‘see’ what’s going on in Mathematics involving space.  This is where Tristan Needham’s books come in.

I’ve written several posts about them.

and Here —



OK, so what is the theorem egregium?  Look at any object (say a banana). You can see how curved it is by just looking at its surface (e.g. how it looks in the 3 dimensional space of our existence).  Gauss showed that you don’t
have to even look at an object in 3 space,  just perform local measurements (using the distance between surface points, e.g. the metric e.g.  the metric tensor) .  Curvature is intrinsic to the surface itself, and you don’t have to get outside of the surface (as we are) to find it.



The idea (and mathematical machinery) has been extended to the 3 dimensional space we live in (something we can’t get outside of).  Is our  universe curved or not? To study the question is to determine its intrinsic curvature by extrapolating the tools Gauss gave us to higher dimensions and comparing the mathematical results with experimental observation. The elephant in the room is general relativity which would be impossible without this (which is why I’m studying the theorem egregium in the first place).


So how does Callahan phrase and prove the theorem egregium? He defines curvature as the ratio of the area on a (small) patch on the surface to the area of another patch on the unit sphere. If you took some vector calculus, you’ll know that the area spanned by two nonCollinear vectors is the numeric value of their cross product.



The vectors Callahan needs for the cross product are the normal vectors to the surface.  Herein beginneth the algebra. Callahan parameterizes the surface in 3 space from a region in the plane, uses the metric of the surface to determine a formula for the normal vector to the surface  at a point (which has 3 components  x , y and z,  each of which is the sum of 4 elements, each of which is the product of a second order derivative with a first order derivative of the metric). Forming the cross product of the normal vectors and writing it out is an algebraic nightmare.  At this point you know you are describing something called curvature, but you have no clear conception of what curvature is.  But you have a clear definition in terms of the ratio of areas, which soon disappears in a massive (but necessary) algebraic fandango.



On pages 258 – 262 Callahan breaks down the proof into 9 steps involving various mathematical functions of the metric and its derivatives such as  Christoffel symbols,  the Riemann curvature tensors etc. etc.  It is logically complete, logically convincing, and shows that all this mathematical machinery arises from the metric (intrinsic to the surface) and its derivatives (some as high as third order).



For this we all owe Callahan a great debt.  But unfortunately, although I believe it,  I don’t see it.  This certainly isn’t to denigrate Callahan, who has helped me through his book, and a guy who I consider a friend as I’ve drunk beer with him and his wife while  listening to Irish music in a dive bar north of Amherst.



Callahan’s proof is the way Gauss himself did it and Callahan told me that Gauss didn’t have the notational tools we have today making the theorem even more outstanding (egregious).


Well now,  onto Needham’s geometrical proof.  Disabuse yourself of the notion that it won’t involve much intellectual work on your part even though it uses the geometric intuition you were born with (the green glasses of Immanuel Kant —


Needham’s definition of curvature uses angular excess of a triangle.  Angles are measured in radians, which is the ratio of the arc subtended by the angle to the radius of the circle (not the circumference as I thought I remembered).  Since the circumference of a circle is 2*pi*radius, radian measure varies from 0 to 2*pi.   So a right angle is pi/2 radians.


Here is a triangle with angular excess.  Start with a sphere of radius R.  Go to the north pol and drop a longitude down to the equator.  It meets the equator at a right angle (pi/2).  Go back to the north pole, form an angle of pi/2 with the first longitude, and drop another longitude at that angle which meets the equator at an angle of pi/2.   The two points on the equator and the north pole form a triangle, with total internal angles of 3*(pi/2).  In plane geometry we know that the total angles of a triangle is 2 (pi/2).  (Interestingly this depends on the parallel postulate. See if you can figure out why).  So the angular excess of our triangle is pi/2.  Nothing complicated to understand (or visualize) here.


Needham defines the curvature of the triangle (and any closed area) as the ratio between the angular excess of the triangle to its area



What is the area of the triangle?  Well, the volume of a sphere is (4/3) pi * r^3, and its area is the integral (4 pi * r^2).  The area of the north hemisphere, is 2 pi *r^2, and the area of the triangle just made is 1/2 * Pi * r^2.



So the curvature of the triangle is (pi/2) / (1/2 * pi * r^2) = 1 / r^2.   More to the point, this is the curvature of a sphere of radius r.



At this point you should have a geometric intuition of just what curvature is, and how to find it.  So when you are embroiled in the algebra in higher dimensions trying to describe curvature there, you will have a mental image of what the algebra is attempting to describe, rather than just the symbols and machinations of the algebra itself (the Lilliputian strands of logic tying down the Gulliver of curvature).


The road from here to the Einstein gravitational field equations (p. 326 of Needham) and one I haven’t so far traversed,  presently is about 50 pages.Just to get to this point however,  you have been exposed to comprehensible geometrical expositions, of geodesics, holonomy,  parallel transport and vector fields, and you should have mental images of them all.Interested?  Be prepared to work, and to reorient how you think about these things if you’ve met them before.  The 3 links mentioned about will give you a glimpse of Needham’s style.  You probably should read them next.

New light on protein folding

Henry Eyring would have loved this paper [ Proc. Natl. Acad. Sci. vol. 119 e2112372118 ’22  —  ] He developed transition state theory.  You can read about it and what Eyring was actually like here —

The paper also gives an excellent history of the intellectual twists and turns of the protein folding problem.  It starts with Anfinsen’s work on Ribonuclease (RNAase), which is a rather simple protein.  He noted that even when unfolded (denatured), RNAase would spontaneously fold to its native structure.  Thus was born the thermodynamic hypothesis of protein structure.  Because the native form occurred spontaneously it had to have the lowest free energy of all the possible conformations.   This was long before we knew about protein chaperones,.

This was followed by the molten globule idea.  It was modeled on solid formation from a gas in which a metastable liquid phase precedes solid formation during gas deposition.  The molten globule has a high degree of secondary structure (alpha helices, beta sheets), but no fixed arrangement of them relative to each other (e.g. no tertiary structure).  To be considered a molten globule, the protein must have an expanded structure relative to the native fully folded protein.

This was followed by the energy landscape theory of protein folding, something I never liked because I never saw a way to calculate the surface of the landscape.  Proteins fold by following the landscape to a lower potential energy, the way a skier follows the mountain down hill. It seems like a high falutin’ way of saying proteins fold, the same way docs say you have idiopathic something or other instead of saying we don’t know what caused what you have.  In the energy landscape theory molten globule intermediates are not necessary.

Then there is the foldon hypothesis — proteins fold following a unique pathway by the cooperative and sequential formation of native structure domains (e.g. the foldons).  Folding amounts to the productive tinkering of amino acids and foldons rather than the diffusion of a protein in a funnel-like energy landscape.

The paper studied Barnase, a 110 amino acid protein which degrades RNA (so much like the original protein Anfinsen studied years ago).  Barnase is highly soluble and very stable making it one of the E. Coli’s of protein folding studies.

The new wrinkle of the paper is that they were able to study the folding and unfolding and the transition state of single molecules of Barnase at different temperatures (an experiment which would have been unlikely for Eyring to even think about doing in 1935 when he developed transition state theory, and yet this is exactly the sort of thing what he was thinking about but not about proteins whose structure was unknown back then).

The work alluded to in the link to another post above, did something similar except that they used DNA instead of a protein.  Here is the relevant part of it.

A polyNucleotide hairpin of DNA  was connected to double stranded DNA handles in optical traps where it could fluctuate between folded (hairpin) and unfolded (no hairpin) states.  They could measure just how far apart the handles were and in the hairpin state the length appears to be 100 Angstroms (10 nanoMeters) shorter than the unfolded state.

So they could follow the length vs. time and measure the 50 microSeconds or so it took to make the journey across the free energy maximum (e.g. the transition state). A mere 323,495 different transition paths were studied.

This allowed them to determine not just the change in free energy (deltaG)  between the unfolded (U) and the transition state (TS) and the native state (N) of Barnase, but also the changes in enthalpy (delta H) and entropy (delta S) between U and TS and between N and TS.

Remember delta G = Delta H – T delta S.  A process will occur if deltaG is negative, which is why an increase in entropy is favorable, and why the decrease in entropy between U and TS is unfavorable.

Almost all of the entropy decrease  between U and N occurs between U and TS.  Which makes sense as the transition state is a lot more ordered than than the unfolded state.  Most of the change in enthalpy occur on the TS –> N transition.

The results are most consistent with both the energy landscape of Wolynes and the molten globule  They describe the transition state as like a golf course, where there are many positions for the ball (the molten globule), but only one place to go down to the native state.  Once the hole is found the protein zooms down to the native state through the potential energy funnel.

Fascinating stuff.

Are the antiVaxers already relatively immune?

As delta, omicron and god knows what other Greek letter variant of the pandemic virus marches through our population, it is time to find out how many of the unvaccinated have actually been infected asymptomatically.  It could well be most of them.  Studies done July 2020, a year and a half ago in New York State (before we even had vaccines) showed high levels of antibodies to the virus.

Do distinguish what an antibody to the virus means from a positive PCR or antigen test.  A positive antibody test means you’ve been infected with the virus at some point — almost certainly it’s long gone (the footprint of the bear is not the actual bear — sounds like Zen).  A positive antigen or PCR test means that the virus is within you now.

At a clinic in Corona, a working-class neighborhood in Queens, more than 68 percent of people tested positive for antibodies to the new coronavirus. At another clinic in Jackson Heights, Queens, that number was 56 percent. But at a clinic in Cobble Hill, a mostly white and wealthy neighborhood in Brooklyn, only 13 percent of people tested positive for antibodies.

Note the date — July of 2020.   Clearly most of these people did not require hospitalization.

So it’s time to look for antibodies in the never vaccinateds (there is no point in looking at the already vaccinated as they should have them).  If the never vaccinated antibody rate is as high as I think it is (>80%), it’s time to stop the lockdowns, the maskings, and the school closures.  Why — because they’ve already been infected and fought off the virus.

It is clear that vaccination will not keep you out of the hospital.

At the end of 25 January the state of Massachusetts had 2,617 people in the hospital with COVID19, 405 in the ICU and 248 intubated.  Half of them are described by the department of Health are described as ‘fully vaccinated’  (which just means 2 shots as of their definition of September 2021 — clearly this should be updated).   Probably almost all of these are with omicron.

Although this is bad, it represents a drop from 3,192, 466 and 290 just a week ago.

I doubt that such a study will be done, but it would be useful.

Addendum  Science 28 January 2022  p. 387.  Well how wrong could I be . “A serosurvey he led in Gauteng province, home to one-quarter of South Africa’s population, showed close to 70% of unvaccinated people carried SARS-CoV-2 antibodies at the start of the Omicron wave. In the next survey, he expects that number to have gone up to at least 85%, a level that should prepare South Africa for a post-Omicron future.”

Second Addendum 28 January 2022 — Here’s a link to (and a bit more about)  the paper on which the proceeding paragraph was based — It’s not peer-reviewed yet, but it’s from the S. African Medical Research Council, so it is likely to be valid.

The serosurvey was recent (22 October – 9 December 2021) and just before omicron hit the country.   Only 1,319 of the 7,010 people in the study were vaccinated.  An amazing 70% of the unvaccinated were seropositive (had antibodies to the pandemic virus).  As (not quite expected) the vaccinated had a higher seropositive rate (93%) but I thought it would be higher.

This is exactly the sort of study carried out in New York, where the testers went out and grabbed people to get a sample of what was actually going on in the population at large which makes it highly likely to be valid.  We should do a similar study in the USA.

What would really be terrific (but I don’t think it exists yet) would be a test for antibodies to omicron which distinguishes them from antibodies to older SARS-CoV-2 variants.

There is some evidence that vaccination protects against severe infections with omicron, so leave that to the unvaccinated, and leave the rest of us alone.

A premature book review and a 60 year history with complex variables in 4 acts

“Visual Differential Geometry and Forms” (VDGF) by Tristan Needham is an incredible book.  Here is a premature review having only been through the first 82 pages of 464 pages of text.

Here’s why.

While mathematicians may try to tie down the visual Gulliver with Lilliputian strands of logic, there is always far more information in visual stimuli than logic can appreciate.  There is no such a thing as a pure visual percept (a la Bertrand Russell), as visual processing begins within the 10 layers of the retina and continues on from there.  Remember: half your brain is involved in processing visual information.  Which is a long winded way of saying that Needham’s visual approach to curvature and other visual constructs is an excellent idea.
Needham loves complex variables and geometry and his book is full of pictures (probably on 50% of the pages).

My history with complex variables goes back over 60 years and occurs in 4 acts.


Act I:  Complex variable course as an undergraduate. Time late 50s.  Instructor Raymond Smullyan a man who, while in this world, was definitely not of it.  He really wasn’t a bad instructor but he appeared to be thinking about something else most of the time.


Act II: Complex variable course at Rocky Mountain College, Billings Montana.  Time early 80s.  The instructor and MIT PhD was excellent.  Unfortunately I can’t remember his name.  I took complex variables again, because I’d been knocked out for probably 30 minutes the previous year and wanted to see if I could still think about the hard stuff.


Act III: 1999 The publication of Needham’s first book — Visual Complex Analysis.  Absolutely unique at the time, full of pictures with a glowing recommendation from Roger Penrose, Needham’s PhD advisor.  I read parts of it, but really didn’t appreciate it.


Act IV 2021 the publication of Needham’s second book, and the subject of this partial review.  Just what I wanted after studying differential geometry with a view to really understanding general relativity, so I could read a classmate’s book on the subject.  Just like VCA, and I got through 82 pages or so, before I realized I should go back and go through the relevant parts (several hundred pages) of VCA again, which is where I am now.  Euclid is all you need for the geometry of VCA, but any extra math you know won’t hurt.


I can’t recommend both strongly enough, particularly if you’ve been studying differential geometry and physics.  There really is a reason for saying “I see it” when you understand something.


Both books are engagingly and informally written, and I can’t recommend them enough (well at least the first 82 pages of VDGF).


An intentional social and epidemiological experiment

Back in the day, mutations causing disease were called experiments of nature, something I thought cruel because as a budding chemist I regarded experiments as something intentional, and mutations occuring outside the lab are anything but intentional.

Massachusetts entered an intentional social and epidemiological experiment today. I hope it turns out well, but I seriously doubt it.

Unvaccinated people are ‘urged’ to wear masks. The state “advises all unvaccinated residents to continue to wear masks in indoor settings and when they can’t socially distance.” Lots of luck with that.

So I went to the working class cafe where I get coffee every day. A place where irony is unknown. In a 25 – 25 foot space (my guess) were 50 people in 6 booths and about 6 tables, none wearing masks. I doubt that all were vaccinated. Fortunately the cafe staff has all been vaccinated.

The cafe is in an old building with at most a 9 foot ceiling. Service was likely slow because of the crowd. So they’d likely spend 30 – 60 minutes breathing each other’s air, a perfect way to transmit the virus if any of the 50 would be infected. For details please see

The clientele is not a healthy lot, and I’d estimate that 40% of those present had BMIs over 30.

For some reason the classic editor won’t let me put in links. So I’m publishing this as is.

Minorities on course to win the Darwin awards

While the plural of anecdote is not data, two episodes this week have me very depressed about the spread of the pandemic virus in the minority community (particularly in Blacks). The first occurred with a very intelligent Black woman who worked in tech support at Comcast and helped us when our internet connection went down. You do not get a job like that unless you’re smart. She’s heard a lot about vaccine side effects and isn’t going to get it. The next was a National Guard woman working for AAA, who won’t get the vaccine unless its a military requirement.

At 3 visits to our vaccination site in a town 45% of the population is Puerto Rican we saw nary a one (except for the guy disinfecting the chairs). I talked to one of the nurses, who said that our experience is typical of what she sees day after day.

One way to make a dent in this, is force hospitals when reporting COVID19 deaths, to state whether the patient was vaccinated or not. Granted most COVID19 will not be vaccinated at out current levels of vaccination, but as this doesn’t change with increasing vaccination levels, perhaps they will be convinced (but unfortunately after a lot of unnecessary deaths.

This is not written with the old WordPress Editor, but with the new one which I hate. It doesn’t seem to let you put in tabs.

You now have to pay up to get Premium edition to install the classic editor. Although initially angered, I’ve been using it for a decade absolutely free, and it’s time to pay up

The past year

The past year was exactly what practicing clinical neurology from ’67 -’00 was like. Fascinating intellectual material along with impotence in the face of horrible suffering

Force in physics is very different from the way we think of it

I’m very lucky (and honored) that a friend asked me to read and comment on the galleys of a his book. He’s trying to explain some very advanced physics to laypeople (e.g. me). So he starts with force fields, gravitational, magnetic etc. etc. The physicist’s idea of force is so far from the way we usually think of it. Exert enough force long enough and you get tired, but the gravitational force never does, despite moving planets stars and whole galaxies around.

Then there’s the idea that the force is there all the time whether or not it’s doing something a la Star Wars. Even worse is the fact that force can push things around despite going through empty space where there’s nothing to push on, action at a distance if you will.

You’ve in good company if the idea bothers you. It bothered Isaac Newton who basically invented action at a distance. Here he is in a letter to a friend.

“That gravity should be innate inherent & {essential} to matter so that one body may act upon another at a distance through a vacuum without the mediation of any thing else by & through which their action or force {may} be conveyed from one to another is to me so great an absurdity that I beleive no man who has in philosophical matters any competent faculty of thinking can ever fall into it. “

So physicists invented the ether which was physical, and allowed objects to push each other around by pushing on the ether between them. 

But action at a distance without one atom pushing on the next etc. etc. is exactly what an incredible paper found [ Proc. Natl. Acad. Sci. vol. 117 pp. 25445 – 25454 ’20 ].

Allostery is an abstract concept in protein chemistry, far removed from everyday life. Far removed except if you like to breathe, or have ever used a benzodiazepine (Valium, Librium, Halcion, Ativan, Klonopin, Xanax) for anything. Breathing? Really? Yes — Hemoglobin, the red in red blood cells is really 4 separate proteins bound to each other. Each of the four can bind one oxygen molecule. Binding of oxygen to one of the 4 proteins produces a subtle change in the structure of the other 3, making it easier for another oxygen to bind. This produces another subtle change in structure of the other making it easier for a third oxygen to bind. Etc. 

This is what allostery is, binding of molecule to one part of a protein causing changes in structure all over the protein. 

Neurologists are familiar with the benzodiazepines, using them to stop continuous seizure activity (status epilepticus), treat anxiety (Xanax), or seizures (Klonopin). They all work the same way, binding to a complex of 5 proteins called the GABA receptor, which when it binds Gamma Amino Butyric Acid (GABA) in one place causes negative ions to flow into the neuron, inhibiting it from firing. The benzodiazepines bind to a completely different site, making the receptor more likely to open when it binds GABA. 

The assumption about all allostery is that something binds in one place, pushing the atoms around, which push on other atoms which push on other atoms, until the desired effect is produced. This is the opposite of action at a distance, where an effect is produced without the necessity of physical contact.

The paper studied TetR, a protein containing 203 amino acids. If you’ve ever thought about it, almost all the antibiotics we have come from bacteria, which they use on other bacteria. Since we still have bacteria around, the survivors must have developed a way to resist antibiotics, and they’ve been doing this long before we appeared on the scene. 

TetR helps bacteria resist tetracycline, an antibiotic produced by bacteria. When tetracycline binds to TetR it causes other parts of the protein to change so it binds DNA causing the bacterium, among other things, to make a pump which moves tetracyline out of the cell. Notice that site where tetracycline binds on TetR is not the business end where TetR binds DNA, just as where the benzodiazepines bind the GABA receptor is not where the ion channel is. 

This post is long enough already without describing the cleverness which allowed the authors to do the following. They were able to make TetRs containing every possible mutation of all 203 positions. How many is that — 203 x 19 = 3838 different proteins. Why 19? Because we have 20 amino acids, so there are 19 possible distinct changes at each of the 203 positions in TetR.  

Some of the mutants didn’t bind to DNA, implying they were non-functional. The 3 dimensional structure of TetR is known, and they chose 5 of nonfunctional mutants. Interestingly these were distributed all over the protein. 

Then, for each of the 5 mutants they made another 3838 mutants, to see if a mutation in another position would make the mutant functional again. You can see what a tremendous amount of work this was. 

Here is where it gets really interesting. The restoring mutant (revertants if you want to get fancy) were all over the protein and up to 40 – 50 Angstroms away from the site of the dead mutation. Recall that 1 Angstrom is the size of a hydrogen atom, a turn of the alpha helix is 5.4 Angstroms and contains 3.5 amino acids per turn.The revertant mutants weren’t close to the part of the protein binding tetracycline or the part binding to DNA. 

Even worse the authors couldn’t find a contiguous path of atom pushing atom pushing atom, to explain why TetR was able to bind DNA again. So there you have it — allosteric action at a distance.

There is much more in the paper, but after all the work they did it’s time to let the authors speak for themselves. “Several important insights emerged from these results. First, TetR exhibits a high degree of allosteric plasticity evidenced by the ease of disrupting and restoring function through several mutational paths. This suggests the functional landscape of al- lostery is dense with fitness peaks, unlike binding or catalysis where fitness peaks are sparse. Second, allosterically coupled residues may not lie along the shortest path linking allosteric and active sites but can occur over long distances “

But there is still more to think about, particularly for drug development. Normally, in developing a drug for X, we have a particular site on a particular protein as a target, say the site on a neurotransmitter receptor where a neurotransmitter binds. But the work shows that sites far removed from the actual target might have the same effect

Natural selection yes, but for what?

Groups across the political spectrum don’t like the idea that natural selection operates on us. The left because of the monstrosities produced by social Darwinism and eugenics. The devout because we have supposedly been formed by the creator in his image and further perfection is blasphemous.

Like it or not, there is excellent evidence for natural selection occurring in humans. One of the best is natural selection for the lactase gene.

People with lactose intolerance have nothing wrong with the gene for lactase which breaks down the sugar lactose found in milk.  Babies have no problem with breast milk.  The enzyme (lactase)  produced from the gene is quite normal in all of us; no mutations are found in the lactose protein.  However 10,000 years ago and earlier, cattle were not domesticated, so there was no dietary reason for a human weaned from the breast to make the enzyme.  In fact continuing to use energy to make the enzyme something it would never get to act on is wasteful. The genomes of our ancient ancestors had figured this out.   The control region (lactase enhancer) for the lactase gene is 14,000 nucleotides upstream from the gene itself, and back then it shut off after age 8.  After domestication of cattle 10,000 or so years ago, so that people could digest milk their entire lives a mutation arose changing cytosine to thymine in the enhancer. It spread like wildfire because back then our ancestors were in a semi-starved state most of the time, and carriers of the mutation had better nutrition.

Well that was the explanation until a recent paper [ Cell vol. 183 pp. 684 – 701 ’20 ]. It was thought that lacking the mutation you couldn’t use milk past age 8 or so. However sequencing of sites of the herdsmen of the steppes showed that they were using milk a lot (making cheese and yogurt) 8,000 years ago. Our best guess is that the mutation arose 4,000 years ago.

So possibly, the reason it spread wasn’t milk digestion, but something else. Nothing has changed the million nucleotide segment of our genome since the mutation arose — this implies that it was under strong positive natural selection. But for what?

Well, a million nucleotides codes for a lot of stuff, not just the lactase enzyme. Also there is evidence that people with the mutation is linked to metabolic abnormalities and diseases associated with decreased energy expenditure, such as obesity and type II diabetes, as well as abnormal blood metabolites and lipids.

The region codes for a microRNA (miR-128-1). Knocking it out in mice results in increases energy expenditure and improvement in high fat diet obesity. Glucose tolerance is also improved.

So it is quite possible that what was being selected for was the ‘thrifty gene’ miR-128-1 which would our semi-starved ancestors expend less energy and store whatever calories they met as fat.

In cattle a similar (syntenic) genomic region near miR-128-1 has also been under positive selection (by breeders) for feed efficiency and intramuscular fat.

So a mutation producing a selective advantage in one situation is harmful in another.

Another example —

The mutation which allows Tibetans to adapt to high altitude causes a hereditary form of blindness (Leber’s optic atroxpy) in people living at sea level. 25% of Tibetans have the mutation. Another example of natural selection operating on man.

Neural nets

The following was not written by me, but by a friend now retired from Bell labs. It is so good that it’s worth sharing.

I asked him to explain the following paper to me which I found incomprehensible despite reading about neural nets for years. The paper tries to figure out why neural nets work so well. The authors note that we lack a theoretical foundation for how neural nets work (or why they should !).

Here’s a link

Here’s what I got back

Interesting paper. Thanks.

I’ve had some exposure to these ideas and this particular issue, but I’m hardly an expert.

I’m not sure what aspect of the paper you find puzzling. I’ll just say a few things about what I gleaned out of the paper, which may overlap with what you’ve already figured out.

The paper, which is really a commentary on someone else’s work, focuses on the classification problem. Basically, classification is just curve fitting. The curve you want defines a function f that takes a random example x from some specified domain D and gives you the classification c of x, that is, c = f(x).

Neural networks (NNs) provide a technique for realizing this function f by way of a complex network with many parameters that can be freely adjusted. You take a (“small”) subset T of examples from D where you know the classification and you use those to “train” the NN, which means you adjust the parameters to minimize the errors that the NN makes when classifying the elements of T. You then cross your fingers and hope that the NN will show useful accuracy when classifying examples from D that it has not seen before (i.e., examples that were not in the training set T). There is lots of empirical hokus pokus and rules-of-thumb concerning what techniques work better than others in designing and training neural networks. Research to place these issues on a firmer theoretical basis continues.

You might think that the best way to train a NN doing the classification task is simply to monitor the classifications it makes on the training set vectors and adjust the NN parameters (weights) to minimize those errors. The problem here is that classification output is very granular (discontinuous): cat/dog, good/bad, etc. You need to have a more nuanced (“gray”) view of things to get the hints you need to gradually adjust the NN weights and home in on their “best” setting. The solution is a so-called “loss” function, a continuous function that operates on the output data before it’s classified (while it is still very analog, as opposed to the digital-like classification output). The loss function should be chosen so that lower loss will generally correspond to lower classification error. Choosing it, of course, is not a trivial thing. I’ll have more to say about that later.

One of the supposed truisms of NNs in the “old days” was that you shouldn’t overtrain the network. Overtraining means beating the parameters to death until you get 100% perfect classification on the training set T. Empirically, it was found that overtraining degrades performance: Your goal should be to get “good” performance on T, but not “too good.” Ex post facto, this finding was rationalized as follows: When you overtrain, you are teaching the NN to do an exact thing for an exact set T, so the moment it sees something that differs even a little from the examples in set T, the NN is confused about what to do. That explanation never made much sense to me, but a lot of workers in the field seemed to find it persuasive.

Perhaps a better analogy is the non-attentive college student who skipped lectures all semester and has gained no understanding of the course material. Facing a failing grade, he manages by chicanery to steal a copy of the final exam a week before it’s given. He cracks open the textbook (for the first time!) and, by sheer willpower, manages to ferret out of that wretched tome what he guesses are the correct, exact answers to all the questions in the exam. He doesn’t really understand any of the answers, but he commits them to memory and is now glowing with confidence that he will ace the test and get a good grade in the course.

But a few days before the final exam date the professor decides to completely rewrite the exam, throwing out all the old questions and replacing them with new ones. The non-attentive student, faced with exam questions he’s never seen before, has no clue how to answer these unfamiliar questions because he has no understanding of the underlying principles. He fails the exam badly and gets an F in the course.
Relating the analogy of the previous two paragraphs to the concept of overtraining NNs, the belief was that if you train a NN to do a “good” job on the test set T but not “too good” a job, it will incorporate (in its parameter settings) some of the background knowledge of “why” examples are classified the way they are, which will help it do a better job when it encounters “unfamiliar” examples (i.e., examples not in the test set). However, if you push the training beyond that point, the NN starts to enter the regime where its learning (embodied in its parameter settings) becomes more like the rote memorization of the non-attentive student, devoid of understanding of the underlying principles and ill prepared to answer questions it has not seen before. Like I said, I was never sure this explanation made a lot of sense, but workers in the field seemed to like it.

That brings us to “deep learning” NNs, which are really just old-fashioned NNs but with lots more layers and, therefore, lots more complexity. So instead of having just “many” parameters, you have millions. For brevity in what follows, I’ll often refer to a “deep learning NN” as simply a “NN.”
Now let’s refer to Figure 1 in the paper. It illustrates some of the things I said above. The vertical axis measures error, while the horizontal axis measures training iterations. Training involves processing a training vector from T, looking at the resulting value of the loss function, and adjusting the NN’s weights (from how you set them in the previous iteration) in a manner that’s designed to reduce the loss. You do this with each training vector in sequence, which causes the NN’s weights to gradually change to values that (you hope) will result in better overall performance. After a certain predetermined number of training iterations, you stop and measure the overall performance of the NN: the overall error on the training vectors, the overall loss, and the overall error on the test vectors. The last are vectors from D that were not part of the training set.

Figure 1 illustrates the overtraining phenomenon. Initially, more training gives lower error on the test vectors. But then you hit a minimum, with more training after that resulting in higher error on the test set. In old-style NNs, that was the end of the story. With deep-learning NNs, it was discovered that continuing the training well beyond what was previously thought wise, even into the regime where the training error is at or near zero (the so-called Terminal Phase of Training—TFT), can produce a dramatic reduction in test error. This is the great mystery that researchers are trying to understand.

You can read the four points in the paper on page 27071, which are posited as “explanations” of—or at least observations of interesting phenomena that accompany—this unexpected lowering of test error. I read points 1 and 2 as simply saying that the pre-classification portion of the NN [which executes z = h(x, theta), in their terminology] gets so fine-tuned by the training that it is basically doing the classification all by itself, with the classifier per se being left to do only a fairly trivial job (points 3 and 4).
To me, I feel like this “explanation” misses the point. Here is my two-cents worth: I think the whole success of this method is critically dependent on the loss function. The latter has to embody, with good fidelity, the “wisdom” of what constitutes a good answer. If it does, then overtraining the deep-learning NN like crazy on that loss function will cause its millions of weights to “learn” that wisdom. That is, the NN is not just learning what the right answer is on a limited set of training vectors, but it is learning the “wisdom” of what constitutes a right answer from the loss function itself. Because of the subtlety and complexity of that latent loss function wisdom, this kind of learning became possible only with the availability of modern deep-learning NNs with their great complexity and huge number of parameters.