Null hacking — Reproducibility and its Discontents — take II

Most scientific types have heard about p hacking, but not null hacking.

Start with p hacking.  It’s just running statistical test after statistical test on your data until you find something unlikely to occur by chance more than 5% of the time (a p of .05) making it worthy of publication (or at least discussion).

It’s not that hard to do, and I faced it day after day as a doc and had to give worried patients a quick lesson in statistics.  The culprit was something called a chem-20, which measured 20 different things (sodium, potassium, cholesterol, kidney tests, liver tests, you name it).  Each of the 20 items had a normal range in which 95% of the values from a bunch of (presumably) normal people would fall.  This of course means that 2.5% of all results would be outside the range on the low side, and 2.5% would be outside the range on the upside.

Before I tell you, how often would you expect to get a test where all 20 tests were normal?

The chance of a single test being normal is .95, two tests .95 * .95 = .90, 4 tests .90 * .90 = .81, 8 tests .81 * .81 = .65, 16 tests .65 *.65 = .42, 20 tests .42 * .81 = .32.

Less than 1/3 of the time.

That’s p hacking.  It has been vigorously investigated in the past few years in psychology, because a lot of widely cited results in supposedly high quality journals couldn’t be reproduced.  See the post of 6/16 at the end for the initial work.

It arose because negative results don’t win you fame and fortune and don’t get published as easily.

So there has been a very welcome and salutary effort to see if results could be confirmed — only 39% were — see the copy of the old post at the end.

So all is sweetness and light with the newly found rigor.  Not so fast says Proc. Natl. Acad. Sci. vol. 116 pp. 25535 – 25545 ’19.  The same pressures that lead investigators to p hack their result to get something significant and publishable, leads the replicators to null hack their results to win fame and fortune by toppling a psychological statue.

At this point it’s time for a Feynman quote “The first principle is that you must not fool yourself and you are the easiest person to fool.”

The paper talks about degrees of freedom available to the replicator, which in normal language just means how closely do you have to match the conditions of the study you are trying to replicate.

Obviously this is impossible for one of the studies and its replication they discuss — whether the choice of language used in a mailing  to urge people to vote in an election had any effect on whether they actually voted.  Obviously you can’t arrange to have the two hard fought elections in which there was a lot of interest of the initial study run again.  But the replicators choose a bunch of primaries in which interest and turnout was low, casting doubt on their failure to replicate the original results (which was that language DID make a difference in voter turnout).

Then the authors of the PNAS paper reanalyzed the data of the replicators a different way, and found that the original study was replicated.  This is the second large degree of freedom, the choice of the way to analyze the raw data — the same as the original authors or differently — “reasonable people may differ” about these matters.

There’s a lot more in the paper including something called the Bayesian Causal Forest which is a new method of data analysis which the authors favor (which I confess I don’t understand).

Here’s the old post  of 6/16

Reproducibility and its discontents

“Since the launch of the clinicaltrials.gov registry in 2000, which forced researchers to preregister their methods and outcome measures, the percentage of large heart-disease clinical trials reporting significant positive results plummeted from 57% to a mere 8%”. I leave it to you to speculate why this happened, but my guess is that probably the data were sliced and diced until something of significance was found. I’d love to know what the comparable data is on anti-depressant trials. The above direct quote is from Proc. Natl. Acad. Sci. vol. 113 pp. 6454 – 6459 ’16. The article looked at the 100 papers published in ‘top’ psychology journals, about which much has been written — here’s the reference to the actual paper — Open Science Collaboration (2015) Psychology. Estimating the reproducibility of psychological science. Science 349(6251):aac4716.

The sad news is that only 39% of these studies were reproducible. So why beat a dead horse? The authors came up with something quite useful — they looked at how sensitive to context each of the 100 studies actually was. By context they mean the time of the study (e.g., pre- vs. post-Recession), culture (e.g., individualistic vs. collectivistic culture), the location (e.g., rural vs. urban setting), or the population (e.g., a racially diverse population vs. a predominantly White or Black or Latino population). Their conclusions were that the contextual sensitivity of the research topic was associated with replication success (e.g. the more context sensitive, the less likely it was that the study could be reproduced). This was even after statistically adjusting for several methodological characteristics (e.g., statistical power, effect size, etc. etc). The association between contextual sensitivity and replication success did not differ across psychological subdisciplines.

Addendum 15 June ’16 — Sadly, the best way to say this is — The more likely a study is to be true (replicable) the more likely it is to be not generally applicable (e.g. useful).

So this is good. Up to now the results of psychology studies have been reported in the press as of general applicability (particularly those which enforce the writer’s preferred narrative). Caveat emptor is two millenia old. Carl Sagan said it best — “Extraordinary claims require extraordinary evidence.”