Tag Archives: p hacking

Functional MRI research is a scientific sewer — take 2

You’ve heard of P-hacking, slicing and dicing your data until you get a statistically significant result.  I wrote a post about null-hacking –https://luysii.wordpress.com/2019/12/22/null-hacking-reproducibility-and-its-discontents-take-ii/.  Welcome to the world of pipeline hacking.  Here is a brief explanation of the highly technical field of functional magnetic resonance imaging (fMRI).   Skip to the **** if you know this already.

Chemists use MRI all the time, but they call it Nuclear Magnetic Resonance. Docs and researchers quickly changed the name to MRI because no one would put their head in something with Nuclear in the name.

There are now noninvasive methods to study brain activity in man. The most prominent one is called BOLD (Blood Oxygen Level Dependent), and is based on the fact that blood flow increases way past what is needed with increased brain activity. This was actually noted by Wilder Penfield operating on the brain for epilepsy in the 1930s. When a patient had a seizure on the operating table (they could keep things under control by partially paralyzing the patient with curare) the veins in the area producing the seizure turned red. Recall that oxygenated blood is red while the deoxygenated blood in veins is darker and somewhat blue. This implied that more blood was getting to the convulsing area than it could use.

BOLD depends on slight differences in the way oxygenated hemoglobin and deoxygenated hemoglobin interact with the magnetic field used in magnetic resonance imaging (MRI). The technique has had a rather checkered history, because very small differences must be measured, and there is lots of manipulation of the raw data (never seen in papers) to be done. 10 years ago functional magnetic imaging (fMRI) was called pseudocolor phrenology.

Some sort of task or sensory stimulus is given and the parts of the brain showing increased hemoglobin + oxygen are mapped out. As a neurologist as far back as the 90s, I was naturally interested in this work. Very quickly, I smelled a rat. The authors of all the papers always seemed to confirm their initial hunch about which areas of the brain were involved in whatever they were studying.

****

Well now we know why.  The data produced by and MRI is so extensive and complex that computer programs (pipelines) must be used to make those pretty pictures.  The brain has a volume of 1,200 cubic centimeters (or 1,200,000 cubic millimeters).  Each voxel of an MRI (like the pixels on your screen is about 1 cubic millimeter) and basically gives you a number of how much energy is absorbed by the voxel.  Computer programs (called pipelines) must be used to process it and make those pretty pictures you see.

Enter Nature vol. 582 pp. 36 – 37, 84 – 88 ’20 and the Neuroimaging Analysis Replication and Prediction Study (NARPS).  70 different teams were given the raw data from 108 people, each of whom was performing one or the other of two versions of a task through to study decision making under risk.  The groups were asked to analyze the data to test 9 different hypotheses about what part of the brain should light up in relation to  specific feature of the task.

Now when a doc orders a hemoglobin from the lab he’s pretty should that they’ll all give the same result because they determine hemoglobin by the same method.  Not so for functional MRI.  All 70 teams analyzed the data using different pipelines and workflows.

Was there agreement.  20% of the teams reported a result different from most teams.  Random is 50%.  Remember they all got the same raw data.

From the News and Views commentary  on the the paper.

“It is unfortunately common for researchers to explore various pipelines to find the ver­sion that yields the ‘best’ results, ultimately reporting only that pipeline and ignoring the others.”

This explains why I smelled a rat 30 years ago.  I call this pipeline hacking.

Further infelicities in the field can be found in the following posts

l. it was shown in 2014 that 70% of people having functional MRIs (fMRIs) were asleep during the test, and that until then fMRI researchers hadn’t checked for it. For details please see
https://luysii.wordpress.com/2014/05/18/how-badly-are-thy-researchers-o-default-mode-network/. You don’t have to go to med school, to know that the brain functions quite differently in wake and sleep.

2. A devastating report in [ Proc. Natl. Acad. Sci. vol. 113 pp. 7699 – 7600, 7900 – 7905 ’16 ] showed that certain common settings in 3 software pacakages (SPM, FSL, AFNI) used to analyze fMRI data gave false positive results ‘up to’ 70% of the time. Some 3,500 of the 40,000 fMRI studies in the literature over the past 20 years used these settings. The paper also notes that a bug (now corrected after being used for 15 years) in one of them also led to false positive results.  For details see — https://luysii.wordpress.com/2016/07/17/functional-mri-research-is-a-scientific-sewer/

In fairness to the field, the new work and #1 and #2 represent attempts by workers in fMRI to clean it up.   They’ve got a lot of work to do.

Null hacking — Reproducibility and its Discontents — take II

Most scientific types have heard about p hacking, but not null hacking.

Start with p hacking.  It’s just running statistical test after statistical test on your data until you find something unlikely to occur by chance more than 5% of the time (a p of .05) making it worthy of publication (or at least discussion).

It’s not that hard to do, and I faced it day after day as a doc and had to give worried patients a quick lesson in statistics.  The culprit was something called a chem-20, which measured 20 different things (sodium, potassium, cholesterol, kidney tests, liver tests, you name it).  Each of the 20 items had a normal range in which 95% of the values from a bunch of (presumably) normal people would fall.  This of course means that 2.5% of all results would be outside the range on the low side, and 2.5% would be outside the range on the upside.

Before I tell you, how often would you expect to get a test where all 20 tests were normal?

The chance of a single test being normal is .95, two tests .95 * .95 = .90, 4 tests .90 * .90 = .81, 8 tests .81 * .81 = .65, 16 tests .65 *.65 = .42, 20 tests .42 * .81 = .32.

Less than 1/3 of the time.

That’s p hacking.  It has been vigorously investigated in the past few years in psychology, because a lot of widely cited results in supposedly high quality journals couldn’t be reproduced.  See the post of 6/16 at the end for the initial work.

It arose because negative results don’t win you fame and fortune and don’t get published as easily.

So there has been a very welcome and salutary effort to see if results could be confirmed — only 39% were — see the copy of the old post at the end.

So all is sweetness and light with the newly found rigor.  Not so fast says Proc. Natl. Acad. Sci. vol. 116 pp. 25535 – 25545 ’19.  The same pressures that lead investigators to p hack their result to get something significant and publishable, leads the replicators to null hack their results to win fame and fortune by toppling a psychological statue.

At this point it’s time for a Feynman quote “The first principle is that you must not fool yourself and you are the easiest person to fool.”

The paper talks about degrees of freedom available to the replicator, which in normal language just means how closely do you have to match the conditions of the study you are trying to replicate.

Obviously this is impossible for one of the studies and its replication they discuss — whether the choice of language used in a mailing  to urge people to vote in an election had any effect on whether they actually voted.  Obviously you can’t arrange to have the two hard fought elections in which there was a lot of interest of the initial study run again.  But the replicators choose a bunch of primaries in which interest and turnout was low, casting doubt on their failure to replicate the original results (which was that language DID make a difference in voter turnout).

Then the authors of the PNAS paper reanalyzed the data of the replicators a different way, and found that the original study was replicated.  This is the second large degree of freedom, the choice of the way to analyze the raw data — the same as the original authors or differently — “reasonable people may differ” about these matters.

There’s a lot more in the paper including something called the Bayesian Causal Forest which is a new method of data analysis which the authors favor (which I confess I don’t understand).

Here’s the old post  of 6/16

Reproducibility and its discontents

“Since the launch of the clinicaltrials.gov registry in 2000, which forced researchers to preregister their methods and outcome measures, the percentage of large heart-disease clinical trials reporting significant positive results plummeted from 57% to a mere 8%”. I leave it to you to speculate why this happened, but my guess is that probably the data were sliced and diced until something of significance was found. I’d love to know what the comparable data is on anti-depressant trials. The above direct quote is from Proc. Natl. Acad. Sci. vol. 113 pp. 6454 – 6459 ’16. The article looked at the 100 papers published in ‘top’ psychology journals, about which much has been written — here’s the reference to the actual paper — Open Science Collaboration (2015) Psychology. Estimating the reproducibility of psychological science. Science 349(6251):aac4716.

The sad news is that only 39% of these studies were reproducible. So why beat a dead horse? The authors came up with something quite useful — they looked at how sensitive to context each of the 100 studies actually was. By context they mean the time of the study (e.g., pre- vs. post-Recession), culture (e.g., individualistic vs. collectivistic culture), the location (e.g., rural vs. urban setting), or the population (e.g., a racially diverse population vs. a predominantly White or Black or Latino population). Their conclusions were that the contextual sensitivity of the research topic was associated with replication success (e.g. the more context sensitive, the less likely it was that the study could be reproduced). This was even after statistically adjusting for several methodological characteristics (e.g., statistical power, effect size, etc. etc). The association between contextual sensitivity and replication success did not differ across psychological subdisciplines.

Addendum 15 June ’16 — Sadly, the best way to say this is — The more likely a study is to be true (replicable) the more likely it is to be not generally applicable (e.g. useful).

So this is good. Up to now the results of psychology studies have been reported in the press as of general applicability (particularly those which enforce the writer’s preferred narrative). Caveat emptor is two millenia old. Carl Sagan said it best — “Extraordinary claims require extraordinary evidence.”

For an example data slicing and dicing, please see — https://luysii.wordpress.com/2009/10/05/low-socioeconomic-status-in-the-first-5-years-of-life-doubles-your-chance-of-coronary-artery-disease-at-50-even-if-you-became-a-doc-or-why-i-hated-reading-the-medical-literature-when-i-had-to/