tl;dr: They were instructed to-p hack because their instructors didn't know any better. With p-hacking, you can find statistically significant evidence of anything you want, and often, researchers don't even realize they're p-hacking!
What is "p-hacking"? Wikipedia has a good overview of course:
Now, onto the actual topic:
Why Psychologists Must Change the Way They Analyze Their Data: The Case of Psi: Comment on Bem (2011) (I'll refer to it with "Wagenmakers et al. (2011)") thoroughly and clearly exposes the flawed recommendations. The authors break the issues down into 3 problems: Exploration Instead of Confirmation, Fallacy of the Transposed Conditional, and p Values Overstate the Evidence Against the Null. The first problem is most relevant to this post.
Wagenmakers et al. (2011) include a quote from Bem (2000):
Wagenmakers et al.:
A similar concept is the multiple comparisons problem. You can understand this concept quite easily: imagine a large dataset where you can generate hundreds or thousands of hypotheses (in this example, you have generated your hypotheses without looking at the data, contrary to the previous paragraph). The original effect you test will have a valid p-value where the probability of finding a "fake" effect is .05 as desired. But if you add another test after the first one failed to find a significant result, you now have a 1 - (1 - .05)^2 = 0.0975 probability of finding a false significant result. The error rate grows as 1 - (1 - .05)^n where n is the number of statistical tests performed. The probability of finding a fake effect asymptotically grows to 1 as you add more tests, essentially guaranteeing that you will find some statistically significant effect eventually. There is an important distinction in the case where you have multiple comparisons, but none of them are testing hypotheses that you generated by looking at the data. In the latter case, you can't fix it as I mentioned above. In the former, you can use corrections, such as the Bonferroni correction, to adjust for the multiple comparisons problem and make valid statistical conclusions.
This ended up being a much longer post than anticipated, and there is still more to be said, such as examples of bad things to look out for, good things to look out for (pre-registration and Bayesian analysis), and who knows what else. However, I'm tired, so I'll continue in a reply eventually. I'll also include commentary from another paper that mentions Bem (2011) - The garden of forking paths: Why multiple comparisons can be a problem, even when there is no fishing expedition or p-hacking and the research hypothesis was posited ahead of time by Andrew Gelman and Eric Loken (2013).
Boring background; skip if you already know what p-hacking is
I originally started writing this as a response to another MB post, Mediumship — triple blind study, but I decided to make a separate thread since the information is so widely applicable across the woo and pseudoscience world (as well as the real science world sadly - though to a lesser degree. See Replication crisis).What is "p-hacking"? Wikipedia has a good overview of course:
P-hacking can be boiled down to any statistical analysis that increases the likelihood, or even guarantees, finding "statistically significant" results - the coveted p < 0.05. To avoid getting into the weeds, I'll just use the colloquial definition of statistically significant: the results you have found are likely "real" and not just a happy accident due to random variation. A simple example will show how easy it is to commit p-hacking unintentionally: you have a hypothesis (the effect of X on outcome Y is greater than 0); you collect your data, which includes independent variable X, dependent variable Y, as well as another independent variable Z; you perform your statistical test on both X and Z because modern software makes it so easy; you look at X's effect and, heartbreakingly, you get p > .05. But you also performed the test on Z, and hooray, that got p < 0.05! You write your paper with Z's effect as the focus, and all is good. Unfortunately, that's p-hacking because that wasn't your plan and is an example of the multiple comparisons problem.External Quote:Data dredging, also known as data snooping or p-hacking[1][a] is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. This is done by performing many statistical tests on the data and only reporting those that come back with significant results.[2] Thus data dredging is also often a misused or misapplied form of data mining.
Now, onto the actual topic:
Why are p-hacking and other bad stats so common in psychic, psi, NDE, mediumship, etc research?
Because researchers were literally instructed to do so! One of the primary culprits propagating this in the psi-related research field is Daryl J. Bem (sometimes referred to as D.J. Bem). Bem is well known and influential in the field, including a very popular publication Feeling the future: Experimental evidence for anomalous retroactive influences on cognition and affect, which found statistically significant evidence for psi abilities. Bem also wrote two popular articles on methodology: Writing an empirical article in 2000 and Writing the empirical article in 2003. Those articles have serious flaws, and they'll be the focus of this post. (2003 is mostly a rehashing of 2000, including copy/pasting entire paragraphs, with some minor additions. Since the important quotes are present in both versions, I will be citing the 2000 version exclusively. 2003 does not fix any of the conceptual mistakes made in 2000.)Why Psychologists Must Change the Way They Analyze Their Data: The Case of Psi: Comment on Bem (2011) (I'll refer to it with "Wagenmakers et al. (2011)") thoroughly and clearly exposes the flawed recommendations. The authors break the issues down into 3 problems: Exploration Instead of Confirmation, Fallacy of the Transposed Conditional, and p Values Overstate the Evidence Against the Null. The first problem is most relevant to this post.
Wagenmakers et al. (2011) include a quote from Bem (2000):
None of the above recommendations are bad, per se. In fact, it's great practice to slice and dice your data to deduct useful insights. As Wagenmakers et al. (2011) explains, the issues arise with how you use the information, which the authors denote as "exploratory" or "confirmatory". If you use this exploratory data analysis EDA to generate new ideas for future experiments, that's great! However, if you use these findings in an article as if they were your intent all along and present them as confirmatory, you have just done some bad stats. You cannot use the same data to both generate your hypothesis AND test your hypothesis. Your p-values are invalid in that case.To compensate for this remoteness from our participants, let us at least become intimately familiar with the record of their behavior: the data. Examine them from every angle. Analyze the sexes separately. Make up new composite indexes. If a datum suggests a new hypothesis, try to find further evidence for it elsewhere in the data. If you see dim traces of interesting patterns, try to reorganize the data to bring them into bolder relief. If there are participants you don't like, or trials, observers, or interviewers who gave you anomalous results, place them aside temporarily and see if any coherent patterns emerge. Go on a fishing expedition for something—anything—interesting. (Bem, 2000, pp. 4 –5)
Wagenmakers et al.:
Bem (2011) does actually attempt to warn against this but immediately follows with a fatal mistake:Instead of presenting exploratory findings as confirmatory, one should ideally use a two-step procedure. First, in the absence of strong theory, one can explore the data until one discovers an interesting new hypothesis. But this phase of exploration and discovery needs to be followed by a second phase, one in which the new hypothesis is tested against new data in a confirmatory fashion.
The first sentence is good. The second is back to bad stats. There is no way for the data to be "strong enough and reliable enough" to "[recenter] your article around the new findings and subordinating or even ignoring your original hypotheses." Since you have tested a new hypothesis on the same data from which you generated the hypothesis, there is no way to make the p-values valid. The p-values are inherently wrong, and you cannot adjust for it. Further, ignoring your original hypothesis is a terrible recommendation as it can mislead the reader into believing that the presented hypothesis was the original hypothesis all along.If you still plan to report the current data, you may wish to mention the new insights tentatively, stating honestly that they remain to be tested adequately. Alternatively, the data may be strong enough and reliable enough to justify recentering your article around the new findings and subordinating or even ignoring your original hypotheses.
A similar concept is the multiple comparisons problem. You can understand this concept quite easily: imagine a large dataset where you can generate hundreds or thousands of hypotheses (in this example, you have generated your hypotheses without looking at the data, contrary to the previous paragraph). The original effect you test will have a valid p-value where the probability of finding a "fake" effect is .05 as desired. But if you add another test after the first one failed to find a significant result, you now have a 1 - (1 - .05)^2 = 0.0975 probability of finding a false significant result. The error rate grows as 1 - (1 - .05)^n where n is the number of statistical tests performed. The probability of finding a fake effect asymptotically grows to 1 as you add more tests, essentially guaranteeing that you will find some statistically significant effect eventually. There is an important distinction in the case where you have multiple comparisons, but none of them are testing hypotheses that you generated by looking at the data. In the latter case, you can't fix it as I mentioned above. In the former, you can use corrections, such as the Bonferroni correction, to adjust for the multiple comparisons problem and make valid statistical conclusions.
This ended up being a much longer post than anticipated, and there is still more to be said, such as examples of bad things to look out for, good things to look out for (pre-registration and Bayesian analysis), and who knows what else. However, I'm tired, so I'll continue in a reply eventually. I'll also include commentary from another paper that mentions Bem (2011) - The garden of forking paths: Why multiple comparisons can be a problem, even when there is no fishing expedition or p-hacking and the research hypothesis was posited ahead of time by Andrew Gelman and Eric Loken (2013).
Last edited: