Meta-science uses scientific methods to study science and meta-psychologists study psychological science. Although meta-psychology emerged only recently, psychologists have reflected on their science since its beginning (Cohen, 1961; Tversky & Kahneman, 1971).
From 1960 to 1990, the most influential meta-psychologist was Jacob Cohen. Cohen made two important observations about statistical practices in psychology. First, Cohen observed that psychologists often conducted studies without considering the a priori probability that their study would produce the correct result; that is, a statistically significant result that is based on an actual relationship between two variables. Apparently, they were as happy to conduct studies with a 20% chance of getting significant results as doing studies with an 80% chance of a significant result (Tversky & Kahneman, 1971). Second, Cohen conducted the first meta-analysis of statistical power in psychology that suggested an average power of 50% (Cohen, 1961). This finding has been replicated in other studies and power does not seem to have increased from 1960 to 2010 (Maxwell, 2004; Sedlmeier & Gigerenzer, 1989; Schimmack, 2012; Smaldino & McElreath, 2016).
In the 1990s, the American Psychological Association assembled a task force to examine statistical practices in psychological science. A key concern was the high number of false negative results. In response to this concern, Psychological Science tried to abandon the use of statistical significance as a criterion for publication (Cutting, 2006). Although another solution to the problem would have been to demand higher-powered studies, power remained neglected (Maxwell, 2004).
The 2010s saw a dramatic shift in meta-scientific concerns about research practices in psychological science. A new concern was that that many hypotheses in psychology might be false; that is, the nil-hypothesis is actually true, and that many published results are false positive results. A similar concern had been raised a bit earlier for medical research (Ioannidis, 2005), but psychologists only started worrying about false positive results in 2011. The crisis of confidence was triggered by a social psychological article that claimed implicit priming effects work even when the prime is presented after the behavior (Bem, 2011). Few psychologists were willing to accept time-reversed causality and considered Bem’s results to be false positive results. This sensational finding triggered questions about the credibility of other findings that were produced with the same research methods. Suddenly, a major concern was that many, if not most published results are false positive results.
Concerns about false positive results were stocked by demonstration that a few statistical tricks may help researchers to produce significance when the nil-hypothesis is true (Simmons, Nelson, & Simonsohn, 2011). These tricks are called questionable research practices because the undermine the objectivity of data analysis (John et al., 2012). Concerns about false positive results are intimately tight to concerns about questionable research practices because the propoer use of significance testing produces significant results in only 1 out of 20 attempts and if the sign has to be consistent, it only produces significant results in 1 out of 40 attempts with the standard criterion of alpha = 5%, two-sided (z = 1.96). It seems implausible that a single researcher could conduct 40 studies to produce 1 significant result. Even if independent researchers are testing the same hypothesis, it is implausible that a field with dozens of significant results in a meta-analysis could have produced these results, if the nil-hypothesis was true. This logic is consistent with Rosenthal’s fail-safe N statistic for meta-analysis. The statistic computes the number of non-significant results that would be required to nullify a significant meta-analytic result. Typically, fail-safe N is much higher than any reasonable number of studies would be. Thus, it is irrational to postulate a high false positive risk without assuming the use of questionable research practices that increase the chance of producing false positive results. In sum, the 2010s saw a shift in concerns about research practices in psychological science. The new concern was that psychological science is not very different from other human attempts to understand the world.
“Instead of a naïve scientist entering the environment in search of the truth, we find the rather unflattering picture of a charlatan trying to make the data come out in a manner most advantageous to his or her already-held theories” (Fiske & Taylor, 1984).
The past decade has demonstrated conclusively that questionable research practices inflate the percentage of significant results in psychology journals. First, statistical comparisons of power and success rates show that success rates exceed statistical power (Francis, 2012; 2014; Schimmack, 2012; 2020). Second, success rate of replication studies without QRPs are lower than success rates in original articles (Open Science Collaboration, 2015). Third, success rates in published articles based on dissertations are higher than success rates in the dissertations (reference). Thus, there is no doubt that questionable research practices are used and increase the risk of false positive results. However, it remains unclear whether QRPs merely inflate true effect sizes or whether they often produce false positive results.
At present, concerns about a high rate of false positive results in psychology rests mostly on speculations. For example, Smaldino and McElreath (2016) speculate that most effect sizes are small, d = .2, which leads to the assumption that power is low (24%), and that this explains why “false discoveries are common” (p. 5). However, with low power it is impossible to distinguish between true small effects and true nil-hypotheses and there is no empirical evidence to support the claim that false discoveries are common. The key problem with these speculations is that it is practically impossible to distinguish between false hypotheses where the nil-hypothesis is true and true hyptheses, where the effect size is very small and may not be practically significant (Brunner & Schimmack, 2020). Thus, even a high rate of replication failures does not warrant the conclusion that most published results are false positives (Schimmack, 2020).
Dreber et al. (2015) speculated that the false discovery risk is high because psychologists often test false hypotheses. Based on a prediction market of replication outcomes they estimated that only 9 out of 100 hypotheses in psychology are true. Along with low power, this would suggest that the actual number of discoveries in psychological laboratories is below 10% (90 * .05 + 10 * .50 = 9.5 out of 100). That is, whenever a psychological researcher presses a key to get a p-value, 9 out of 10 times the result is a p-value greater than .05. Researchers then use QRPs to inflate the rate of significant results in publications to 95% (Sterling, 1995; Sterling et al., 1995).
Without strong empirical data, meta-scientists are at risk to be no different than other scientists. That is, they will fit their assumptions and simulations to predict the state of affairs they believe to be true. To avoid this problem, I am presenting the results of objective analyses of empirical data. The empirical data are test results from articles published in psychology journals that cover a wide range of psychological disciplines. The dataset also covers all years from 2010 to 2020, which makes it possible to examine any changes in research practices over the past decade. The test statistics were automatically extracted from text files of the published articles. This ensures that the coding of articles is objective (that is not influenced by my own motivated biases) and a large sample size, which is helpful to obtain precise parameter estimates. The method also has some limitations that are discussed in detail in the discussion section, but the dataset is superior to previous attempts to estimate false discovery risks in psychology. For example, Dreber et al.’s (2015) estimate was based on 44 tests. In contrast, the present analysis is based on over 1 million test statistics.
To provide quantitative information about the use of questionable research practices, power, and false positive risks, I used a new statistical tool called z-curve.2.0 (Bartos & Schimmack, 2021; Brunner & Schimmack, 2020). To standardize information from various test statistics, z-curve first transforms test statistics into exact p-values (or log p.values), and then converts the two-sided p-values into absolute z-scores. For example, a p-value of .05 (two-sided) is converted into a z-score of 1.96. Each observed z-score has two components. One component is the non-central z-score that is expected based on the sample size and effect size of a test. The other component is sampling error. In reality, each z-score can have a different non-central z-score. However, the distribution of z-scores can be approximated with a finite mixture model that has only a few non-central z-scores as components. Z-curve approximates the observed distribution of z-scores by assigning different weights to each component to optimize fit between the observed distribution and the distribution that is predicted by the z-curve model. Each component of the model corresponds to power to reject the nil-hypothesis. For example, the component z = 2 has approximately 50% power to reject the nil-hypothesis. The key parameters that are estimated by z-curve.2.0 is the average power of all studies (before selection for significance) and the average power of studies with a significant result (after selection for significance). The focus here is on the estimate of power before selection for significance. This parameter is called the estimated discovery rate (EDR).
The EDR is crucial to answer three questions about research practices in psychology. First, the EDR is an estimate of the typical power of psychological studies. This power estimate is different from a priori power estimates that postulate a (non-zero) effect size. Here power is an estimate of average power across a set of studies, not power of a single study. Moreover, because it is not possible to determine the number of false hypotheses that were tested, the average also includes an unknown percentage of studies with power equivalent to alpha. Thus, power is not conditioned on the presence of an effect. To distinguish it from the typical use of power, I call it unconditional power. Unconditional power provides an estimate of the amount of times psychologists press a key and get a significant result, p < .05. Unconditional power can be low for two reasons. Either psychologists test many false hypotheses and power is 5% or they test many true hypotheses with low conditional power.
An estimate of the discovery rate can be used to quantify the influence of questionable research practices that inflate the observed discovery rate. For example, if psychology journals publish 95% significant results with only 20% unconditional power, we see a large discrepancy between observed power and estimated power that can only be explained with QRPs. However, if the observed discovery rate is 60% and the estimated discovery rate is 50%, the influence of QRPs is milder. Finally, if the observed discovery rate matches the expected discovery rate, there would be no evidence that QRPs were used. Thus, the discrepancy between observed and expected discovery rates provide quantitative information about the use of QRPs.
The estimated discovery rate is also useful to assess the risk that published significant results are false positive results. The reason is that the risk of false positive results decreases as the discovery rate increases (Soric, 1989). Figure 1 shows the relationship between the discovery rate and the maximum rate of false positive results, which I call the false discovery risk.
The relationship is non-linear. With just 10 discoveries, the false discovery risk is 50%. Thus, claims that most published results are false require fewer than 10 out of 100 significant results (Dreber et al., 2015; Ioannidis, 2005). On the other hand, if just over 50% of all tests are significant, the risk that significant results are false positives is only 5%. In contrast, to speculations about the number of true or false hypotheses that are being tested, Soric’s focus on observable discovery rates makes it possible to study false discovery risks empirically. The only proble is that observed discovery rates in psychology journals are not credible estimates of the actual discovery rates becuase they are inflated to an unknown degree by the use of QRPs. Z-curve makes it possible to answer questions about QRPs and false discovery risk because it provides estimates of unconditional power that correct for the influence of QRPs.
Data
PDF files of published articles from 120 psychology journals (journal list) from 2010 to 2020 were downloaded and converted into text files using the commercial software PDFzilla. The text files were searched for F-tests, t-tests, z-tests, and chi-square test statistics that were reported in the text. Results presented in other forms or in tables were not included. Chi-square tests were limited to tests with up to 5 degrees of freedom to exclude mode fit results from structural equation modeling studies. Results from t-tests that were reported without degrees of freedom were treated as z-tests. The data are available to reproduce the results (data.file).
Results and Discussion
Figure 2 shows the z-curve plot for all test statistics. First, the percentage of results that are significant with alpha = .05 (two-sided) is 69%. This is notably lower than previous estimates (Sterling, 1959; Sterling et al., 1995). The main reason for this discrepancy is that journal articles are often written around a significant result as the key finding (Kerr, 1998). Thus, non-significant results are not entirely missing from published articles, but the focus is on significant results because they can be interpreted, whereas non-significant results are inconclusive in the standard nil-hypothesis framework.
The second observation is that the EDR is 47% with a fairly tight confidence interal ranging from 42% to 53%. This finding is consistent with Cohen’s assumption that most studies in psychology test true hypotheses and that average power to detect true effects is around 50%. In contrast, an EDR around 50% is inconsistent with previous claims that average power is 23%, which limits the EDR to 23% if only true hypotheses were tested. An EDR of 47% implies that the false discovery risk is limited to 7%, with a fairly tight confidence interval ranging from 5% to 7%. This finding is inconsistent with Ioannidis’ s speculation that most published results in science are false. A similar analyses for medical journals also failed to support Ioannidis’s prediction (Jager & Leek, 2014). Finally, the results are inconsistent with Deber et al.’s estimate that psychologists test only 10% true hypotheses, which limits the predicted EDR to 14.5.
The third observation is that the estimated discovery rate is only 47%. Given the large sample size, the 95%CI around this estimate is fairly tight ranging from 42% to 53%. The discrepancy between the observed discovery rate and the estimated discovery rate shows clear evidence that the success rate in journals is inflated by QRPs. The discrepancy is not as large as some have feared (Simmons et al., 2011), but it is not trivial either. These results corroborate the findings from anonymous survey studies that QRPs are being used to produce significant results (John et al., 2012; Nosek et al., 2021).
In conclusion, a z-curve analysis of statistical results in psychology journals provides some clarification about the prevalence of false positives and false negative results in psychology. The results strongly support the concerns of the 1990s that psychological studies are often underpowered and bear a high risk to produce false negative results. In contrast, the risk of false positive results is relatively small. Psychologists are much more likely to miss detecting statistically small effects than to produce significant results without a true effect. Before the implications of these new findings are discussed in detail, it is important to examine potential moderator variables.
Time Effects
To start the analyses in the year 2010 is arbitrary. With some additional work it is possible to extend the dataset to previous years. However, it is unlikely that results have changed in the time before 2010 because critical reflections on research methods showed little effect (Sedlemeier & Gigerenzer, 1989; Smaldino & McElreath, 2016). The more interesting question is whether research practices changed in response to the crisis of confidence (Nosek et al., 2021). The present study provides the first empirical answer to this question. To do so, I estimated the ODR and EDR separately for each year.
The results in Figure 3 show that ODR decreased over time and EDR increased over time. The ordinal trends are highly significant, p < .001. An analysis with quadratic trends suggests that the increase in the EDR is not linear. Visual inspections suggests that EDR did not change much from 2010 to 2015, but did increase after 2015. This would be consistent with a delayed response to the credibility crisis that emerged in 2011.
In absolute terms the changes are not that impressive. The ODR decreased from 72% to 68%. The EDR increased from 45% to 53%. Thus, the general results hold even for 2010. Average power was around 50% and the false discovery risk was only 6%.
Discipline as Moderator
The credibility crisis has focussed heavily on social and cognitive psychology (Open Science Collaboration, 2015). In contrast, research where recruitment of participants is difficult and costly has seen fewer replication studies. As a result, little is known about variability in the use of QRPs and false discovery risks across scientific disciplines.
I first focus on the major areas that have multiple journals, namely social psychology (k = 21), cognitive (k = 16), developmental (k = 14), clinical (k = 14), biological (k = 13), personality (k = 6), and applied (k = 17).
The results show that all disciplines use QRPs as the ODR is always significantly higher than the EDR. The confidence intervals are 83% CI to allow easier comparisons of EDR’s across disciplines, but the typical 95%CI also never included the ODR. There is statistically significant variation in the ODR which are estimated with high precision, but in terms of effect sizes the variation is small ranging from 65% for Clinical to 72% for Personality. Variation in the EDR is more substantial, ranging from 31% for Applied to 54% for Cognitive Psychology. Most confidence intervals overlap, but cognitive psychology has a higher EDR than social psychology and applied psychology. In addition, applied differed significantly from personality and development.
The result qualitatively validate the finding in the reproducibility project that results in cognitive psychology that replication studies without QRPs produced more significant results for cognitive psychology than social psychology. The present results also suggest that this difference is driven by both disciplines. Social psychology is less replicable and cognitive psychology is more replicable than other fields.
However, quantitatively the present results are not consistent with the OSC reproducibility project. While the success rate for cognitive psychology (50%) is in line with the EDR estimate of 54%, the success rate for social psychology was lower (25%) than the EDR (40%). A number of factors can contribute to this discrepancy, including the low number of studies in the reproducibility project and some problems with conducting actual replication studies in social psychology (van Bavel et al., 2016; Inbar, 2016). A larger sample of actual replication studies is needed to explore this discrepancy further.
Overall, the results show only modest heterogeneity in research practices across disciplines in psychology. The most notable exception are results published in applied psychology journals. This is a novel and concerning finding as applied research has more impact on individuals’ lives. However, even for applied psychology results the 95%CI for the false discovery risk ranges from 6% to 17%. Thus, there is no indication that most published results in psychological science are false in any of the major disciplines.
Time by Discipline Interactions
Most of the discussion about research practices has occurred among social psychologists (Nosek et al., 2021; Schimmack, 2020), and social psychologists have been overrepresented in the Open Science movement. It is plausible to assume that changes in research practices among social psychologists disproportionally contribute to the overall increase in the EDR. It is less clear, whether other disciplines have changed over the past decade. To answer this question, I also examined time trends for each of the major disciplines.
All disciplines, except cognitive psychology, show an increasing trend. Confidence intervals are not shown because they are too wide to show annual differences. Another way to compare time trends of disciplines is to regress the EDR and to compare the slopes.
The wider 95%CIs show that all disciplines except cognitive and developmental psychology improved. However, the CIs are wide and it is possible that some increases occurred that are not yet statistically reliable. Using the narrower 83% confidence intervals shows that only the large increases for social and personality psychology are statistically significantly different from cognitive psychology. In terms of practical significance, the changes for social and personality psychology are meaningful and show some robust improvements in these two disciplines.
The next figure shows the results after EDR scores are transformed into FDR scores. the results are predicted scores based on linear regression of EDR scores on year. The regression was carried out on EDR scores and predicted scores and confidence interval boundaries were transformed to FDR scores. The most important finding is that the false discovery risk is much lower than meta-scientists feared (Ioannidis, 2005; Simmons et al., 2011). Even in social psychology in the 2010s, the FDR is only 14% and the upper limit of the 95%CI is 18%.
In 2020, several areas have point estimates below 5%, but the 95%CI include 5%, making it impossible to conclude that the FDR in any area is below 5%. However, all areas except applied psychology have an FDR below 10%. Overall, these results are consistent with Cohen’s assumption that psychologists infrequently test true nil-hypotheses.
Adjusting Alpha to Control the Journal-wide False Discovery Risk
In the final empirical part of this article, I focus on one specific journal to illustrate how z-curve analyses can be used to examine research published in a single journal. The focus on journals is useful because journal editors have control over the research that is published in their journals. So far, most editors have treated all p-values below .05 as equivalent. However, z-curve analyses of significant p-values shows that not all p-values below .05 are equal. Some provide stronger evidence for a hypotheses than others, and while some p-values that are just significant are expected, too many just significant results are a sign that QRPs were used (Lindsay, 2013 editorial).
To illustrate the use of z-curve to assess individual journals I picked the journal Psychological Science for several reasons. First, it is the flag ship journal of the Association for Psychological Science that publishes articles from all areas of psychology. Second, the journal was chosen for the reproducibilty project and showed a low success rate in replication attempts without QRPs (Open Science Collaboration, 2015). Third, the journal has responded to the crisis of confidence by implementing reforms such as awarding badges for open science practices. Fourth, it is one of two journals that showed a significant increase in the EDR from 2010 to 2020 with alpha = .01. Thus, it is one of the few journals that shows signs of increasing unconditional power (the ERR shows improvement for more journals because estimates are more reliable). I extended the analysis to all years since 2000. I did not use the first volumes in the 1990s because they had relatively few test statistics to analyze.
A simple plot of the median z-score shows that the signal/noise ratio was a bit elevated at the beginning of the 2000s. It is noteworthy, that Psychological Science implemented a new statistic, p-rep, in 2005 and softened the strict p < .05 criterion to claim discoveries. We see how this policy coincides with a reduction in median z-scores. However, even when this policy was abandoned in 2010, evidence remained weak and only strengthened again with a new editor in 2015. However, this increase is relatively mild and the median z-score is still within 2 and 2.8, which corresponds to p = .05 and p = .005, respectively.
To increase the robustness of results, I conducted z-curve analyses for the years 2000-2006, 2007-2016, and 2017-2020, which roughly mirrors the pattern of median-z scores in the previous figure.
The results for 2000-2006 show clear evidence that QRPs were used because the 95% confidence interval of the EDR, 21% to 54%, does not include the observed discovery rate of 77%. With a point-estimate of the EDR of 32%, the results imply that up to 11% of the significant results could be false positives, but the upper limit of the 95%CI suggests that no more than 20% of significant results are false positives.
The next results show that the small dip in the median z-scores has no practical relevance for the credibility of published results. The estimated discovery rate is 33% compared to 32% for the previous years. It is interesting to compare the EDR of 33% to the success rate of actual replication studies in the reproducibility project for studies published in Psychological Science. The actual success rate was 38%, which is fairly close to the EDR, but much lower than the estimated replication rate (62%), which will be discussed in the General Discussion section.
One way to use the information about the EDR and the implied false discovery risk is to lower the criterion for significance (alpha-level) to reduce the false discovery risk. With alpha = .01, the false discovery risk is 4%. Thus, readers of older Psychological Science articles can adjust alpha to .01 (t ~ 2.5, F ~ 8), to minimize the risk of interpreting a false positive result. This adjustment is empirically justified and less stringent than the adjustment suggested by a large group of meta-scientists who called for alpha = .005 to reduce the risk of false positives (Benjamin et al., 2017).
The results for the last four years show an increase in the EDR from 33% to 54%, which lowers the false discovery risk to 4%. Thus, alpha = .05 offers the same protection against false discoveries as alpha = .01 did in previous years. The adjustment of alpha according to empirically estimated false discovery risks provides an incentive for journal editors to maintain a reasonably high discovery rate. However, even the results for 2017-2020 show that QRPs are still being used. Thus, readers may still be cautious in interpreting just significant results in Psychological Science and editors should aim to further reduce the gap between the observed discovery rate and the estimated discovery rate.
In sum, z-curve analyses of individual journals can be used to examine the type of research that a journal publishes. The results for all 120 journals are published in the annual replicability rankings (Schimmack, 2021). Here I showed how this information can be used to create an incentive for editors to reduce publication bias and false discovery risk by maintaining an unbiased discovery rate of at least 50%.
General Discussion
With a few exceptions (Cohen, 1989; Sedlmeier & Gigerenzer, 1989), meta-psychologists have relied on assumptions to speculate about research practices and false discovries in psychology. Survey studies demonstrated that researchers use questionable practices, but it remained unclear how much these practices undermine the credibility of published results (John et al., 2012). Empirical investigations with replication studies provided an alarming finding that only 25% of significant results in social psychology could be replicated, but it remained unclear whether this low replication rate is limited to social psychology or also holds for other research areas. As a result, there remains a lot of uncertainty and room for speculations among meta-psychologists. This study provides much needed empirical data based on an analyses of over 1 million test statistics from over 100 journals that cover a broad range of psychological disciplines. David Funder observed that some data are often better than no data. Based on this simple heuristic, the increase from N = 0 to N > 1 million is a step in the right direction. Although the present study is not without limitations, it provides much needed empirical evidence about the use of QRPs and the false discovery risk in psychological research across a broad range of disciplines. Below I provide a more detailed discussion of the key findings.
Questionable Research Practices
The first finding is that there is clear evidence that questionable research practices are common in all disciplines of psychology. This finding supports survey results that show not only that psychologists use these practices, but also do not see them as problematic (John et al., 2012). However, survey results have been questioned and are subject to numerous biases that may lead to an underestimation of the prevalence of QRPs. Here I used an objective method to quantify the impact of QRPs on the credibility of published results. The evidence is conclusive that journals publish too many significant results. The only remaining question is why QRPs are being used.
One common claim among meta-psychologists is that QRPs are often used without awareness when researchers explore their data. However, this explanation is unsatisfactory for two reasons. First, data exploration also produces many non-significant results. Thus, researchers must be actively choosing to report the significant ones and not to report the non-significant ones to produce to many non-significant results in the journals. It makes little sense to postulate that these choices occur without awareness. A more plausible explanation is that psychologists are aware how they analyze their data, but lack (or at least lacked) awareness that their practices are questionable. In support of this hypothesis, many researchers considered questionable practices (except fraud) to be acceptable (John et al., 2012).
One possible explanation for this nonchalant attitude towards QRPs is that many psychologists may falsely believe that statistical significance ensures a low false discovery risk (reference). It is therefore important to educate psychologists that alpha is not equivalent to the false discovery risk and that the false discovery risk increases when they fish for significance in their data and find only a few significant results in a large number of tests. This does not mean that fishing expeditions are wrong, but reporting the results of these exploratory studies as if only a few confirmatory tests were conducted is wrong. Honest reporting of all tests is a fundamental aspect of a sound science. This message was also emphasized by Bem (2000) in his tuturial for graduate students.
“the integrity of the scientific enterprise requires the reporting of disconfirming results.”
Thus, there is no excuse for the use of QRPs that inflate the rate of confirming results. Z-curve analyses of published studies make it possible to detect the use of QRPs and researchers who use QRPs should be held responsible for practices that threaten scientific integrity.
Average Power of Studies in Psychology
A seminal study of statistical power by Cohen (1961) suggested an average power of around 50%. Most commentators have argued that 50% power is low and Nobel Laurate Kahneman and his equally famous colleague Tversky called it “ridiculously low” (Tversky and Kahneman, 1971, p. 107). They echoed Cohen’s recommendation to conduct a priori power analyses and suggested that researchers should not conduct studies if power is less than 50%. “We refuse to believe that a serious investigator will knowingly accept a .50 risk of failing to confirm a valid research hypothesis” (p. 110). Yet, Kahneman admitted that he ignored his own advice and placed to much faith in studies with small samples (Kahneman, 2016). The present results suggest that Kahneman is not alone. Psychologists across many disciplines seem to be happy to conduct studies with an average success rate around 50%. A number of meta-psychologists have speculated about this neglect of statistical power (Maxwell, 2004; Schimmack, 2012; Sedlmeier & Gigerenzer, 1989).
Many areas of psychology still publish articles that report the results of a single study. For solo-study articles, a simple explanation for the neglect of power is that studies are often complex enough to allow for multiple hypothesis tests. With an average probability of 50%, the chance of finding one significant result increases considerable as the number of tests increases, especially when these tests are independent. For example, three independent tests with 50% power have 87.5% power to get at least one significant result. This logic implies that non-significant results are type-II errors and that results will be inconsistent across studies. In this scenario, questionable research practices are particularly damaging to a field because there is no real difference between significant and non-significant results that are merely distinguished by the random flip of a coin. If at least all results were published, meta-analyses could use all the data to reveal that the inconsistent results from study to study are due to chance and produce consistent and unbiased results. Thus, honest reporting of all results would solve the problem that is created by selecting for significance with studies that have low power.
Schimmack (2012) pointed out that low power is a bigger problem for articles that report multiple tests of the same hypotheses across several studies, like Bem’s article with 9 demonstrations of time-reversed cognitive abilities. Here low power may seem counterproductive because internal replication failures weaken the evidence in favor of a hypothesis. Most people would be more impressed if all studies are successful than if only 50% of the reported studies are successful. Not surprisingly, most internal replication studies that are published report significant results. I could not find an explanation for low-powered internal replication studies. However, a simulation study showed that it can be more efficient to run small studies and to select only significant ones than to run studies with larger samples to produce a package of successful replication studies (Finkel, @@@). For example, it requires a large sample size to have 80% power to test a small standardized mean difference of d = .2, N = 788). To produce 4 significant results, researchers need to conduct 5 studies which implies a total sample size of N = 3,940 participants. Instead, researchers could test the same effect in severely underpowered studies (10%) with just 46 participants per study. Ten percent power implies that they need to run 40 studies to get 4 significant ones, but they need only 1,840 participants. Thus, it is more efficient to allocate resources to more studies with low power than to a few studies with high power. The problem with this strategy is that significance does not ensure that significant results are true positive results. And here we see the disadvantage of the latter strategy. When we have 4 significant results out of 40 attempts, the discovery rate is only 10%, and the false discovery risk is 50%. However, when we have 4 out of 5 significant results, the discovery rate is 80%, and the false discovery risk is less than 1%. Thus, conducting many studies with low power creates more significant results, but at the cost of a higher risk of false discoveries. This risk was ignored because QRPs masked the low discovery rate. With real-time information about the actual discovery rate in psychology journals, editors and researchers should have a real incentive to increase their discovery rates.
The present results showed that several areas of psychology, especially those where recruitment of participants is fairly easy and inexpensive, have increased statistical power over the past decade. These increases have also reduced the false discovery risk. In light of these positive trends, it would be counter-productive to lower the significance criterion to .005 (Benjamin et al., 2017). This recommendation was based on speculations about extremely high false positive rates in psychology. The present results show that these fears are unfounded. Moreover, the false discovery risk can be reduced further by increasing statistical power. In addition, increasing statistical power has the advantage that it also reduces the percentage of false negative results. Any adjustments to alpha should be based on empirical evidence that the false discovery risk in a discipline is too high.
It is also important to distinguish between previous estimates of conditional power and the present examination of unconditional power. Previous estimates of power assumed effect sizes. In contrast, the present results are based on the actual effect sizes and include an unknown proportion of studies in which the nil-hypothesis was at least approximately true. Ideally, we would want researchers to sometimes test false hypotheses because making risky new predictions is an important part of scientific progress. The presence of some studies in which the nil-hypothesis was true implies that the power estimates here are estimates of the lower bound of conditional probability when the effect size is at least small or larger. Some models have tried to estimate the percentage of actual nil-hypotheses, which would allow to estimate the conditional probability given the presence of an effect size (Jager & Leak, 2014). This is an interesting avenue for future research. Meanwhile, the present results should be considered conservative estimates of power for studies in which an effect size is at least small.
On the other hand, the present estimates of power may overestimate power for focal hypothesis tests in confirmatory studies. The reason is that power estimates are based on all test-statistics that are reported in the text of an article. This often includes manipulation checks and auxiliary tests. A comparison of automatically extracted test statistics and focal hypothesis tests tends to show lower EDRs for focal tests. The solution to this important caveat is to conduct studies with hand-coding of focal hypothesis tests (Motyl et al., 2016). Thus, the way forward is to get more data, but not to return to arm-chair speculations and criticism.
False Discovery Risks in Psychological Science
Concerns about false discoveries in psychology are not new (Rosenthal, 1979). The main method to address these concerns were meta-analyses. Meta-analysts were well aware that questionable research practices, typically called publication bias, undermine the validity of meta-analyses. However, psychologists relied heavily on the fail-safe N statistic to convince themselves that publication bias is often not a major problem. A large fail-safe N statistic was even used to argue that extrasensory perception is real (Bem & Honorton, 1994). The problem with meta-analysis is that they do not properly correct for publication bias. As a result, hundreds of meta-analyses are invalid and provide no evidence whether an effect is real or not, let alone provide credible effect size estimates.
The sanguine belief that most results in psychology are real was shattered when Simmons et al. (2011) published an article titled “False Positive Psychology.” The article has been cited thousands of times to suggest that producing false positive results is easy and common. However, once more this claim was based purely on simulations and a demonstration how massive use of questionable research practices produced an implausible significant result. The present results show that the simulated scenario does not represent the typical behavior of psychological researchers. If this were the case, we would have expected a much higher estimate of the false discovery risk. The huge impact of the false positive article shows how easy opinions can be swayed in the absence of hard facts. The present results provide clear evidence that researchers use QRPs, but they do not support the claim that most published results are false positives. In this way, the present results confirm Jager and Leaks’ findings for medical journals, where the estimates false discovery rate was 15% and not 50% as predicted by Ioannidis (2005). It is therefore important to correct the widespread belief that scientists are just (p-value) hacks. While there is need for improvement, social scientists are producing knowledge that is grounded in empirical observations. We can improve and we need to protect our science from practices that undermine it, but a few stunning replication failures in social psychology do not imply that psychology as a whole is not trustworthy.
Outlook
Psychology as a science has come a long way since it took off in North America after the second world war. Pioneers worked with limited resources and were probably well aware of the limitations of their studies. The study of human behavior is becoming easier as technology advances and more and more of human behavior is mediated by computers. In contrast, statistical practices in psychology have not changed and still follow Fisher’s original approach of significance testing with alpha = .05. This approach to data analysis has been criticized for decades, but not much has changed. The reason is that the approach is not fundamentally flawed. Real effects, especially those with meaningful effect sizes, are more likely to produce a p-value less than .05 than nil-hypotheses or trivial effect sizes. Thus, the call for radical reforms seems misguided. Instead, psychologists could benefit from focussing more on effect sizes, conducting a priori power analyses, and most important, reporting all of their results. The honest reporting of results has become much easier with unlimited publication space, online supplements, and online sharing of data. While vanity journals may continue to limit publication space like exclusive restaurants and night clubs, science is best served by journals that merely ensure scientific integrity. The emergence of journals that are free for authors and free for readers creates an infrastructure that makes honest reporting of research results possible. The only remaining barrier are motivated biases by researchers to protect their theories from falsification and their original discoveries from replication failures. However, the past decade has shown that it is no longer possible to suppress replication failures. I am therefore optimistic that a real reform of psychology is underway and the positive trends in the EDR support this. This optimistic conclusion may surprise some readers who have followed my work over the past decade, but it is informed by an empirical assessment of the evidence. My first analyses of trends in 2016 showed little evidence that things were changing, some journals and disciplines have shown some real improvements. The present results hopefully will accelerate this change because no editor wants their journal to be the last refuge for p-hackers.
It is also important to realize that rejecting a false nil-hypothesis is only the beginning of a scientific exploration. Defenders of nil-hypothesis testing have pointed out that a correct rejection of the nil-hypothesis with a two-sided test provides at least information about the direction of an effect. This is more than nothing. Knowing that an intervention, on average, reduces depression rather than increasing depression is valuable information. However, the next step has to be estimation of effect sizes and specification of boundary conditions. Moreover, demonstrating real effects requires valid measurement and psychology has neglected validating measures. Thus a lot of work remains to improve psychology, but at least we are not starting from ground zero.
References
Bem, D. J., & Honorton, C. (1994). Does psi exist? Replicable evidence for an anomalous process of information transfer. Psychological Bulletin, 115(1), 4–18. https://doi.org/10.1037/0033-2909.115.1.4
Smaldino P.E. & McElreath R. (2016). The natural selection of bad science.
Royal Society Open Science, 3, 160384. http://dx.doi.org/10.1098/rsos.160384