replicationindex.com

Lies in Disguise: Cheating Detection in Academia

Fischbacher, U., & Föllmi-Heusi, F. (2013). Lies in disguise: An experimental study on cheating. Journal of the European Economic Association, 11(3), 525–547. https://doi.org/10.1111/jeea.12014

Summary

Experimental economists have developed experiments (paradigms) to detect cheating. The paradigm is simple. Participants roll a die and report the result and receive rewards based on the outcome. It is unknown whether a specific participant lied or not, but the distribution of results can be compared to the pre-study probabilities to detect lying.

The same logic can be used to detect cheating by researchers in the real world. We can compare the outcome of studies to the probability of obtaining significant results. The only difference is that it is a bit more difficult, but not impossible, to estimate these probabilities, known as statistical power.

Pek et al. (2024) claim that it is an ontological error. Their argument is invalid and would imply that experimental cheating studies are invalid because it is false to apply a priori probabilities to the results of completed studies. The only plausible explanation for Pek et al.’s confusion about the use of probabilities is that they are motivated to protect academic cheaters from detection by asking how they can get significant results all the time. Nobody should expect 95% of significant results without cheating (Sterling et al., 1995), just like we should not expect that 35% of participants in Fischbacher and Föllmi-Heusi (2013) to have rolled a 5 , which produced the highest reward.

Introduction

Fischbacher and & Föllmi-Heusi (2013) introduced a simple experiment to study cheating. Participants are asked to roll a dice and report the result without the experimenter seeing the result. The result would determine a reward. Numbers from 1 to 5 would earn the corresponding value in Swiss Franks (about 1:1 to US dollar). A value of 6, however, would result in no reward. The task was not designed to measure cheating of a single participant. After all, there is no way of knowing whether a single participant lied to get the maximum reward or actually rolled a 5. However, in the aggregate cheating behavior can be studied with this task because the probabilities of the possible events are known. If participants were completely honest only 1 out of 6 (16.7%) participants (with some sampling error) would receive the maximum reward and 1 out of 6 participants would earn no reward. The results showed that some participants were not entirely honest. Only 6.5% reported a 6 and received no reward, whereas 35% of participants claimed the maximum reward. Other participants cheated a little bit and claimed 4 dollars rather than the maximum of 5 dollars, maybe thinking that claiming the maximum makes them look guilty.

The response to this finding may differ depending on the audience. Fischbacher, U., & Föllmi-Heusi are economists and were surprised by the honesty of some respondents because economic theories predict that people will maximize rewards and act honestly only out of fear of punishment. Humanistic psychologists may be dismayed by the evidence of honesty or blame it on the socialization in capitalistic societies that corrupts humans who are fundamentally good and pro-social. Personality psychologists will focus on the fact that some people were honest and others were not and point to ample evidence that honesty is a trait. While situational factors play a role, some people are more likely to be honest than others, not only in lab experiments, but also in the real world.

M. Dufwenberg and M. A. Dufwenberg (2018) list a few real-world examples of cheating behavior such as tax evasion or, to get to the real topic, scientific misconduct; their words not mine. In psychology, we make a distinction between two types of cheating. Fabricating data is blatant cheating and considered misconduct that has severe consequences (Stapel). In contrast, mild forms of data manipulation that increase the chances of a publishable significant result are not considered misconduct and have no consequences for researchers who use these questionable practices. It is also difficult to detect the use of these practices in a single published study, just like it is impossible to know whether somebody really rolled a 5 or not. However, it is easy to notice the use of cheating in the aggregate when researchers report mostly wins (p < .05) than losses (p > .05).

A classic article by Sterling (1959) noticed that psychology journals publish over 90% wins (p < .05), a finding that has been replicated by Sterling and colleagues in 1995, and in the major replicability project in 2015 (Open Science Collaboration, 2015). This is where the analogy ends. Unlike dice, the outcome of a hypothesis test is more like a coin flip (p < .05 = win; p > .05 = loss) and the probability of the desirable outcome is not known like the probability of a coin toss.

Uncertainty about the probability of the outcome of a specific study, however, is not necessary to detect cheating. There are two ways to estimate the percentage of significant results that a set of studies should produce. One approach is to redo the reported studies exactly in the same way with the same sample sizes as the reported studies. The only difference between the published studies and the exact replication studies is that the outcome of the replication studies is a new random event that is determined by a new selection of participants from the same population. This is equivalent to Fischbacher, U., & Föllmi-Heusi rolling the dice in their study themselves to verify that the dice are not wonky. The percentage of significant results in the published studies and the replication studies should be the same (unless the replication studies are themselves selected for significance). So, if we randomly pick 100 original published studies and find 90% significant results and there is no cheating, we should also get 90% significant results in the replication studies. If we obtain considerably less significant results, it suggests that some of the original published results were obtained with cheating to use Fischbacher and & Föllmi-Heusi’s terminology or with questionable research practices to use the euphemistic term preferred by psychologists.

A highly publicized finding from a replication project showed that 97% of original results were statistically significant, whereas only 36% of the replication results were significant, a difference that is itself statistically significant. Based on the simple logic of the cheating-detection paradigm, this difference strongly suggests that some of the published results were obtained with cheating, p < .05.

One argument against this conclusion, made by psychologists who do not like the finding, was that replication studies are never exact and often conducted by researchers who are much less competent than the original researchers at top universities. Fortunately, we can avoid the “my studies are better than yours” debate that has no ending and use the statistical results of the original studies to examine cheating. This is possible because significance testing is a simple dichotomous decision based on continuous information about the probability of the results without a real effect (i.e., the null-hypothesis is true). Using the continuous information, it is possible to estimate the average probability of the studies to produce a significant result and compare it with the significance outcome. The advantage of this approach is that the evidence for the cheating test comes from the published studies. Thus, incompetent replicators cannot mess up the results. The results are based on top notch research by top researchers at top universities published in top journals.

The following figures show the results of a z-curve analysis for the original studies in the Open Science Collaboration project. The first figure assumes that there is no cheating and tries to fit a model to the data. Visual inspection is sufficient to see that the model does not fit because there are not sufficient non-significant results and way more significant results around z = 2, which corresponds to the criterion to claim significance, p < .05, and a potential publication.

The second figure shows the results for a model that assumes cheating occurred and corrects for it.

The extend dotted blue line shows how many non-significant results there should be if researchers just repeated a study until it gets a significant result. In reality, there are way fewer attempts needed because there are other questionable research practices that help to get p-values below .05.

The key finding is the expected replication rate. This is the hypothetical percentage of significant results that we would expect if the original researchers would redo their studies exactly with the same sample sizes, but with a new set of participants and without cheating. The estimate is 60%, which is higher than the percentage estimated with the actual replication studies. However, given the small number of studies, it could be just 42% or as much as 78%. However, even 78% is still less than the actual 92% that were reported as significant at p < .05, not counting the studies that were called marginally significant with p-values greater than .05, another questionable practices that is at least transparent in that the failure to achieve significance is documented.

While a lot of these issues were controversial during the past decade known as the replication crisis or the crisis of confidence in published results, it is now widely accepted that many significant results were obtained with cheating. However, a vocal minority tries to discredit this evidence.

Pek et al. (2024) “remind the reader using observed power calculations (based on collected data) to make statements about the power of tests in those completed studies is problematic

because such an application of power is an ontological impossibility

of frequentist probability (Goodman & Berlin, 1994; Greenland, 2012; McShane et al., 2020).” (p. 12).

According to this argument, it would be incorrect to evaluate the observed outcome that 35% of participants reported a 5 against the probability that we would expect from a normal die which is 16.7%. This argument makes no sense because probabilities determine the long-run frequencies of outcomes. So, either the die is not normal and produces an abnormal high frequency of fives, or participants were cheating. Evidently, outcomes of random events tell us something about the probability of these events to occur.

Pek et al.’s (2025) psedo-philosophical claim of an ontological error is clearly counterintuitive, but it has convinced reviewers and editors to publish their claims. So, it is fair to ask, am I just a victim of the Dunning-Kruger effect that I am unable to realize my own incompetence in probability theory? ChatGPT doesn’t think so (chat about Pek et al.), but you have to make up your own mind. What explains that the success rate of researchers is much lower in replication studies than the 90% success rate in psychology journals?

Read full news in source page