In 2015, the Science journal published the results of a so-called ‘replication study’, which generated a great deal of attention.
A large number of researchers had replicated one hundred scientific studies in psychology, which had previously been published in top-rated magazines. The researchers used the original methods but with new and larger samples. The results were surprising. They found that just one third of the results could be reproduced.
“It is, quite simply, that a large proportion of published research results do not stand up”, says Anna Dreber Almenberg, who is professor of economics at Stockholm School of Economics and one of the 270 researchers behind the study.
Last year, together with colleagues, she published another, similar study in Science. Here, they examined 18 experimental studies in economics, which had also been published in prestigious journals. They conducted them in the same way as they had originally been conducted, but with more people. They found that just eleven of the results – 60 percent – could be reproduced.
More examples can be found in other areas. In 2013, for example, the Amgen pharmaceutical company attempted to repeat 53 preclinical cancer studies. When the studies were repeated, just one in ten produced the same results as in the original article. This study was followed by several others within cancer research, and these studies also demonstrated a low degree of reproducibility.
“I don’t believe that researchers are deliberately being misleading – on the contrary, I’m sure that most are attempting to conduct good research. There are a number of alternative explanations.”
Anna Dreber Almenberg mentions several examples. One is that the studies can give a so-called ‘false positive’ result because they are based upon a sample size that is too small. Another explanation is that it is difficult to publish studies that do not indicate any connection – for example, where a particular action does not have a positive effect.
Above all else, however, she emphasises the role played by the many decisions that researchers have to make during their statistical analysis, which can lead to the results being misinterpreted.
Large degree of freedom
The first of these is called p-hacking. This concerns the inclusion or the exclusion of different variables or observations in the analysis of the data, until a sufficiently low so-called ‘p-value’ has been achieved.
This is a statistical measure of how reliable the result is. Results with a p-value of below 0.05 are usually regarded as being statistically significant.
The second pitfall is known as ‘forking’, whereby the researcher allows the results to determine how the analysis should be performed. If the researcher is not able to find any connection within the whole of the group being examined, they can continue their search in subgroups.
“These phenomena make it difficult to interpret the results.”
Anna Dreber Almenberg believes that the solution is what she refers to as ‘tying yourself to the mast’.
“It is best to decide in advance exactly which tests are to be conducted, whether or not to look at subgroups, how the variables are to be defined, and so on. Everything must be decided before the analysis begins in order to ensure that the interpretation of the results will be meaningful”, she explains.
A large part of her research concerns the examination of whether or not research results are reliable. Together with some colleagues, she was drawn towards the large-scale replication project in the field of psychology, which was eventually published in Science.
Reviews top journals
A further replication study is planned to be published during the winter. This involve examination of 21 social-scientific studies that were published in Nature and Science between 2010 and 2015. There are 23 researchers behind this work, including the colleagues from Stockholm School of Economics.
The next project involves investigating the reproducibility of articles published in the American journal PNAS – Proceedings of the National Academy of Sciences.
“We choose the high-profile journals because they have such a major impact within their subject fields and sometimes even at policy level”, says Anna Dreber Almenberg.
Wants to have an influence
She points to a number of risks involved in published results which do not stand up. One problem is that researchers waste both time and resources on the wrong things. Another is that politicians who want to base policy reforms on research results can make decisions that are informed by data that is unreliable.
Primarily, Anna Dreber Almenberg wants to influence other researchers.
“I want them to think more about the kind of statistical analysis they perform. In my opinion, most of the false results arise from good researchers, with good intentions, ending up with misleading p-values. Publishing a false positive result does not necessarily mean that the researcher is incompetent or unethical – we can all be led to believe that we have found something that later fails to withstand scrutiny. It is significant, however, if the researcher chooses to defend results that have repeatedly been shown to not stand up. Low reproducibility is a problem for science in general.”
She also believes that a change is on the way.
“Researchers in the field of psychology are leading the way in updating the methods used. Economists have also begun the process.”