The role played by the many decisions that researchers have to make during their statistical analysis can lead to the results being misinterpreted.

Results do not stand up due to researchers’ decisions

Svenska 2017-11-22

More and more reports are stating that research results cannot be repeated. Professor Anna Dreber Almenberg believes this is a problem for science in general. In several studies, she has examined the reproducibility of results published in high-ranking journals.

In 2015, the Science journal published the results of a so-called ‘replication study’, which generated a great deal of attention.

A large number of researchers had replicated one hundred scientific studies in psychology, which had previously been published in top-rated magazines. The researchers used the original methods but with new and larger samples. The results were surprising. They found that just one third of the results could be reproduced.

Anna Dreber Almenberg

“It is, quite simply, that a large proportion of published research results do not stand up”, says Anna Dreber Almenberg, who is professor of economics at Stockholm School of Economics and one of the 270 researchers behind the study.

Last year, together with colleagues, she published another, similar study in Science. Here, they examined 18 experimental studies in economics, which had also been published in prestigious journals. They conducted them in the same way as they had originally been conducted, but with more people. They found that just eleven of the results – 60 percent – could be reproduced.

More examples

More examples can be found in other areas. In 2013, for example, the Amgen pharmaceutical company attempted to repeat 53 preclinical cancer studies. When the studies were repeated, just one in ten produced the same results as in the original article. This study was followed by several others within cancer research, and these studies also demonstrated a low degree of reproducibility.

“I don’t believe that researchers are deliberately being misleading – on the contrary, I’m sure that most are attempting to conduct good research. There are a number of alternative explanations.”

Anna Dreber Almenberg mentions several examples. One is that the studies can give a so-called ‘false positive’ result because they are based upon a sample size that is too small. Another explanation is that it is difficult to publish studies that do not indicate any connection – for example, where a particular action does not have a positive effect.

Above all else, however, she emphasises the role played by the many decisions that researchers have to make during their statistical analysis, which can lead to the results being misinterpreted.

Large degree of freedom

The first of these is called p-hacking. This concerns the inclusion or the exclusion of different variables or observations in the analysis of the data, until a sufficiently low so-called ‘p-value’ has been achieved.

This is a statistical measure of how reliable the result is. Results with a p-value of below 0.05 are usually regarded as being statistically significant.

The second pitfall is known as ‘forking’, whereby the researcher allows the results to determine how the analysis should be performed. If the researcher is not able to find any connection within the whole of the group being examined, they can continue their search in subgroups.

“These phenomena make it difficult to interpret the results.”

Detailed plan

Anna Dreber Almenberg believes that the solution is what she refers to as ‘tying yourself to the mast’.

“It is best to decide in advance exactly which tests are to be conducted, whether or not to look at subgroups, how the variables are to be defined, and so on. Everything must be decided before the analysis begins in order to ensure that the interpretation of the results will be meaningful”, she explains.

A large part of her research concerns the examination of whether or not research results are reliable. Together with some colleagues, she was drawn towards the large-scale replication project in the field of psychology, which was eventually published in Science.

Reviews top journals

A further replication study is planned to be published during the winter. This involve examination of 21 social-scientific studies that were published in Nature and Science between 2010 and 2015. There are 23 researchers behind this work, including the colleagues from Stockholm School of Economics.

The next project involves investigating the reproducibility of articles published in the American journal PNAS – Proceedings of the National Academy of Sciences.

“We choose the high-profile journals because they have such a major impact within their subject fields and sometimes even at policy level”, says Anna Dreber Almenberg.

Wants to have an influence

She points to a number of risks involved in published results which do not stand up. One problem is that researchers waste both time and resources on the wrong things. Another is that politicians who want to base policy reforms on research results can make decisions that are informed by data that is unreliable.

Primarily, Anna Dreber Almenberg wants to influence other researchers.

“I want them to think more about the kind of statistical analysis they perform. In my opinion, most of the false results arise from good researchers, with good intentions, ending up with misleading p-values. Publishing a false positive result does not necessarily mean that the researcher is incompetent or unethical – we can all be led to believe that we have found something that later fails to withstand scrutiny. It is significant, however, if the researcher chooses to defend results that have repeatedly been shown to not stand up. Low reproducibility is a problem for science in general.”

She also believes that a change is on the way.

“Researchers in the field of psychology are leading the way in updating the methods used. Economists have also begun the process.”

Text: Siv Engelmark

4 comments

Thank you for your comment. It may be moderated before it is published.

  • Swaraj Paul

    It is not the problem with economic and social science studies, we have the same problems in the technology field. There are so many journals and depending on the impact rates the publication price is also high for peer reviewed journals. Therefore they cannot be too tough in scrutinizing the journal. Moreover, the journal is never interested in publishing negative results as it has mentioned above. I fully agree with the author and I hope that some serious actions are needed to publish such results! Somebody will have to read these articles also!

    2017.11.23

  • Christoph

    Interesting, but it would have been nice to learn a bit more about why "p-hacking" and "forking" "make it difficult to interpret the results". For the innocent user of statistics (and apparently also many reviewers), nothing seems wrong with analyzing sub-groups or choosing particular variable rather than others.

    It seems to me that the interesting question here is to what extent these findings have to do with the inherent limits of statistical methods more generally. The "tying yourself to the mast" solution (which is not about statistics) seems to point in that direction. But my hunch is that the discussion has to go much further than that.

    2017.11.23

  • Tommy Gärling

    I think "pre-registration" is a good thing if it turns the clock back to the time when not fast publication but publication of reliable results were important. It would hopefully make researchers to increase their preparations (doing pilot studies, simulations, analyses of statistical power, theoretical development) before "pre-reigister" and conducting the study (if accepted based on pre-registration).

    2017.11.23

  • Lars-Göran Johansson

    I think the most plausible explanation is that the great majority of tested hypotheses are false. Suppose 10% av tested hypotheses are in fact true (after all you don't test trivial hypotheses so plausibly most are in fact false), that the chosen significance is 5% and the power 40%. This will result that only 47% of those hypotheses accepted (i.e. the null hypothesis rejected) are in fact true. If we require higher power, say 80%, we still get 36% false hypotheses among those accepted. The obvious thing to do is to require stronger significance. (Adapted from The Economist oct 19, 2013)

    2017.11.23