Open data – it cannot just be presented

Svenska 2017-01-19

Curie has spoken to researchers who globally share language data, water models and DNA sequences. The different fields face varying practical challenges, but a common requirement is that the data must be prepared – it cannot just be presented.

Many researchers today have many years of experience of the practical implications of open data. Between 5 and 10 years ago, SMHI (the Swedish Meteorological and Hydrological Institute) took the conscious step of sharing its data and analytical tools. These are not only used for research but also for social planning relating to, for example, flooding and drinking water.

Berit Arheimer, who leads the hydrological research at SMHI, explains:

“It can be used, for example, by environmental consultants in Spain who need to find out if there will be sufficient water-flows for the irrigation of fields in a particular area when the temperature increases in the future. By using the web-based Swicca climate service, they are able to calculate what the future situation is likely to be.”

Data needs to be presented

SMHI works within the EU’s major project Copernicus in order to provide services such as Swicca and several other tools. A great deal of work is required to show this data in map form and to make the data presentable.

“It needs to be usable. For civic consultants, it needs to be presented in the form of a simple table – not the complex data formats used by climate scientists.”

A lot of consideration is necessary when producing metadata (information about data) and when estimating the credibility of scenarios.

“If you don’t know enough about the data, you will probably not dare to use it. You need to know what lies behind the data – how the calculations were made, about uncertainties, licences and much more.”

Not all data can be trusted

SMHI understands the need to establish certainty, as they themselves use open data from other organisations and researchers. This could be data such as flow measurements and satellite images. This presents the challenge of cleansing the data before it can be used. Not all measurements are plausible.

“One measuring point could, for example, be located in the middle of the ocean.”

And the more open the format of data presentation, the greater the risk that it will be unreliable. In the Virtual Water-Science Laboratory – a forum for water researchers – it does happen that people upload disreputable data, such as measurements of their bathwater.

At the same time, open data has enabled hydrology to develop in a way that would not otherwise have been possible.

“Hydrology is about flows of fresh water. Previously, hydrologists would mainly only conduct research on their local stream or river. Now we are able to exchange information across the entire world, and we are learning more about the processes that affect water-flows together.”

Large amounts of data require long-term thinking

Erik Kjellström leads SMHI’s work with climate modelling. For him, open data is a prerequisite for this work.

“We are extremely data-dependent. In order to be able to further develop models for climate prediction, we need both the data we produce and the data we receive from other agents.”

SMHI also makes its own data available to other researchers. One practical problem in this field is that the data sets needed for climate modelling are so large that countries with poorer internet access can find it difficult to download them.

Another important issue is how to guarantee a long-term approach for data management. This requires personnel with the skills to manage servers and to maintain the data (including metadata) even after the individual research projects have been completed.

Erik Kjellström questions how the long-term storage of data can be ensured when it is financed by a research project.

“Research projects seldom have long-term financing.”

Språkbanken deals with copyright issues

In the Clarin international network, collections of texts and speeches can be shared around the world. A language researcher can, for example, study court protocols from a library in Germany – something that was not previously possible.

Lars Borin, professor of linguistic data processing at the University of Gothenburg, works with Språkbanken (the Swedish Language Bank) in Clarin and discusses the practical challenges of publishing language collections.

“Copyright issues for new texts need to be handled in some way, and we do this by rearranging the texts at random. This doesn’t matter to the language researchers as they don’t need to see the text structure – they are mostly interested in the choice of words or in grammatical phenomena.”

Presenting information in such a way that it becomes usable also both takes a lot of time and requires special knowledge.

“It must be searchable and annotated, so it is possible to see where the text comes from.”

He notes that an increasing number of language collections are being published, but that there is also some resistance to this development; there may be concerns about integrity-related issues, and it is not always certain that researchers want to give away their material before they have had the chance to complete their analysis and have it published. There is often a great deal of work involved in the collection of language material, especially when it comes to interviews.

Greater basis for brain research

Gustav Nilsonne works with brain research at Stockholm University. In the field of medicine, there are often lots of small studies made which it is then not possible to repeat.

He points out that making study data freely available by routine would make it easier for him and for other researchers to perform meta-analyses, where data from several studies is compiled together in order to provide a greater knowledge base. Open data also makes it easier to detect mistakes or deceit. In addition, maximum usage can be made of the data.

“As part of a study of sleep deprivation, we have performed brain imaging. When these images are made available, other researchers with, for example, different technical abilities will be able to conduct more-advanced analyses of the images.”

Facilitates quality control

Another researcher working in medicine is Jens Hjerling-Leffler of the Karolinska Institute, who studies nerve cells. Part of his research involves the mapping of individual cells using a process called single cell sequencing.

Within this field, it is routine procedure to make data available in conjunction with publication, and this provides more citations for the article. This is perhaps because an article with open data allows other researchers to use the data for their own analysis and publication, but also because the data provides the possibility of quality control.

“We have sometimes seen from the data that the quality is not as high as the article had led us to believe.”

At the same time, he would like to see a more-nuanced situation with regard to the presentation of research data.

“It would be a disadvantage if all data sets were to be made available. Data needs to be processed – we don’t present raw data, we only present data that can be interpreted.”

He explains that the raw data file that they produce consists of zeros and ones, and it must be processed for a week using a super-computer before it can be read.

Jens Hjerling-Leffler is a member of the Young Academy of Sweden, which campaigns for a differentiation of the open data guidelines for different research fields.

“This means that researchers from different fields need to be present when the guidelines are established.”

National responsibility

In the research proposition, the Swedish Research Council is given responsibility for the national coordination of work with open research data. But this does not mean that the Research Council currently sets a requirement for open data for the awarding of research grants.

“It is the processing plans with which we are primarily concerned, not the publishing plans. Universities are trying to work out how all the data shall be processed, and it is only when this has been decided that the question of making the data available arises. The submission of data-processing plans will be required for certain types of grant – possibly during 2017 or otherwise in 2018”, says Maria Thuveson, from the Swedish Research Council’s department for research funding.

On the other hand, the Research Council does have as a condition for the granting of research grants that the scientific results are made available online – open access – within six months of publication. This deadline may be extended in certain cases – for example, for publications within the HS field.

Read also in Curie:

The pressure rises: Make research data available

Text: Anja Castensson
Photo: Helena Larsson / Natufotograferna / IBL Bildbyrå

Ta del av information om behandlingen av dina personuppgifter