Data publication process
Persistent identifiers as a special feature of research data
How do I publish my own data?
What do "open" and "machine-readable" formats mean?

Data publication process

What is a publication process and what does Open Access publishing mean?
The National Research Data Initiative (TTA) enterprise architecture (version 0.94) defines the research data publication process as follows:

"Purpose: Processed, quality-assured data with metadata is published for general or restricted use. The use of data located elsewhere (e.g. in a "mandatory" international storage location for a specific discipline) is also made possible.

Included sub-processes:
listing of data in a catalogue
assigning an identifier to the data.

End result:
Data that can be reused. Data is publicly available or at least made available to as large a number of users as possible in a clear, user-friendly form".

The Finnish Open Access Working Group explains the open availability of research results as follows: "The open access of research results is divided into two categories: the open availability of research publications and research data. In simplest terms, the open access publishing of research publications (articles, reports, monographs) involves the online publishing of publications "on the other end of a click": the right to freely read, copy, print and link entire scientific publications. Research results can be published online in "open access journals", or research results published elsewhere can be stored in open digital archives, i.e. they are parallel published in digital form (www.finnoa.fi)"

Even though the open availability of research data is important to scientific openness, it is a good idea to also give thought to the management of availability, for example using licenses. Publications must comply with ethical research principles, determining ownership and copyright issues involving the data and publications in advance. If necessary, researchers must retain sufficient copyrights for themselves in order to ensure open availability. Open access scientific publishing does not, however, compromise on quality assurance – the quality of scientific publications is ensured by means of peer review. "It must also be possible to trust openly available research data" (www.finnoa.fi).

Persistent identifiers (PIDs) as a special feature of research data

A distinctive feature in the publishing of research data, as compared to, for example, administrative data, is the ability to use persistent identifiers when making references between data and publications. An essential part of publishing data is to make it as easy to find as possible, reaching those who are interested in precisely that data. The possibility to locate data  can be enhanced by technical solutions as well as through communication, i.e. by telling potential re-users  of its existence. (Poikola, Kola, Hintikka: Public Data, Ministry of Transport and Communications 2010).


How do I publish my own data?

The publishing process can be divided into the following phases:

1.    Choose and prepare your data
-    When choosing data to be published, you should start with data which is already in an easily publishable form or for which you know there is already interest for .
-    If desired, you can also contact the potential re-users of your data in advance. This can help you to choose the data to be published. (This is recommended in e.g.  the Open Knowledge Foundation Open data handbook.
-    It is not necessary  to publish all of your data at the same time.
-    Check the quality and accuracy of your data.
-    Specify and mark metadata as early as possible. Read more on metadata here.
-    Change your data into an open and, if possible, machine-readable form. It improves the possibilities for locating, using and linking data. . The possibility for machine-readability is, however, field-specific, and not all data can be modified.

2.  Get an open license for your data
Licensing specifies the rights of the data author and the user, thus considerably improving the possibilities for data reuse. Terms of use which are not clearly explained or are completely missing can limit the use of your data far more than you intend. Poikola et al. 2010).
 Licensing guidelines can be found on the Data management planning checklist.

3.    Save your data in a reliable storage location
It is a good idea to choose a storage location, which is reliable and stable as well as allows for open access: the data or at least its metadata can be openly viewed and is freely available in a machine-readable and easily downloadable format for anyone to find and use. Storage services are presented on the Services page.

4.    List the data in a data catalogue and assign it a persistent identifier (PID)
-    Persistent identifiers are a special feature of research data.
-    Persistent identifiers, such as a URN or DOI, are internationally unique, remaining the same even if the location of the data changes.
-    Persistent identifiers allow references to be made between publications and data, which promotes researcher merit.
-    Some services, such as Etsin and IDA, automatically generate a URN (Uniform Resource Name). The user can also assign an identifier received elsewhere when adding data to a service.

5.    After publication: Retrieve, use and combine data, and refer to it. Inform reusers of the existence of your openly published data and ask for feedback!
-    The goal is that your data is as openly available and easily found  as possible, in a clear, easily reusable form.
-    Tools for using data:
for example, the AVAA open access publishing platform

What do "open" and "machine-readable" formats mean?

An open format generally means a non-proprietary format, whose use does not require any proprietary program. For example ASCII-form .txt files can be opened with any Notepad-type program using any operating system, but Microsoft Word .docx files cannot necessarily be opened and will not appear correctly without using proprietary Word software. Conversely, table-form data should be stored and shared in a text-form CSV file, not as an Excel file. In other words, the data should be, wherever possible, saved in a format that can be used on as many different operating systems as possible without requiring any proprietary software.

Machine-readable formats are oftenXML-based.  In many fields, there are standard XML-based formats for the transfer of data specific to that field: for example, the format used in the field of geographic information is GML. The software used in these specialised fields, such as geographic information programs, usually include the possibility of storing data in the given format or a conversion function for that format. For more information on machine-readability, click on the following links:  Helsinki Region Infoshare and Open Data Handbook.

You can find additional links on the publication process on the link list.