30. marraskuuta 2017

Let There Be Digital Preservation – A View from the Data Archive

For the most part, November 6 this year was just an ordinary day. What was different about it was a tiny SIP1, neatly wrapped in a METS2-container that digitally travelled from the Finnish Social Science Data Archive to the National Long Term Digital Preservation Service.

Confused? No worries. So were we a number of times before we got this far. Taking our digital preservation to the level described above required planning and hard work from both research data curators and programmers. In addition to simply preserving bits in a reliable way, we aim to make sure that the digital objects are also understandable by humans and machines in the future. This requires collecting metadata, harmonising file formats, managing versions, and preparing for the change.

I have often said that preserving digital research data is like preserving a moving train. You cannot stop it. If you do, you are not preserving a moving train anymore, only a snapshot. There is a lot to preserve and it may hit you hard, and a lot quicker than you think. The train also has different cars - one can add more, or take some away. Like file formats, cars are different too - from passenger cars to freight cars. There are containers, with hundreds of objects. You need to know what is in each one, and who should have access to those. Some cars contain goods, some people. While goods may last for a long time, people need to be refreshed regularly or they will not survive the journey.

The long road to preservation is paved with obstacles and opportunities

The train analogy should show that digital preservation is an active duty. You cannot put a lid on it and wait until someone asks what is in the box. Because by then, you do not know anymore. Everyone in the preservation business recognises this. At the Data Archive, we preserve research data for long-term access. That means that we actively keep on adding new information too. We make the metadata better, we may find errors in the data and fix them, or at very least we add information on where the data has been used. Moving train, remember!

Since 2008, we have been involved in building a national digital preservation solution for cultural heritage materials and research data. For our purposes, a secure, highly reliable document store is a crucial element for building a sustainable and scalable long-term preservation solution. It will add an additional preservation layer for the data we keep for our users. In a country about the size of Finland, it is feasible to provide a preservation platform nationally to a number of organisations.

We started piloting the service in 2015 and in November we finally transferred out first packages to the preservation service. It has been a long road. We have yet to pop the sparkling wine since there are a number of short-term goals to address. Piloting a service means that there have been moments when the envisioned services are not yet fully operational, specifications need tweaking before one can proceed, or something has simply appeared out of the blue.

Tools are needed to handle the data deluge

The greatest benefit of the exercise thus far has been the internal harmonisation of file formats and data processing workflows. The Data Archive has been around since 1999. While that is a relatively short time, it is a lifetime for many file formats or their versions. We have combed through the most - about 50 000 files - and defined what will be preserved and what are the acceptable file formats. While this is good, it is crystal clear that a constant technology watch is needed in the future. It is also apparent that very soon the magnitude of this will get out of hand. We cannot manually keep an eye on all files, versions and processes.

Therefore, we have built a specific data processing pipeline. It is a collection of tools that fulfil the requirements of the National Long Term Digital Preservation Service specification. It has individual parts that are responsible of standardising the character sets of all files to UTF-8, combining technical metadata with study level metadata, and creating a METS document as well as creating a submission information packet (SIP), and sending it to the preservation service provider.

Commit to constantly challenge the current practices

It is often the case that the ideal format for digital preservation may not be ideal for scientific use. This is no new dilemma. We need to carefully assess not only the formats and their feasibility for digital preservation, but also the costs of maintaining the system of archive formats and actively used formats. Any organisation that joins the national digital preservation service must have an interest in challenging the current best practises and bringing their specific user perspective into the discussion. Because in the end, everything is kept for future use, not for storage only.

The other corner stone is commitment. Once you start with digital preservation, you cannot easily stop. It means the knowhow and resources need to be there in the future too. We believe that a national solution will be beneficial for us. We are able to transfer some of our knowhow requirements to the digital preservation specialists, and focus on serving researchers better. However, we do need to keep monitoring the specialists' performance like our own. Any outsourced activity in the digital preservation chain cannot be the weakest link. Therefore, further standardisation and auditing are crucial steps in the future.

Notes:
1 Submission Information Package (Information sent from the producer to the preservation service)
2 The Metadata Encoding and Transmission Standard (METS) (Container format and metadata standard for encoding descriptive, administrative, and structural metadata regarding objects)

Why today?
» This year, the first ever International Digital Preservation Day on 30th November 2017 will draw together individuals and institutions from across the world to celebrate the collections preserved, the access maintained and the understanding fostered by preserving digital materials. The aim of the day is to create greater awareness of digital preservation that will translate into a wider understanding which permeates all aspects of society – business, policy making, personal good practice.

Further reading:
» The National Digital Library - Digital Preservation
» Digital Preservation Solution for Research Data (PAS)

Tuomas J. Alaterä
IT Services Specialist
firstname.surname [at] uta.fi



Ei kommentteja: