UVC-based preservation
A Universal Virtual Computer (UVC) is a virtual machine (VM) specially designed for preservation of digital objects such as held by libraries, archieves and institutions alike. The method is based on emulation but because it is developed for data archiving rather than program archiving it does not require full emulation. Instead the concept combines emulation and migration. Emulation through the UVC as an independent platform on top of the hard- and software and migration by converting specific file formats into universal technology-independent formats based on an XML-like format. Raymond A. Lorie, during his employment at IBM Research Centre Almaden, initiated the development of a UVC-based solution to long-term digital preservation [1]. He describes the approach as ‘Universal’ because its definition is so basic that it will endure forever, ‘Virtual’ because it will never have to be physically built and it is a ‘Computer’ in its functionality.
UVC in context
The preservation problem
Preservation of digital resources is of a paramount importance for deposit libraries, research libraries, archives, government agencies, and actually most organizations[2]. The dominant approach to digital preservation is migration. Migration entails making periodic transformations in archived information and has the notable danger of data loss, and possible loss of original functionality or the ‘look and feel’ of the original format.
Role of the National Library of the Netherlands (Koninklijke Bibliotheek (KB))
The KB played a major role in demonstrating that emulation based on the UVC concept is a viable option for long-term digital preservation.
In 2000, the emulation advocate, Jeff Rothenberg participated in a study with the KB to test and evaluate the feasibility of using emualtion as a long-term preserving strategy. His method was to use software emulation to reproduce the behaviour of obsolete computing platforms on newer platforms offering a way of running a digital document’s original software in the far future, thereby recreating the content, behaviour, and ‘look and feel’ of the original document [3]. Rothenberg was critisized for trying to preserve the wrong thing by suggesting to emulate the software program functionality. To overcome this Raymond A. Lorie introduced a novel approach of data archiving using a ‘Universal Virtual Computer’ [1]. The concept of the UVC-based preservation strategy was implemented by the KB and tested on PDF files as part of a KB/IBM ‘Long Term Preservation’ (LTP) study [4]. The emulation-based approach also resulted in the UVC as one of the permanent access tools for JPEG/TIFF images within the Preservation Subsytem of the KB’s e-Depot [5]. Further developments delivered a durable x86 component-based computer emulator: Dioscuri, the first modular emulator for digital preservation [6].
UVC-based preservation method
The Universal Virtual Computer is part of a broader concept, called the UVC-based preservation method. This method allows digital objects (like text documents, spreadsheets, images, sound waves, etc.) to be reconstructed in its original appearance anytime in the future.
The Universal Virtual Computer is a program containing a set of instructions rather than a physical computer. It offers emulation in the sense that it aims at ressembling the original data format. It is also conversion in the way that a translation is made by a conversion program, capable of decoding the original form of the data to an easy to understand, Logical Data View (LDV) [2]. It will run as a software application on a future platform. Because we do not know at this time which hardware is available in the future, the UVC must be created at the time we want to access a particular document from the repository. The UVC forms the platform on which programs specifically written for the UVC can run.
The method of a UVC-based preservation strategy differentiates between data archiving which does not require full emulation, and program archiving which does. For archiving data, the UVC is used to archive methods which interpret the stored data stream[1]. The methods are programs written in the machine language of a Universal Virtual Computer (UVC). The UVC program is completely independent of the architecture of the computer on which it runs.
Data archiving methodology
The data contained in the bit stream is stored with an internal representation, extracted from the data stream, of logical data elements that obey a certain schema in a certain data model. A decoding algorithm (method) extracts the various data elements from the internal representation and returns them tagged according the schema. An additional schema (schema to read schemas) with information of the schema is similarly stored with the data together with a method to decode the schema to read schemas
Logical Data Model
The logical data model is kept simple in order to minimize the amount of description accompanying the data and to decrease the difficulty of understanding the structure of the data. The data model chosen for the UVC-based preservation method linearizes the data elements into a hierarchical structure using a XML-like approach.
Tagged data elements
The data elements are extracted from the data stream of the digital file and returned tagged according to the logical data model specified above. The tag specifies the role that the data element plays in the data structure. The element tags hold the specific information about the content of the data in a technology-independent manner.
The schema (format decoder)
More information is needed about the various data elements in order to humanly understand what each element means. Information such as the place of the tags in the hierarchy, the type of data (numeric, characters), together with some information on the semantics of the data. For example, the image has two attributes, width and heights, indicating that width times height pixels follow; but are these pixels stored line by line or column by column? Or, for colored pictures, how to interpret the RGB values in order to recreate the right color? This extra information is also called metadata. The schema is clearly application-dependent as it describes the structure and meaning of the tags as parts of a specific information type.
The data elements tagged according to the schema are returned to the client in a Logical Data View (LDV)
Schema to read the schemas (Logical Data Schema (LDS))
If in the future a user gets the tagged data elements, he/she will generally not understand the meaning of the data and the relationships between them and the future user will need additional information on the logical structure. In other word, a schema to read the metadata schema is needed. A simple solution adopted for the UVC approach is a method for the schema similar to the method for the data: the schema information is stored in an internal representation, and accompanied by a method to decode it.
At this point, what will be included in the archive is: the data itself, the metadata, a UVC program to decode the data, and a UVC program to decode the metadata.
UVC-based preservation methodology
Data archivng is only one part of the UVC-based preservation methodology as the central idea of the UVC-based preservation method is based on five different components. These are:
- Universal Virtual Computer
- UVC program (format decoder)
- Logical data Schema (LDS) with information type description
- Restoration program
- Logical data viewer
The UVC program decodes the file format of a digital object. This format decoder program runs on the UVC, which is the platform-independent layer, independent of future hard- and software changes. Executing the format decoder delivers the element tags. These elements build the Logical Data View (LDV) of the data, which is quite similar to XML. The LDV is an instantiation of the LDS, describing the structure and meaning of the tags as parts of a specific information type.
All these components are controlled by a Logical Data Viewer simply called viewer. For reconstruction, the viewer starts the UVC and feeds it with the data of the digital object to a format decoder running on top of the UVC. In return it retrieves an LDV and reconstructs a specific representation of the original object’s meaning.
Together with the original data it is possible to reconstruct the meaning of each particular digital object. The UVC can be seen as the heart of the system. Like the Java Virtual Machine and the Common Language Runtime, the UVC is actually an emulator for not really existing hardware and will run as software application on a future platform. Because we do not know at this time which hardware is available in the future, the UVC must be created at the time we want to access a particular document from the repository. The UVC forms the platform on which programs specifically written for the UVC can run.
What need to be done
Different steps must be taken at archiving time and (present) and retrieval time (future).
At archiving time
Step 1 - Define the appropriate logical schema for a given application
Step 2 - Choosing an internal representation and associate a UVC program P with the data. This is part of the normal design of an application
Step 3 – Writing the UVC program for data interpretation
Step 4 - Archiving the schema information by storing an internal representation of the schema information in the bit stream together with a UVC program Q to decode it. Since the structure of the schema is the same for all applications, a schema to read schemata is chosen once and for all
At retrieval time
Step 1 - Create an emulator on the current platform Because of the simplicity of the UVC concept, it is fairly easy for skilled software developers to construct a UVC emulator for a particular platform of the time
Step 2 - Develop a Logical Data Viewer (a restore program to restore the data). This is an application program that reads the UVC object code and the bit stream and invokes the emulator to execute the UVC program i.e. the program controls the UVC and all input/output interaction between it
Step 3 - Write a restore program to restore the schema. Since the logical view for the schema information is fixed a single restore program may actually support all applications. If the future client already knows the logical view for the documents being restored then the schema does not necessarily needs retrieving. Furthermore, the schema only need to be requested once for a collection of documents of the same type
References
- ^ a b c Lorie R. A., 2001. Long term preservation of digital information. Proceedings of the 1st ACM/IEEE-CS joint conference on Digital libraries, Roanoke, Virginia, United States. 24-28 June 2001. New York, NY: Association of Computing Machinery. pp. 346-352
- ^ a b Lorie R. A., 2002. A Methodology and System for Preserving Digital Data. Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, Portland, Oregon, USA. 14-18 July 2002. New York, NY: Association of Computing Machinery. pp. 312-319
- ^ Rothenberg, J., 2000. Experiment in using digital emulation to preserve digital publications: NEDLIB Report Series 1 [online] Den Haag: National Library of the Netherlands http://nedlib.kb.nl/results/emulationpreservationreport.pdf
- ^ Lorie, R. A., 2002. The UVC: a Method for Preserving Digital Documents – Proof of Concept. IBM/KB Long-term Preservation Study. Amsterdam: IBM Netherlands
- ^ Wijngaarden H., Oltmans, E., 2004. Digital Preservation and Permanent Access: The UVC for Images. Proceedings of the Imaging Science & Technology Archiving Conference, San Antonio, USA. 23 April 2004 pp. 254-259 http://www.kb.nl/hrd/dd/dd_links_en_publicaties/publicaties/uvc-ist.pdf
- ^ van der Hoeven J. R., Lohman B., Verdegem R., 2007. Emulation for Digital Preservation in Practice: The Results. The International Journal of Digital Curation, 2(2), pp. 123-132
External links
- UVC demonstration tool - Freely available UVC demonstration tool from IBM.
- National Library of the Netherlands - Link to KB's "e-depot and digital preservation" page.
- Dioscuri software - Open source software for any individual or institution that would like to work with their older digital documents again