Introduction

This document provides guidance on implementation of the [[OCFL-Specification]] for how clients should behave when operating on OCFL Objects.

Digital Preservation

Rebuildability

A key goal of the OCFL is the rebuildability of a repository from an OCFL storage root without additional information resources. Consequently, a key implementation consideration should be to ensure that OCFL objects contain all the data and metadata required to achieve this. With reference to the [[OAIS]] model, this would include all the descriptive, administrative, structural, representation, and preservation metadata relevant to the object.

Additionally, as an aid to those who may need to recover OCFL objects in the future, it is recommended that a copy of the [[OCFL-Specification]] is stored in the top level of the OCFL storage root. The OCFL ignores files other than the conformance declaration at the top level so it is a good location to store documentation that may be useful for recovery.

A more complete approach would be to create a specific OCFL object that contains this documentation and to have a pointer to its location in the storage root. This documentation object would then be subject to OCFL validation and any other digital preservation processes that might be implemented without requiring special handling.

Fixity

The digests in the manifest are used by the OCFL for content addressability rather than fixity but they are suitable for use as part of a fixity regime, and the manifest block usefully identifies all the files in an object. OCFL validation also requires that digests and files match. However, while the characteristics of digest algorithms that make them suitable for fixity checking and content addressing are closely related, they are not identical. In particular, fixity against malicious tampering requires that a digest computation is hard to reverse, which is not a requirement for content addressing. It is this aspect which is the most frequent target for cryptoanalytic attack.

Consequently, it is sensible to allow additional or alternative fixity algorithms to be used. These may be made in a fixity block which has the same layout as a manifest block but permits a broader range of algorithms. The OCFL will consider a fixity block valid if all the files referenced in the block exist but the OCFL does not validate digests for all possible algorithms. The fixity block does not have to include all the files in an object to permit legacy fixity to be imported without requiring continued use of obsolete digest algorithms.

Storage

Object Contents

The OCFL separates the existing file path of stored files from the logical file path of these files' content in OCFL object versions. This is a key feature that allows previous versions of objects to remain immutable but permitting deduplication, forward delta differencing, and easy file renaming. Consequently, the OCFL only requires that files added to any version of an OCFL object must be stored somewhere within the relevant version directory, with a corresponding entry in the manifest block. An entry in the state block determines the path and name of the file within that version by referencing the manifest entry, not the actual path on disk.

The most transparent approach is to have the path used to store the file on disk the same as the path of the file within the object when accessioned. This is readily understandable in terms of visual inspection of the physical filesystem.

However, this is not always possible. For example, complex objects with deep file hierarchies may encounter issues if they come from a fileystem that allows longer paths than are supported by the target OCFL system. In this case, the decoupling between existing file paths and logical file paths in OCFL objects allows the use of truncated paths for storage while the full paths can be preserved in state block entries which are not length constrained.

Another use case is importing content from other repository systems which renames files on ingest and stores them in a flat hierarchy. These can be imported, as is, and the original paths and file names recorded through suitable state block entries rather than reconstructing a physical file layout. Of course, the OCFL supports ongoing use of such a methodology.

Data and Metadata

OCFL object versions are composed of series of files/bitstreams but the OCFL does not make any distinction between different types of files other than those reserved for OCFL functionality: the inventory, its digest file, and conformance declaration files. It is possible, for example, to create separate data and metadata directories within each version to help organize material but all files are treated equally for the purpose of OCFL validation and management.

Deduplication

The OCFL supports optional deduplication if a client ensures that all digests in the manifest block refer to a single file path on disk. This entry is created the first time file content is stored in an OCFL Object. Subsequent references to that file content should then occur in the state block only. This can be determined by computing the digests of incoming files and determining if they already exist in the manifest block.

If deduplication is carried out within an object then, for consistency, it is expected that Forward Delta differencing will also be used between object versions so subsequent references to duplicated content should also refer back to the original manifest entry rather than updating it to include additional references.

Filesystem metadata

Filesystem metadata (e.g. permissions, access, and creation times) are not considered portable between filesystems or preservable through file transfer operations. Nor can these attributes be validated in terms of fixity in a consistent manner. As such, the OCFL neither explicitly supports nor expects that these attributes remain consistent. If retaining this metadata is important then files should either be encapsulated in a filesystem image format that preserves this information, or the metadata extracted and stored explicitly in an additional file.

Empty Directories

The OCFL preserves files and their content, with directories serving as a useful organizational convention. An empty directory consists only of filesystem metadata and therefore, as noted above, is not amenable to direct preservation in OCFL objects. If the preservation of empty directories is considered essential then the suggested route is to insert a zero length file named .keep into the directory which will ensure directories are preserved as part of the file's path.

Note that .keep files are not considered special by the OCFL in any way and are treated exactly the same way as other files. As such, a non-zero length .keep file is not considered invalid.

Objects with Many Small Files

Objects that contain a large number of files can pose performance problems if they are stored in a filesystem as-is. Fixity checks, object validation and version creation can require an OCFL client to process all the files in an object which can be time consuming. Additionally, most storage systems have a minimum block size for allocation to files, so a large number of small files can end up occupying a volume of storage significantly larger than the sum of the individual file sizes. In this case, assuming that the majority of the files are relatively static data that is unlikely to change between objects versions, it is sensible to package the static files together in a single, larger file (zip is recommended). This can be parsed to extract individual files if necessary but can significantly improve the efficiency of basic OCFL client and storage operations.

Storage Root Hierarchy

Strictly speaking, the OCFL only requires that an OCFL Storage Root contains OCFL Objects in directories, distributed in some manner in the underlying filesystem. In turn, an OCFL object is identified purely by the presence of a [[NAMASTE]] conformance file in the object root. The presence and correctness of inventory files and version directories are a validation rather than an identification concern.

These definitions allow a lot of freedom as to how objects are arranged beneath an OCFL Storage Root and, while there is no strict requirement for all OCFL Objects to be arranged according the same system it is nevertheless considered good practice to do so. In addition, in the interests of rebuildability, it would be prudent to include an indication of the details of this arrangement alongside the OCFL specification as described in the Rebuildability section.

In the interests of transparency the it makes sense for an object's URI, its unique identifier and its location under the OCFL Storage Root to be aligned and simply derivable from each other. Good examples include:

Filesystem Features

In order to be portable across as many filesystems as possible, the OCFL makes use of a subset of filesystem features that are very broadly supported. It is therefore strongly advised to not use additional features in OCFL Storage Roots since OCFL clients and other filesystem tools that need to operate between different filesystems may exhibit unpredictable behaviour when feature sets do not match. In particular, using features such as hard and soft (symbolic) links for deduplication can work at odds with the OCFL's own mechanisms and should be avoided.

Consideration should also be given to calculations of storage usage when migrating between filesystems. Many back-end filesystem features, which are essentially invisible to user-space code, can have a significant impact on the actual consumption of storage space compared with the a simple sum of file sizes. Compression, extents and block sub-allocation are examples of such features which, while providing benefits in terms of storage efficiency, do require care when considering issues of capacity planning or migration.

Client Behaviors

Basic File Operations

The OCFL and its inventory structure are designed to support and capture the following file operations that create OCFL versions, regardless of whether optional features, such as deduplication, are used. The OCFL is not concerned with the process of creating versions but only the final outcome in terms of the differences with the previous version that need to be recorded and preserved.

Versioning

Version Numbering

Version numbering should start with 1 and be positive sequential integers. Names start with a lower case v. The numbers may be zero padded to the left to give fixed length, but, if used, zero padded numbers must always retain at least one leftmost zero. All versions in an object must use the same version numbering layout which can be easily determined by looking at one existing version — if the digit following v is a zero then the number format is zero padded to fixed length, otherwise it is simply an integer.

Version Immutability

Previous versions of an object should be considered immutable since the composition of later versions of an object may be dependent on them. In addition, the assumption of immutability ensures that copies of different versions of an object remain consistent with each other, avoiding issues with identifying canonicity and reconciliation.

One key consequence of this immutabilty is that manifest entries should never be deleted. New entries may be created, and, if not deduplicating file content, additional references to copies of stored content may be added.

File Purging

Sometimes a file needs to be deleted from all versions of an object, perhaps for legal reasons. Doing this to an OCFL Object breaks the previous version immutability assumption and is not supported directly. The correct way to do this is to create a new object that excludes the offending file, with a revised version history taking this into account. The original object can then be deleted in its entirety. Creating the new object first is good practice as it avoids any risk of data loss that may occur if an object were to be deleted before the new object is created.

The new object need not have the same identifier as the original object. In this case, the deleted object can be replaced by a "stub" object with the original identifier and location in the OCFL Storage Root. This object is a standard OCFL object that just contains brief information that redirects users and software to the new version - possibly with an indication of why the new object was created, if appropriate. The OCFL does not define this stub object in any way - its structure, interpretation and handling are entirely client dependent, but ideally some elements should be human readable for rebuildability.

Log Information

There may be the need to record some actions on objects that do not result in changes to the object content. For example, copying the object to new storage or validating fixity and finding nothing amiss. The log directory is the location in an OCFL object where such events can be recorded. The OCFL does not make any assumptions about the contents of this directory but, if it exists, then its contents will not be subject any validation processes.

Forward Delta

Forward delta differencing is a key, though optional, feature of the OCFL that means that parts of an OCFL object version that are unchanged from a previous version are not stored again. This has the potential to significantly improve storage efficiency when objects have multiple versions, whether through ongoing curatorial action or the accessioning of updated material.

When a new version of an OCFL Object is created from an earlier version and a client wishes to implement forward delta differencing, then the possible file operations are handled in the following manner (with reference to the state and manifest blocks of the OCFL object's inventory file):