This document provides guidance on implementation of the [[OCFL-Specification]] for how clients should behave when operating on OCFL Objects.
A key goal of the OCFL is the rebuildability of a repository from an OCFL storage root without additional information resources. Consequently, a key implementation consideration should be to ensure that OCFL objects contain all the data and metadata required to achieve this. With reference to the [[OAIS]] model, this would include all the descriptive, administrative, structural, representation, and preservation metadata relevant to the object.
Additionally, as an aid to those who may need to recover OCFL objects in the future, it is recommended that a copy of the [[OCFL-Specification]] is stored in the top level of the OCFL storage root. The OCFL ignores files other than the conformance declaration at the top level so it is a good location to store documentation that may be useful for recovery.
A more complete approach would be to create a specific OCFL object that contains this documentation and to have a pointer to its location in the storage root. This documentation object would then be subject to OCFL validation and any other digital preservation processes that might be implemented without requiring special handling.
The digests in the manifest are used by the OCFL for content addressability rather than fixity but they are suitable for use as part of a fixity regime, and the manifest block usefully identifies all the files in an object. OCFL validation also requires that digests and files match. However, while the characteristics of digest algorithms that make them suitable for fixity checking and content addressing are closely related, they are not identical. In particular, fixity against malicious tampering requires that a digest computation is hard to reverse, which is not a requirement for content addressing. It is this aspect which is the most frequent target for cryptoanalytic attack.
Consequently, it is sensible to allow additional or alternative fixity algorithms to be used. These may be made in a fixity block which has the same layout as a manifest block but permits a broader range of algorithms. The OCFL will consider a fixity block valid if all the files referenced in the block exist but the OCFL does not validate digests for all possible algorithms. The fixity block does not have to include all the files in an object to permit legacy fixity to be imported without requiring continued use of obsolete digest algorithms.
The OCFL separates the existing file path of stored files from the logical file path of these files' content in OCFL object versions. This is a key feature that allows previous versions of objects to remain immutable but permitting deduplication, forward delta differencing, and easy file renaming. Consequently, the OCFL only requires that files added to any version of an OCFL object must be stored somewhere within the relevant version directory, with a corresponding entry in the manifest block. An entry in the state block determines the path and name of the file within that version by referencing the manifest entry, not the actual path on disk.
The most transparent approach is to have the path used to store the file on disk the same as the path of the file within the object when accessioned. This is readily understandable in terms of visual inspection of the physical filesystem.
However, this is not always possible. For example, complex objects with deep file hierarchies may encounter issues if they come from a fileystem that allows longer paths than are supported by the target OCFL system. In this case, the decoupling between existing file paths and logical file paths in OCFL objects allows the use of truncated paths for storage while the full paths can be preserved in state block entries which are not length constrained.
Another use case is importing content from other repository systems which renames files on ingest and stores them in a flat hierarchy. These can be imported, as is, and the original paths and file names recorded through suitable state block entries rather than reconstructing a physical file layout. Of course, the OCFL supports ongoing use of such a methodology.
OCFL object versions are composed of series of files/bitstreams but the OCFL does not make any distinction between different types of files other than those reserved for OCFL functionality: the inventory, its digest file, and conformance declaration files. It is possible, for example, to create separate data and metadata directories within each version to help organize material but all files are treated equally for the purpose of OCFL validation and management.
The OCFL supports optional deduplication if a client ensures that all digests in the manifest block refer to a single file path on disk. This entry is created the first time file content is stored in an OCFL Object. Subsequent references to that file content should then occur in the state block only. This can be determined by computing the digests of incoming files and determining if they already exist in the manifest block.
If deduplication is carried out within an object then, for consistency, it is expected that Forward Delta differencing will also be used between object versions so subsequent references to duplicated content should also refer back to the original manifest entry rather than updating it to include additional references.
Filesystem metadata (e.g. permissions, access, and creation times) are not considered portable between filesystems or preservable through file transfer operations. Nor can these attributes be validated in terms of fixity in a consistent manner. As such, the OCFL neither explicitly supports nor expects that these attributes remain consistent. If retaining this metadata is important then files should either be encapsulated in a filesystem image format that preserves this information, or the metadata extracted and stored explicitly in an additional file.
The OCFL preserves files and their content, with directories serving as a useful organizational convention. An
empty directory consists only of filesystem metadata and therefore, as noted above, is not amenable to direct
preservation in OCFL objects. If the preservation of empty directories is considered essential then the
suggested route is to insert a zero length file named
.keep into the directory which will ensure
directories are preserved as part of the file's path.
.keep files are not considered special by the OCFL in any way and are treated exactly
the same way as other files. As such, a non-zero length
.keep file is not considered invalid.
Objects that contain a large number of files can pose performance problems if they are stored in a filesystem as-is. Fixity checks, object validation and version creation can require an OCFL client to process all the files in an object which can be time consuming. Additionally, most storage systems have a minimum block size for allocation to files, so a large number of small files can end up occupying a volume of storage significantly larger than the sum of the individual file sizes. In this case, assuming that the majority of the files are relatively static data that is unlikely to change between objects versions, it is sensible to package the static files together in a single, larger file (zip is recommended). This can be parsed to extract individual files if necessary but can significantly improve the efficiency of basic OCFL client and storage operations.
Strictly speaking, the OCFL only requires that an OCFL Storage Root contains OCFL Objects in directories, distributed in some manner in the underlying filesystem. In turn, an OCFL object is identified purely by the presence of a [[NAMASTE]] conformance file in the object root. The presence and correctness of inventory files and version directories are a validation rather than an identification concern.
These definitions allow a lot of freedom as to how objects are arranged beneath an OCFL Storage Root and, while there is no strict requirement for all OCFL Objects to be arranged according the same system it is nevertheless considered good practice to do so. In addition, in the interests of rebuildability, it would be prudent to include an indication of the details of this arrangement alongside the OCFL specification as described in the Rebuildability section.
In the interests of transparency the it makes sense for an object's URI, its unique identifier and its location under the OCFL Storage Root to be aligned and simply derivable from each other. Good examples include:
[storage_root] ├── 0=ocfl_1.0 ├── ocfl_1.0.html (optional copy of the OCFL specification) ├── d45be626e024 | ├── 0=ocfl_object_1.0 | ├── inventory.json | ├── inventory.json.sha512 | └── v1... ├── d45be626e036 | ├── 0=ocfl_object_1.0 | ├── inventory.json | ├── inventory.json.sha512 | └── v1... ├── 3104edf0363a | ├── 0=ocfl_object_1.0 | ├── inventory.json | ├── inventory.json.sha512 | └── v1... └── ...
[storage_root] ├── 0=ocfl_1.0 ├── ocfl_1.0.html (optional copy of the OCFL specification) ├── d4 | └── 5b | └── e6 | └── 26 | └── e0 | ├── 24 | | └──d45be626e024 | | ├── 0=ocfl_object_1.0 | | └── ... | └── 36 | └──d45be626e036 | ├── 0=ocfl_object_1.0 | └── ... ├── 31 | └── 04 | └── ed | └── f0 | └── 36 | └── 3a | └── 3104edf0363a | ├── 0=ocfl_object_1.0 | └── ... └── ...
[storage_root] ├── 0=ocfl_1.0 ├── ocfl_1.0.html (optional copy of the OCFL specification) ├── d45 | └── be6 | └── 26e | ├──d45be626e024 | | ├── 0=ocfl_object_1.0 | | └── ... | └──d45be626e036 | ├── 0=ocfl_object_1.0 | └── ... ├── 310 | └── 4ed | └── f03 | └── 3104edf0363a | ├── 0=ocfl_object_1.0 | └── ... └── ...
In order to be portable across as many filesystems as possible, the OCFL makes use of a subset of filesystem features that are very broadly supported. It is therefore strongly advised to not use additional features in OCFL Storage Roots since OCFL clients and other filesystem tools that need to operate between different filesystems may exhibit unpredictable behaviour when feature sets do not match. In particular, using features such as hard and soft (symbolic) links for deduplication can work at odds with the OCFL's own mechanisms and should be avoided.
Consideration should also be given to calculations of storage usage when migrating between filesystems. Many back-end filesystem features, which are essentially invisible to user-space code, can have a significant impact on the actual consumption of storage space compared with the a simple sum of file sizes. Compression, extents and block sub-allocation are examples of such features which, while providing benefits in terms of storage efficiency, do require care when considering issues of capacity planning or migration.
The OCFL and its inventory structure are designed to support and capture the following file operations that create OCFL versions, regardless of whether optional features, such as deduplication, are used. The OCFL is not concerned with the process of creating versions but only the final outcome in terms of the differences with the previous version that need to be recorded and preserved.
Version numbering should start with 1 and be positive sequential integers. Names start with a lower case
v. The numbers may be zero padded to the left to give fixed length, but, if used, zero padded
numbers must always retain at least one leftmost zero. All versions in an object must use the same version
numbering layout which can be easily determined by looking at one existing version — if the digit
v is a zero then the number format is zero padded to fixed length, otherwise it
is simply an integer.
Previous versions of an object should be considered immutable since the composition of later versions of an object may be dependent on them. In addition, the assumption of immutability ensures that copies of different versions of an object remain consistent with each other, avoiding issues with identifying canonicity and reconciliation.
One key consequence of this immutabilty is that manifest entries should never be deleted. New entries may be created, and, if not deduplicating file content, additional references to copies of stored content may be added.
Sometimes a file needs to be deleted from all versions of an object, perhaps for legal reasons. Doing this to an OCFL Object breaks the previous version immutability assumption and is not supported directly. The correct way to do this is to create a new object that excludes the offending file, with a revised version history taking this into account. The original object can then be deleted in its entirety. Creating the new object first is good practice as it avoids any risk of data loss that may occur if an object were to be deleted before the new object is created.
The new object need not have the same identifier as the original object. In this case, the deleted object can be replaced by a "stub" object with the original identifier and location in the OCFL Storage Root. This object is a standard OCFL object that just contains brief information that redirects users and software to the new version - possibly with an indication of why the new object was created, if appropriate. The OCFL does not define this stub object in any way - its structure, interpretation and handling are entirely client dependent, but ideally some elements should be human readable for rebuildability.
There may be the need to record some actions on objects that do not result in changes to the object content.
For example, copying the object to new storage or validating fixity and finding nothing amiss. The
log directory is the location in an OCFL object where such events can be recorded. The OCFL does
not make any assumptions about the contents of this directory but, if it exists, then its contents will not be
subject any validation processes.
Forward delta differencing is a key, though optional, feature of the OCFL that means that parts of an OCFL object version that are unchanged from a previous version are not stored again. This has the potential to significantly improve storage efficiency when objects have multiple versions, whether through ongoing curatorial action or the accessioning of updated material.
When a new version of an OCFL Object is created from an earlier version and a client wishes to implement forward delta differencing, then the possible file operations are handled in the following manner (with reference to the state and manifest blocks of the OCFL object's inventory file):
stateblock of the new version. These entries will be identical to the corresponding entries in the previous version's
stateblock. No changes to the
manifestblock are required. When a new OCFL version of an OCFL Object is created, the starting point against which changes are made should be to copy the entire
stateblock of the previous version, thus inheriting all the files and content from the previous version.
stateblock of the new version. The file should be stored and an entry for the new content must be made in the
manifestblock of the object's inventory. The new digest from the
manifestblock can then be used to create the new
stateblock entry. If the file content, as determined by its digest, corresponds to an existing
manifestentry then, technically, this is a reinstatement operation rather than addition and should be flagged to prevent the operation being recorded incorrectly in preservation logs.
stateblock of the new version - with new digests associated with existing file paths. The updated file should be stored and a new entry for the updated content must be made in the
manifestblock of the object's inventory. The new digest can then be used to replace the digest for the old content in the relevant
stateblock entry. If the file content, as determined by its digest, corresponds to an existing
manifestentry then, technically, this is a reinstatement operation rather than updating and should be flagged to prevent the operation being recorded incorrectly in preservation logs.
stateblock of the new version - with existing digests associated with new file paths. No changes to the
manifestblock are required.
stateblock of the new version.
manifestblock are required. Reinstated entries in the
stateblock should replace any entries with the same path inherited from the previous version. If the file paths are unchanged then these entries will be identical to the corresponding entries in the earlier version's