Infrastructure

The DataNet Federation Consortium promotes the application of policy-based data management systems to handle massive data collections. The approach is based on the following concepts:

Purpose – reason a collection is assembled. The reason drives the choice of the types of digital records that will be included, the types of operations that will be performed upon the records, the context provided with each record, and the policies that will be enforced. Reasons for assembling a collection are typically tied to a specific stage of the scientific data life cycle. If the collection is being assembled for local use, then assumptions can be made about local knowledge and less context is required. If the collection is being assembled for use by a broader community, then much more complete contextual knowledge is needed to interpret the data correctly.

Properties – assertions that will be made about the collection. The group assembling a collection typically will make assertions about completeness (collection contains all of the experimental data), or authoritativeness (collection contains data that come from a known source), or consistency (collection contains data that have been calibrated, registered onto a coordinate system, and converted to physical units). Desired properties might include the context required for each digital record (appropriate descriptive metadata), the data integrity (management of checksums and replicas), and chain of custody (audit trails of all operations performed upon the data).

Policies – controls for enforcing desired properties. Policies define when and where procedures are executed to enforce desired properties. It is not sufficient to apply policies only on ingestion and access. Policies are needed to govern all administrative tasks as well. This implies the need for policy enforcement points within the data management infrastructure that control all data management actions. The iRODS data grid manages 74 policy enforcement points, corresponding to actions such as deleting a user, ingesting a file, depositing metadata, deleting a file, adding a storage resource, selecting a storage resource when ingesting files, etc. Policies are typically applied to control a desired action, or applied to do pre-processing before the action occurs, or applied to do post-processing after the desired action completes. The set of policy enforcement points is expected to grow slowly over time as new application communities decide that additional actions need to be controlled by a policy.

Procedures – functions that implement the policies. The iRODS data grid chains basic functions (micro-services) together to implement workflows that are applied at the remote storage location under the control of a policy that is executed as a rule within a distributed rule engine. An implication is that the successful chaining of basic functions requires the exchange of structured information. The iRODS data grid manages the exchange of information through standard structures in memory, through parameter passing, through metadata in a catalog, through files, and through network communication. For operations that are executed across multiple storage locations, the in-memory structures are serialized and sent over the network to the remote location where they are unpacked and reconstituted as in-memory structures. This approach ensures that the workflows will function the same, whether they are executed entirely within a single storage platform or across multiple storage platforms. The iRODS data grid currently provides 250 micro-services. Note that about 67 standard in-memory data structures are sufficient to handle data interchange between the 250 micro-services.

Persistent state information – results of applying the procedures. The iRODS data grid manages 205 state information attributes about users, records, collections, storage systems, and rules. The attributes also include information about quotas, the load on the system, outstanding or deferred operations, audit trails, rule versions and rule execution.

Assessment criteria – validation that the state information conforms to the desired properties. The assessment criteria can be evaluated through periodic rules that verify a well-defined property. This evaluation must directly address the issue of scale, as data grids may contain petabytes of data and hundreds of millions of files. Support for bulk operations is provided within the storage system, the metadata catalog, the rule engine, and the messaging system that is used to track progress.

Federation – controlled sharing of logical name spaces. Given the wide range of failure modes, long-term data management requires replication across independently managed data grids. If a data management environment fails for whatever reason, an independently managed environment is needed to minimize risk of data loss. The sources of risk include media failure, software and hardware failures, operational errors, natural disasters, and security violations. Federation is also used to re-purpose the use of a collection, by integrating digital holdings across multiple sources under new policies and procedures.