The DataNet Federation Consortium (DFC) provides a new approach for implementing data management infrastructure that transcends technology, social networks, space, and time through federation-based sustainability models. DFC will support and manage data collections and services that span individual, institutional, regional, national and international repositories and demonstrate collaborative research, education and outreach on shared collections. The diversity of applications and services needed by the participating disciplines, and the heterogeneity of data resources in these disciplines, bring challenges in providing a seamless integrated system – infrastructure, policy management, administrative organization and long-term sustainability models.
The DFC federation technology is based on the integrated Rule-Oriented Data System (iRODS, irods.org), a second-generation data grid. The iRODS system provides a stable virtual data platform, based on policy-oriented data management, for federating data collections across distributed heterogeneous resources. iRODS provides a rich interface that supports a range of client applications, from discipline-centric analysis and visualization tools to emerging social networking web applications. The iRODS system scales to 100s of millions of data objects in Petabyte storage systems and supports high-speed data transport.
The technology development has been driven by the following three over-arching science-oriented goals of the DFC.
1. Promote Information-based Science and Engineering (help share, discover & access)
2. Enable automation of analyses and workflows (capture provenance & aid reanalysis)
3. Provision long-term access to data and metadata (factor obsolescence & enable repurposing)
Goals from our science and engineering (S&E) partners:
1. Our engineering partner wants DFC to make it easier for engineers to share, discover and access information – move engineering applications from stand-alone systems and silos to the cloud and to make the data available for educational purposes. The challenge comes from the need to handle a wide diversity of types, formats and models used by the engineering community.
2. Our hydrology partners want DFC to make it easier for them to execute (and re-execute) complex workflows; help manage data by gathering and staging files to appropriate computing platforms for use in a workflow; capture provenance information about the entire workflow process; and finally help publish the results so as to enable other scientists to perform reanalysis and apply (tweak) workflows with changed components or different data sets.
3. Our marine partners want DFC to help in maintaining continuous and long-term access to data generated by marine sensor platforms. The challenge comes from dealing with sensor packet streams and HD-video which will form the bulk of the marine data.
4. Our biology partners want DFC to enable the mechanisms needed to scale data management systems to tens of thousands of users, hundreds of millions of files, and petabytes of data.
5. Our cognitive science partners want DFC to simplify the clients needed to access the collaboration environment. There is a strong need for clients that provide ways to easily share data.
6. Our social science partners want DFC to implement the preservation policies that assure effective management of archived data.
Even though the six main requirements come from six diverse S&E domains, the solutions will be useful and applicable to all of them and indeed for other S&E domains also.
The DFC platforms are built from nine integrated components. Note that an individual component may support multiple collaboration goals:
a) Data Life-cycle Management services (Needed for goals 1, 2, and 3)
b) Data Analysis Services for the marine, climate and hydrology disciplines (Needed for Goal 2)
c) Storage and Compute access in the distributed partnership (Needed for Goals 1, and 3)
d) Science Collaboratories (marine, hydrology, plant biology, cognitive science) integration (Needed for Goals 2 and 3)
e) Engineering Model management (needed for Goal 1)
The iRODS Data System that forms the backbone of the DFC Platform can scale to high levels of interconnectivity and size. A DFC Platform can be implemented on a single laptop by an individual researcher, peered with other servers to accommodate a growing community, and federated to integrate multiple communities. Each DFC Platform will implement a software stack customized to community needs using the iRODS micro-service architecture.
Software and applications integrated into the DFC platform will include both generic functions and domain specific functions. Initially generic capabilities will be integrated to support oceanography, hydrology, and engineering. Capabilities required by other domains will be integrated as part of the later three years of the project. The initial capabilities include support for scientific data formats, third-party authentication and authorization, common scientific languages and databases, and access to cloud computing. The second class of applications will be specific to the plant biology, cognitive science, and social science domains.