Report on Open Archives work in progress at Virginia Tech

Hussein Suleman (hussein@vt.edu)
Edward A. Fox (fox@cs.vt.edu)
Dave Watkins (dwatkins@cs.vt.edu)
Robert France (france@vt.edu)
Marcos Andre Goncalves (mgoncalv@vt.edu)



Introduction to the Open Archives Initiative

The Open Archives initiative (OAI) promotes and encourages the development of author self-archiving solutions (also commonly called e-print systems) through the development of technical mechanisms and organizational structures to support interoperability of e-print archives. Such interoperability can stimulate the transition of e-print systems into genuine building blocks of a transformed scholarly communication model.

The inaugural meeting of the OAI in October 1999 resulted in the agreement now known as the Santa Fe Convention. This is a set of relatively simple but potentially quite powerful interoperability agreements that facilitate the creation of mediator services, including both free and commercial incarnations. These services combine and process information from individual archives and offer increased functionality to support discovery, presentation and analysis of data originating from compliant archives.

The Santa Fe Convention is a combination of organizational principles and technical specifications to facilitate a minimal but potentially highly functional level of interoperability among scholarly e-print archives. The convention gives data providers -- individual archives -- relatively easy-to-implement mechanisms for making information in their archives externally available. This external availability then makes it possible for service providers to build higher levels of functionality, mediator services, using the information made available from scholarly archives that adopt the convention.

Virginia Tech's Involvement in the OAI

Virginia Tech has been involved with this process from the early stages and continues to contribute towards the development of protocols and standards documents that comprise the Santa Fe Convention. Professor Edward Fox was one of the original participants in the project, representing the NDLTD (Networked Digital Library of Theses and Dissertations) project. At that first meeting Virginia Tech made a commitment to integrate NDLTD into the OAI project. Subsequent to that, the CSTC (Computer Science Teaching Center) and W3C Web Characterization Repository have been added as additional contributing digital libraries from Virginia Tech. Work has also been done in testing compliance of archives and defining metadata transport formats.

MARC XML-DTD

Robert France (france@vt.edu)

US-MARC was included as one of the initial metadata formats because of its widespread usage in library and library-related systems, like the Marian system being developed at Virginia Tech. An XML transport format is being defined for MARC so that MARC records could be exchanged using the OAI protocols. This work has three parts:

The MarcRecord class serves as a base class for the MarcDocument class used in MARIAN and related systems for internal representation and end-user presentation of MARC records. Once the MarcRecord and re-engineered MarcDocument classes are tested and published, we plan to enhance MarcDocument to directly output "short descriptions" in the OAI metadata standard. This enhancement will also be Web-published and freely available.

CSTC

Dave Watkins (dwatkins@cs.vt.edu)
Description of Project

The Computer Science Teaching Center is a digital library of peer-reviewed teaching resources for computer science educators. Teachers can go to the CSTC and submit resources that they use in their classrooms. Once the material has been submitted, an editor assigns reviewers to review it. Once the reviewers have completed their reviews, a decision is made whether or not to include the material as an official resource.

Other teachers can then go to the CSTC to find resources that they could use in their classrooms. You can browse and search through the official material in a number of ways, including by author, by date, by subject category, etc.

A new ACM journal is also being spawned off from the CSTC. The ACM Journal of Educational Resources In Computing (JERIC) will be tightly coupled with the CSTC. All articles in the journal will be taken from resources submitted to the CSTC. The actual logistics of this are still to be determined.

The system is implemented on a UNIX system. All of the CGI scripts are written in Perl. The CSTC also uses an mSQL database to store all of the metadata.

Current State of OAI Compliance

The OAI implementation has already been completed. The CSTC only needs to be modified as the standards change.

W3C Web Characterization Repository

Hussein Suleman (hussein@vt.edu)
Description of Project

The W3C Web Characterization Repository is an online database of metadata for resources in the field of Web characterization. This includes links to publications, tools and data files. The focus of the repository has been on providing validated trace files, while at the same time providing the facilities to include any type of resource that practitioners wish to disseminate.

This project has been steered by the Network Research Group at Virginia Tech, as a part of the World-Wide-Web Consortium's Web Characterization Activity working group. The database is manipulated by a set of Perl CGI scripts which interface with a MySQL database.

Current State of OAI Compliance

The OAI Implementation has been completed on a test server. When the production server finds a permanent location, it will also be fully compliant.

NDLTD

Marcos Andre Goncalves (mgoncalv@vt.edu)
Description of Project

The Networked Digital Library of Theses and Dissertations is an international project that supports the creation, archiving and exchange of digital versions of theses and dissertations. Virginia Tech has been at the forefront of this project for many years now, encouraging other universities to adopt and support electronic submission policies. There are currently 78 member universities worldwide, of whom 5 already mandate electronic submission.

Now, with the advent of the OAI, work has begun on the next phase of the project - linking together the various university archives to create a the union of all the internationally-scattered NDLTD collections. To this end, a collaborative effort between Virginia Tech and universities in Germany was launched and funded by the NSF to investigate federated search, multilingual access and interoperability issues using our MARIAN system, developed at Virginia Tech, as a testbed system.

Current State of OAI Compliance

Different partners of NDLTD use different systems for archiving and searching. For example, the German collections allow interoperability by means of Harvest, as opposed to Dienst (which is used by the OAI). Other sources use Z39.50-based repositories. To address these disparities, we have developed several wrappers and data format converters to be able to deal with all formats. This part of the project has been completed. Marian is now being extended to work as a mediator system over those wrappers to allow complete integration of information. The "spider" technology of the Harvest system is also being leveraged to facilitate harvesting of NDLTD collections from remote international sources. It remains for the harvested data to be parsed and integrated into a central MARIAN database and to make MARIAN an OAI-conformant repository.

Repository Explorer

Hussein Suleman (hussein@vt.edu)
Description of Project

Since the OAI is still very much in its infancy, there are not many implementations in existence. Virginia Tech was very much at the forefront of implementing OAI compliance, with the W3C Repository and CSTC being amongst the earliest projects. When comparing implementation notes, we discovered that the specifications could easily be read in different ways, as the two implementations had subtle differences. As one solution to this problem, the Repository Explorer was built to serve as a compliance test.

This program allows a person to browse through any archive which is supposedly compliant with the OAI, using only the protocol defined in the specifications. All aspects of the protocol can be tested, and the results of queries are checked for strict compliance with the expected syntax. The results are then parsed to present the user with a browsable interface.

To allow for maximum portability, the software is written in the form of a C++ CGI script which interrogates the archive, parses results and passes them on to the user. Thus users of the software need only access the central website instead of installing a client.

Current State of OAI Compliance

The project has been completed except for a few unresolved issues in the protocol, in which case the most popular interpretation has been adopted. As these issues are resolved, the Explorer is being updated to reflect the most current specifications.

Future Work at Virginia Tech

The general vision of the OAI is to build upon the foundations of archives whose metadata may be harvested in order to create higher level services. Virginia Tech will play an active role in defining such services on the data supersets that will be collected from open archives. Discussions have already been initiated to consider the problems of merging data streams, creating research workspaces, archiving of large collections of data, searching through large full-text collections, etc.

The MARC-DTD and its related implementation will be incorporated into the OAI standards as soon as they are tested and ratified.

The NDLTD OAI project will be expanded to include other partners, thus moving towards the goal of a global digital library for theses and dissertations.

CSTC and the W3C Repository will serve as sources of non-standard data for higher-level services. Since neither of them is restricted to publications, they will help to maintain some generality in the definition of protocols and services for the OAI.

The Repository Explorer will be enhanced as the protocol is updated to serve as a basic compliance test.

Websites

Open Archives Initiative - http://www.openarchives.org

MARC XML-DTD - http://dlrl.cc.vt.edu/projects/OpenArchives/oa_marc.html

Computer Science Teaching Center - http://www.cstc.org

Networked Digital Library of Theses and Dissertations - http://www.ndltd.org

W3C Web Characterization Repository - http://purl.org/net/repository

OAI Repository Explorer - http://purl.org/net/explorer