Set Orthogonality
Advocates: Hussein Suleman,
Mohammad Zubair
Status: Open
Last Updated: 19 October 2001
Description
There is no way to determine all the sets that an identifier belongs to.
This is typically referred to as set orthogonality because the protocol allows
a harvester to find out which identifiers belong to a particular set but
not vice versa.
This is not as much of a problem for a flat space of archives, but organizations
like NDLTD and NCSTRL have already started to create hierarchical catalogs
based on OAI and existing set information is lost at the very first level.
Also, the Internet2 Distributed Storage Initiative wants to work on replication
of OAs - this will mean harvesting every set and dealing with duplicates.
Can we do this in a way that is more efficient without adding to the complexity
?
Scenarios
- A union catalog is an archive that attempts to collect the metadata
from multiple archives and republish it in the form of a single collection
- this is extremely useful for providing centralized services to distributed
communities (e.g. NDLTD, NCSTRL). Current attempts to create union catalogs
have been plagued by the problem of set membership not being transmitted
by the standard ListRecords and ListIdentifiers requests. In practice this
information is either lost or the mirroring algorithms are much more complex
than typical harvesting algorithms. If set information was available along
with the individual metadata records, this would make replication much simpler
and more complete without the added effort.
- Sets could conceivably be used for browsing through the contents of
an archive. However, given a single identifier, it is not possible to find
out what sets the identifier belongs to, hence, even if the sets corresponded
to browse categories it would not be possible to get from the identifier
to the set in which it is contained.
Issues
- Set orthogonality requires either a new verb or an extension of the
response syntax to include this information. Either way it will require
more effort on the part of the data providers.
- Set membership information is not always easily available. In many
typical databases, generating all the sets for a particular identifier may
require scanning through the whole database while the inverse does not require
this. To alleviate this more resources may be required in order to create
additional indices. Will data providers be willing to make this optimization
? Can we require orthogonality or should it be optional ?
Possible Solutions
- Extend the syntax of "ListSets" to parallel the "ListMetadataFormats"
verb. In particular, if an identifier is provided, then the response should
indicate the list of sets that contain that identifier.
- If the "about" container may be repeatable, allow one of these to contain
a set listing. For example:
...
<about>
<sets xmlns="setnamespace" xsi:schemaLocation="setschema>
<setSpec>set12</setSpec>
<setSpec>set42</setSpec>
</sets>
</about>
...
- Add set information to the headers
...
<header>
<identifier>oai:anArchive:12345</identifier>
<datestamp>2000-01-01</datestamp>
<set>set12</set>
<set>set42</set>
</about>
...
- A combination of (1) and (3)