MARC XML


MARC as an Open Archives Metadata Standard

The Open Archives Initiative "recognizes that archives will use specific metadata sets and formats that suit the need of their communities and the types of data they handle" [OA core document]. In the case where participating archives recognize an existing metadata standard, they are encouraged to provide metadata in both the Open Archives Metadata Set and the existing standard. For a great many library systems, that existing standard will be MARC records.

MARC records have their own native transport format, but this format ("MARC communications format" or among old-timers "tape format") requires specialized parsers and makes use of some fairly arcane conventions. The VT DLRL has undertaken to create an XML Schema to support wider distribution of MARC records within the OA community.

The current version of the MARC XML Schema can be found at:

http://www.openarchives.org/OAI/oai_marc.xsd
This version has been frozen for all of 2001 so developers won't have to track a moving target. Open questions still include how best to insure that significant spaces in fixed fields stay fixed.

 

Java Implementation

The Virginia Tech DLRL has agreed to provide a freely available set of Java classes to handle translations between MARC communications format and OAI XML. Our design is to provide two layers of classes: a MarcRecord class that can read and write both MARC communications format and the OAI MARC XML format, and a MarcDocument subclass that can provide additional translations, for instance to Open Archives Metadata Standard (OAMS) records and to pretty-printed HTML. Each layer also includes classes for MarcFixField, MarcVarField, MarcSubField and so forth.

The MarcDocument object can also produce short and long description in ASCII or ANSEL, long descriptions in HTML, and something approximating OAMS metadata records in the XML transport defined in the Santa Fe Convention. This is in keeping with the class's other life, as a presentation object in the MARIAN digital library system.

A "Beta" version of both layers can be found in this directory. This verion has been tested on over 150,000 MARC records, moving from communications format to XML and back to communications format without losing a character. Thanks to Dean Wilder for the test set. Despite this, the code is neither necessarily finished nor made beautiful to read.

The Java files think that they are (part of) the package edu.vt.marian.Document. References to other edu.vt.marian packages can be found via the MARIAN Java Re-Engineering project page on the DLRL site. If you are interested in running the code locally, we recommend that you

  1. build directories in your Java class directory for
    • edu
    • vt
    • marian
    each inside the preceding, and within marian build two directories called common and Document.
  2. download all the .java files from
    http://dlrl.cc.vt.edu/projects/MarianJava/edu/vt/marian/common/
    into the common directory and all the .java files from
    http://dlrl.cc.vt.edu/projects/MarianJava/edu/vt/marian/Document/
    into the Document directory.
  3. compile the code in the common, then the Document directory.
You are now in a position to use MARIAN classes in your own code or to run, for instance, edu.vt.marian.Document.MarcFilter as a stand-alone Java application.

Update 15 February 2001: By mistake an earlier version of the EntityMap.java class was on the site in the first release. If you downloaded code previous to 15 February, please update this file. Also updated are MarcFilter.java (now accepts input from System.in as well as from named files) and MarcDocument.java (now gives better information messages for malformed input).

 

Samples and Examples

Here are some examples of records translated by the Java class:

286 interesting records from VT Library, selected by all having 856u (URL) fields:
vtlib.urls.mrc (Original communications format records)
vtlib.urls.xml (Validated OAI_MARC XML stream)
vtlib.urls.html (Long descriptions in ASCII and HTML)

A selection of music manuals from the LoC American Memory project (208 records; thanks to Caroline Arms for this set):
musdibib.mrc (Original communications format records)
musdibib.xml (Validated OAI_MARC XML stream)
musdibib.html (Long descriptions in ASCII and HTML)

A very large sample (> 115,000 records) of non-English-language records from the Library of Congress:
(thanks to Dean Wilder):
file14f.mrc (Original communications format records)

The first 24 records in the above set:
file14f.samp.mrc (Original communications format records)
file14f.samp.html (Long descriptions in HTML only)
file14f.samp.xml (Validated OAI_MARC XML stream)

4430 public Virginia Tech ETDs generated from the VTDL databases:
etds.mrc (Original communications format records)
etds.xml (Validated OAI_MARC XML stream)
etds.oams (OAMS short records, in XML)


For each example, we have provided some subset of the full alternatives available from the Java translation software. Others can be loaded on request. These alternatives are:
  1. Full versions — all information present; intended for transport between systems
    1. the communications format version of the record, perforce in ANSEL
    2. the OAI XML for the complete record
  2. Long versions — most information present; intended for display to library end-users
    1. a simple formatted description of the record in ASCII
    2. an HTML description of the record as used in MARIAN
  3. Short versions — minimal information; intended e.g. for lists of search results
    1. a one-line Author / Title / Publication version in ASCII
    2. the OAMS metadata translation in XML

Other alternatives are possible, up to a full 3x5 grid of {full, long, short} vs. {XML, HTML, SGML, ASCII, ANSEL}. Also, other XML, HTML, or SGML translations are possible, although the Java can currently only handle one schema or DTD per language. Most of the alternatives are unlikely to happen soon, but we would like some day to produce a full description in either ASCII or HTML that shows all the data in the record in human-readable form.


Character Sets

One of the trickiest problems with translating MARC records has been handling non-ASCII characters. MARC uses two systems for characters beyond the few Anglo-American letters defined in the ASCII standard. One is based on Unicode, and is not treated explicitly in either the OAI MARC XML standard or the processing software available here. We hope and believe that Java software and XML transport standards will handle Unicode without additional assistance from us.

The other, more commonly used standard for representation of non-ASCII characters in MARC records is the ANSEL character set defined at MARC Specifications for Character Sets: Latin. This set of characters, unique to the library community, is widely used but to the best of our knowledge has no simple correspondence to any other international standard or set of standards. The primary exception is the correspondence to Unicode encodings detailed at MARC-21 Specifications for Character Sets: Latin.

After considerable discussion, we have opted in the OAI MARC XML standard to use numerically encoded entity references, as described by Unicode, Inc..

Finally, the treatment of diactitics in ANSEL allows for a wider range of modified characters than the composite characters (e.g. i-breve, e-tilde) defined in either ISO 8879 or Unicode. When faced with such a character combination, we have chosen to output a combination of character and modifier. This has the advantage of being unambigous and close to the original ANSEL version; it has the disadvantage that browser plug-ins generally cannot cope. It also complicates the translator code noticably, since in ANSEL modified characters are represented by a combining diacritic character followed by the modified character, while XML takes the more difficult alternative and places the combining diacritic after the character to be modified. Still, if the world were logical, we wouldn't need computers, now would we?

The Java translator uses a table of mappings for XML and HTML translation. This makes reconfiguration easy, even to the point of completely substituting another set of entities. The complete set of mappings between ANSEL characters and numeric UNICODE entity references can be found in the table:

ansel_uni_comb.map

An earlier set of mappings using ISO 8879 entities plus a supplementary set (marcadds.ent) can be found in the table:
oai_xml.map

Finally, a set using no modified characters but only separate combining diacritic entities can be found in:
ansel_unicode.map

The first is the actual table used to drive the Java MarcRecord object in translating to OAI XML.

Each table is made up of lines, each of which specifies a mapping for one non-ASCII character as follows:

characterValue    <TAB>    characterType    <TAB>    entity
where
  • characterValue is the numeric value of the character
  • characterType is either 'e' for a simple entity or 'd' for a combining diacritic
  • entity is the XML entity name (minus the & and ;)
Each combining diacritic is followed by a set of offset lines that specify precomposed entities using the combining mark:
   <TAB>    characterValue    <TAB>    entity
where
  • characterValue is the numeric value of the (ASCII) character being modified
  • entity is the XML entity name (minus the & and ;) of the resulting "precomposed" entity
Character values may be either in base 10 or (with leading "x" or "0x") in hexadecimal.

For instance, the combining acute accent is encoded in ANSEL with by the (8-bit) character with numeric value 226 (0xE2). The table "ansel_uni_comb.map" contains a section for this diacritic that includes in part the lines:

0xE2	d	#x0301
	65	#x00C1
	67	#x0106
	...
	97	#x00E1
	99	#x0107
	...
A Java MarcRecord object using this table would translate upper- and lower-case A and C with acute accents into single entites. Acute-accented characters not defined in the section would be translated into sequences consisting of the character to be accented followed by the combining diactric &#x0301;. Thus the character sequence for a lower-case 'a' with an acute accent in ANSEL (:226:97: or in hexadecimal :E2:61:) would be mapped to the entity &#x00C0; in OAI XML. An ANSEL string containing a lower-case 'b' with an accute accent (:226:98: or in hexadecimal :E2:62:) — a character for which there is no "precomposed" character defined in UNICODE 舒 would use the combining diacritic on the first line in the section, producing the sequence b&#x0301; in OAI XML.


The Contact Person for this page is Robert France (Email france@vt.edu).


Back to VT-OAI Home Page     Back to DLRL Home Page     Back to Open Archives Home Page