| MARC XML |
|
The Open Archives Initiative "recognizes that archives will use specific metadata sets and formats that suit the need of their communities and the types of data they handle" [OA core document].
In the case where participating archives recognize an existing metadata
standard, they are encouraged to provide metadata in both the
Open Archives Metadata
Set and the existing standard. For a great many library systems,
that existing standard will be MARC records.
MARC records have their own native transport format, but this format ("MARC communications format" or among old-timers "tape format") requires specialized parsers and makes use of some fairly arcane conventions. The VT DLRL has undertaken to create an XML Schema to support wider distribution of MARC records within the OA community.
The current version of the MARC XML Schema can be found at:
|
|
The Virginia Tech DLRL has agreed to provide a freely available set
of Java classes to handle translations between MARC communications format and OAI
XML. Our design is to provide two layers of classes: a MarcRecord class
that can read and write both MARC communications format and the OAI MARC XML format, and a
MarcDocument subclass that can provide additional translations, for instance
to Open Archives Metadata Standard (OAMS) records and to pretty-printed HTML.
Each layer also includes classes for MarcFixField, MarcVarField, MarcSubField
and so forth.
The MarcDocument object can also produce short and long description in ASCII or ANSEL, long descriptions in HTML, and something approximating OAMS metadata records in the XML transport defined in the Santa Fe Convention. This is in keeping with the class's other life, as a presentation object in the MARIAN digital library system. A "Beta" version of both layers can be found in this directory. This verion has been tested on over 150,000 MARC records, moving from communications format to XML and back to communications format without losing a character. Thanks to Dean Wilder for the test set. Despite this, the code is neither necessarily finished nor made beautiful to read. The Java files think that they are (part of) the package edu.vt.marian.Document. References to other edu.vt.marian packages can be found via the MARIAN Java Re-Engineering project page on the DLRL site. If you are interested in running the code locally, we recommend that you
Update 15 February 2001: By mistake an earlier version of the EntityMap.java class was on the site in the first release. If you downloaded code previous to 15 February, please update this file. Also updated are MarcFilter.java (now accepts input from System.in as well as from named files) and MarcDocument.java (now gives better information messages for malformed input). |
|
Here are some examples of records translated by the Java class: 286 interesting records from VT Library, selected by all having 856u
(URL) fields: A selection of music manuals from the LoC American Memory project
(208 records; thanks to Caroline Arms for this set): A very large sample (> 115,000 records) of non-English-language records from the Library of Congress: The first 24 records in the above set: 4430 public Virginia Tech ETDs generated from the VTDL databases: For each example, we have provided some subset of the full alternatives available from the Java translation software. Others can be loaded on request. These alternatives are:
Other alternatives are possible, up to a full 3x5 grid of {full, long, short} vs. {XML, HTML, SGML, ASCII, ANSEL}. Also, other XML, HTML, or SGML translations are possible, although the Java can currently only handle one schema or DTD per language. Most of the alternatives are unlikely to happen soon, but we would like some day to produce a full description in either ASCII or HTML that shows all the data in the record in human-readable form.
|
|
One of the trickiest problems with translating MARC records has been handling non-ASCII characters. MARC uses two systems for characters beyond the few Anglo-American letters defined in the ASCII standard. One is based on Unicode, and is not treated explicitly in either the OAI MARC XML standard or the processing software available here. We hope and believe that Java software and XML transport standards will handle Unicode without additional assistance from us. The other, more commonly used standard for representation of non-ASCII characters in MARC records is the ANSEL character set defined at MARC Specifications for Character Sets: Latin. This set of characters, unique to the library community, is widely used but to the best of our knowledge has no simple correspondence to any other international standard or set of standards. The primary exception is the correspondence to Unicode encodings detailed at MARC-21 Specifications for Character Sets: Latin. After considerable discussion, we have opted in the OAI MARC XML standard to use numerically encoded entity references, as described by Unicode, Inc.. Finally, the treatment of diactitics in ANSEL allows for a wider range of modified characters than the composite characters (e.g. i-breve, e-tilde) defined in either ISO 8879 or Unicode. When faced with such a character combination, we have chosen to output a combination of character and modifier. This has the advantage of being unambigous and close to the original ANSEL version; it has the disadvantage that browser plug-ins generally cannot cope. It also complicates the translator code noticably, since in ANSEL modified characters are represented by a combining diacritic character followed by the modified character, while XML takes the more difficult alternative and places the combining diacritic after the character to be modified. Still, if the world were logical, we wouldn't need computers, now would we?
The Java translator uses a table of mappings for XML and HTML translation.
This makes reconfiguration easy, even to the point of completely substituting
another set of entities.
The complete set of mappings between ANSEL characters and numeric UNICODE
entity references can be found in the table: An earlier set of mappings using ISO 8879 entities plus a supplementary set (marcadds.ent) can be found in the table: Finally, a set using no modified characters but only separate combining diacritic entities can be found in: The first is the actual table used to drive the Java MarcRecord object in translating to OAI XML. Each table is made up of lines, each of which specifies a mapping for one non-ASCII character as follows:
For instance, the combining acute accent is encoded in ANSEL with by the
(8-bit) character with numeric value 226 (0xE2).
The table "ansel_uni_comb.map" contains a section for this diacritic that
includes in part the lines:
0xE2 d #x0301 65 #x00C1 67 #x0106 ... 97 #x00E1 99 #x0107 ...A Java MarcRecord object using this table would translate upper- and lower-case A and C with acute accents into single entites. Acute-accented characters not defined in the section would be translated into sequences consisting of the character to be accented followed by the combining diactric ́. Thus the character sequence for a lower-case 'a' with an acute accent in ANSEL (:226:97: or in hexadecimal :E2:61:) would be mapped to the entity À in OAI XML. An ANSEL string containing a lower-case 'b' with an accute accent (:226:98: or in hexadecimal :E2:62:) — a character for which there is no "precomposed" character defined in UNICODE 舒 would use the combining diacritic on the first line in the section, producing the sequence b́ in OAI XML.
|
The Contact Person for this page is Robert France (Email france@vt.edu).