2005

MODS, MARC, and Metadata Interoperability PART 1

MARC Formats Interest Group (LITA/ALCTS)
Monday, June 27, 1:30 pm – 5:30 pm

Description from LITA site: Libraries face challenges in integrating descriptive metadata for electronic resources with traditional cataloging data. This program will address the repurposing of MARC data and metadata interoperability in a broader context. It will then introduce the Library of Congress’ Metadata Object Description Schema (MODS) and present specific project applications of MODS. Finally, the program will offer scenarios for coordinating MARC and non-MARC metadata processes in an integrated metadata management design and introduce tools for simplifying interoperability.
Speakers: Dr. William Moen, University of North Texas SLIS; Rebecca Guenther, Library of Congress; Ann Caldwell, Brown University; Marty Kurth, Cornell University; Terry Reese, Oregon State University

This was an extremely dense but immensely useful session; PowerPoint presentations will be available online at the ALCTS site some time soon (as of June 28 they are not yet linked).

Speaker 1: William Moen, Texas Center for Digital Knowledge, University of North Texas
Summary from Claire: Moen put into very succinct and very clear language the reasons why we (librarians but more specifically catalogers) have to begin to know standards other than our own.

Speaking on metadata interaction, integration and interoperability

Problem statement … is there a problem? We used to think of interoperability as a systems problem; we now understand that there are different levels to the problem. There are many metadata schema, some well-documented and well-known (AACR2), others less so. Ditto for content standards. There are also a variety of syntaxes (MARC and XML, for example). Lorcan Dempsey calls this our “vital and diverse metadata ecology.” We don’t really have a problem UNLESS we expect these various standards to interact, which of course we do.

So we are moving from a systems-oriented definition of interoperability to a user-oriented definition, Moen suggests a preliminary framework to help scope the work. Look at communities of practice: who is our community? Libraries, archives and museums are fairly tightly-knit communities with a good understanding of standards. As we try to cross into other communities, however, the costs of interoperability go up.

Communities of practice, two types:
-Networks of professionals (librarians, etc.) have similar language and shared meanings
-Information communities are looser organizations , and include the creators of information, managers of information (librarians/catalogers), and users.

Godfrey Rust (complete citation for this and other references will be in Moen’s ppt preso when it goes online) divides things into: PEOPLE, STUFF and AGREEMENTS.

Interoperability cost vs. functionality. William Arms’ curve of cost v functionality (graph & cite in ppt). OAI harvesting, for example, has lightweight requirements, so it is easy to implement but less functional. Federated searching/Z39.50 is highly functional but more costly to implement.

The library has developed very sophisticated structures over time. In the larger scheme of things, over time, probably these structures will not be as broadly adopted. The time is now: this is our opportunity to act if we want to try to see our standards adopted more broadly.

There probably will never be ONE canonical metadata scheme BUT we may all be able to agree on XML, which is a great step forwards. Some apparently simple schemes like Dublin Core turn out not to be so easy to implement in actual practice. We do not want to be further marginalized, we want to (have to) learn to play with others and have to get over the “not invented here” syndrome.

Mechanisms to address interoperability (with the fundamental assumption that there will NOT be one basic standard):
-Crosswalks/mapping
-Application profiles
-Registries
-RDF

Crosswalks and mapping. Mapping is the intellectual process of analyzing the standards and making matches. The crosswalk is the instantiation of the map. 1998 NISO white paper on crosswalks. This activity is successful when accomplished by someone who really knows the standards on both ends of the map: catalog librarians who know AACR2 will be responsible for becoming knowledgeable about other standards so that they can lead the mapping/crosswalking activity.

Difficult decisions to be made while mapping include: should it be one-way only or reversible? Reversible/round-trip: MARCXML < -> MARC. MARC -> MODS, however, is not round-trip, there is some loss of data, albeit perhaps slight. So is the mapping one-to-one, one-to-many, many-to-one, etc.? Other difficulties include vocabularies: how to go from controlled to uncontrolled? For example, how does one indicate in Dublin Core that the subject is an LC heading?

Mapping to an interoperable core. OCLC is working on this problem, trying to come up with something rich enough to act as a core: all things map to the core and then out again to other forms. They’ve been looking at MARC as the possible basis [note: see Terry Reese’s presentation on MarcEdit; he was the last speaker in this program]

Application profiles: same elements used in different ways, and with different meanings. These uses can refind the standard definition of the element as long as the fundamental meaning is unchanged.

Registries are necessary for application profiles to be successful. Ex: UK Schemas, EU Cores, others (see ppt)

RDF is the foundation of the semantic web and is a grammar for expressing terms, semantics. Moen admits his difficulty with RDF. Is important, but struggles to explain it.

Conclusions: Libraries are just ONE of the communities, we do not have a central role, but we may have a priveleged role thanks to our long experience. Some librarians continue to think that cataloging is different from metadata generation. We have to think about interacting with other communities. The challenge is to develop tools to hide the differences between formats (hide them from users of our systems). See Roy Tennant’s recent article about transparency. Moen demoe’d an SRW search on LOC which can show the data in MODS format or in XML, or in DC, etc. This is a good example of transparency: give the data to the user in a format that they can use.

Speaker 2: Marty Kurth, Cornell University Metadata Services

Provides services to faculty and others on campus. Interested in repurposing the library’s MARC. Metadata management design. What does all of this metadata mean for our shops and how do we set up systems and services that support interoperability over time? His preso is based on an article for Library HiTech that he co-authored 2004 (22:2).

Explains what is meant by ‘repurposing MARC data:’ being able to reuse MARC outside of the library catalog. Example collections: Making of America (MOA), Historical Math monographs, HEARTH home ec. collection, May anti-slavery, Historical literature of agriculture. All 5 of these dl projects had print counterparts and thus MARC to build on.

Metadata processing involves: mapping, defining relationships between schemas; transformation, the process of moving between schemes; and management, coordinating the tasks and the resources.

Metadata management challenges: workflows are not yet well established. Mapping and transformation is not happening all in one place, it is happening all over the library and may not be well documented, or if it is, the documentation may be scattered. Goal was to move from projects to process.

Why is repurposing MARC a logical place to begin? Firstly, we’ve got lots of it. Allows them to maximize the potential of the data. MARC mapping can be expensive; cost goes down as tools are developed. Typically this work is done by specialized staff for whom opportunity costs are expensive. It can be messy and difficult, it probably will generate multiple versions of data and records, etc. Thus, a good challenge.

Collection-specific mapping variations are inevitable. MOA, May, HEARTH all involve TEI. Handling of date transformation between MARC and TEI, for example, varied between the MOA and the May collections. The mapping was further complicated because each project was delivered with a different platform (DLXS, EnCompass, and DPub). Each project had slightly different needs. Work was performed in different areas of the library.

MARC mapping models. How to deal with the collection specificity? Looked at LOC MARC-> DC, but made local decisions on additional fields. Sought feedback on this library-wide.

Managing transformations. Transformations also vary from collection to collection. Some were performed by vendors. Scripting and XSLT trans. were later implemented. The library catalog is still the database of record. The scripted approach to transformation extracts the MARC, transforms it into XML, and combines it with other data including admin and technical md, OCR’ed text, etc. The XSLT approach involved writing transformations to accomodate the possible entirety of any MARC recrod; the metadata staff then customize the XSLT for their particular collections. It is easier to tweak and modify as the project unfolds. Documentation is critical and had been lacking in the past. It is a key component in management of metadata over time.

Metadata management: coordinating the intellectual work AND managing the tools and files that are products. The tools and process are resources to be managed. Important to know the user community for these tools and their needs for using and accessing them.

Strategies: inventory existing relationships and processes (this is not something Cornell has specifically done). Identify the staff who will be responsible over time and who will mentor. Requires strategic buy-in. Important to communicate the importance of this more than once. [Marty’s ppt. here gives a useful example of such an inventory]

Concrete next steps: how do we build a culture to embrace this? Develop reusable transformation tools. Build library consensus on mapping. Create a culture and a practice of sharing and revising. External stakeholder discussions, library-wide. Talk about the risks of NOT managing tools. Think about creating a repository for metadata management tools that is searchable.

Speaker 3: Rebecca Guenther, Library of Congress

Rich descriptive metadata in XML: MODS. Overview: background on MARC & XML, MODS intro, MODS’s relationship to other schemes.

MARC and XML. We have large investments in MARC. Cataloging is an early form of metadata. Trying to retool to exploit flexibility of XML. Also trying to anticipate receiving metadata in other formats in XML or as part of a digital object.

Evolution of MARC21. Until now, MARC has been both a syntax and an element set. In current environment, XML is being used more and more and more tools are available. Diagram shogin transformation from MARC21 to XML. First transform to MARCXML in order to be able to do other things (validation, etc.)

MARC 21 in XML. MARCXML is lossless and capable of round-trip to MARC. Once it is in XML, we can then use stylesheets/XSLT to present in different environments/interfaces.

MODS is a derivative of MARC. It uses XML Schema. It was initially thought of for library applications, but they are seeing other uses and implementations.

Why bother? There is an emerging initiative to reuse metadata in XML: SRU/SRW, METS, OAI, etc. Looking for something richer than Dublin Core. Before MODS, not much in between MARC and Dublin Core. MODS is a core element set for convergence between MARC and non-MARC XML.

Advantages of MODS: it is compatible with existing library database descriptions. Richer than d.c., simpler than MARC, partly because the language is more readable than numerical tags. The hierarchical structure more readily supports rich description of complex objects.

Features of MODS. Uses language-based tags which share definitions with MARC. Description is rul agnostic. Elements are reusable and not limited as to number of sub-element. For example, the name tag can be used throughout the record, in author fields but also as part of related item-subject. Redundant elements can be repackaged more efficiently [Rebecca’s ppt will be useful here to clarify these points]

Status of MODS. Started a MODS listserv in 2002. #.0 has been stable for about a year. 3.1 is coming out soon, doesn’t change anything in 3.0 but has been reordered to be compatible with MADS (Marc Authority). Registered with NISO.

Relationship to other schema. General-purpose and compatible with MARC. More broad than many other formats (EAD, ONYX, etc.) Difference between MODS and Dublin Core: MODS has structure, DC is flat. Can more precisely modify/qualify fields in MODS, for example, publication info can be related to date in MODS, can’t in DC. MODS is more compatible with library data. MODS can include record management information.

MARCXML vs MODS. Demo’ed music records in MARC, MARCXML, MODS. May not be exactly the same specificity when converting from MARC-MODS but most of the record converts.

LC uses of MODS. Using to describe electronic resources (AV project, web archiving). METS. SRU/SRW implementation offers records in MODS (this is one of the available choices).

MINERVA web archiving project. Exploring born-digital materials. Used MODS native (vs. creating as MARC and then converting to MODS); perhaps will some day put into the library catalog, but perhaps now. For web archiving, created 1 collection-level record, individual MODS records for each object.

Election 2002 web archiving: webarchivist.org cataloged the datea, creating MODS records for each site, some of which were captured more than one time. Other web archiving projects, yet to be cataloged: 9/11, 107th Congress, 2004 election.

Demo’ed 2002 election archive. Used XSLT to transform MODS to HTML. Link to the archived site. Showing MODS in XML – date captured data includes start and end points for capture. Decided not to link to the live site, which in many cases disappeared almost immediately after the election anyhow.

107th congress website archiving. Did in-house (MODS cataloging at LC). Used XMLSPY to catalog. Built own search and browse. Browse has drop-down menus to select the house or senate ctte.

Iraq war. Now have an input form for the catalogers to use as they catalog w/drop-down menus, etc.

I Hear America Singing project. METS + FEDORA w/MODS. METS packages all metadata and all digital objects, including sounds, CD covers and other images, etc.

Other MODS projects. MusicAustralia and Screen Sound Australia are using MODS as an exchange format.

Directions for MODS. Continue to explore interactions with METS. Continue to use for digital library projects @ LC. Richer linking capabilities than MARC. Website archiving. Looking at MODS tools, looking at using it with OAI as an alternative to D.C>

Q&A for the first three speakers
Q. When will MODS 3.1 be out?
R.G. Had hoped last week, but within next few weeks. 4.0 will be a complete rewrite and is in the workds but will take more time, require broader discussion, etc.

Q. As Cornell attempts to shift from a projects-oriented approach to a program-oriented approach, what will happen with the collection-specific approach, and have they talked about using MODS?
M.K. Talk about it all the time but there is some political drag to this idea.

Q. About LC web archiving; are any of the keywords or other data automatically extracted from web sites as they are archived/cataloged?
R.G. Yes, worked with their IT folks who extracted from the HTML. For the Milstein project (music project from I Hear America Singing) the metadata was all manually created, not extracted.

Q. Will MINERVA records go into library catalog?
R.G. Initially, though ILS was where all the records had to go, but with emergence of federated search, are no longer thinking this is the case.

Q. MARC records are dynamic and maintenance is possible (update an authority record, all records linking to it are updated)
M.K. Still consider library catalog to be the catalog of record. Haven’t established periodicity for refresh but it is possible to do this, built in to their design.

END OF PART ONE