Application of JPEG2000 in Archives & Libraries

Application of JPEG2000 in Archives & Libraries
Peter Murray
Concurrent session #1, LITA National Forum 2005
September 30, 2005

Started out with questions; who is thinking of it as an access technology? Who is thinking of it as a preservation technology?

Contributions from the audience: what are people interested in?
-Someone who just bought JPEG2000 is wondering about how to use it.
-People who are interested in archiving issues
-People who are using ContentDM are using JPEG2000, there are a few of those folks here.
-Can we use it for newspapers?
-VidiPax has customers who are asking about Motion JPEG2000 …
-What are the performance issues
-LuraTech
-ExLibris, who OEM’s the AWARE product
-Endeavor

Attributes the presentation in part to Robert Buckley, Research Fellow at Xerox; some of Peter’s slides are from Buckley’s presentation.

Key Messages:
Will begin with an intro to the format.
Talk about the “value proposition”
Opportunities for collaboration

What is JPEG2000?
Wavelet-based image compression standard. Same ISO that worked on JPEG worked on this. 2000 was the year that ISO officially passed part 1 of the standard.

Conception:
-Improve the performance of JPEG
-Add features and capabilities not available with Basline JPEG compression.

What is required to adopt a new technology?
1) Knowledge

JPEG2000 is one standard but it has evolving number of parts.
—–
Part 1) the core image coding part, was passed in 2000, followed by Part 2, which was extensions. Part 3: Motion JPEG2000. 4) Conformance Testing 5) Reference Software 6) Compound image file format. Blah blah more parts…

Image codestream compression architecture: PART ONE

Wavelet Transform: see slides from XEROX (Peter is trying to get the rights to redistribute those). The format divides image data up into discrete blocks by size, by resolution, etc. so that parts of the codestream can be accessed to get derivatives of the image at various sizes, resolutions, etc. very efficiently. When all of the pieces, or blocks, are reassembled, the original image results.

Can deliver JPEG 2000 images:
Progressively by size
Progressively by resolution
Progressively by Quality

JPEG2000 optimizes compression across the entire image, rather than by spatial blocks as JPEG does.

Color management in part one is based on sRGB color space. The people who were at the table during the time that discussions were happening, and they felt sRGB was good enough.

Part 2, JPX is more capable. It supports other color spaces, full ICC profiles
—-
Image components

Part 1 (JP2) supports 1- or 3-component images, plus optional masks; all JPEG 2000 compressed. 1 would be b/w only, 3- components would be RGB.

Part 2 (JPX) supports anything for which there is a color spec, for example multispectral photography. Getting beyond just the Red, Green and Blue spectrum.

FILE FORMAT ARCHITECTURE

Initially, JPEG group only specified the compression, and didn’t address the file format. There were negative outcomes, with a proliferation of different JPEG file formats.

This time, decided to focus on the file formats.

A JPEG200 file is a sequence of boxes with 3 fields each:
-length L
-type T
-data D

With such a file format, an application can read through the block (box) and figure out how long it is, skip over the L component to the type and see what kind of info it is. If it isn’t interested in that data, it has all of the information it needs to skip over it entirely.

BASIC JPEG2000 file:
-Begins with JPEG2000 signature file (declares itself as a member of the JPEG2000 filefamily)
-File type box
-Header box (image and color params)
-codesstream box (actual image data)
-Metadata

METADATA
You can pretty much put anything you like in it; allows for two types:
-XML box, any XML-formatted metadata
-Any other kind of data (UUIC boxes), voice annotations, TIFFs, PDFs, etc.

JPEG2000 FILE FORMAT FAMILY
-JP2 (JPEG 2000 Core)
-JPX (Extensions)
-MJ2 (timed sequence of JPEG2000 images). Not coded with interframe differences
-JPM (JPEG2000 Multi-layer). Documents where different parts of the image might be coded differently; for example, a newpaper article where the text can be bitonal but the photograph rgb.

Motion JPEG2000 was recently adopted by Digital Cinema initiative: this will be the way movies are going to be delivering content to movie theaters.

JP2 HEADER BOX: TECHNICAL METADATA LIKELY TO BE ENCODED
-image header
-Bits per component

There has already been an initiative to map the JP2 headers to TIFF (see the American Memory site for info on TIFF headers.
Some things that don’t have direct mappings in the TIFF header to JP2 header info can possibly be mapped to Dublin Core instead.

Protection
-Security. JPX introduces a digital signature Box, containing a checksum or digital signature
-Part 8 supports selective encryption and conditional access for the codestream.
For example, you could password-protect a certain layer of your JPEG2000 file. This may not necessarily be advisable for long-term archiving, but could perhaps be useful for secure transit between archives.

Error resilience
Variable length coders like JPEG2000 are vulnerable to errors that cause loss of synchronization. In Part 1, optional start of packet (SOP) synchronization markers are defined, so that an application reading in the file could resynchronize.
-Part 11, which deals with JPEG2000 for wireless, defines methods for protecting the codestream from errors in noisy environments.

Losing a certain amount of data from a JPEG2000 image will yield a loss of some kind, but it will not be as catastrophic as loss of data from the middle of, say a JPEG or an uncompressed TIFF or a TIFF with compression. Every chunk of data and compression in JPEG2000 is applied across the entire image. You don’t have chunks of data that correspond to what we think of as chunks of an image, that is, a block with X,Y coordinates.

JPM – Multilayer JPEG2000 for compound document images.

JPSearch. Provide clear understanding of the image retrieval process. The library community should be active here.

——————–

JPEG2000 Practice in Archives and Libraries

What is required to adopt a new technology?
2) Is JPEG2000 better enough technology? This is the key question that we should be asking ourselves.

WHY USE JPEG2000?
-Open standard; royalty-free use. Write and encoder and decoder and pay royalties to noone. There are no patent issues for the encode/decode. The vendors can license their software. Writing an encoder is harder than writing a decoder.
-One asset supports multiple derivatives; one file for both lossless and lossy data.
-Region-of-interest (ROI) on coding and access. Can specify that certain parts of the image are very important and should be encoded at higher quality, for example.
-Easily handles large images. (Peter’s example: ERMapper brought very very large diskpacks with a 10 Terabyte image and browsed it as a JPEG2000 file).

Architecture for access and archiving with JPEG2000
-Part 9
Peter’s working on the architecture piece; capture and management of JPEG2000

JPEG2000 in use:
National Digital Newspaper Program (NDNP)
-Objectives and constraints

UConn’s Charles Colson project
Annotated Melville’s manuscripts
Received a grant for preservation treatment and digitization.
Have embedded various types of metadata: TEIheaders (XML data), PDF (UUID data), and the entire EAD finding aid to provide the context, so the user can tell where this came from.

—–
What remains?

What is required to adopt a new technology?
3) Confirmation (dialog, and people to test, is this where we should be going? And Peter thinks this is better enough than what we have been doing. Harvard and LC are also starting to do this.

Final questions
Has anyone endorsed this as a standard? Library of Congress has put it on par with TIFF as a storage standard.

Archive groups haven’t endorsed it yet.

Is this replacing EXIF data? Will camera vendors do JPEG2000? Yes, there will be some new digital cameras this Christmas with JPEG2000 support.

Custom metasearch services using an XML API

Concurrent session 3
Custom metasearch services using an XML API
LITA National Forum 2005
Saturday, October 1, 2005, 10:50am

Roy Tennant, CDL

Breaking out of the box is using an API to create own interface; allows deeper integration of other types of activities (querying dictionary service to check spelling, for example).

Why do it? greater interface flexibility, can do things vendor doesn’t support. Changes to interface can remain unchanged when new versions of application are released.

CDL and the talk today is about the MetaLib X-Server, but there are others that offer similar functionality.

CDL’s vision is of many search portals. No one-stop shopping. Many services for many different audiences. This is often very problematic with a vendor product out of the box. Particularly difficult because the tool out of the box ships with a lot of little fragments of XML, etc. that make up the interface, so making even minor changes required a lot of hunting and correcting.

Also, the code was crap. Very substandard.

AT CDL, metasearching is not just about searching databases, but about searching other things, including OAI-harvested data, crawled earth sciences websites, etc, RSS feeds or new articles and resources, etc.

Showed a wireframe of a metasearch app. Showing examples of image search, etc.

Diagram of applications … see PowerPoint from this preso.

———-
Michael McKenna CDL: Using SOAP and XML to access Metalib

Use the HTML interface from MetaLib as little as possible.

Currently, X-Server supports basic services; an XML interface (basic request/response, modeled on web services), a core interface to metalib to do querying of database, data preso to other apps.

Basic services:
-Login (this is an application connection)
-User authentication (different from login, which is an application accessing rather than a person)
-Retrieve resources
-Search resources
-Retrieve search status report
-Combine results sets
-Retrieve search results

Metalib’s two interfaces: /v (vanilla) and /x (XServer):
-When there is an upgrade, there is a certain amount of fixing involved: with an upgrade with /v, 196 files changed, 2,354 files changed across all campuses. Not just the 10 campuses, but multiple departments and libraries within those. The upgrade script for /v saves and markes changed files. Would have to “diff” to find out what those changes are.
-With /x, never ever have to change the UI. May require mods to the Common Framework (CF) layer (more on this bit later in preso).

Development Methodology
Over 60 people on staff (more than 20 are developers) who are working on this. Potentially anyone could make a change and it would affect all. Like to use source control to track this.
-With /v, cannot use CVS easily. Had to use very close tabs on the changes
-With /x, much easier to test and prototype, synch with checkin server without making any changes to the back end.

For example, to customize simple search, have to change the following:
-/v many fragments of docs to be updated. MANY MANY fragments
-/x interface will go to jsp pages; edit one .jsp file.

Common Framework
Integrated interface for CDL managed info services. Packaged for internal use by the various campuses.

See ppt for the diagrams of the
-Application layer
-Manager layer
-Service layer
-Client layer
-User interface

CF integration
Used existing modules as templates, modified them to act as metasearch interfaces. Then sits alongside other applications at every layer (see list, above). A complete Metalib search “slice”

Logic Flow:
-Connect
-Authorize site (once person has logged in to a site, the site tells other apps about what the user has access to, so they can just authorize the site)

Michael took us on a detailed walk through the layers, using the diagram of the system layers and slices.

Issues, which they are working with the vendor on:
-Limited buffer sizes. Being fixed by ExLibris
-Limited number of databases that can be searched. Need to balance speed vs. coverage
-Issues with response time and timeouts

Future
-Finish service implementation
-usability studies
-write client layer
-write UI
-Beta rollout will be fall/winter 2005

Other thoughts about future
What happens when MetaLib releases entirely new Web Services? Or if ExLibris fails?
Make some changes at the management layer

———-
David Walker
Web dev. librarian @ CSU San Marcos

Actual preso and the handout are radically different: is now talking about RSS Creator

The problem with Journals & RSS
-RSS ideal for journals? Timely content, high interest among faculty. A TOC alerting service, whether email based or other, is a traditional type of service.
-Problem is that few aggregators or publishers offer these. If they do, they still have to be discovered, collected and maintained. If they do exist, they probably link back to the publisher’s site.

Showed an example of a feed from the Journal of Toxicology, a link back to the publisher’s page. This doesn’t take into account whether or not the library subscribes to or otherwise has access to the ejournal already.

RSS Creator
-TOC alerting servcie: wanted to create this but there was no budget and no staff time to maintain. Focus is on undergrad activity @ San Marcos, this is probably of more interest to faculty.

Diagram of network model: SFX, Databases, Metalib

In SFX, there is a data export tool. Could export all of the journal subscriptions, which db’s have full text, etc.

This knowledge base is exported. Then, the RSS creator ingests the file, breaks it apart, makes a db of the holdings.

When a user visits the RSS feed, Metalib passes off a request to the database or databases, passes back the journal info as XML, transforms it and creates RSS. For first time to create feed, takes 5-10 seconds. User gets an RSS feed. Info is cached on the server. Cron jobs on windows server to go back to db to check for new articles.

When user clicks on an article link, they get the standard SFX interfaace. Includes links to ILL options, since the journals that CSU San Marcos subscribes to are limited to supporting the curriculum. Interests of faculty are potentially much broader.

Advantages: BIG — all the content in databases. Can represent 20,000 – 40,000 feeds. EASY —
[sorry, missed the next two advantages]

Challenges: MULTIPLE DATABSES indexing for journals that are abstracted in more than one. SELECTIVE INDEXING for db providers that index unevenly: some content from some journals, everything from others. TIMING AND UPDATING — how often to you back and refresh. SFX KB doesn’t contain everything, and contains minimal data; enough to create a feed, but not sure that it’s enough to find the journal to begin with; has title info, etc., but not necessarily subject data to enhance discovery of journal titles. REQUIRES A SECURE LOGIN — these are subscription databases. Sitting as an intermediary between user and subscription content. A lot of RSS clients are not set up to handle authentication so they can’t get in. BIGGEST CHALLENGE IS FACULTY USE OF/knowledge of RSS. The RSS client solves a huge problem, alleviates the need to write an email disseminator, for example, but getting faculty to start using an RSS client, and one that handles the authentication, is a struggle.

Future developments:
First beta release in October, will promote to select faculty. Are also exploring topical feeds which actually uses metasearch. Shibboleth and auth. to other CSU campuses

————-
Raymond Yee
UC Berkeley

Live demo of Scholar’s Box at Berkeley: not possible due to no wireless access in Conv. Center

Problem to solve: giving scholars seamless access to research material: any type of content from any research, package up and go.

Attempting to demonstrate whether or not solving this problem is useful, and if it is, how difficult would it be?

Scholar’s Box is a desktop app written in Python.

First presented with a search screen and an option to select repositories to search, and terms to enter.

One search results are returned, can drag and drop into own collection. Copying and pasting traditionally on the web results in a loss of the metadata. The goal of this project is to make it easier to gather the materials along with the metadata … attribution, etc.

Showing results of a MetaLib search/integration. Hope to provide end users with easy access to the scholarly research.

Have hooked up to Flickr and can mine content from Flickr site. Raymond is very interested in the idea of personal digital repositories, so is using Flickr to build one and has about 11,000 images in Flickr at ths point. Uses the XML interface to Flickr to mine and remix data.

The Scholars Box can then export the data in any number of different packages: an OpenOffice presentation, etc.

——–
Q&A

Question about the upgrade process. How do you keep up with release notes and make enhancements so that you don’t have an application that’s stuck, not taking advantage of new functionality.

Answer from Mike: have a number of checks they run to check for changes in the interface. Checking some sample queries to see what changes have been (# of databases, # of results returned, etc.).

Comment/question from the ExLibris guy: goal is to make it possible for the institutions to do what they want, integrate with own services, etc. This will be an idea that is expanded with other products, including DigiTool, etc.

Comment/Question: has been watching metasearch evolve over years, at first wondered if this would take off as an idea at libraries. Now believes that it will and that it will be very important, equal in importance to catalog. Question about whether it will be easy for other schools to gather and adapt this code, given the existing difficulties, even for large schools, to comment and share code. David: yes, and will use SourceForge to share, will make sure it’s documented. Raymond: not sure; ScholarsBox is still a thought piece, not sure how useful it will be to others. Mike: for Common Framework, thinking of Open Sourcing it but it’s pretty large. XTF full text index/search tool has been pushed to SourceForge already. Thinking about breaking off other pieces and sharing them.

MODS, MARC, and Metadata Interoperability PART 2

Speaker 4 (first speaker of second half): Ann Caldwell, Brown University

Overview of digital initiatives @ brown. The CDI was created in Oct. 01; metadata specialist position (Ann’s) created October 2002.

Brown metadata model: Ann’s position includes all metadata, not just descriptive. Using METS to package, chose MODS over DC. Their model enables both shallow and deep discovery. For example: an art image can be searched in native VRA format in Luna, but in central repository as MODS. Everything has a MODS record.

Early projects. Were at first only delaing with library materials – sheet music, etc. Used existing MARC-MODS tools. Still have no metadata creation staff but got interns from Univ. R.I. Library school. 150 hours/sem = 3 credit hours. Many students are on second careers and very focussed on their work. Began using NoteTab Pro, which they had also been using for EAD-creation.

Broadening the base. Word got around campus very quickly that this was going on. Faculty and other groups began coming in with very creative projects, some hybrid of own materials and library materials.

CDI dropped their current work to help faculty. The Scholarly Technology Group (part of IT) were contacted to be sure the CDI was not duplicating their efforts. They weren’t, STG wasn’t doing any metadata work to speak of.

Ho to build MODS records. Some from MARC, some from scratch, some extracted from other dbs (FMPro) to convert to XML.

NoteTab Pro: cheap. Downloaded the EAD “clip library” and modified it. Very flexible. All MODS, all METS records build in NTP. Programmed it to prompt the user through a series of templates. Constantly making changes to this.

VRA & EAD records mapped to MODS transformed with XSLT. VRA records exported from image cataloging system (FMPro-based). Not all elements retained from VRA -> MODS. “subjects in VRA get a little squishy” EAD component-level content captured and converted to MODS on a 1-1 basis.

What’s in the records? Have established a bare minimum, every MODS record validated against stylesheet for minimum and also certain local requirements. Don’t have subject analysis on all records.

Storage and display: records mapped into PHP/MySQL (homegrown). All mapped into relational tables to enable the cross-collection searching. Records retrieved through search are displayed with stylesheets.

Ann had several examples of table displays and a schematic diagram of the system [see her ppt.] She demo’ed searching the Brown repository.

Current status. July ’04 added 1.66 professional position and some additional paraprofessional staff (didn’t catch the number). Still no additional staff for the metadata component. 20+ active projects now. Have started to work with audio and video. Audio hasn’t been a problem but video serving is still being addressed at the university level.

There are now some main Technical Services staff generating MODS.

Future directions: NoteTab works OK for some but some users (particularly outside the library) really want a web interface. The scientific/medical communities at Brown are very interested in adding content but don’t have time for description. Looking at TEI this summer; the STG group have had great success training students to do TEI encoding. Looking at the overall staffing, looking for efficiency opportunities. Digital backlog is now larger than the analog backlog.
Brown digital library site.

DEMO: NoteTab Pro. Showed MODS tools (building MODS through prompts), using NTB to create METS record and package. THIS WAS A VERY COOL DEMO! Can’t really do it justice here in the notes.

Speaker 5 (second speaker from second half): Terry Reese, Oregon State University

Terry is the Digital Production Unit Head @ OSU and was named a 2005 “mover and shaker” by Library Journal. Terry has a software dev. background.

Started by giving some background on metadata interoperability and metadata tools: proliferation of metadata schemes; differences in best practices is also a source of some problems. Cited the Indiana study that showed metadata creation costs of about $3/book for copy cataloging, $27/book original, $20/these .

In some cases, things are being cataloged more than once: things that go into DSpace or ContentDM. Now, they only create one record and derive/repurpose for other uses.

Challenges of interoperability: one-to-many, many-to-one transformations (this is the problem of going from less to greater semantic richness, or vice versa, same problem Moen touched on in his talk). Other problems include different hierarchies and “spare parts” – leftover content that doesn’t fit anywhere. It may be better to discard than to try to make non-fitting data fit?

MARCEdit crosswalking tool uses MARCSML as control schema to facilitate transformations. Due to the nature of its design (network, or star), no more than two tranformations will take place (looks like a wheel).

DEMO of MARCEdit. Transformed an EAD record to MARC. It also has a MARC editor for people who aren’t comfortable editing MODS directly.

Also has an OAI harvester built in to grab OAI records and transform them into MARC. They use it at OSU to grab DSpace records and input into the library catalog.

This was a great Demo and there is a lot more to Terry’s presentation than I’m reflecting in these notes. His PPT will have more detail. It was a very impressive tool and a wonderful way to end this long session; it gave me the sense that non-programmers could get their hands on some tools and actually do some transforming. A great way to become familiar with these various schema. See Terry’s site for links to MARCedit and other goodies.

MODS, MARC, and Metadata Interoperability PART 1

MARC Formats Interest Group (LITA/ALCTS)
Monday, June 27, 1:30 pm – 5:30 pm

Description from LITA site: Libraries face challenges in integrating descriptive metadata for electronic resources with traditional cataloging data. This program will address the repurposing of MARC data and metadata interoperability in a broader context. It will then introduce the Library of Congress’ Metadata Object Description Schema (MODS) and present specific project applications of MODS. Finally, the program will offer scenarios for coordinating MARC and non-MARC metadata processes in an integrated metadata management design and introduce tools for simplifying interoperability.
Speakers: Dr. William Moen, University of North Texas SLIS; Rebecca Guenther, Library of Congress; Ann Caldwell, Brown University; Marty Kurth, Cornell University; Terry Reese, Oregon State University

This was an extremely dense but immensely useful session; PowerPoint presentations will be available online at the ALCTS site some time soon (as of June 28 they are not yet linked).

Speaker 1: William Moen, Texas Center for Digital Knowledge, University of North Texas
Summary from Claire: Moen put into very succinct and very clear language the reasons why we (librarians but more specifically catalogers) have to begin to know standards other than our own.

Speaking on metadata interaction, integration and interoperability

Problem statement … is there a problem? We used to think of interoperability as a systems problem; we now understand that there are different levels to the problem. There are many metadata schema, some well-documented and well-known (AACR2), others less so. Ditto for content standards. There are also a variety of syntaxes (MARC and XML, for example). Lorcan Dempsey calls this our “vital and diverse metadata ecology.” We don’t really have a problem UNLESS we expect these various standards to interact, which of course we do.

So we are moving from a systems-oriented definition of interoperability to a user-oriented definition, Moen suggests a preliminary framework to help scope the work. Look at communities of practice: who is our community? Libraries, archives and museums are fairly tightly-knit communities with a good understanding of standards. As we try to cross into other communities, however, the costs of interoperability go up.

Communities of practice, two types:
-Networks of professionals (librarians, etc.) have similar language and shared meanings
-Information communities are looser organizations , and include the creators of information, managers of information (librarians/catalogers), and users.

Godfrey Rust (complete citation for this and other references will be in Moen’s ppt preso when it goes online) divides things into: PEOPLE, STUFF and AGREEMENTS.

Interoperability cost vs. functionality. William Arms’ curve of cost v functionality (graph & cite in ppt). OAI harvesting, for example, has lightweight requirements, so it is easy to implement but less functional. Federated searching/Z39.50 is highly functional but more costly to implement.

The library has developed very sophisticated structures over time. In the larger scheme of things, over time, probably these structures will not be as broadly adopted. The time is now: this is our opportunity to act if we want to try to see our standards adopted more broadly.

There probably will never be ONE canonical metadata scheme BUT we may all be able to agree on XML, which is a great step forwards. Some apparently simple schemes like Dublin Core turn out not to be so easy to implement in actual practice. We do not want to be further marginalized, we want to (have to) learn to play with others and have to get over the “not invented here” syndrome.

Mechanisms to address interoperability (with the fundamental assumption that there will NOT be one basic standard):
-Crosswalks/mapping
-Application profiles
-Registries
-RDF

Crosswalks and mapping. Mapping is the intellectual process of analyzing the standards and making matches. The crosswalk is the instantiation of the map. 1998 NISO white paper on crosswalks. This activity is successful when accomplished by someone who really knows the standards on both ends of the map: catalog librarians who know AACR2 will be responsible for becoming knowledgeable about other standards so that they can lead the mapping/crosswalking activity.

Difficult decisions to be made while mapping include: should it be one-way only or reversible? Reversible/round-trip: MARCXML < -> MARC. MARC -> MODS, however, is not round-trip, there is some loss of data, albeit perhaps slight. So is the mapping one-to-one, one-to-many, many-to-one, etc.? Other difficulties include vocabularies: how to go from controlled to uncontrolled? For example, how does one indicate in Dublin Core that the subject is an LC heading?

Mapping to an interoperable core. OCLC is working on this problem, trying to come up with something rich enough to act as a core: all things map to the core and then out again to other forms. They’ve been looking at MARC as the possible basis [note: see Terry Reese’s presentation on MarcEdit; he was the last speaker in this program]

Application profiles: same elements used in different ways, and with different meanings. These uses can refind the standard definition of the element as long as the fundamental meaning is unchanged.

Registries are necessary for application profiles to be successful. Ex: UK Schemas, EU Cores, others (see ppt)

RDF is the foundation of the semantic web and is a grammar for expressing terms, semantics. Moen admits his difficulty with RDF. Is important, but struggles to explain it.

Conclusions: Libraries are just ONE of the communities, we do not have a central role, but we may have a priveleged role thanks to our long experience. Some librarians continue to think that cataloging is different from metadata generation. We have to think about interacting with other communities. The challenge is to develop tools to hide the differences between formats (hide them from users of our systems). See Roy Tennant’s recent article about transparency. Moen demoe’d an SRW search on LOC which can show the data in MODS format or in XML, or in DC, etc. This is a good example of transparency: give the data to the user in a format that they can use.

Speaker 2: Marty Kurth, Cornell University Metadata Services

Provides services to faculty and others on campus. Interested in repurposing the library’s MARC. Metadata management design. What does all of this metadata mean for our shops and how do we set up systems and services that support interoperability over time? His preso is based on an article for Library HiTech that he co-authored 2004 (22:2).

Explains what is meant by ‘repurposing MARC data:’ being able to reuse MARC outside of the library catalog. Example collections: Making of America (MOA), Historical Math monographs, HEARTH home ec. collection, May anti-slavery, Historical literature of agriculture. All 5 of these dl projects had print counterparts and thus MARC to build on.

Metadata processing involves: mapping, defining relationships between schemas; transformation, the process of moving between schemes; and management, coordinating the tasks and the resources.

Metadata management challenges: workflows are not yet well established. Mapping and transformation is not happening all in one place, it is happening all over the library and may not be well documented, or if it is, the documentation may be scattered. Goal was to move from projects to process.

Why is repurposing MARC a logical place to begin? Firstly, we’ve got lots of it. Allows them to maximize the potential of the data. MARC mapping can be expensive; cost goes down as tools are developed. Typically this work is done by specialized staff for whom opportunity costs are expensive. It can be messy and difficult, it probably will generate multiple versions of data and records, etc. Thus, a good challenge.

Collection-specific mapping variations are inevitable. MOA, May, HEARTH all involve TEI. Handling of date transformation between MARC and TEI, for example, varied between the MOA and the May collections. The mapping was further complicated because each project was delivered with a different platform (DLXS, EnCompass, and DPub). Each project had slightly different needs. Work was performed in different areas of the library.

MARC mapping models. How to deal with the collection specificity? Looked at LOC MARC-> DC, but made local decisions on additional fields. Sought feedback on this library-wide.

Managing transformations. Transformations also vary from collection to collection. Some were performed by vendors. Scripting and XSLT trans. were later implemented. The library catalog is still the database of record. The scripted approach to transformation extracts the MARC, transforms it into XML, and combines it with other data including admin and technical md, OCR’ed text, etc. The XSLT approach involved writing transformations to accomodate the possible entirety of any MARC recrod; the metadata staff then customize the XSLT for their particular collections. It is easier to tweak and modify as the project unfolds. Documentation is critical and had been lacking in the past. It is a key component in management of metadata over time.

Metadata management: coordinating the intellectual work AND managing the tools and files that are products. The tools and process are resources to be managed. Important to know the user community for these tools and their needs for using and accessing them.

Strategies: inventory existing relationships and processes (this is not something Cornell has specifically done). Identify the staff who will be responsible over time and who will mentor. Requires strategic buy-in. Important to communicate the importance of this more than once. [Marty’s ppt. here gives a useful example of such an inventory]

Concrete next steps: how do we build a culture to embrace this? Develop reusable transformation tools. Build library consensus on mapping. Create a culture and a practice of sharing and revising. External stakeholder discussions, library-wide. Talk about the risks of NOT managing tools. Think about creating a repository for metadata management tools that is searchable.

Speaker 3: Rebecca Guenther, Library of Congress

Rich descriptive metadata in XML: MODS. Overview: background on MARC & XML, MODS intro, MODS’s relationship to other schemes.

MARC and XML. We have large investments in MARC. Cataloging is an early form of metadata. Trying to retool to exploit flexibility of XML. Also trying to anticipate receiving metadata in other formats in XML or as part of a digital object.

Evolution of MARC21. Until now, MARC has been both a syntax and an element set. In current environment, XML is being used more and more and more tools are available. Diagram shogin transformation from MARC21 to XML. First transform to MARCXML in order to be able to do other things (validation, etc.)

MARC 21 in XML. MARCXML is lossless and capable of round-trip to MARC. Once it is in XML, we can then use stylesheets/XSLT to present in different environments/interfaces.

MODS is a derivative of MARC. It uses XML Schema. It was initially thought of for library applications, but they are seeing other uses and implementations.

Why bother? There is an emerging initiative to reuse metadata in XML: SRU/SRW, METS, OAI, etc. Looking for something richer than Dublin Core. Before MODS, not much in between MARC and Dublin Core. MODS is a core element set for convergence between MARC and non-MARC XML.

Advantages of MODS: it is compatible with existing library database descriptions. Richer than d.c., simpler than MARC, partly because the language is more readable than numerical tags. The hierarchical structure more readily supports rich description of complex objects.

Features of MODS. Uses language-based tags which share definitions with MARC. Description is rul agnostic. Elements are reusable and not limited as to number of sub-element. For example, the name tag can be used throughout the record, in author fields but also as part of related item-subject. Redundant elements can be repackaged more efficiently [Rebecca’s ppt will be useful here to clarify these points]

Status of MODS. Started a MODS listserv in 2002. #.0 has been stable for about a year. 3.1 is coming out soon, doesn’t change anything in 3.0 but has been reordered to be compatible with MADS (Marc Authority). Registered with NISO.

Relationship to other schema. General-purpose and compatible with MARC. More broad than many other formats (EAD, ONYX, etc.) Difference between MODS and Dublin Core: MODS has structure, DC is flat. Can more precisely modify/qualify fields in MODS, for example, publication info can be related to date in MODS, can’t in DC. MODS is more compatible with library data. MODS can include record management information.

MARCXML vs MODS. Demo’ed music records in MARC, MARCXML, MODS. May not be exactly the same specificity when converting from MARC-MODS but most of the record converts.

LC uses of MODS. Using to describe electronic resources (AV project, web archiving). METS. SRU/SRW implementation offers records in MODS (this is one of the available choices).

MINERVA web archiving project. Exploring born-digital materials. Used MODS native (vs. creating as MARC and then converting to MODS); perhaps will some day put into the library catalog, but perhaps now. For web archiving, created 1 collection-level record, individual MODS records for each object.

Election 2002 web archiving: webarchivist.org cataloged the datea, creating MODS records for each site, some of which were captured more than one time. Other web archiving projects, yet to be cataloged: 9/11, 107th Congress, 2004 election.

Demo’ed 2002 election archive. Used XSLT to transform MODS to HTML. Link to the archived site. Showing MODS in XML – date captured data includes start and end points for capture. Decided not to link to the live site, which in many cases disappeared almost immediately after the election anyhow.

107th congress website archiving. Did in-house (MODS cataloging at LC). Used XMLSPY to catalog. Built own search and browse. Browse has drop-down menus to select the house or senate ctte.

Iraq war. Now have an input form for the catalogers to use as they catalog w/drop-down menus, etc.

I Hear America Singing project. METS + FEDORA w/MODS. METS packages all metadata and all digital objects, including sounds, CD covers and other images, etc.

Other MODS projects. MusicAustralia and Screen Sound Australia are using MODS as an exchange format.

Directions for MODS. Continue to explore interactions with METS. Continue to use for digital library projects @ LC. Richer linking capabilities than MARC. Website archiving. Looking at MODS tools, looking at using it with OAI as an alternative to D.C>

Q&A for the first three speakers
Q. When will MODS 3.1 be out?
R.G. Had hoped last week, but within next few weeks. 4.0 will be a complete rewrite and is in the workds but will take more time, require broader discussion, etc.

Q. As Cornell attempts to shift from a projects-oriented approach to a program-oriented approach, what will happen with the collection-specific approach, and have they talked about using MODS?
M.K. Talk about it all the time but there is some political drag to this idea.

Q. About LC web archiving; are any of the keywords or other data automatically extracted from web sites as they are archived/cataloged?
R.G. Yes, worked with their IT folks who extracted from the HTML. For the Milstein project (music project from I Hear America Singing) the metadata was all manually created, not extracted.

Q. Will MINERVA records go into library catalog?
R.G. Initially, though ILS was where all the records had to go, but with emergence of federated search, are no longer thinking this is the case.

Q. MARC records are dynamic and maintenance is possible (update an authority record, all records linking to it are updated)
M.K. Still consider library catalog to be the catalog of record. Haven’t established periodicity for refresh but it is possible to do this, built in to their design.

END OF PART ONE

Greenstone Digital Libraries: Installation to Production

Sunday, June 26th, 10:30am – 12:00pm
Session descr. from the LITA site: Greenstone digital library software is a comprehensive, multilingual open-source system for constructing, presenting, and maintaining digital collections. Greenstone developer Ian H. Witten will introduce Greenstone and demonstrate installation and collection building. Washington Research Library Consortium and University of Chicago Library representatives will discuss Greenstone implementations at their organizations, including software requirements and selection, collection and interface customization and use of METS-encoded metadata. Laura Sheble will present results from the 2004 Greenstone User Survey.

[Note from Claire: sorry, everyone, my laptop died so I don’t have complete notes on this session; hopefully my co-blogger has a more complete record]

Speaker 1: Ian Witten, University of Waikato, developers of the Greenstone library system

Goals of Greenstone have been:
-to be able to present collections of digital material and to support custom presentation of these colls.
-large scale support, up to several Gb text
-support associated/linked images, movies, etc.
-serve on web or publish to CD
-run anywhere, on any platform, and with support for many languages
-non-exclusive as to format
-non-prescriptive as to metadata, etc.

Easy to install, supports full text or fielded search. Extensible.

FACTS
-Open source (SourceForge)
-5,000 copied downloaded each year
-supports 38 languages
-Supported by some important international agencies; UNESCO distributes and provides Greenstone training

Ian did a demo of the Greenstone system (I believe he said he was showing version 2):
Running the librarian interface, demoed creation of a new collection with these main steps
“Gather” – drag/drop images and other Beatles miscellany into a collection window. Greenstone detects mime types, prompts to install plugins for mime types not previously encountered (MP3 and MARC)

“Enrich” – optional step to add metadata, which Ian skipped for demo purposes

“Design” to create indexes. Uses any available extracted metadata if metadata not explicitly provided in the Enrich step (titles from MP3 and HTML files, etc.)

“Build” to build the collection

Demo’ed a search for “love” in full text & title. Shows thumbnails of images, which it creates as the image files are imported in the “Gather” phase.

Bulding a more sophisticated collection for Beatles miscellany took about 1.5, this involved adding a MIDI plugin, adding metadata for the objects, adding DC classifiers, adding a browse by media type function.

Greenstone 3 is a complete rewrite and is in the works; can be downloaded in beta form now but not recommended for production use. V2 is still the supported/recommended product. Changes coming in 3: generates XML rather than HTML, METS is the foundation and underlying collection format, JAVA-based and uses SOAP.

Speaker 2: Alison Zhang, Washington Research Library Consortium

WRLC is 8 academic libraries in the DC area

In 2002, received and IMLS grant to provide dig. collections in a consortial environment

Needed power and flexibility from a digital library delivery system. Features sought:
User interface: good browse, powerful search, customizable, collection-based indexing and labeling, linkable digital objects & metadata, multipage object display (books or other complex text objects), support for multiple formats (MD?), support for standard schema, federated search
Staff interface: ease of use, support for Dublin Core, support master and derivative vers. of objects, templates, direct view of digital objects, allow search edit and delete of records, support global changes/updates, local authority control.

None of the software evaluated met all requirements, so decided to customize two open source packages: DCDot for metadata creation and Greenstone for display/user int. Neither supports federated search or multipage object view.

Most of staff interface is DCDot-based, customized. Created own multipage viewer.

Example collections, for which customized HTML templates were built (17 dig. collections built since 2002 using Greenstone): Art images Collection, Finding Aids collection (EAD-based, first Greenstone customer to do this).

Delved a bit into the details of how to customize Greenstone, referred us to the doc. she wrote which is linked to from the Greenstone site: “Customizing the Greenstone User Interface”

Customizing DCDot – most customization involved Perl. Created templates, implemented a drop-down authority list that updates dynamically as additions are made.

Created own collection management system to tie everything together and are in the process of replacing DCDot with another management interface, possibly DSpace.


Speaker 3: Tod Olson, University of Chicago Library

Chopin Scores project: over 400 scores from Chopin’s early period.

Tabbed user interface display, choose to view bibliographic desc. or the document itself, which has a multipage browse feature.

Built this project on AACR2 MARC from library catalog. Preservation scans and structural metadata were input into a relational database. MARC was transformed to MODS, which were then combined with images and structural md to create a METS record. METS transformed via XSLT into the Greenstone structure. Tod explained in some detail which bits of the METS structmap, etc. were mapped to the Greenstone format.

Features of Greenstone3 that U of C looks forward to: support for Lucene or MG/MGPP (Greenstone internal indexing component), METS as internal structure, MySQL support, XML/XSLT for presentation, continued support for existing Greenstone2 data.

Proof-of-concept Music Information Retrieval (MIR) component:
Scores in the collection are matched to existing MIDI examples. Pitch intervals are encoded as text, which is added to the document metadata.

User can input a tune into a keyboard. This MIDI file is similarly encoded as text, then a search looks for matches in the document metadata. It actually works!

Chopin Early Editions

Speaker 4: Laura Sheble, Wayne State
Greenstone User survey

Created a user survey to get feedback on Greenstone support mechanisms

[Session notes cut off here, sorry – Claire] [no problem, great post! — kgs]