Custom metasearch services using an XML API

Concurrent session 3
Custom metasearch services using an XML API
LITA National Forum 2005
Saturday, October 1, 2005, 10:50am

Roy Tennant, CDL

Breaking out of the box is using an API to create own interface; allows deeper integration of other types of activities (querying dictionary service to check spelling, for example).

Why do it? greater interface flexibility, can do things vendor doesn’t support. Changes to interface can remain unchanged when new versions of application are released.

CDL and the talk today is about the MetaLib X-Server, but there are others that offer similar functionality.

CDL’s vision is of many search portals. No one-stop shopping. Many services for many different audiences. This is often very problematic with a vendor product out of the box. Particularly difficult because the tool out of the box ships with a lot of little fragments of XML, etc. that make up the interface, so making even minor changes required a lot of hunting and correcting.

Also, the code was crap. Very substandard.

AT CDL, metasearching is not just about searching databases, but about searching other things, including OAI-harvested data, crawled earth sciences websites, etc, RSS feeds or new articles and resources, etc.

Showed a wireframe of a metasearch app. Showing examples of image search, etc.

Diagram of applications … see PowerPoint from this preso.

Michael McKenna CDL: Using SOAP and XML to access Metalib

Use the HTML interface from MetaLib as little as possible.

Currently, X-Server supports basic services; an XML interface (basic request/response, modeled on web services), a core interface to metalib to do querying of database, data preso to other apps.

Basic services:
-Login (this is an application connection)
-User authentication (different from login, which is an application accessing rather than a person)
-Retrieve resources
-Search resources
-Retrieve search status report
-Combine results sets
-Retrieve search results

Metalib’s two interfaces: /v (vanilla) and /x (XServer):
-When there is an upgrade, there is a certain amount of fixing involved: with an upgrade with /v, 196 files changed, 2,354 files changed across all campuses. Not just the 10 campuses, but multiple departments and libraries within those. The upgrade script for /v saves and markes changed files. Would have to “diff” to find out what those changes are.
-With /x, never ever have to change the UI. May require mods to the Common Framework (CF) layer (more on this bit later in preso).

Development Methodology
Over 60 people on staff (more than 20 are developers) who are working on this. Potentially anyone could make a change and it would affect all. Like to use source control to track this.
-With /v, cannot use CVS easily. Had to use very close tabs on the changes
-With /x, much easier to test and prototype, synch with checkin server without making any changes to the back end.

For example, to customize simple search, have to change the following:
-/v many fragments of docs to be updated. MANY MANY fragments
-/x interface will go to jsp pages; edit one .jsp file.

Common Framework
Integrated interface for CDL managed info services. Packaged for internal use by the various campuses.

See ppt for the diagrams of the
-Application layer
-Manager layer
-Service layer
-Client layer
-User interface

CF integration
Used existing modules as templates, modified them to act as metasearch interfaces. Then sits alongside other applications at every layer (see list, above). A complete Metalib search “slice”

Logic Flow:
-Authorize site (once person has logged in to a site, the site tells other apps about what the user has access to, so they can just authorize the site)

Michael took us on a detailed walk through the layers, using the diagram of the system layers and slices.

Issues, which they are working with the vendor on:
-Limited buffer sizes. Being fixed by ExLibris
-Limited number of databases that can be searched. Need to balance speed vs. coverage
-Issues with response time and timeouts

-Finish service implementation
-usability studies
-write client layer
-write UI
-Beta rollout will be fall/winter 2005

Other thoughts about future
What happens when MetaLib releases entirely new Web Services? Or if ExLibris fails?
Make some changes at the management layer

David Walker
Web dev. librarian @ CSU San Marcos

Actual preso and the handout are radically different: is now talking about RSS Creator

The problem with Journals & RSS
-RSS ideal for journals? Timely content, high interest among faculty. A TOC alerting service, whether email based or other, is a traditional type of service.
-Problem is that few aggregators or publishers offer these. If they do, they still have to be discovered, collected and maintained. If they do exist, they probably link back to the publisher’s site.

Showed an example of a feed from the Journal of Toxicology, a link back to the publisher’s page. This doesn’t take into account whether or not the library subscribes to or otherwise has access to the ejournal already.

RSS Creator
-TOC alerting servcie: wanted to create this but there was no budget and no staff time to maintain. Focus is on undergrad activity @ San Marcos, this is probably of more interest to faculty.

Diagram of network model: SFX, Databases, Metalib

In SFX, there is a data export tool. Could export all of the journal subscriptions, which db’s have full text, etc.

This knowledge base is exported. Then, the RSS creator ingests the file, breaks it apart, makes a db of the holdings.

When a user visits the RSS feed, Metalib passes off a request to the database or databases, passes back the journal info as XML, transforms it and creates RSS. For first time to create feed, takes 5-10 seconds. User gets an RSS feed. Info is cached on the server. Cron jobs on windows server to go back to db to check for new articles.

When user clicks on an article link, they get the standard SFX interfaace. Includes links to ILL options, since the journals that CSU San Marcos subscribes to are limited to supporting the curriculum. Interests of faculty are potentially much broader.

Advantages: BIG — all the content in databases. Can represent 20,000 – 40,000 feeds. EASY —
[sorry, missed the next two advantages]

Challenges: MULTIPLE DATABSES indexing for journals that are abstracted in more than one. SELECTIVE INDEXING for db providers that index unevenly: some content from some journals, everything from others. TIMING AND UPDATING — how often to you back and refresh. SFX KB doesn’t contain everything, and contains minimal data; enough to create a feed, but not sure that it’s enough to find the journal to begin with; has title info, etc., but not necessarily subject data to enhance discovery of journal titles. REQUIRES A SECURE LOGIN — these are subscription databases. Sitting as an intermediary between user and subscription content. A lot of RSS clients are not set up to handle authentication so they can’t get in. BIGGEST CHALLENGE IS FACULTY USE OF/knowledge of RSS. The RSS client solves a huge problem, alleviates the need to write an email disseminator, for example, but getting faculty to start using an RSS client, and one that handles the authentication, is a struggle.

Future developments:
First beta release in October, will promote to select faculty. Are also exploring topical feeds which actually uses metasearch. Shibboleth and auth. to other CSU campuses

Raymond Yee
UC Berkeley

Live demo of Scholar’s Box at Berkeley: not possible due to no wireless access in Conv. Center

Problem to solve: giving scholars seamless access to research material: any type of content from any research, package up and go.

Attempting to demonstrate whether or not solving this problem is useful, and if it is, how difficult would it be?

Scholar’s Box is a desktop app written in Python.

First presented with a search screen and an option to select repositories to search, and terms to enter.

One search results are returned, can drag and drop into own collection. Copying and pasting traditionally on the web results in a loss of the metadata. The goal of this project is to make it easier to gather the materials along with the metadata … attribution, etc.

Showing results of a MetaLib search/integration. Hope to provide end users with easy access to the scholarly research.

Have hooked up to Flickr and can mine content from Flickr site. Raymond is very interested in the idea of personal digital repositories, so is using Flickr to build one and has about 11,000 images in Flickr at ths point. Uses the XML interface to Flickr to mine and remix data.

The Scholars Box can then export the data in any number of different packages: an OpenOffice presentation, etc.


Question about the upgrade process. How do you keep up with release notes and make enhancements so that you don’t have an application that’s stuck, not taking advantage of new functionality.

Answer from Mike: have a number of checks they run to check for changes in the interface. Checking some sample queries to see what changes have been (# of databases, # of results returned, etc.).

Comment/question from the ExLibris guy: goal is to make it possible for the institutions to do what they want, integrate with own services, etc. This will be an idea that is expanded with other products, including DigiTool, etc.

Comment/Question: has been watching metasearch evolve over years, at first wondered if this would take off as an idea at libraries. Now believes that it will and that it will be very important, equal in importance to catalog. Question about whether it will be easy for other schools to gather and adapt this code, given the existing difficulties, even for large schools, to comment and share code. David: yes, and will use SourceForge to share, will make sure it’s documented. Raymond: not sure; ScholarsBox is still a thought piece, not sure how useful it will be to others. Mike: for Common Framework, thinking of Open Sourcing it but it’s pretty large. XTF full text index/search tool has been pushed to SourceForge already. Thinking about breaking off other pieces and sharing them.