Speaker: Mark Phillips, University of North Texas
Mark discussed the experienced he has had as manager of the Digital Projects Unit at the University of North Texas Libraries. Their projects include the Portal to Texas History, a multi-institutional repository of approximately 20,000 items relating to Texas History; the CyberCemetery, a collection of websites from defunct government agencies; Congressional Research Service reports; and other digital collections from the UNT libraries. All together, he said that they manage approximately 70,000 items, with a total of around 500,000 pages, and they expect to double that number within the next year.
Mark described their technical environment. They use an open-source asset management system, Keystone from Index Data, which they have heavily customized. They use the same modified Dublin Core metadata in all their collections, which has allowed them to maintain consistency of cataloging, but he admitted that many of their controlled vocabulary and descriptive processes were developed independently and do not follow standard cataloging practices. In the future, he believes that they will be experimenting with other types of metadata, including MODS, which will make some of their work more challenging.
The Metadata Analysis Tool (MAT) they use was developed in house. It has not been released as open source, but UNT is planning to re-platform most of its projects, and the MAT may be documented and made available for wider use at that point. Mark’s slides were not available in the session materials, and he did a live demo of the MAT, but screenshots from the tool are available in an older presentation available from UNT.
The metadata analysis begins by indexing across the various “silos” of content they manage. They use Solr for indexing and a Python script to push the data into the MAT. At that point, they can examine the metadata based on a number of different aspects. They can look for missing required elements and quickly modify records. They can show terms used in various elements and scan for typographic or other errors; a similarity analysis, which is set to flag content with 90% or greater overlap with a previous term, allows Digital Projects staff to identify problems quickly. Other tools incorporated in the MAT are a term cloud, allowing a graphic display of term use based on frequency, and various graphs and charts showing records added by date and by coverage elements.