The MARC data structure, and the AACR2 rules that usually accompany it, are strange beasts. Every once in a while I’m asked why I get so frustrated with them, and I explain that there are things — strange things — that I have to deal with by writing lots of code when I could be spending my time trying to improve relevancy ranking or extending the reporting tools my librarians use to make decisions that affect patrons and their access.
This is one of those tales.
I’m a systems librarian, which in my case means that I deal with MARC metadata pretty much all day, every day. Coming from outside the library world, it took me a while to appreciate the MARC format and how we store data in it, where appreciate can be read as hate hate hate hate hate.
I find it frustrating to deal with data typed into free-text fields all willy-nilly with never a thought for machine readability, where a question like what is the title is considered a complicated trap, and where the word unique, when applied to identifiers, has to have air quotes squeezing it so hard that the sarcasm drips out of the bottom of the ‘q’ in a sad little stream of liquid defeat.
One of the most frustrating things, though, is when a cataloger has clearly worked hard to determine useful information about a work and then has nowhere to put those data. To wit: date of publication.
Many programmers have to deal with timestamps, with all the vagaries of time zones, leap years, leap seconds, etc. In contrast, you’d think that the year in which something was published wouldn’t be fraught with ambiguity and intrigue, but you’d be wrong. Dates are spread out over MARC records in several places, often in unparsable free-text minefields (I’m looking at you, enumeration/chronology) and occasionally in different calendars.
The most “reliable” dates (see? there are those air-quotes again!) live in the
008 fixed field. Of course, they mean different things depending on format determination and so on, but generally you get four bytes to put down four ASCII characters representing the year. When you don’t know the all the digits of the year exactly, you substitute a
u for the unknown numbers.
- 1982 — published in 1982
- 198u — published sometime in in the 1980s
- 19uu — published between 1900 and 1999
So, that’s fine. Except that it isn’t. It’s dumb. It made sense to someone at the time to only allow four bytes, because bytes were expensive. But those days have been gone for decades, and we still encode dates like this, despite the fact that having actual start and endpoints for a known range would be better in every way.
Look at what we lose!
- 1982 or 1983 — 198u (ten years vs. two)
- Between 1978 and 1982 — 19uu (one-hundred years vs. five)
- Between the Civil War and WWI — 1uuu (one-thousand years vs about fifty)
The other day, in fact, I came across this date:
Yup. The work was published sometime between 2000 and 2099. My guess is that it was narrowed down to, say, 2009-2011 and this is what we were stuck with. I’d bet big money that its date of publication isn’t, say, after 2016, unless time travel gets invented in the next few years.
But the MARC format works against us, and once again we throw data away because we don’t have a good place to store it, and I’m spending my time trying to figure out a reasonable maximum based on the current date or the date of cataloging or whatnot when it could have just been entered at the time.
As much as we’d like to pretend otherwise, no one is ever going to go back and re-catalog everything. I can almost stomach the idea that we did this thirty years ago. It drives me crazy that we’re still doing it today.
How about it, library-nerd-types? What do you spend your time dealing with that should have been dealt with at another place in the workflow?