Presidential End of Term Web Harvest: Lessons Learned by Mark Phillips

In a meeting room far, far away…

Mark Phillips from the University of North Texas Libraries spoke to a small gathering of LITA librarians who found their way to the remote Convention Center Meeting Room C1+C4 about web harvesting government information. If you imagine that it is a simple thing to do, you are wrong!

Why would you even consider harvesting data from government websites? 96 percent of federal government information is now digital and much of it is not archived; much of it is disappearing at the direction of bureaucrats who do not know or follow any archiving directives.

The University of North Texas Libraries (UNT Libraries) was contracted by the Government Printing Office (GPO) in 1997 to begin harvesting the web pages of government commissions that were filing final reports and agencies whose functions were ending. The result is the CyberCemetery, which archives the websites of 42 defunct agencies and makes them available for public use.

In theory, the GPO tells UNT Libraries when a commission or agency needs a final harvest so nothing will be lost. In reality, many bodies have disappeared without notice; their web pages which were stored on private industry servers often disappeared before anyone at GPO or UNT was aware of their demise. UNT tries to find the data secondarily through sources like the Internet Archives, but often much is lost.

In 2004 the National Archives and Records Administration (NARA) approached UNT Libraries about conducting an end of presidential term harvesting of federal information with the results to be sent to the California Digital Library. UNT first declined but then accepted when asked a second time. The project was very time sensitive. Phillip’s department had only a month to prepare and another month to do the harvesting. NARA provided UNT with a list of URLs that any good government documents librarian could tell was incomplete, so Phillip’s department had to go to other sources to collect URLs. They also had to get software for the harvest, set up computers, and decide on procedures. Phillips described this operation.

Problems began to crop up as soon as the harvesting began. NARA had promised that necessary notices and permissions would be given; notices may have gone to management administrators in government agencies, but many server administrators knew nothing when they discovered their files being massively copied. Phillips got many angry calls ordering him to cease and desist. There were other technical problems. By the time of the presidential inauguration, UNT libraries had only captured about one-third of the federal web data, but it was still more than NARA had said they would find.

Phillips did not have a digital presentation, but numerous documents on his department’s work can be found at

I have been reading for years about the problems of disappearing federal information. Phillip’s presentation gave it a whole new twist. We have plenty of reasons to worry.