Archiving & Preserving the Web

Kristine Hanna was the main speaker for this session, and Linda Freuh also contributed. Both are from the Internet Archive.

The session opened with a brief outline of the history of the Internet Archive. They were founded in 1996, and are a non profit organization dedicated to, well, archiving the Internet. They crawl two billion pages a month, plus other media files like audio clips. These snapshots are then stored and made available online. Currently the archive holds 55 billion pages from 55 million sites! To put this in perspective, Kristine estimated that if printed out the pages would reach to the moon and back 19 times.

IA makes no distinction between what should be archived and what shouldn’t – the web is so ephermeral that they’re focused on just grabbing the data for now. All software used in the process is open source and developed from partnerships between IA and other organizations. This includes their crawler, the “Wayback Machine” method of displaying the stored sites, a search engine, and the file format of the archives.

In my mind I had always pictured the Internet Archive as a giant behemoth of an organization. But in reality they have just forty employees! As someone pointed out, that means that 5% of their entire organization was here today. They are completely non profit, and even services that have a fee (such as custom archives for organizations like the Library of Congress) are done at cost.

I was also unaware of all the special projects IA takes on. They’ve branched out a bit from the general archive, and also create special collections around big events like Hurricane Katrina.

As I mentioned earlier, IA works with a number of clients on special projects as well. Users include the Virginia and North Carolina state governments. Others are much broader than just one state – Working with France, they crawled archived the entire .fr domain! Same with .au in Australia!

As a relatively small organization, safe backup of all this information is an important issue. IA follows the Lots of Copies Keeps Stuff Safe philosophy, running mirror servers in places like Egypt in addition to the main California facility. Because this is such a huge amount of data and IA doesn’t have access to the higher bandwidth of Internet2, the backups are actually physically shipped around the world on massive racks of hard drives.

Both Kristine and Linda emphasized that they are not librarians. Instead, they say that the Internet Archive works only as “technical partners” to existing organizations and their expertise. And their services are “…only good because we get lots of user feedback.” In some cases entire projects are suggested by users, including a new archive of topographical maps of the United States.

During Q&A, the presenters noted that if any content owner would like their sites removed from the archive, they need only ask. Also, the IA crawler obeys robots.txt files and will ignore servers if directed to. Internet Archive isn’t large enough and doesn’t have enough money to get into the legal area necessary to clarify these issues. But, Kristine also mentioned that part of “archiving it all” means getting the “bad” stuff along with the good – pornography, ads, etc.

The Internet Archive’s book scanning project was also brought up during Q&A. So far they’ve scanned 80 thousand books, and the main barrier to moving faster is a lack of money to build more “scribe” machines to scan books. All books scanned so far are public domain.

The session closed with brief mentions of two upcoming projects from the Internet Archive:

  • Searching the archive by a method other than a known URL is in the works.
  • The archive of 1996-2000, the so-called “historical web”, will be broadened.