A Linked Data Journey: Beyond the Honeymoon Phase

Image courtesy of Grant MacDonald under a CC BY-NC 2.0 license.

Introduction

I feel that this series is becoming a little long in the tooth. As such, this will be my last post in the series. This series will be aggregated under the following tag: linked data journey.

After spending a good amount of time playing with RDF technologies, reading authoritative literature, and engaging with other linked data professionals and enthusiasts, I have come to the conclusion that linked data, as with any other technology, isn’t perfect. The honeymoon phase is over! In this post I hope to present a high-level, pragmatic assessment of linked data. I will begin by detailing the main strengths of RDF technologies. Next I will note some of the primary challenges that come with RDF. Finally, I will give my thoughts on how the Library/Archives/Museum (LAM) community should move forward to make Linked Open Data a reality in our environment.

Strengths

Modularity. Modularity is a huge advantage RDF modeling has over modeling in other technologies such as XML, relational databases, etc. First, you’re not bound to a single vocabulary, such as Dublin Core, meaning you can describe a single resource using multiple descriptive standards (Dublin Core, MODS, Bibframe). Second, you can extend existing vocabularies. Maybe Dublin Core is perfect for your needs, except you need a more specific “date”. Well, you can create a more specific “date” term and assign it as a sub-property of DC:date. Third, you can say anything about anything: RDF is self-describing. This means that not only can you describe resources, you can describe existing and new vocabularies, as well as create complex versioning data for vocabularies and controlled terms (see this ASIST webinar). Finally, with SPARQL and reasoning, you can perform metadata cross-walking from one vocabulary to another without the need for technologies such as XSLT. Of course, this approach has its limits (e.g. you can’t cross-walk a broader term to a specific term).

Linking. Linking data is the biggest selling point of RDF. The ability to link data is great for the LAM community, because we’re able to link our respective institutions’ data together without the need for cross-referencing. Eventually, when there’s enough linked data in the LAM community, it will be a way for us to link our data together across institutions, forming a web of knowledge.

Challenges

Identifiers. Unique Resource Identifiers (URIs) are double-edged swords when it comes to RDF. URIs help us uniquely identify every resource we describe, making it possible to link resources together. They also make it much less complicated to aggregate data from multiple data providers. However, creating a URI for every resource and maintaining stables URIs (which I think will be a requirement if we’re going to pull this off) can be cumbersome for a data provider, as well as rather costly.

Duplication. I have been dreaming of the day when we could just link our data together across repositories, meaning we wouldn’t need to ingest external data into our local repositories. This would relieve the duplication challenges we currently face. Well, we’re going to have to wait a little longer. While there are mechanisms out there that could tackle the problem of data duplication, they are unreliable. For example, with SPARQL you can run what is called a “federated query”. A federated query queries multiple SPARQL endpoints, which presents the potential of de-duplicating data by accessing the data from its original source. However, I’ve been told by linked data practitioners that public SPARQL endpoints are delicate and can crash when too much stress is exerted on them. Public SPARQL endpoints and federated querying are great for individuals doing research and small-scale querying; not-so-much for robust, large-scale data access. For now, best practice is still to ingest external data into local repositories.

Moving forward

Over the past few years I have dedicated a fair amount of research time developing my knowledge of linked data. During this time I have formed some thoughts for moving forward with linked data in the LAM community. These thoughts are my own and should be compared to others’ opinions and recommendations.

Consortia-level data models. Being able to fuse vocabularies together for resource description is amazing. However, it brings a new level of complexity to data sharing. One institution might use DC:title, DC:date, and schema:creator. Another institution might use schema:name (DC: title equivalent), DC:date, and DC:creator. Even though both institutions are pulling from the same vocabularies, they’re using different terms. This poses a problem when trying to aggregate data from both institutions. I still see consortia such as the Open Archives Initiative forming their own requirements for data sharing. This can be seen now in the Digital Public Library of America (DPLA) and Europeana data models (here and here, respectively).

LD best practices. Linked data in the LAM community is in the “wild west” stages of development. We’re experimenting, researching, presenting primers to RDF, etc. However, RDF and linked data has been around for a while (a public draft of RDF was presented in 1997, seen here). As such, the larger linked data and semantic web community has formed established best practices for creating RDF data models and linked data. In order to seamlessly integrate into the larger community we will need to adopt and adhere to these best practices.

Linked Open Data. Linked data is not inherently “open”, meaning data providers have to make the effort to put the “open” in Linked Open Data. To maximize linked data, and to follow the “open” movement in libraries, I feel there needs to be an emphasis on data providers publishing completely open and accessible data, regardless of format and publishing strategy.

Conclusion

Linked data is the future of data in the LAM community. It’s not perfect, but it is an upgrade to existing technologies and will help the LAM community promote open and shared data.

I hope you enjoyed this series. I encourage you to venture forward; start experimenting with linked data if you haven’t. There are plenty of resources out there on the topic. As always, I’d like to hear your thoughts, and please feel free to reach out to me in the comments below or through twitter. Until next time.