Linked Data for Open and Distance Learning – Part 4
Continued from Linked Data for Open and Distance Learning – Part 3
Adopting Linked Data: Technologies and Tools
The goal here is not to give a complete guide to these technologies (books such as Heath and Bizer, 2011 can be used for this), but to give an overview of the types of tools and formats that one might have to encounter when adopting linked data, as an application developer or as a publisher of open data.
RDF (Resource Description Framework) is the base format for the representation of linked data on the web. Since it is meant for the representation of data that can connect across different sources on the web, RDF naturally follows a graph model, where data is represented as nodes, connected through edges. Nodes can be either resources, or literals (i.e. values such as a string or a number) and edges connect these resources or literals. An important aspect here is that each resource and each relationship is identified by a URI, i.e. a web address that point to the information about the corresponding entity. Taking the example of a unit of open educational content from the OpenLearn repository, we can represent the fact that it is a document, has a title (“Machines, minds and computers”), is published by the Open University and relates to a course called “M366: Natural and Artificial Intelligence”, through the set of triples:
While these triples might relate information from various sources across the web, they need to be stored and managed within dedicated systems. Triple stores are the equivalent to database management systems in relational databases. They are software systems that provide functionalities to load, store, update and query data in RDF. They are called triple stores as their underlying data model is a graph made of RDF triples. Contrary to usual database systems, they interact with external applications through linked data standards (RDF and SPARQL, described next), therefore hiding the details of their specific implementation and making it possible to use different triple stores in a homogenous way. Existing systems differ mostly with respect to their hardware requirements, their scalability and performance, as well as whether they provide additional features beyond what is strictly required to realise the linked data standards.
SPARQL link (Simple Protocol and RDF Query Language) is the query language for RDF and generally Linked Data. It has a similar role in triple stores as SQL in relational database management systems. Major differences with SQL however include that SPARQL is designed to query graph-based data representations. It is therefore essentially based on the definitions of “triple patterns” corresponding to particular conditions in the data graph. For example, the following query (adapted from the examples at http://data.nature.com/query) gives the title and first author of up to 25 articles that contain the word “an” in their title:
Another important aspect of SPARQL is that it does not only include a query language, but also a protocol to create data endpoints on the Web. The idea is that a SPARQL endpoint (such as the one from nature.com mentioned above) should only require standard web mechanisms in order to be accessed and used. In other terms, one only needs a web connection to query a SPARQL endpoint. Also, results are given in standard web formats such as XML, or even RDF, making them Web accessible and linkable. Finally, the SPARQL language includes several different types of queries that can be used to interrogate the data in a triple store (“Select” queries as above, or “Ask” queries) or to extract a sub-graph from the data (“Construct” or “Describe” queries).
These technologies represent the basic, lower level layer of linked data – dedicated to data management, and that is necessary to understand and possibly adopt linked data in an open and distance learning context, or any other context where it might be relevant. Going beyond this however, a layer above these basic data management mechanisms include the one of vocabularies, that relates to conventions and schemas for modelling linked data in a way that makes them easily interoperable. Indeed, while in principle data on the web and the RDF format, do not require a schema in the typical sense of a relational database system, vocabularies and ontologies in linked data play a similar role, with the aim to define shared ways to structure and meaningfully organize data in RDF so that these data can be reused. The base format for representing such vocabularies is RDF Schema, which makes it possible to declare the types of objects (the classes) and the relationships (properties) that might apply between objects of different types. OWL (the Web Ontology Language) goes a step further, making it possible to define these classes and properties using logical statements. Such logical statement should represent the shared meaning, the semantics, of these classes and properties, therefore providing a way to attach this meaning to the data for others to reuse, and possibly to reason upon.
Several of such web vocabularies and ontologies exist that are specifically dedicated to the representation of data about education, learning and teaching (see the vocabularies page of linkeduniversities.org for details). For example AIISO (the Academic Institution Internal Structure Ontology), is dedicated to the representation of information about departments, faculties and other divisions of an educational institution, as well as their relationships with courses and qualifications. LRMI (Learning Resource Metadata Initiative) is another of such examples. It has been designed as an extension of Schema.org, and as such can be used as a linked data vocabulary. Its aim is to provide a common representation schema for metadata related to educational resources.
One important set of tools are the ones that can help transforming existing, legacy data into linked data. Some platforms specifically dedicated to education and eLearning already provide features to export metadata about their content in RDF. For example, in Moodle, the RDF description of the content of a page can be obtained by adding “.rdf” at the end of its address. The Fedora Content Management System, commonly used for multimedia repositories, includes a similar feature, and is actually based on a triple store providing SPARQL-based querying mechanisms. The ePrints open source digital repository also includes features to export data into a variety of formats, including RDF. When not provided directly by the platform that originally holds the information to be extracted and transformed, tools to realise these tasks might be available separately. For example, Triplify and D2RQ are two of the most popular tools used to create mappings between relational databases and specific RDF representations. Other tools such as Open Refine (a tabular data cleaning and manipulation tool) also include extensions to export data that can originate from a large variety of tabular formats (Excel, Google Spreadsheet, Comma Separated Value files) into RDF.
Last, but most definitely not least, an important aspect of adopting linked data is, of course, linking. Links between different datasets can take two forms: the reuse of external URIs, or declaring that some resources in the considered data are the same as resources in others. The first approach is somehow the most natural, creating links by simply relating resources in one source to another somewhere else on the web using a dedicated property. For example, the description of courses at the Open University include information about the countries in which they are available. These countries are referenced by using URIs from the Geonames dataset. When this might not be suitable, one can use the second approach, relying on the ‘owl:sameAs’ property of the OWL language to declare that two different linked data entities (having two different URIs) represent the same real-world object. For example, in the Linked Data platform of the University of Southampton, entities representing bus-stops (for example http://id.southampton.ac.uk/busstop/SN120684) are linked through ‘owl:sameAs’ to corresponding entities in the UK government’s transport linked dataset (for example http://transport.data.gov.uk/id/stop-point/1980SN120684). In both cases, if not created manually, the links are often created through ad-hoc scripts and tools. However generic tools such as SILK and LIME exist that can be set up to automatically discover such links between datasets.