Alignments for data interlinking: Analysed systems

Six systems took part in the experiment. We are aware of two other systems not analysed in this study [sais:2008,hogan:2007].

Interlinking tools are mainly based on string matching techniques, as well as more advanced techniques using linguistic tools and taking into account the graph structure of web data sets.

Tools

We give here only a succint presentation of the various tools we analysed. A more detailed description of the tool formats is on a separate page.

RKB-CRS

The co-reference resolution system (CRS) of the RKB knowledge base [jaffri:2008b] is built around URIs equivalence lists. These lists are built using an ad-hoc Java program working on the specific domain of universities and conferences. A new program needs to be written for each pairs of data sets to integrate. Each program consists in the selection of the resources to compare, and to compare them using string similarity on the resources properties values.

LD-Mapper

LD-Mapper [raimond:2008] is a data set integration tool working on the music domain. This tool is based on a similarity aggregation algorithm taking into account the similarity of a resource's neighbors in the graph describing it. It requires little user configuration but only works with data sets described with the Music Ontology [raimond:2007]. LD-Mapper is implemented in Prolog.

ODD-linker

ODD-Linker [hassanzadeh:2009] is an interlinking tool implemented on top of a tool mining for equivalent records in relational databases. ODD-Linker uses SQL queries for identifying and comparing resources. The tool translates link specifications expressed in the LinQL dedicated language originally developed for duplicate records detection in relational databases. Its usage in the context of linked data is thus limited to relational databases exposed as linked data. LinQL is nonetheless an expressive formalism for link specifications. The language supports many string matching algorithms, hyponyms and synonyms, conditions on attribute values. Given its nature, ODD-Linker is made to be used with relational databases exported in RDF. The tool was used to interlink the Linked Internet Movie Database to DBpedia.

RDF-AI

RDF-AI [scharffe:2009] is an architecture for data set matching and fusion. It generates an alignment that can be later used either to generate a linkset, or to merge two data sets. The interlinking parameters of RDF-AI are given in a set of XML files corresponding to the different steps of the process. The data set structure and the resources to match are described in two files. This description corresponds to a small ontology containing only resources of interest and the properties to use in the matching process. A pre-processing file describes operations to perform on resources before matching. Translation of properties and name reordering are performed before looking for links. A matching configuration file describes which techniques should be used for which resources. A threshold for generating the linkset from the alignment can be specified. Additionally, when data sets need to be merged, a configuration file describes the fusion method to use. The prototype works with a local copy of the data sets and is implemented in Java.

Silk

Silk [bizer:2009] is an interlinking tool parametrised by a link specification language: the Silk Link Specification Language (Silk LSL). The user specifies the type of resources to link and the comparison techniques to use. Datasets are referenced by giving the URI of the SPARQL endpoint from which they are accessible. A named graph can be specified in order to link only resources belonging to this graph. Resources to be linked are specified using their type, or the RDF path to access them. Silk uses many string comparison techniques, numerical and date similarity measures, concept distances in a taxonomy, and set similarities. A condition allows for specifying the matching algorithm used to match resources. Matching algorithms can be combined using a set of operators (MAX, MIN, AVG) and literals can be transformed before the comparison by specifying a transformation function, concatenating or splitting resources. Regular expressions can be be used to preprocess resources. Silk takes as input two web data sets accessible behind a SPARQL endpoint. When resources are matched with a confidence over a given threshold, the tool outputs sameAs links or any other RDF predicate specified by the user. Silk is implemented in Python.

This description corresponds to Silk 2009, but Silk has been reimplemented since then.

Knofuss

The Knossos architecture [nikolov:2008] aims at providing support for data set fusion. A specificity of Knofuss is the possibility to use existing ontology alignments. The resource comparison process is driven by a dedicated ontology for each data set specifying resources to compare, as well as the comparison techniques to use. Each ontology gives, for each type of resource to be matched, an application context defined as a SPARQL query for this type of resource. An object context model is also defined to specify properties that will be used to match these resource types. Corresponding application contexts are given the same ID in the two ontologies and one application context indicates which similarity metric should be used for comparing them. When the two data sets are described using different ontologies, an ontology alignment can be specified. This alignment is given in the ontology alignment format [euzenat:2004]. Knofuss allows for exporting links between data sets, but was originally designed to merge equivalent resources. It includes a consistency resolution module which ensures that the data sets resulting from the fusion of the two data sets is consistent with respect to the ontologies. The parameters of the fusion operation are also given in the input ontologies. Knofuss works with local copies of the data sets and is implemented in Java.

Analysis

For each analyzed tool, we tried to answer several questions reproduced below. We will then describe and categorise each tool according to these questions.

Degree of Automation
Used matching techniques
Ontologies
Data Set access
How are the data set accessed (through a SPARQL endpoint, a URL, a local copy of the data set).
Output
Domain
Is the tool specific for a given domain ?
Post-processing

An analysis of these tools according to these criteria is summarised in the following Table.

RKB CRS LD-Mapper ODD RDF-AI Silk Knofuss
Ontologies multi multi single single single multi
Automation semi automatic semi semi semi semi
User input ad-hoc program none link spec. query data set structure
alignment method
links spec.
alignment method
merged ontology
Input format Java Prolog LinQL XML Silk-LSL (XML) OWL
Matching techniques string string,
similarity propagation
string string,
Wordnet
string string,
adaptive learning
Onto. alignment no no no no no yes,
as input
Output owl:sameAs owl:sameAs
linkset
linkset alignment format,
merged data set
linkset alignment format,
merged data set
Data access API local copy ODBC local copy SPARQL local copy
Domain publications Music Ontology independent independent independent independent
Post-processing no no no no no inconsistency resolution

Obviously there is a lot of variation between these tools in spite of their common goal. So, before trying to find what they have in common with ontology matchers, it is necessary to consider them in a common framework. This is what we do in the next section: from this analysis, we attempted as providing a synthetic unified view of the data interlinking activity.

François Scharffe and Jérôme Euzenat
http://melinda.inrialpes.fr
2009-2011 (24/02/2011)