Alignments for data interlinking: Analysed systems

Six systems took part in the experiment. We are aware of two other systems not analysed in this study [sais:2008,hogan:2007].

Interlinking tools are mainly based on string matching techniques, as well as more advanced techniques using linguistic tools and taking into account the graph structure of web data sets.

Tools

We give here only a succint presentation of the various tools we analysed. A more detailed description of the tool formats is on a separate page.

RKB-CRS

The co-reference resolution system (CRS) of the RKB knowledge base [jaffri:2008b] is built around URIs equivalence lists. These lists are built using an ad-hoc Java program working on the specific domain of universities and conferences. A new program needs to be written for each pairs of data sets to integrate. Each program consists in the selection of the resources to compare, and to compare them using string similarity on the resources properties values.

LD-Mapper

LD-Mapper [raimond:2008] is a data set integration tool working on the music domain. This tool is based on a similarity aggregation algorithm taking into account the similarity of a resource's neighbors in the graph describing it. It requires little user configuration but only works with data sets described with the Music Ontology [raimond:2007]. LD-Mapper is implemented in Prolog.

ODD-linker

ODD-Linker [hassanzadeh:2009] is an interlinking tool implemented on top of a tool mining for equivalent records in relational databases. ODD-Linker uses SQL queries for identifying and comparing resources. The tool translates link specifications expressed in the LinQL dedicated language originally developed for duplicate records detection in relational databases. Its usage in the context of linked data is thus limited to relational databases exposed as linked data. LinQL is nonetheless an expressive formalism for link specifications. The language supports many string matching algorithms, hyponyms and synonyms, conditions on attribute values. Given its nature, ODD-Linker is made to be used with relational databases exported in RDF. The tool was used to interlink the Linked Internet Movie Database to DBpedia.

RDF-AI

RDF-AI [scharffe:2009] is an architecture for data set matching and fusion. It generates an alignment that can be later used either to generate a linkset, or to merge two data sets. The interlinking parameters of RDF-AI are given in a set of XML files corresponding to the different steps of the process. The data set structure and the resources to match are described in two files. This description corresponds to a small ontology containing only resources of interest and the properties to use in the matching process. A pre-processing file describes operations to perform on resources before matching. Translation of properties and name reordering are performed before looking for links. A matching configuration file describes which techniques should be used for which resources. A threshold for generating the linkset from the alignment can be specified. Additionally, when data sets need to be merged, a configuration file describes the fusion method to use. The prototype works with a local copy of the data sets and is implemented in Java.

Silk

Silk [bizer:2009] is an interlinking tool parametrised by a link specification language: the Silk Link Specification Language (Silk LSL). The user specifies the type of resources to link and the comparison techniques to use. Datasets are referenced by giving the URI of the SPARQL endpoint from which they are accessible. A named graph can be specified in order to link only resources belonging to this graph. Resources to be linked are specified using their type, or the RDF path to access them. Silk uses many string comparison techniques, numerical and date similarity measures, concept distances in a taxonomy, and set similarities. A condition allows for specifying the matching algorithm used to match resources. Matching algorithms can be combined using a set of operators (MAX, MIN, AVG) and literals can be transformed before the comparison by specifying a transformation function, concatenating or splitting resources. Regular expressions can be be used to preprocess resources. Silk takes as input two web data sets accessible behind a SPARQL endpoint. When resources are matched with a confidence over a given threshold, the tool outputs sameAs links or any other RDF predicate specified by the user. Silk is implemented in Python.

This description corresponds to Silk 2009, but Silk has been reimplemented since then.

Knofuss

The Knossos architecture [nikolov:2008] aims at providing support for data set fusion. A specificity of Knofuss is the possibility to use existing ontology alignments. The resource comparison process is driven by a dedicated ontology for each data set specifying resources to compare, as well as the comparison techniques to use. Each ontology gives, for each type of resource to be matched, an application context defined as a SPARQL query for this type of resource. An object context model is also defined to specify properties that will be used to match these resource types. Corresponding application contexts are given the same ID in the two ontologies and one application context indicates which similarity metric should be used for comparing them. When the two data sets are described using different ontologies, an ontology alignment can be specified. This alignment is given in the ontology alignment format [euzenat:2004]. Knofuss allows for exporting links between data sets, but was originally designed to merge equivalent resources. It includes a consistency resolution module which ensures that the data sets resulting from the fusion of the two data sets is consistent with respect to the ontologies. The parameters of the fusion operation are also given in the input ontologies. Knofuss works with local copies of the data sets and is implemented in Java.

Analysis

For each analyzed tool, we tried to answer several questions reproduced below. We will then describe and categorise each tool according to these questions.

Degree of Automation

Is the tool completely automatic (a black box)?
Does the tool need to be parametrised by the user ? What kind of parameters (data matching techniques, ontology alignment)?

Used matching techniques

String matching?
External functions (values conversion, data transformations)?
Similarity propagation?
Other techniques?

Ontologies

Does the tool take into account ontologies associated to the data sets?
Does the tool allow to interlink data sets described according to different ontologies?
If the ontologies differ, does the tool perform ontology matching?

Data Set access

How are the data set accessed (through a SPARQL endpoint, a URL, a local copy of the data set).

Output

What does the tool produce as output (owl:sameAs links, VoiD linkset, other type of links)?
Does the tool propose to merge the two input data sets?

Domain

Is the tool specific for a given domain ?

Post-processing

Does the tool perform any post-processing operation (consistency checking and inconsistency resolution)?

An analysis of these tools according to these criteria is summarised in the following Table.

	RKB CRS	LD-Mapper	ODD	RDF-AI	Silk	Knofuss
Ontologies	multi	multi	single	single	single	multi
Automation	semi	automatic	semi	semi	semi	semi
User input	ad-hoc program	none	link spec. query	data set structure alignment method	links spec. alignment method	merged ontology
Input format	Java	Prolog	LinQL	XML	Silk-LSL (XML)	OWL
Matching techniques	string	string, similarity propagation	string	string, Wordnet	string	string, adaptive learning
Onto. alignment	no	no	no	no	no	yes, as input
Output	owl:sameAs	owl:sameAs linkset	linkset	alignment format, merged data set	linkset	alignment format, merged data set
Data access	API	local copy	ODBC	local copy	SPARQL	local copy
Domain	publications	Music Ontology	independent	independent	independent	independent
Post-processing	no	no	no	no	no	inconsistency resolution

Obviously there is a lot of variation between these tools in spite of their common goal. So, before trying to find what they have in common with ontology matchers, it is necessary to consider them in a common framework. This is what we do in the next section: from this analysis, we attempted as providing a synthetic unified view of the data interlinking activity.

François Scharffe and Jérôme Euzenat
http://melinda.inrialpes.fr
2009-2011 (24/02/2011)