Alignments for data interlinking: Framework

After analysing some tools avalailable to interlinking data, we realised how different such tools were. So we attempt here are proposing a framework in which they act.

We present here a general framework encompassing the various approaches used to interlink resources on the web of data. This framework adapts to the different cases that can be encountered when two web data sets are interlinked. We will see how each of the studied interlinking tool find its place in the framework.

Manual alignment

Resources may be manually interlinked.

Manually linking resources can be performed using collaborative tools in the case of large data sets.

URI correspondence

Resources can be trivially linked using a simple transformation of their URIs.

A set of rules can be defined to identify equivalent resources from their identifier. For example, in the LastFM data set, the URI representing an artist is built on the pattern ``First_name+Last_name''. Persons URIs in DBPedia are built around the pattern ``FirstName_LastName''. A trivial algorithm can be developed to find equivalent artists based on their URIs. This example is illustrated below for the composer J.S. Bach.

Data sets sharing same ontologies

Further than URIs, it may be necessary to consider the ontologies in order to identify entities. In a first case, the two data sets to interlink are described by the same ontology. The role of the interlinking system is to analyze resources of the same type in order to detect the equivalent ones. To do this, the system compares resource properties with a similarity measure. Systems in this category take as input the properties to compare, the type of comparison algorithm to use for each property, and the method to aggregate the similarity measures of the various properties in order to construct a measure between two resources.

For example, Jamendo and MusicBrainz, two data sets containing musicological data, are both described according to a common music ontology [raimond:2007]. The artist J.S. Bach can be identified in both data sets by observing the first name and last name properties of the class MusicArtist. It is not possible in this case to identify the equivalence of resources based on their URIs. This example is illustrated in the following figure:

Data sets described with heterogeneous ontologies

Data sets may be described by different ontologies. In order to know which types of entities have to be linked together, the system needs to know the correspondences between these types of entities. The system then works similarly than if there were one ontology.

Two approaches might be used in order to interlink the data sets. In a first approach, the alignment between the two ontologies is implicitly specified in the input of the interlinking system. We represent this case in the following figure by introducing the correspondences between ontology classes as an alignment. This alignment is presented as implicit because it does not exist as such, but it is mixed with the linking specification or the data interlinking system.

Consider two data sets, one described using FOAF, the other using VCard. The linking specification will indicate to the tool to compare entities of type foaf:Person and entities of type vcard:VC, and that when comparing resources of these types, the properties foaf:givenname should be compared to vcard:fn, as well as the property foaf:familyname compared to the property vcard:ln. This is an implicit alignment containing two correspondences.

For example, OpenCyc represents the artist J.S. Bach using a different ontology than the one used to describe MusicBrainz. The properties ``firstname'' and ``lastname'' correspond to a property ``EnglishID'' in which both names are concatenated. The class MusicArtist in the Music Ontology corresponds to a class Classical Music Composer in OpenCyc. An alignment between classes and properties needs to be specified in order to find an equivalence between those two resources. This example is illustrated in the next figure:

Another approach takes advantage of an already existing explicit alignment between the two ontologies used by the data sets

Conclusion: the general case

Each of the analysed tools fits in one of the category of this framework as shown on the following table:

Manual link specification
URI correspondenceRKB-CRS
Common ontologyLD-Mapper, ODD-Linker
Different ontologies, implicit ontology alignmentRDF-AI, Silk
Different ontologies, explicit ontology alignementKnofuss
An additional possibility, not found in existing systems, would be for the data linking system to first match the two ontologies before using the resulting alignment for supporting data interlinking. In such a system, ontology matching and data interlinking would be merged.

This is the most general case illustrated below, in which two web data sets are related using a method for comparing their resources.

Each of the cases above can be reduced to this general case. The reduction can be obtained by either suppressing some elements (the ontology matcher, the ontologies, the alignment) or merging the ontologies and instantiating the processing components (matcher and interlinker).

We do not specify at this stage if the method should be automatic or manual. Neither do we specify if the two data sets are described using a common ontology or if the ontologies describing their resources differ. The result of the interlinking process is a set of owl:sameAs predicates between these resources.

The diagram above, already shows how ontology matching and data interlinking could cooperate. Starting from these remarks, we can propose a way to make them collaborate.

François Scharffe and Jérôme Euzenat
2009-2011 (24/02/2011)