Alignments for data interlinking: Specification formats

Most of the tools that we considered are built around a specific format for specifying the linking process. We consider these languages below.

We describe in this section the link specification formats for each of the studied tools. When describing the formats we focus on the interlinking task, letting on the side technical details related to the implementation or the way to run the tool. For each format we concentrate on the following aspects:

Silk-LSL

Silk-LSL is a XML format for specifiying linkages as part of the Silk framework. Data sets are referenced by giving the URI of the SPARQL endpoint between which they are accessible. Eventually, a named graph can be specified in order to link only resources belonging to this graph. Resources to be linked are specified using their type, or the RDF path to access them. The linkage condition allows to specifies the matching algorithm used to match resources. Matching algorithms can be combined using a set of operators (MAX, MIN, AVG). Also, literals can be transformed before the comparison by specifying a transformation function, concatenating or splitting resources. Eventually regualr expressions can be be used to preprocess resources. Finally, the resources matched above a given threshold are considered for output. The output type can be owl:sameAs triples or any other given RDF predicate.

Below is an link specification for matching cities between DBPedia and Geonames.

 <Interlink id="cities">
  <LinkType>owl:sameAs</LinkType>
  <SourceDataset dataSource="dbpedia" var="a">
   <RestrictTo>
    ?a rdf:type dbpedia:City
   </RestrictTo>
  </SourceDataset>
  <TargetDataset dataSource="geonames" var="b">
   <RestrictTo>
    ?b rdf:type gn:P
   </RestrictTo>
  </TargetDataset>
  <LinkCondition>
   <AVG>
    <Compare metric="jaroSimilarity">
     <Param name="str1" path="?a/rdfs:label" />
     <Param name="str2" path="?b/gn:name" />
    </Compare>
    <Compare metric="numSimilarity">
     <Param name="num1" path="?city1/dbpedia:populationTotal" />
     <Param name="num2" path="?city2/gn:population" />
    <Compare>
   </AVG>
  </LinkCondition>
  <Thresholds accept="0.9" verify="0.7" />
  <Output acceptedLinks="accepted_links.n3" 
            verifyLinks="verify_links.n3"
                   mode="truncate" />
 </Interlink>

Knofuss

In Knofuss, a set of OWL ontologies, one for each data set, are specifying the resources to be matched. Each ontology gives the data set details. For each type of resource to be matched, an application context is defined, specifying a SPARQL query for this type of resource. An object context model is also defined to specify properties that will be used to match these resources types. Corresponding application contexts are given the same ID in the two ontologies and one application context indicaties wich similarity metric should be used for the matching. When the two data sets are described using different ontologies, an ontology alignment can be specified. This alignment is given in the ontology alignment format. Knofuss allows to export links between data sets, but was originally designed to merge equivalent resources. It includes a consistency resolution modules which ensures that the data sets resulting from the fusion of the two data sets is consistent with regards to the ontologies. The parameters of the fusion operation are also given in the input ontologies.

Below is an exerpt of an application context for matching DBPedia places.

<ApplicationContext rdf:ID="SimMetricsPlaceApplicationContext">
 <hasSelectQuery>
   PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
   PREFIX dbpedia: <http://dbpedia.org/ontology/>
   SELECT DISTINCT ?uri WHERE {
    ?uri rdf:type dbpedia:Place . }
 </hasSelectQuery>
 <describes rdf:resource="http://dbpedia.org/ontology/Place"/>
 <simmetrics:threshold>
         0.9</simmetrics:threshold>
 <simmetrics:similarityMetrics>
  monge-elkan
 </simmetrics:similarityMetrics>
 <hasReliability>0.85</hasReliability>
 <hasConnectedObject>
  <ObjectContextModel rdf:ID="PlaceContextModel">
   <hasSelectQuery>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX dbpedia: <http://dbpedia.org/ontology/>
    SELECT DISTINCT ?uri ?name
    WHERE {
     ?uri rdf:type dbpedia:Place .
     ?uri rdfs:label ?name . }
   </hasSelectQuery>
   <isPrimaryKey>false</isPrimaryKey>
  </ObjectContextModel>
 </hasConnectedObject>
</ApplicationContext>

RKB-CRS

Unfortunately we did not have access to the RKB data linking format. It seems there is no specific format, but adhoc Java code is implemented.

RDF-AI

The interlinking parameters of RDF-AI are given in a set of XML files corresponding to the different phases of the process. The data set structure and the resources to be matched are described in two files. This description in fact corresponds to a small ontology containing only resources of interest and the propoerties to be used in the matching process. A pre-processing file descripbes operation to perform on certain resources before the matching. Translation of properties and name reordering are available in the evaluated version. A matching configuration file describes what matching technique should be used for which resources. Additionally when fusion of the data sets needs to be performed, a configuration file describes the fusion method to be used.

The exerpt below gives an example of matching specification for data sets with resources from the music ontology, dublin core and the event ontology.

  <parameters>
    <parameter>
        <title>similarity computation</title>
        <namespace>dc:title</namespace>
        <method>string comparison</method>
    </parameter>
    <parameter>
        <title>similarity computation</title>
        <namespace>mo:instrument</namespace>
        <method>SKOS</method>
    </parameter>
    <parameter>
        <title>similarity computation</title>
        <namespace>mo:genre</namespace>
        <method>WordNet</method>
    </parameter>
    <parameter>
        <title>similarity computation</title>
        <namespace>event:hasProduct</namespace>
        <method>string comparison</method>
    </parameter>
    <parameter>
        <title>similarity computation</title>
        <namespace>event:hasAgent</namespace>
        <method>string comparison</method>
    </parameter>
</parameters>

LD-mapper

This tool developped in prolog has no dedicated linkage format on his own but requires to write Prolog code.

ODDlinker LinQL

The linkage query language LinQL is a SQL based format originally developped for duplicate records detection in relational databases. Its usage in the context of linked-data is thus limited to relational databases exposed as linked data. LinQL is nonetheless an expressive formalism for links specifications. The language supports many string matching algorithms, hyponys and synonyms, conditions on attribute values. The following example query example shows a linkage between diagnosis using a restriction and the weighted Jaccard similarity metric.

create
linkindex for visits.diagnosis using weightedJaccard;
create linkindex for condition.name using weightedJaccard;
SELECT visits.*, condition.*
FROM visits, condition
WHERE visits.tid < 50
LINK visits.diagnosis WITH condition.name USING weightedJaccard;

Synthesis

The following table synthesise the answers to the questions above:

data sets resources links matching thres align pre-proc. post-proc. misc.
Silk-LSL SPARQL endpoint, graphs name resources to interlink, resources type link condition (for each resource) string matching, matchers combination yes no functions transformation link type path variables
Knowfuss local copy (SPARQL query) fusion method string matching (for each resource) yes yes - insoncistency resolution alignment, threshold
RKB-CRS programming
RDF-AI local copy resource descriptions link description fuzzy string, wordnet yes no translation, name reordering no parameters merge
LD-Mapper local copy resource query link description string matching yes no tokenization, case, space removal
ODD-linker local copy resource description (table.column) link description, synonym, hyponym, weightedJaccard, token intersect yes no

François Scharffe and Jérôme Euzenat
http://melinda.inrialpes.fr
$Id: formats.html,v 1.1 2009/10/28 12:58:39 euzenat Exp euzenat $