Processing RDF with SPARQL

Summary

This chapter introduces the sparql-maven-plugin.

Purpose

The plugin enables within a pom.xml file the generate, transformation, enrichment, and repair of any type of RDF data.
What makes the combination of SPARQL and Maven especially powerful is the interplay of maven properties with SPARQL:
- Maven metadata can be used to interpolate SPARQL query strings.
- Conversely, SPARQL function extensions make it possible to query the pom.xml model using XPath.

Use Cases

The maven-sparql-plugin is a powerful tool that can be used for many use cases. A typical use case is to generate metadate in the following models:
- Vocabulary of Interlinked Datasets (VoID): Captures statistical information about an RDF dataset, such as the frequences of classes and properties.
- Provenance Ontology (PROV-O): Can be used to capture the plan that was used to produced a dataset - with the maven4data paradigm this plan is the pom.xml file - which can be referenced by its maven coordinate!
- Data Catalog (DCAT) Vocabulary: Captures versions, publishers and distributions (= means of access) of a dataset.

Limitations

Currently the SPARQL functions do not support generating the effective pom.xml. This is future work and a contribution would be welcome.

Basic Approach

TODO Adapt and test example.

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>org.aksw.data.config</groupId>
  <artifactId>dcat-generator</artifactId>
  <version>0.0.1-SNAPSHOT</version>

  <properties>
    <input.url>pom.xml</input.url>

    <output.type>ttl.bz2</output.type>
    <output.classifier>dcat</output.classifier>
    <output.path>${project.build.directory}/${project.artifactId}-${output.classifier}.${output.type}</output.path>
    
    <description>This description is shared between maven and SPARQL.</description>
  </properties>

  <descriptions>${description}</description>

  <build>
    <plugins>
      <plugin>
        <groupId>org.aksw.maven.plugins</groupId>
        <artifactId>sparql-maven-plugin</artifactId>
        <version>0.0.1-SNAPSHOT</version>
        <executions>
          <execution>
            <id>generate-metadata</id>
            <phase>process-resources</phase>
            <goals>
              <goal>run</goal>
            </goals>
            <configuration>
              <!-- TDB2 is a disk-based engine. When engine is omitted, then the in-memory one will be used -->
              <engine>tdb2</engine>
              <outputFile>${output.path}</outputFile>
              <outputFormat>${output.type}</outputFormat>
              <env>
                <DATASET>${input.urn.dataset}</DATASET>
                <BASE>${input.urn.base}#</BASE>
                <POM>${input.pom.path}</POM>
              </env>
              <args>
                <!-- -->
                <arg>${input.data.path}</arg>

                <arg>void/sportal/compact/qb2.rq</arg>

<!-- A construct query string whose property references are interpolated before the query is evaluated -->
<arg><![CDATA[
CONSTRUCT {
  <urn:mvn:${groupId}:${artifactId}:${version}#dataset>
    a dcat:Dataset ;
    rdfs:comment """${description}"""@en ;
    .
}
WHERE {
}
]]></arg>
                
              </configuration>
            </execution>
          </executions>
      </plugin>
    </plugins>
  </build>
</project>