Implementing GRDDL in XProc

Introduction

One of the questions that the W3C XML Processing Model Working Group was trying to answer was whether XProc should define a standard GRDDL step. GRDDL, a charming acronym of no less charming “Gleaning Resource Descriptions from Dialects of Languages”, is a W3C standard that specifies a mechanism for obtaining (“gleaning”) RDF metadata from XML documents by applying one or more transformations to them. Implementations of GRDDL exist for a variety of programming languages (Java, Python, JavaScript, PHP etc.), and since GRDDL can be viewed as an XML transformation pipeline, it seemed only natural to attempt an XProc implementation – if only for the working group to see whether the language is expressive and feature-rich enough for such a task, or whether a special GRDDL step is indeed necessary.

 

As it turns out, it is possible to implement GRDDL entirely by using the standard XProc functionality. This article presents an example of such an implementation, including the complete source code.

 

Prerequisites

To run the GRDDL XProc pipeline, you will need a compliant XProc processor, such as EMC Documentum XProc Engine. The pipeline should also work with other XProc processors, although this has not been tested.

 

As GRDDL relies heavily on various W3C-hosted resources (XHTML DTDs, GRDDL profiles, etc.), you should consider using local copies of these resources and configuring the XProc processor to use an XML catalog. Without a catalog, the GRDDL pipeline will generate a fair amount of traffic on W3C servers. Some of the resources may even be inaccessible because of W3C's traffic policies.

 

This article assumes good familiarity with XProc. No prior knowledge of GRDDL or RDF is necessary, although it is beneficial. The ability to pronounce “STRČ PRST SKRZ KRK” is not required.

 

Extracting resource metadata with GRDDL

One of the key factors in the success of XML undoubtedly lies behind the letter “X” – it is primarily the easy extensibility that has enabled the proliferation of a wide variety of XML dialects (“domain languages”) and the broad level of adoption of XML by the industry. However, the plethora of XML dialects also makes it rather challenging to extract meaningful information from the existing XML content, and to achieve understanding across different domains.

 

Consider the following two XML documents. They clearly represent the same information (a poem by John Keats), only marked up using different vocabularies. The first document uses a custom “poem” markup (amenable to various forms of XML processing or querying), while the other uses XHTML for presenting the poem in a web browser.

 

Figure 1. “This Living Hand” marked up using custom vocabulary

<poem>
  <title>This Living Hand</title>
  <author>John Keats</author>
  <stanza>
    <line>This living hand, now warm and capable</line>
    <line>Of earnest grasping, would, if it were cold</line>
    <line>And in the icy silence of the tomb,</line>
    <line>So haunt thy days and chill thy dreaming nights</line>
    <line>That thou wouldst wish thine own heart dry of blood</line>
    <line>So in my veins red life might stream again,</line>
    <line>And thou be conscience-calmed---see here it is---</line>
    <line>I hold it towards you.</line>
  </stanza
</poem>

Figure 2. “This Living Hand” marked up as an XHTML document

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title>This Living Hand</title>
  </head>
  <body>
    <h1>This Living Hand</h1>
    <p><i>John Keats</i></p>
    <div>
      <p>This living hand, now warm and capable</p>
      <p>Of earnest grasping, would, if it were cold</p>
      <p>And in the icy silence of the tomb,</p>
      <p>So haunt thy days and chill thy dreaming nights</p>
      <p>That thou wouldst wish thine own heart dry of blood</p>
      <p>So in my veins red life might stream again,</p>
      <p>And thou be conscience-calmed---see here it is---</p>
      <p>I hold it towards you.</p>
    </div>
  </body>
</html>

Can software (for instance, a search engine) tell that the two XML documents actually represent a poem (and in fact, the exact same one)?

 

To enable such reasoning, a framework for making statements about resources is necessary. RDF (Resource Definition Framework), a widely adopted W3C standard for modeling and describing resource metadata, provides such a framework.

 

In RDF, statements about resources are expressed in the form of (subject, predicate, object) triples. For example, to represent that John Keats is the author of the poem “This Living Hand”, one might use a triple where the subject is the poem, the predicate is “has author” and the object is John Keats. Similarly, another triple might be used to represent the title of the poem, or to record the fact that “This Living Hand” is a poem. The set of all triples is referred to as the RDF graph.

 

RDF at its core is just a data model; in order to represent and exchange RDF graphs, a serialization format needs to be applied. The example below shows one possible RDF graph for “This Living Hand” serialized using RDF/XML:

 

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:dc="http://purl.org/dc/elements/1.1/">
  <rdf:Description rdf:about="http://www.example.org/this_living_hand.xhtml">
    <dc:creator>John Keats</dc:creator>
    <dc:title>This Living Hand</dc:title>
    <dc:type>Poem</dc:type>
  </rdf:Description>
</rdf:RDF>

Now, suppose that we have software that can process the above RDF metadata. How do we get such metadata about an XML document, for example the XHTML version of “This Living Hand”? That is where GRDDL comes in the picture.

 

GRDDL, in its essence, provides a mechanism for bootstrapping RDF metadata from XML documents. This is achieved by introducing simple markup that can be used for specifying algorithms – or transformations – for extracting the metadata from the document content.

 

Most often, GRDDL transformations are expressed in the form of XSLT stylesheets that transform the document to RDF/XML, but in theory, any technology can be used: XQuery, XProc, or even C or Java programs.

 

The combined result (a RDF graph) of applying the GRDDL transformations on the XML document is called the GRDDL result.

 

There are four main ways of associating GRDDL transformations with XML documents; they are discussed in the following text. GRDDL implementations may or may not support all of them, and they can also decide which of the GRDDL transformations that they discover they will apply.

 

XML glean

The basic mechanism for associating GRDDL transformations with XML documents is by adding the grddl:transformation attribute to the document element. The example below illustrates this technique:

 

<poem xmlns:grddl="http://www.w3.org/2003/g/data-view#"
      grddl:transformation="poem2rdf.xsl">
  <title>This Living Hand</title>
  <author>John Keats</author>
  ...
</poem>

The XML document in the example is associated with a GRDDL transformation defined by the poem2rdf.xsl XSLT stylesheet. (The example uses a single transformation, but it is possible to specify multiple transformations if necessary: the value of the transformation attribute is a space-separated list.)

 

The GRDDL result of the above XML document is an RDF graph represented by the result of applying the poem2rdf.xsl stylesheet to the document.

 

XML namespace glean

While the previous technique works well for individual XML documents, it is sometimes more practical to associate a GRDDL transformation with an entire class of XML documents. GRDDL supports this by allowing associating transformations with XML documents that share the same XML namespace. This can be done by specifying the transformations in the namespace document. If the GRDDL result of the namespace document contains the grddl:namespaceTransformation property, this property defines the transformation to apply to the source XML document.

 

To demonstrate how this works, consider the following input XML document:

 

<poem xmlns="http://www.example.org/poem"
      xmlns:grddl="http://www.w3.org/2003/g/data-view#"
      grddl:transformation="third.xsl"/>
  <title>This Living Hand</title>
  <author>John Keats</author>
  ...
</poem>

 

The document element (poem) is in the http://www.example.org/poem namespace. Suppose that the GRDDL result of the namespace document located at http://www.example.org/poem exists and looks like this:

 

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:grddl="http://www.w3.org/2003/g/data-view#">
 <rdf:Description rdf:about="#">
   <grddl:namespaceTransformation rdf:resource="poem2rdf.xsl"/>
   <grddl:namespaceTransformation rdf:resource="bogus.xsl"/>
 </rdf:Description>
</rdf:RDF>

 

Then the GRDDL result of the input document will be an RDF graph represented by the merged results of three transformations:

 

  • poem2rdf.xsl – specified in the grddl:namespaceTransformation property in the GRDDL result of the namespace document

     


  • second.xsl – specified in the grddl:namespaceTransformation property in the GRDDL result of the namespace document

     


  • third.xsl – specified in the grddl:transformation attribute in the input document (the “XML glean”)

     


XHTML glean

While XHTML is a subset of XML, the “XML glean” technique is not applicable to XHTML documents because the grddl:transformation attribute is not permitted by the XHTML DTD. GRDDL deals with this by making use of the metadata profile feature of XHTML:

 

<html xmlns="http://www.w3.org/1999/xhtml">
  <head profile="http://www.w3.org/2003/g/data-view">
    <title>This Living Hand</title>
    <link rel="transformation" href="xhtmlpoem2rdf.xsl" />
    <link rel="transformation" href="second.xsl" />
  </head>
  <body>
    <h1>This Living Hand</h1>
    <p><i>John Keats</i></p>
    ...
  </body>
</html>

 

The XHTML document in the above example is associated with two GRDDL transformations (xhtmlpoem2rdf.xsl and second.xsl); the combined results of these transformations constitute the GRDDL result of the document.

 

HTML profile glean

Similar to the “XML namespace glean”, it is possible to associate GRDDL transformations with entire classes of XHTML documents that share the same XHTML profile. If the GRDDL result of the XHTML profile document contains the grddl:profileTransformation property, then this property defines the transformation to apply to the source XHTML document.

 

The XHTML document below refers to the metadata profile at http://www.example.org/poem (it also makes use of the http://www.w3.org/2003/g/data-view profile to specify a local “XHTML glean” transformation):

 

<html xmlns="http://www.w3.org/1999/xhtml">
  <head profile="http://www.example.org/poem http://www.w3.org/2003/g/data-view">
    <title>This Living Hand</title>
    <link rel="transformation" href="second.xsl" />
  </head>
  <body>
    <h1>This Living Hand</h1>
    <p><i>John Keats</i></p>
    ...
  </body>
</html>

 

Suppose that the metadata profile exists and its GRDDL result looks as follows:

 

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:grddl="http://www.w3.org/2003/g/data-view#">
 <rdf:Description rdf:about="#">
   <grddl:profileTransformation rdf:resource="xhtmlpoem2rdf.xsl.xsl"/>
 </rdf:Description>
</rdf:RDF>

 

Then the GRDDL result of the input XHTML document will be an RDF graph obtained by merging the results of two transformations:

 

  • xhtmlpoem2rdf.xsl – specified in the grddl:profileTransformation property in the GRDDL result of the profile document

     


  • second.xsl – specified in the input XHTML document (the “XHTML glean”)

     


Gleaning with XProc

At its core, the GRDDL algorithm is actually quite simple. It can be viewed as consisting of two steps:

 

  1. Identifying the GRDDL transformations for the document; and

     


  2. applying any or all of the transformations.

     


(As a special case, if the input document is an RDF/XML document, the result of parsing this document is its GRDDL result.)

 

Depending on the glean type (XML glean, XML namespace glean, XHTML glean, HTML profile glean), the exact details of how the GRDDL transformations are located may vary, but the essence of the algorithm remains the same.

 

One interesting aspect of the GRDDL algorithm is – as some readers may already have guessed – that it is recursive. Take for example the “XML namespace glean” and the phrase: “If the GRDDL result of the namespace document contains the grddl:namespaceTransformation property...”. This basically implies that after locating and retrieving the namespace document, the GRDDL implementation must apply the GRDDL algorithm to the namespace document to get its GRDDL result. If the namespace document is in a namespace, or if it is an XHTML document with a metadata profile, the recursion continues until a non-recursive state is reached or a loop is detected.

 

On the more detailed level, a robust GRDDL implementation also needs to process base URI information (the xml:base attribute in XML and the base element in XHTML) correctly in order to be able to locate resources specified using relative URIs. Support for the xml:lang XML attribute may also be required in some cases, as may XInclude processing.

 

XProc supports most of the above, and it also provides enough expressive power to make implementing the recursive GRDDL algorithm possible. The translation of GRDDL into XProc eventually resulted in a pipeline with the following signature:

 

<p:declare-step type="xg:grddl" version="1.0"
                xmlns:p="http://www.w3.org/ns/xproc"
                xmlns:xg="http://www.example.org/ns/xproc/grddl">
  <p:input port="source"/>
  <p:output port="result" sequence="true"/>
  <p:option name="xml-glean" select="'true'"/>
  <p:option name="xmlns-glean" select="'true'"/>
  <p:option name="xhtml-glean" select="'true'"/>
  <p:option name="xhtml-profile-glean" select="'true'"/>

  ...
</p:declare-step>

(The complete source code is available in the attachment of this article.)

 

The pipeline expects an XML document on the source input port and produces a sequence of RDF/XML documents on the result output port. The options xml-glean, xmlns-glean, xhtml-glean, and xhtml-profile-glean can be used for specifying which of the GRDDL glean methods to apply. By default, all methods are enabled.

 

The pipeline can be used directly (for example, from the command-line or in an XProc/RDF-enabled software application) as well as in other XProc pipelines:

 

<p:declare-step version="1.0"
                xmlns:p="http://www.w3.org/ns/xproc"
                xmlns:xg="http://www.example.org/ns/xproc/grddl">
  <p:input port="source"/>
  <p:output port="result" sequence="true"/>
  <p:import href="grddl.xpl"/>

  <xg:grddl xhtml-profile-glean="false"/>
</p:declare-step>

 

The individual glean methods are implemented as nested steps. In addition to these there are also nested steps that provide common functionality, such as applying a GRDDL transformation, keeping track of visited resources, or displaying debug output.

 

The pipeline uses plain XProc for the most part, but some of the functionality has been implemented in XSLT (using the p:xslt step). Obviously, the use of XSLT if implied by the nature of GRDDL transformations, which are typically XSLT stylesheets, but the pipeline also uses XSLT (2.0) for other tasks such as tokenizing and preprocessing of resource URIs.

 

The pipeline has been tested against the official GRDDL test suite. The test results are summarized below:

 

Table 1. GRDDL test suite results

Localized Tests100% (12/12)
Namespace Documents and Absolute Locations100% (12/12)
Library Tests69% (9/13)
Ambiguous Infosets, Representations, and Traversals80% (24/30) Note: Some of the tests in this category are mutually exclusive. Ignoring those, there is only 1 failed test.

Although the pipeline does not pass 100% of the tests (there are a small number of corner case tests that fail because of incomplete support for xml:lang and HTTP content negotiation), it is probably robust enough for common use. If needed, the missing functionality can always be added to the pipeline.

 

The pipeline currently supports only XSLT for GRDDL transformations. However, it should be relatively straightforward to extend the pipeline to support other types of transformations as well – for instance, XQuery (using the p:xquery step), XProc (using an “eval” extension step like EMC XProc Engine's emx:eval) or external programs (using the p:exec step).

 

Also note that the XProc standard does not provide any support for dealing with RDF data. The GRDDL XProc pipeline produces a sequence of RDF/XML documents, but in order to process this data further in XProc, you will need RDF-aware extension XProc steps. (For example, when testing the pipeline, we used a custom RDF step that takes a sequence of RDF/XML documents on its input and performs an RDF merge on the corresponding RDF graphs.)

 

Example: “XML glean” implementation

The listing below shows the XProc implementation of the “XML glean” method:

 

  <p:declare-step type="xg:xml-glean" name="xml-glean"
                  xmlns:grddl="http://www.w3.org/2003/g/data-view#"
                  xmlns:xg="http://www.example.org/ns/xproc/grddl"
                  xmlns:xgv="http://www.example.org/ns/xproc/grddl-vocab">
    <p:input port="source" primary="true"/>
    <p:input port="visited">
      <p:inline>
        <xgv:visited/>
      </p:inline>
    </p:input>
    <p:output port="result" sequence="true" primary="true">
      <p:pipe step="glean" port="result"/>
    </p:output>
    <p:output port="result-visited">
      <p:pipe step="glean" port="result-visited"/>
    </p:output>
    <p:option name="enabled" select="'true'"/>

    <p:variable name="base-uri" select="p:base-uri()"/>

    <p:choose name="glean">
      <p:when test="$enabled != 'true' or //xgv:resource[@uri=$base-uri and @mode='xml']">
        <p:xpath-context>
          <p:pipe step="xml-glean" port="visited"/>
        </p:xpath-context>
        <p:output port="result" sequence="true">
          <p:empty/>
        </p:output>
        <p:output port="result-visited">
          <p:pipe step="xml-glean" port="visited"/>
        </p:output>
        <xg:log message="Glean mode disabled or resource already processed"/>
        <p:sink/>
      </p:when>

      <p:otherwise>
        <p:output port="result" sequence="true">
          <p:pipe step="apply-transformations" port="result"/>
        </p:output>
        <p:output port="result-visited">
          <p:pipe step="update-visited" port="result"/>
        </p:output>

        <p:choose name="apply-transformations">
          <p:when test="/*/@grddl:transformation">
            <p:output port="result" sequence="true"/>
            <xg:apply-transformations-literal>
              <p:input port="source">
                <p:pipe step="xml-glean" port="source"/>
              </p:input>
              <p:with-option name="transformations" select="/*/@grddl:transformation"/>
              <p:with-option name="base-uri" select="$base-uri">
                <p:empty/>
              </p:with-option>
              <p:with-option name="output-base-uri" select="$base-uri">
                <p:empty/>
              </p:with-option>
            </xg:apply-transformations-literal>
          </p:when>
          <p:otherwise>
            <p:output port="result" sequence="true"/>
            <p:identity>
              <p:input port="source">
                <p:empty/>
              </p:input>
            </p:identity>
          </p:otherwise>
        </p:choose>

        <xg:add-visited mode="xml" name="update-visited">
          <p:input port="source">
            <p:pipe step="xml-glean" port="visited"/>
          </p:input>
          <p:with-option name="uri" select="$base-uri">
            <p:empty/>
          </p:with-option>
        </xg:add-visited>
      </p:otherwise>
    </p:choose>
  </p:declare-step>

 

The rather lengthy (mainly due to the inherent verbosity of p:choose) code does the following:

 

  1. If the “XML glean” method is disabled or the document appearing on the source input port has already been processed using the “XML glean” method, execution stops and an empty sequence of documents is produced on the result output port.

     


  2. Locate the GRDDL transformations and apply them to the source XML document.

     


  3. Add the source document to the list of documents processed using the “XML glean” method.

     


  4. Return the sequence of results of the GRDDL transformations.

     


Of note is the use of special visited and result-visited ports in the step. This was necessary to make recording of already processed resources possible. Since XProc does not provide any support for global variables, the information about the visited resources needs to be passed (in the form of an XML document) from step to step. The implementations of the steps then usually follow the same pattern:

 

  1. Check if the document appearing on the visited input port already contains the entry [resource base URI, glean mode]. If so, exit.

     


  2. Process the resource.

     


  3. Produce a new “visited” document on the result-visited output port by adding an entry [resource base URI, glean mode].

     


Summary

GRDDL provides a simple yet quite powerful mechanism for obtaining RDF metadata from XML documents. This article discussed an attempt at implementing GRDDL in XProc. The motivation for doing this was twofold: first, it was a nice practical test of the expressivity and usefulness of XProc, and second, it provides a nice – hopefully – example of a larger and more complex XProc pipeline.

 

The pipeline presented here is just a starting point. While it implements most of the core GRDDL algorithm, it can certainly be extended and improved in many ways. But it already provides an interesting “XML-native” alternative to the available GRDDL implementations, written mostly in traditional programming languages such as Java, Python, or JavaScript.