Towards s-expression based XPath/XSLT implementation

Abstract

It's supposed that Lisp languages are ideal for implementing XML standards. Then why we don't have an XSLT processor written in Common Lisp or Scheme? I'm ignoring business issues and want to expose a technical problem in this httpresearch paper.

The popular representations of XML become showstoppers in implementing hidden XML/XPath/XSLT features. In this paper, I try to list the issues, both generic and specific to the representations:

Also I record some low-level issues related to implementing XPath and XSLT.

Finally, I introduce GSXML, a dialect of SXML.

Disclamer 1. This text isn't final. It's just a draft for possible papers. Your comments are welcome httpin my blog.

Disclamer 2. I'm not writing an XPath/XSLT implementation.

1. Problems of the main representations in httpresearch papers

1.1. Object oriented representation

It is used, for example, in CL-XML project {CL-XML} to implement XPath and XQuery. Using objects, there should be no showstoppers. I have only few objections:

On coding. With SXML, to apply a function to all children, one uses the core Scheme function <tt>map</tt>. With DOM-like object interface it becomes a loop with calls to <tt>getNextSibling</tt>.

The main issue is portability. Nearly each Scheme implementation has its own object system, and CLOS/MOP implementations sometimes incompatible between Common Lisp systems. In particular, it took me time to make CL-XML working under <tt>cmucl</tt> Common Lisp system.

1.2. DSSSL representation

XML fragment

<pre> <elem attr1="val1" attr2="val2">

  <subelem />

</elem> </pre>

is represented by DSSSL {DSSSL} as:

<pre> (elem

  attr1: "val1"
  attr2: "val2"
  (subelem))

</pre>

I see the following problems here:

What is a reasonable way to represend a standalone attribute? I think it's something like:

<pre> (attr1: "val1") </pre>

Is it easy to get this from the XML fragment? No, <tt>cdr</tt> produces:

<pre> (attr1: "val1" attr2: "val2" (subelem)) </pre>

Principally, it's possible to ignore everything after the second element, but this solution is somehow unpleasant.

To get rid of the tail, one can write a function like:

<pre> (define (drop-after-tail attr)

  `((car attr) (cadr attr)))

</pre>

Unfortunately, it only makes the situation worse:

<pre> (define a1 (drop-after-tail attr)) (define a2 (drop-after-tail attr)) </pre>

It's must to have <tt>a1</tt> and <tt>a2</tt> equal (<tt>eq?</tt>), but they are not equal.

1.3. SXML, symmetry of elements and attributes

The SXML {SXML} representation of the XML fragment from the previous section:

<pre> (elem

  (@
    (attr1 "val1")
    (attr2 "val2"))
  (subelem))

</pre>

One (mis)feature of SXML is the symmetry of elements and attributes. The only difference between them is that attributes are grouped under the special node <tt>@</tt>. The similarity leads to the problem.

For example, we want evaluate an XPath expression and attach the result to a tree. The SXPath expression is, for example, "<tt>(@ attr1)</tt>". It is evaluated to

<pre> (attr1 "val1") </pre>

When inserting the result to the tree, how it should be inserted, as an element or as an attribute? The code can't decide on its own.

The problem can be workarounded by wrapping the attribute by the special node <tt>@</tt>:

<pre> (@ (attr1 "val1")) </pre>

Unfortunately, the workaround leads to the identity problem. When using XPath to get an attribute, we get

<pre> (@ (attr1 "val1")) </pre>

When using <tt>car</tt>/<tt>cdr</tt> to get the same attribute, we get

<pre> (attr1 "val1") </pre>

Again, the results must be equal (<tt>eq?</tt>), but they are not equal.

The same problem exists for namespace declarations.

Use cases

Why to distinguish elements and attributes. XSLT processor, implementing <pre> <xsl:copy-of select="name | @name" /> </pre>

Why identity is important. Nodeset, returned by XPath, should have no duplicates. To delete duplicates, we need to uniquely identify nodes.

2. Hidden XML/XPath/XSLT features

2.1. Order of attributes

In XML, attributes are unordered. In s-expressions, there is order. Practically, I've never had it as an issue. It's possible just ignore the order in s-expressions.

2.2. Parent and document pointers

XML nodes have the parents and left siblings, s-expressions -- don't have. It's an imporant issue because XPath expressions support navigation over ancestors and siblings.

Different approaches for solving the problem are investigated in the paper "On parent pointers in SXML trees" {PARENT} by Oleg Kiselyov. Regardless of an approach, we can consider that nodes have an auxiliary attribute "<tt>@@parent</tt>", either implicit, either explicit.

Another useful auxiliary attribute is "<tt>@@doc</tt>", a pointer to the containing document. The document property is required to implement XPath/XSLT function <tt>key</tt>, and also might be useful for optimizations.

2.3. After an XPath step

After each XPath step, the resulted nodeset should be sorted in the document order, and duplicates should be removed.

This procedure is often ignored in scientific papers and naive pseudo-XPath implementations.

2.4. Adjacent text

According to the XSLT specification, adjacent text nodes should be joined into one united text node before starting transformation.

This procedure also applies to re-processing the output using the extension function <tt>node-set</tt>.

2.5. Sharing nodes between trees

XSLT expression

<pre> <xsl:copy-of select="..." /> </pre>

copies the selected nodes to the output tree. It might be inefficient. It is better just to put nodes to the output instead of making a full copy, if possible.

When a node is put to the output tree, the node gets a second parent and second associated document. It can be ignored as long as the plain XSLT 1.0 is used.

Indeed, the only way to access an already generated subtree is to put the subtree to a variable. More precisely, to a variable of the type "result tree fragment" (RTF). RTFs don't support XPath expressions, so there is no need to support XPath and, consequently, parent axes and keys.

This behaviour is changed in XSLT 2.0. More, XSLT 1.0 processors implement de-facto standard function <tt>node-set</tt> to convert an RTF to a nodeset. Therefore, the problem of multiple parents shoule be addressed.

The following recipe should work. Don't make a full copy, but put nodes as is to the output tree. If the function <tt>node-set</tt> is used, create a new XML document, import the RTF to it by copy, and use the new XML tree as a nodeset. If <tt>node-set</tt> is called again for the same RTF, use the alredy created nodeset. (TODO: check if it is ok for XSLT 2.0.)

I've said that we can ignore the second parent and second associated document. It's not completely true from the point of view of the memory manager. When the user releases a tree, the memory manager should be smart enough. It shouldn't delete a node which is used in another tree and shouldn't delete the same node twice.

3. GSXML format

GSXML format is a dialect of the SXML format, with an alternative representation of attributes and auxiliary nodes.

In GSXML, just like in SXML, an XML tree is represented the following way. The name of the root element is the head of the representing s-expression, the children are recursively represented in the tail. XML document node, processing instructions and comment are represented as elements with the special names <tt>*TOP*</tt>, <tt>*PI*</tt> and <tt>*COMMENT*</tt>, respectively.

Attributes are represented as an element with the name "<tt>@<i>aname</i></tt>", where <i>aname</i> is the name of the attribute. Auxiliary nodes are represented by an element with the name starting with <tt>@@</tt>.

The list of children starts with the auxiliary nodes (if any, unordered), continues with the attribute nodes (if any, unordered) and finishes with the element and special nodes (ordered).

Namespace declarations are represented as the corresponding <tt>xmlns</tt> attribute. Otherwise, namespace support is exactly the same as in SXML.

Example. An XML fragment.

<pre> <?xml version="1.0"?> <!-- $Id$ --> <article id="hw">

  <title>Hello</title>
  <para>Hello, <object>World</object>!</para>

</article> </pre>

The corresponding GSXML representation.

<pre> (*TOP*

  (*COMMENT* " $Id$ ")
  (article  (@id "hw")  "\n  "
    (title "Hello")  "\n  "
    (para "Hello, " (object "World") "!")  "\n"))

</pre>

References

{CL-XML} James Anderson, Common Lisp support for XML, httphttp://pws.prserv.net/James.Anderson/XML/

{DSSSL} International Standards Organization, Document Style Semantics and Specification Language, 1996, International Standard ISO/IEC 10179:1996(E)

{SXML} Oleg Kiselyov, SXML specification, March 12, 2004, httphttp://okmij.org/ftp/Scheme/SXML.html

{PARENT} Oleg Kiselyov, On parent pointers in SXML trees, Version is 1.1 Feb 12, 2003, httphttp://okmij.org/ftp/Scheme/xml.html#parent-ptr. Discussion: http1, http2, http3, http4, http5. httpMy notes (in Russian).

Not referenced:

Lisp Vs Xml, httphttp://c2.com/cgi/wiki?LispVsXml, Xml Isa Poor Copy Of Ess Expressions, httphttp://c2.com/cgi/wiki?XmlIsaPoorCopyOfEssExpressions.

Paul Prescod, XML is not S-Expressions, httphttp://www.prescod.net/xml/sexprs.html.

XSieve book, httphttp://xsieve.sourceforge.net/. XSieve: extending XSLT with the roots of XSLT, httphttp://xtech06.usefulinc.com/schedule/paper/44.

Kirill Lisovsky, STX: XSLT-like XML transformation in Scheme, httphttp://www196.pair.com/lisovsky/transform/stx/

Jim Bender, (X)Querying XML in Scheme, httphttp://celtic.benderweb.net/webit/docs/xquery-pre/

SXQuery project at ECLab Wiki, httphttp://wiki.eclab.byu.edu/index.cgi?SXQuery

SDOM is an implementation of the W3C DOM recommendation (Level 3), including support for event handling, in the Scheme programming language. httphttp://www.nongnu.org/sdom/

Updates

On alternative XML syntaxes

My standard answer to the authors of alternative XML syntaxes:

By the way, there is a big difference in our approaches. You define a superset of XML, and I don't. I use SXML as an in-memory representation of XML, useful for my developments.

2 Oct 2006, after the talk with Kirill Lisovsky

GSXML is faster when getting an attribute.

GSXML is slower when getting the list of attributes or list of children elements. A lot of calls to "<tt>symbol->string</tt> to check the first character for the attribute signature "<tt>@</tt>".

SXML, the attributes identity problem: no workaround, straightforward "<tt>eq?</tt>" isn't possible. Advanced comparator is required.

Namespaced in SXML. Approach is a bit unsatisfactory, but still better than alternatives.


Last edited on June 16, 2009 11:32 am.