Home · All Classes · Main Classes · Grouped Classes · Modules · Functions

A Short Path to XQuery

Introduction

Compared to programming languages such as Java and C++, which comprise of statements, XQuery has expressions at its core. There are various expressions available ranging from sorting to function calls, but one of the simplest is a direct node constructor. For example:

 <myElement/>

This is an XQuery query and it also happens to be a well-formed XML document. Regular XML nodes can be used as expressions in XQuery. Other expressions can be embedded using curly braces:

 (: Copy the value of xml:id attribute from other.html. This is a comment by the way! :)
 <html xmlns="http://www.w3.org/1999/xhtml/"
       xml:id="{doc("other.html")/html/@xml:id}"/>

Selecting Nodes with Paths

In C++ and Java, for loops and function recursion is used to perform iterative tasks. In contrast, XQuery supplies the declarative path expressions, specifically designed to make iterative tasks concise and precise.

A path expression consists of one or more steps separated by slash or double slash, where the result of each step becomes the focus for the next step. The resulting nodes are always delivered in document order and without duplicates. For instance:

 doc('index.html')//p

selects all p elements in the document index.html regardless of where they appear.

A path expression is always evaluated to a particular focus. Sometimes this focus is created by the path expression itself, as seen above by using the doc() function. In other cases the focus is set by a parent expression, or by calling QXmlQuery::setFocus().

The focus is a sequence of nodes, however it can consist of atomic values too. This focus acts as a context, which the subsequent step filters. A step is invoked for each item in the focus; for each item that the step accepts, the next step is applied. This way paths provide iteration, similar to a nested set of for loops. This query:

 doc('index.html')//p/span

contains one top level path expression which has four steps: a call to the doc() function, and then three node tests. Reading through the query in a step-by-step manner gives:

  1. For each node returned by the function doc() (which is one document node);
  2. for each node that appears as a descendant of the document node and is an element with name p
  3. return the child elements whose name is span

This query evaluates to zero or more span elements.

Although a step in a path can be any expression as long as only the last step evaluates to atomic values, the most common is when steps are axis steps. Axis steps consists of two parts: an axis, and a node test. For each node in the focus for the step, the axis navigates it and the node test is applied to each node along that axis. In fact, the element test above actually use a short form for a combination between the child axis, and the element() node test. This query evaluates to the exact same output:

 doc('index.html')/descendant-or-self::element(p)/child::element(span)

There is a wide range of different axes and node tests, in order to be able to filter nodes in specific ways.

Axes

These are the axes available in XPath, and hence also XQuery:

Node tests

Node tests generally filter nodes based on their kind, whether the context node is a processing-instruction or not for instance, or their name, or a combination between the two.

Name Tests

Name tests are namespace aware, meaning that a name test such as svg wouldn't match the document element in a SVG document, since the document element is in the SVG namespace. The various namespace declarations can be used to provide namespaces. This query:

 declare namespace s = "http://www.w3.org/2000/svg";
 declare default element namespace "http://www.w3.org/2000/svg";
 let $doc := doc('image.svg')
 return ($doc/svg,
         $doc/s:svg)

selects the document element twice, because the two name tests match the same element, although the namespaces are supplied differently each time. Once through a prefix binding, and the other by picking up the default element namespace.

Names and Wildcards

Names can be combined with wildcards in order to select for instance any element or attribute as long as it is in a particular namespace, or an attribute or element appearing in any namespace, as long as it has a particular local name. This is achieved by using a wildcard as the prefix or local name. For instance this query:

 declare namespace xlink = "http://www.w3.org/1999/xlink/";
 doc('image.svg')//@xlink:*

selects all the attributes that are in the XLink namespace, and this query:

 doc('data.xml')/*:body

selects an element whose local name is html, regardless of its namespace.

The following are various kind tests.

Kind Tests

In addition there are node tests related to the schema types of nodes, although they are currently unsupported by QtXmlPatterns.

Abbreviated syntax

For many common tasks the full axis step syntax is a bit verbose and for that reason simplified alternatives exists, which typically combine axes and node tests. Some examples:

More on Focus and Filtering: Predicates

In addition to steps as a way to filter content, XPath & XQuery has the predicate expression: an expression with a second expression to its right enclosed in brackets. For instance this query:

 doc("index.html")/html/body/p[@xml:id = "thatSpecialOne"]

selects the paragraph that has an attribute with the ID thatSpecialOne.

Like steps in path expressions, predicates also make use of the focus. For each item in the source sequence, the predicate is applied, and if the item passes the filter, it is part of the result. Inside a predicate (and inside steps too) the current context item can be accessed by using the dot expression. Consider this query:

 doc('index.html')//p[string-length(.) = 0]

For each p element that the node test returns, the predicate is invoked. If the predicate expression evaluates to true, it returns the node, and that it will do if the string value of the predicate's context item is zero.

There are two kinds of predicates: numeric predicates and truth predicates.

Select based on Positions and Numeric Ranges

While a predicate is applied to its focus, the current context position can be obtained by using the function position(). For instance, this query:

 doc('index.html')//p[position() > 5]

selects all the paragraphs except the five first.

In addition to position(), the function last() also returns a number related to the focus: the position of the last item. last() inside a predicate by itself will simply select the last item in the input sequence, but it can also be combined with for instances an offset:

 doc('index.html')//p[last() - 1]

which would return the next last paragraph in the document.

Positions inside a focus starts from one, not zero.

Filterting based on Logic

If a predicate doesn't evaluate to a number, it is considered a truth predicate. A truth predicate takes the value the predicate expression evaluates to and computes its effective boolean value. The rules for how that is done, is as follows:

For instance, this query:

 doc('index.html')/html/body/p[table]

selects all paragraphs that has a table as a child, since the predicate evaluates to true if the contained step, table, matches any nodes. This is of course very different from:

 doc('index.html')/html/body/p/table

which returns the tables found inside paragraphs (which should be none, since they cannot appear there).

Creating nodes

While the XQuery language has a lot of functions and expression for selecting and filtering exisitng content, it can also create new content using its node constructors. Consider:

 <doc xmlns="http://example.com/Namespace" xml:base="http://example.com/">

     <!-- a comment -->
     <?target data?>
     <anotherElement/>
     some text
 </doc>

While this looks like an XML document, and in fact is so, it also is a valid XQuery query. Node constructors are by large just like XML, so if one knows XML, one can simply continue to write XML inside queries whenever one needs to have nodes created. There is however two things that set direct node constructs apart from XML: one can embed XQuery expressions inside of them, and they are expressions themselves. Let's first look at the former.

Computing values inside nodes

Creating a value inside a node at runtime is done by embedding expressions inside curly braces. For instance, this expression, simple as it is, constructs an element with the text node "6" inside of it:

 <e>{sum((1, 2, 3))}</e>

Similarly, one can embed expressions inside attributes. For instance:

 declare variable $additionalClass := "example";
 <p class="important {$additionalClass} obsolete"/>

creates an element whose attribute called class has the value "important example obsolete", without quotes.

Node Constructors are Expressions

Because node constructors are expressions just like for instance function calls, paths and literals, they can appear anywhere where expressions can appear. For instance:

 let $docURI := 'maybeNotWellformed.xml'
 return if(doc-available($docURI))
        then doc($docURI)//p/<para>{./node()}</para>
        else <para>Failed to load {$docURI}</para>

If maybeNotWellformed.xml can be read successfully it creates a para element for each p element that appears anywhere in the document and copies p's child nodes into it. But if the document cannot be loaded, a single para element is created that contains a descriptive message.

Hence, in the above query node constructors appear in two places:

Computing Node Names at Runtime

Direct node constructors are fine, but what if one doesn't know the names of the nodes to construct when writing the query? For each direct element constructor, there exist a computed node constructor, that takes names and the node values as arbitrary expressions. For instance, the query seen above that produced a small XML document, can also be written like this:

 declare default element namespace "http://example.com/Namespace";
 declare variable $documentElementName := "doc";

 element {$documentElementName}
 {
     attribute xml:base {"http://example.com/"},
     element anotherElement
     {
         comment {" a comment "},
         processing-instruction target {"data"},
         element anotherElement {()},
         text {"some text"}
     }
 }

Copying nodes into other nodes

When an expression embedded inside a node expression evaluates to strings (or any other type of atomic values) the values becomes text nodes by concatenating them with a space inbetween. However, when the expression evaluates to nodes, they are copied and becomes children of the surrounding node. This can occasionally be deceptive. Consider this query:

 <html>
     <body>
         <p>
         {
             doc('feed.rss')/rss/@version
         }
         </p>
     </body>
 </html>

This won't output a p element that has the value of the version attribute, it will instead copy the attribute onto the p element whose result in not even valid XHTML. The approach is instead, in the case of wanting the value of the attribute instead of itself, to extract that using for instance the string() function:

 <html>
     <body>
         <p>
         {
             string(doc('feed.rss')/rss/@version)
         }
         </p>
     </body>
 </html>

Escaping Characters

In the XQuery syntax, a set of characters are given special meaning. For instance, apostrophes or quotes start and terminate string literals. These can be escaped by writing the character twice:

For instance:

 <p>
 {
     """I hate quotations"" -- Ralph Waldo Emerson""",
     "&#xA;",
     '''"I hate quotations"" -- Ralph Waldo Emerson"'', appeared above'
 }
 </p>

Evaluates to:

 <p>"I hate quotations" -- Ralph Waldo Emerson"
  '"I hate quotations"" -- Ralph Waldo Emerson"', appeared above</p>

However, sometimes the easiest is to start the string literal with apostrophes instead of quotes, if the string contains quotes. One can also use XML character references, like &amp; or &#xA;, to express characters that cannot be directly represented in the encoding of the file containing the query.

When curly braces should appear inside node constructors, one can again escape them with double braces or use character references. For instance:

 <doc>
     This is one left followed by one right curly brace: {{ }}
     Here they are again, but with character references: &#x7B; &#x7D;
 </doc>

Dates, Times, Numbers and other Atomic Values

Apart from nodes, XQuery has atomic values and they are just what one would think they are: small, primitive values, that have a similar role to C++'s plain old data structures like float or long. In total there are about twenty of them, some of the most common being:

NameDescription
xs:integerA 64 bit integer
xs:booleanA boolean value, false or true
xs:doubleA 64 bit floating point value
xs:stringA string where each codepoint is an XML 1.0 character(essentially Unicode)
xs:dateA date, such as when you're born: 1984-10-15
xs:timeA time, such as when you show up at work: 09:00:00
xs:dateTimeA date followed by a time: 1974-10-15T05:00:00
xs:durationA time interval such as P5Y2M10DT15H, which represents five years, two months, 10 days, and 15 hours.
xs:base64BinaryRepresents data, possibly binary data, in Base 64 encoding.

Atomic values can be seen as types which have:

Creating Atomic Values

Apart from the builtin functions that returns atomic values, such as current-date-time(), constructor functions can be used to

Integers, decimals, doubles and strings can be created by using literal expressions. Booleans with the functions true() or false() (just true or false would be name tests), and for the rest constructor functions must be used.

Essentially each atomic type can construct a value from a string. While doing so it validates the input string to ensure it has a proper format and if not, it issues a dynamic error. These formats tend to be as one would guess it to be. For instance, if one passes "1.five" to xs:decimal's constructor, as opposed to "1.5" it will halt the query such that the bug can be corrected.

In the example an xs:boolean was created from an xs:integer as opposed to from a string, and that's because values doesn't have to be constructed from strings, they can be created, or converted, from a range of different types. For instance, an xs:double can be created from a xs:decimal, or a xs:boolean can be converted to an xs:string. What conversions that are possible depends on the types but they tend to be intuitive. One of the specifications has a nifty table outlining those.

Using Atomic Values

Once atomic values have been constructed, via one of the methods mentioned above, or as return values from functions or by evaluating variables, one can pass them along to functions, convert them to strings and attach them as part of text nodes to nodes, or use operators between them. Let's look at the latter.

Apart from the usual arithmetic operators between numbers one would expect, they are also available between more exotic types. Have a look at this query:

It substracts two dates which returns an xs:dayTimeDuration, which it subsequently compares against another :xs:dayTimeDuration. The query finally evaluates to a single atomic value of type xs:boolean.

The available operators and between what types are summarized in a table in the main XQuery specification.

Further Reading

XQuery is a big language that is hard to cover in an overview. If one wants a good understanding of the subject, a good thing could be to get a book on topic.

Another alternative is to ask a question or two on the mailing lists qt-interest or talk at x-query.com.

FunctX is a collection of XQuery functions that can be both useful and educational.

Of course, the specifications is one alternative, but one has to take a deep breath before diving into those. Here are the links to (some of) them:

FAQ

Path expressions misses

Often this is caused by that the names that the axis step matches, is different from nodes being matched. For instance, let's say that index.html in this query:

 (: Select all paragraphs that contains examples. :)
 doc("index.html")/html/body/p[@class="example"]

is an XHTML document and hence it resides in the namespace http://www.w3.org/1999/xhtml/. The path won't match since they look for {}html and so forth, while the actual name is {http://www.w3.org/1999/xhtml/}html. The fix is straight forward:

 declare namespace x = "http://www.w3.org/1999/xhtml/";
 (: Select all paragraphs that contains examples. :)
 doc("index.html")/x:html/x:body/x:p[@class="example"]

Path expressions also pick up the default namespace if one is declared:

 declare default element namespace "http://www.w3.org/1999/xhtml/";
 <html>
     <body>
         {
             for $i in doc("testResult.xml")/tests/test[@status = "failure"]
             order by $i/@name
             return <p>{$i/@name}</p>
         }
     </body>
 </html>

In this case the nodes created by the direct element constructors will be in the XHTML namespace, but so will the path expressions. Hence they look for {http://www.w3.org/1999/xhtml/}tests and so forth, while testResult.xml is perhaps in a different namespace, or no namespace at all.

Another reason coulbe be that the context item is not what one expects it to be. For instance, this expression:

 doc("myPlainHTML.html")/body

won't match because the node the doc() function returns is not the top element node(html), it is the document node.

Variable in for loop is out of scope

Due to expression precedence it might be necessary to wrap the return expression in a for clause with paranteses:

 for $i in(reverse(1 to 10)),
     $d in xs:integer(doc("numbers.xml")/numbers/number)
 return ($i + $d)

Without the paranteses on the last line, the arithmetic expression would have had the whole for clause as it left operand, and since the scope of variable $d ends at the return clause, the variable reference would be out of scope.

Expressions aren't evaluated

If an expression is inside a node constructor it must be surrounded by curly braces, otherwise it's interpreted as text. This:

 <e>sum({(1, 2, 3)})</e>

evaluates to:

 <?xml version="1.0" encoding="UTF-8"?>
 <e>sum(1 2 3)</e>

while:

 <e>{sum((1, 2, 3))}</e>

evaluates to:

 <e>6</e>

Filters selects the wrong things

When having predicates, consider what the predicate applies to. For instance, this query:

 <doc>
     <p>
         <span/>
         <span/>
     </p>
     <p>
         <span/>
         <span/>
     </p>
 </doc>/p/span[1]

evaluates to the first span elements in each p element, while this query:

 (<doc>
     <p>
         <span/>
         <span/>
     </p>
     <p>
         <span/>
         <span/>
     </p>
 </doc>/p/span)[1]

evaluates to only one span element, the one that occured first in the result of the path expression as a whole. In the first case the filter expression was applied for the span step.

FLWOR doesn't behave as expected

Note that a for expression generates a so called tuple stream, while a let clause is an ordinary variable binding. For instance, if a let binding is placed inside a for binding it is created for each tuple. The order by clause in turn applies to the result of the tuple stream that the for clause evaluates to. Consider:

 for $a in (8, -4, 2)
 let $b := ($a * -1, 2)
 order by $a
 return $b

evaluates to 4 2 -2 2 -8 2.

This expression:

 let $i := (3, 2, 1)
 order by $i[1]
 return $i

wouldn't be sorted since the items the let clause binds aren't dealt with on an individual basis.

Nodes are created in the wrong order

If nodes are created in the wrong order, it can possibly be related to that the document order between nodes created with node constructors is undefined. For that reason node sorting, which is invoked by path expressions for instance, returns nodes in an order which is undefined. Hence, one gets nodes in an arbitrary order if node constructors are placed somewhere in a path expression; or indirectly, if nodes are created inside a user-declared function which is called from a path step. Consider:

 doc('feed.rss')//item/<p>{description/node()}</p>

This query evaluates to a sequence of p elements. However, the order is not in the same order as the item elements appear in feed.rss. The order is, counter intuitive as it may seem, undefined.

One approach to this is to instead use the for loop, which doesn't perform node sorting on its result:

 for $item in doc('feed.rss')//item
 return <p>
         {
             $item/description/node()
         }
        </p>

true or false doesn't work

Boolean values, that is atomic values of type xs:boolean, cannot be created by writing true or false inside the query, since those are steps, name tests to be precise. The safest and easiest way to create boolean values is to use the builtin functions false() or true().

Another way is to invoke its constructor function:

 xs:boolean("true")


Copyright © 2008 Trolltech Trademarks
Qt 4.4.0