Using XPath to extract values from an XML file

XPath is really the main ingredient of this recipe – the Google Spreadsheets element is just there to hold the flavour.

Dave Pattern has kindly put a webservice on top of the Usage data the University of Huddersfield released – for the details see http://www.daveyp.com/blog/archives/854

Create a new Google Spreadsheet (you’ll need a Google account for this)
In the first column put the Header “ISBN” (i.e. in A1), and then in the first cell (A2) put the ISBN – in this example I’m using 159308000X – “The Adventures of Huckleberry Finn”
In the second column (cell B2) we are simply going to form the URL for the recommendations web service by using the formula:
=concatenate(“http://library.hud.ac.uk/api/usagedata/isbn=”,A2)

Now we’ve done the preparation, we are going to use a function called ‘importxml’ and XPath to find some specific information in the response we get back from the web service.

Before we do this, we need to know the structure of the XML response. This is sort of documented at http://library.hud.ac.uk/data/usagedata/_readme.html. I’m not going to replicate the whole structure here, but concentrate on the aspects I’m going to use in this example. I’m interested in the ‘suggestions’ part of the XML – that is, what are the recommendations. For these purposes the part of the XML structure we are interested in looks like:




[can be repeated]

[can be repeated]




The bit I’m really interested in is the list of ISBNs (there will be one element per recommendation) each of which can have a number of ‘attributes’ (an XML element can have a number of ‘attributes’ which describe facets of the element). In this case the element can have the following attributes:

id — an internal id for the bibliographic record (which can be used to query the web service as well as the ISBN)
common — the number of times both items were borrowed by borrowers who borrowed both items
before — the number of times the suggested book was borrowed before the current item
same — the number of times the suggested book was borrowed at the same time as the current item
after — the number of times the suggested book was borrowed after the current item
total — the number of times the suggested book has been borrowed by anyone

OK – now we’ve looked at the structure of the XML, let’s construct an XPath expression to the element.

There is a pretty good basic intro to XPath available at http://www.w3schools.com/Xpath/xpath_intro.asp, which I’d recommend for a more detailed tutorial on XPath. For the purposes of this example, we are going to use two aspects of XPath.

The ‘/’ character is used to select XML elements starting from the ‘root’ node (in our case – all other elements are within the elements). This looks a bit like a file path – so to select the ISBN element given the structure above, we would construct an XPath like:

/usage_data/items/item/suggestions/isbn

This is essentially saying “retrieve all the elements which are within a element, which is within an element, which is within an element, which is within the root element”

Using ‘/’ means we are looking for an exact to the specified path – so you need to get the path completely right, otherwise the expression will fail (or retrieve something different to what you are expecting).

Also note that this is not necessarily a unique path within the XML – since you can have multiple elements. So if there were elements in different parts of the document, as long as they matched the path specified, this XPath expression would retrieve them all and the response would not differentiate these at all.

Since the ‘importxml’ function needs a URL for an XML file, and an XPath expression, we could now construct the following ‘importxml’ function:

=importxml(“http://library.hud.ac.uk/api/usagedata/isbn=159308000X”,”/usage_data/items/item/suggestions/isbn”)

This would retrieve a list of the ISBNs that correspond to recommendations based on our original ISBN.

However, I’m going to go a step further, and rather than simply retrieve the ISBN, I’m going to get the ‘id’ which is an attribute of the element. In XPath the attributes are indiciated by a prefix of ‘@’. So to modify the above XPath statement to get the value of the id attribute for each (rather than simply the value stored in the element) we would use:

/usage_data/items/item/suggestions/isbn/@id

Since we’ve already constructed our URL in cell B2, we can now use this with importXML by putting the following formula in C2:

=importxml(B2,”/usage_data/items/item/suggestions/isbn/@id”)

This will retrieve all the relevant ids (at time of writing this is 11 recommendations, including some duplicates)

As well as simply allowing us to specify a path, XPath also allows us to insert values so that we only retrieve attributes or elements that relate to specific values. So given that the first id retrieved by the above XPath is “60300” we could now retrieve the ‘total’ attribute for elements which have the id value of 60300 by the following expression

/usage_data/items/item/suggestions/isbn[@id=60300]/@total

So, in Column D we can put a formula that reuses the IDs retrieved in Column C. Something like:

=importxml(B2,concatenate(“/usage_data/items/item/suggestions/isbn[@id='”,C2,”‘]/@total”))

Some further tweaking and we can fill in several of the attributes of the element.

My finished spreadsheet is available at

http://spreadsheets.google.com/ccc?key=rSGw_bsVZvFCp8xsDAMPMlQ

There are some issues and limitations to using Google Spreadsheets in this way, but this is a handy way of constructing and testing XPath expressions and seeing the response – and for very simple XML structures is usable.

Tony Hirst added the following comment to this original post:

The Yahoo YQL Developer console is another environment where you can query XML documents according to a path statement.

For example, the following query pulls back the titles of books that are associated with ISBN 159308000X.

select items.item.title from xml where url=’http://library.hud.ac.uk/api/usagedata/isbn=159308000X’

The format of the query is:

select PATH from xml where url=’URL_TO_XML_DOC’

Here’s a direct link to the query (you probably need to sign in with a Yahoo account):

http://developer.yahoo.com/yql/console/?q=select%20items.item.title…

The path starts at the element below the root element. So in Dave’s XML data, the root element is usage_data.

To pull back the description element (/usage_data/description) I would simply:

select description from xml where url=’http://library.hud.ac.uk/api/usagedata/isbn=159308000X’

To get loan histories, the query would be:

select items.item.loan_history from xml where url=’http://library.h…

It is possible to process the results further using Javascript and the YQL Execute engine: With YQL Execute, the Internet becomes your database.

Leave a reply

You must be logged in to post a comment.