| Click here to download the code bundle. To report code errors, email SystemiNetwork.com |
Although many databases provide functionality for querying XML files, the standard DB2 for i installation (which doesn't include XML Extender) offers no straightforward method for querying XML files. However, as I'll show you, you can leverage the DB2 for i user-defined table function (UDTF) to query XML files and determine the source of a problem with the data. We'll first learn a bit more about UDTFs and how they're useful for querying XML files, then examine a utility I wrote that you can use in an SQL query to search XML files for specific data.
The UDTF functionality in DB2 for i lets database users use HLLs to generate their own result sets and query those result sets. Thus, SQL is no longer limited for use in querying only database tables. It can also query the result sets generated by the HLL programs. (For more information about using UDTFs, see "UDTFs: The Unsung DB2 Function," January 2007, article ID 20788 at SystemiNetwork.com.)
A recent situation I faced inspired me to develop my UDTF-based utility. On my desk were purchase orders, placed by buyer BUY01, that had the wrong supplier address. My manager asked me to find the root cause of this problem. I suspected that this data originated from the XML files that another computer (e.g., our warehouse management system) received from our host system. My first task was to determine which purchase order the XML files received from the host system were placed by buyer BUY01. To find this information, either I would have to manually view the contents of all the XML files to determine the culprit XML, or I'd have to write a program to read the IFS and find the Buyer tag.
An easier way to find the problem data, though, would be to write a small SQL query like this:
Select IFSName, TagValue
From TABLE(SRCHXML
('/XML/Orders','CustOrd/Header/Buyer')) Tab
Where TAGVALUE='BUY01'
This query uses a UDTF that returns to the database a result set that I could query further. If I knew the directory containing the purchase order XML files and the XML path containing the buyer information, I could list all the XML files I was looking for.
To solve the problem of locating the XML files containing the incorrect data, I wrote a UDTF named SRCHXML that looks for data in all XML files in a given directory under a given XPath. The previous query uses the SCRHXML UDTF to search XML documents under the /XML/Orders directory and locate the data under the CustOrd/Header/Buyer XPath.
XPath or XML Path Language is a language used to find information in an XML document. You can use XPath to write expressions to navigate around the tree and select nodes (i.e., parts of the XML document) using various criteria. Figure 1 shows some sample HTML next to a diagram representing the XPath nodes in the document's XML hierarchy. In this example, the XPath for Item is CustOrd/Detail/Item (red line in the diagram), and for Buyer the XPath is CustOrd/Header/Buyer (green line). For more information about XPath, see www.w3schools.com/xpath/default.asp.
Every UDTF has two components:
SQL component. The SRCHXML UDTF accepts two input parameters: the XML directory and the XPath name. The UDTF returns three columns for every successful XML query: the XML filename, the XPath name, and the corresponding value from the required element. Figure 2 shows the UDTF with sample parameters.
HLL component. Figure 3 shows an excerpt of the HLL program for this example, which is written in RPGLE. (The complete Srchxml.RPGLE program and other solution components are available online at SystemiNetwork.com/code.) Here's how the program works. First, at A, DB2 calls the HLL program, specifying different parameters for different functionality. The first call signals the HLL to initialize the variables or open the resources; here, it opened my directory in the OpenDirS procedure. Subsequent calls fetch the row set from the HLL.
The next call at B executes the ProcessIFS procedure to read every XML file in the IFS directory and parse them using the ParseXML procedure, at E. The ParseXML procedure parses the XML using the XML-INTO opcode and stores the result of the parsed data in an array named Wrk_Data. The %XML built-in function (BIF) uses 'case=any' to avoid the problem of mismatching of case between the XPath and the actual XML.
If the ParseXML procedure successfully parses the XML (which you can determine by looking at the number of elements parsed), at C the program reads the Wrk_Data array in a loop for all the parsed data and returns the result to the database. Note that total numbers of parsed elements are available in the program data structure, which is used as the boundary for this loop.
If the program doesn't find the required XPath in the XML or an error occurs, the program returns no result set to the database but instead continues the processing by reading other XML files in the IFS directory until it finds the next suitable XML. I use the Monitor statement around the XML-INTO opcode to perform the error trapping and return control to the ProcessIFS procedure.
If a major error occurs for example, the input directory path is incorrect the program terminates after sending an appropriate error message to the database. In this case, the query terminates with the error message "Query cannot be run. See lower level messages." The lower-level message in the job log gives the complete error information.
After the program has read all the XML files in the IFS directory, the program's next task is to signal the database using the SQL_EOF condition at D, which tells the database to call the program for the last time. The final call signals the program to close the resources, where I close my open directory in the CloseDirS procedure.
To use the utility, you'll need to create the SRCHXML UDTF by following these steps:
RUNSQLSTM SRCFILE(MYLIB/QSQLSRC) SRCMBR(SRCHXML)
After you've created both components, you can use this utility as you would any other SQL statement in your interactive SQL and SQLRPGLE programs. There's a limitation with the SRCHXML UDTF that you should be aware of. If the directory contains a huge number of IFS files, the query may take a long time to run. In that case, the database manager times out the UDTF. To avoid the timeout problem, you can change the QUERY_TIME_LIMIT option in the QAQQINI query options file and specify a longer time limit so you can obtain the complete result set.
You can also use the SRCHXML UDTF to write complex queries. For example, in the purchase order situation mentioned earlier, if my manager also wants to know the purchase orders placed by BUY01, I can slightly change the query to retrieve that information, if I know the XPath for the purchase order number. Here's the query I would use:
Select TAGVALUE PO# from
TABLE(SRCHXML
('/xml/Orders','CustOrd/Header/ OrderNum')) a
Where IFSNAME in (
Select b.IFSNAME from
TABLE(SRCHXML
('/xml/Orders', 'CustOrd/Header/Buyer')) b
Where
b.TAGVALUE='BUY01' )
The SRCHXML UDTF tool extends DB2 for i by expanding users' capabilities for working with XML data, as you can do in DB2 9 for Linux, Unix, and Windows. You'll likely find yourself working with XML data, if you're not already and when you do, SRCHXML UDTF will come in handy.
Jagannath Lenka is a project manager at Infosys Technologies. He has been associated with IBM System i development since 2000.