PHP XML
Parsing Large XML Documents
Problem
You want to parse a large XML document. This document is so large that it’s impractical to use SimpleXML or DOM because you cannot hold the entire document in memory.
Instead, you must load the document in one section at a time.
Solution
Use the XMLReader extension:
$reader = new XMLReader();
$reader->open(__DIR__ . '/card-catalog.xml');
/* Loop through document */
while ($reader->read()) {
/* If you're at an element named 'author' */
if($reader->nodeType == XMLREADER::ELEMENT &&
$reader->localName == 'author') {
/* Move to the text node and print it out */
$reader->read();
print $reader->value . "\n";
}
}
Discussion
There are two major types of XML parsers: ones that hold the entire document in memory at once, and ones that hold only a small portion of the document in memory at any given time.
The first kind are called tree-based parsers, because they store the document into a data structure known as a tree. The SimpleXML and DOM extensions are tree-based parsers. Using a tree-based parser is easier for you, but requires PHP to use more RAM. With most XML documents, this isn’t a problem. However, when your XML document is quite large, this can cause major performance issues.
The other kind of XML parser is a stream-based parser. Stream-based parsers don’t store the entire document in memory; instead, they read in one node at a time and allow you to interact with it in real time. Once you move onto the next node, the old one is thrown away—unless you explicitly store it yourself for later use. This makes stream-based parsers faster and less memory-consuming, but you may have to write more code to process the document.
The easiest way to process XML data using a stream-based parser is using the XMLReader extension. This extension is based on the C# XmlTextReader API. If you’re familiar with SAX (Simple API for XML), XMLReader is more intuitive, feature-rich, and faster.
Begin by creating a new instance of the XMLReader class and specifying the location of your XML data:
// Create a new XMLReader object
$reader = new XMLReader();
// Load from a file or URL
$reader->open('document.xml');
// Or, load from a PHP variable
$reader->XML($document);
Most of the time, you’ll use the XMLReader::open() method to pull in data from an external source, but you can also load it from an existing PHP variable with XMLReader::XML().
Once the object is configured, you begin processing the data. At the start, you’re positioned at the top of the document. You can maneuver through the document using a combination of the two navigation methods XMLReader provides: XMLReader::read() and XMLReader::next(). The first method reads in the piece of XML data that immediately follows the current position. The second method moves to the next sibling element after the current position.
For example, look at this XML:
<books>
<book isbn="1565926811">
<title>PHP Cookbook</title>
<author>Sklar</author>
<author>Trachtenberg</author>
<subject>PHP</subject>
</book>
<book isbn="0596003137">
<title>Perl Cookbook</title>
<author>Christiansen</author>
<author>Torkington</author>
<subject>Perl</subject>
</book>
</books>
When the object is positioned at the first <book> element, the read() method moves you to the next element underneath <book>. (This is technically the whitespace between <book> and <title>.) In comparison, next() moves you to the next <book> element and skips the entire PHP Cookbook subtree.
These methods return true when they’re able to successfully move to another node, and false when they cannot. So, it’s typical to use them inside a while loop, as such:
/* Loop through document */
while ($reader->read()) {
/* Process XML */
}
This causes the object to read in the entire XML document one piece at a time. Inside the while(), examine $reader and process it accordingly.
A common aspect to check is the node type. This lets you know if you’ve reached an element (and then check the name of that element), a closing element, an attribute, a piece of text, some whitespace, or any other part of an XML document. Do this by referencing the nodeType attribute:
/* Loop through document */
while ($reader->read()) {
/* If you're at an element named 'author' */
if($reader->nodeType == XMLREADER::ELEMENT &&
$reader->localName == 'author') {
/* Process author element */
}
}
This code checks if the node is an element and, if so, that its name is author. For a complete list of possible values stored in nodeType, check out Table.
Table XMLReader node type values
Node type Description
XMLReader::NONE No node type
XMLReader::ELEMENT Start element
XMLReader::ATTRIBUTE Attribute node
XMLReader::TEXT Text node
XMLReader::CDATA CDATA node
XMLReader::ENTITY_REF Entity Reference node
XMLReader::ENTITY Entity Declaration node
XMLReader::PI Processing Instruction node
XMLReader::COMMENT Comment node
XMLReader::DOC Document node
XMLReader::DOC_TYPE Document Type node
XMLReader::DOC_FRAGMENT Document Fragment node
XMLReader::NOTATION Notation node
XMLReader::WHITESPACE Whitespace node
XMLReader::SIGNIFICANT_WHITESPACE Significant Whitespace node
XMLReader::END_ELEMENT End Element
XMLReader::END_ENTITY End Entity
XMLReader::XML_DECLARATION XML Declaration node
_____________________________________________________
From there, you can decide how to handle that element and the data it contains. For example, we can print out all the author names in the card catalog:
$reader = new XMLReader();
$reader->open(__DIR__ . '/card-catalog.xml');
/* Loop through document */
while ($reader->read()) {
/* If you're at an element named 'author' */
if($reader->nodeType == XMLREADER::ELEMENT &&
$reader->localName == 'author') {
/* Move to the text node and print it out */
$reader->read();
print $reader->value . "\n";
}
}
Sklar
Trachtenberg
Christiansen
Torkington
Once you’ve reached the <author> element, call $reader->read() to advance to the text inside it. From there, you can find the author names inside of $reader->value.
The XMLReader::value attribute provides you access with a node’s value. This only applies to nodes where this is a meaningful concept, such as text nodes or CDATA nodes. In all other cases, such as element nodes, this attribute is set to the empty string.
Table XMLReader node type values
Name Type Description
attributeCount int Number of node attributes
baseURI string Base URI of the node
depth int Tree depth of the node, starting at 0
hasAttributes bool If the node has attributes
hasValue bool If the node has a text value
isDefault bool If the attribute value is defaulted from DTD
isEmptyElement bool If the node is an empty element tag
localName string Local name of the node
name string Qualified name of the node
namespaceURI string URI of the namespace associated with the node
nodeType int Node type of the node
prefix string Namespace prefix associated with the node
value string Text value of the node
xmlLang string xml:lang scope of the node
____________________________________________________
There’s one remaining major piece of XMLReader functionality: attributes. XMLReader has a special set of methods to access attribute data when it’s on top of an element node, including the following: moveToAttribute(), moveToFirstAttribute(), and moveTo NextAttribute().
The moveToAttribute() method lets you specify an attribute name. For example, here’s code using the card catalog XML to print out all the ISBN numbers:
$reader = new XMLReader();
$reader->XML($catalog);
/* Loop through document */
while ($reader->read()) {
/* If you're at an element named 'book' */
if ($reader->nodeType == XMLREADER::ELEMENT &&
$reader->localName == 'book') {
$reader->moveToAttribute('isbn');
print $reader->value . "\n";
}
}
Once you’ve found the <book> element, call moveToAttribute('isbn') to advance to the isbn attribute, so you can read its value and print it out:
1565926811
0596003137
In the examples in this recipe, we print out information on all books. However, it’s easy to modify them to retrieve data only for one specific book. For example, this code combines pieces of the examples to print out all the data for Perl Cookbook in an efficient fashion:
$reader = new XMLReader();
$reader->XML($catalog);
// Perl Cookbook ISBN is 0596003137
// Use array to make it easy to add additional ISBNs
$isbns = array('0596003137' => true);
/* Loop through document to find first <book> */
while ($reader->read()) {
/* If you're at an element named 'book' */
if ($reader->nodeType == XMLREADER::ELEMENT &&
$reader->localName == 'book') {
break;
}
}
/* Loop through <book>s to find right ISBNs */
do {
if ($reader->moveToAttribute('isbn') &&
isset($isbns[$reader->value])) {
while ($reader->read()) {
switch ($reader->nodeType) {
case XMLREADER::ELEMENT:
print $reader->localName . ": ";
break;
case XMLREADER::TEXT:
print $reader->value . "\n";
break;
case XMLREADER::END_ELEMENT;
if ($reader->localName == 'book') {
break 2;
}
}
}
}
} while ($reader->next());
title: Perl Cookbook
author: Christiansen
author: Torkington
subject: Perl
The first while() iterates sequentially until it finds the first <book> element.
Having lined yourself up correctly, you then break out of the loop and start checking ISBN numbers. That’s handled inside a do… while() loop that uses $reader->next() to move down the <book> list. You cannot use a regular while() here or you’ll skip over the first <book>. Also, this is a perfect example of when to use $reader->next() instead of $reader->read().
If the ISBN matches a value in $isbns, then you want to process the data inside the current <book>. This is handled using yet another while() and a switch().
There are three different switch() cases: an opening element, element text, and a closing element. If you’re opening an element, you print out the element’s name and a colon. If you’re visiting text, you print out the textual data. And if you’re closing an element, you check to see whether you’re closing the <book>. If so, then you’ve reached the end of the data for that particular book, and you need to return to the do… while() loop. This is handled using a break 2;—while jumps back two levels, instead of the usual one level.
No comments:
Post a Comment