PHP XML Parsing Large XML Documents

PHP XML

Parsing Large XML Documents

Problem

You want to parse a large XML document. This document is so large that it’s impractical to use SimpleXML or DOM because you cannot hold the entire document in memory.

Instead, you must load the document in one section at a time.

Solution

Use the XMLReader extension:

$reader = new XMLReader();

$reader->open(__DIR__ . '/card-catalog.xml');

/* Loop through document */

while ($reader->read()) {

/* If you're at an element named 'author' */

if($reader->nodeType == XMLREADER::ELEMENT &&

$reader->localName == 'author') {

/* Move to the text node and print it out */

$reader->read();

print $reader->value . "\n";

}

Discussion

There are two major types of XML parsers: ones that hold the entire document in memory at once, and ones that hold only a small portion of the document in memory at any given time.

The first kind are called tree-based parsers, because they store the document into a data structure known as a tree. The SimpleXML and DOM extensions are tree-based parsers. Using a tree-based parser is easier for you, but requires PHP to use more RAM. With most XML documents, this isn’t a problem. However, when your XML document is quite large, this can cause major performance issues.

The other kind of XML parser is a stream-based parser. Stream-based parsers don’t store the entire document in memory; instead, they read in one node at a time and allow you to interact with it in real time. Once you move onto the next node, the old one is thrown away—unless you explicitly store it yourself for later use. This makes stream-based parsers faster and less memory-consuming, but you may have to write more code to process the document.

The easiest way to process XML data using a stream-based parser is using the XMLReader extension. This extension is based on the C# XmlTextReader API. If you’re familiar with SAX (Simple API for XML), XMLReader is more intuitive, feature-rich, and faster.

Begin by creating a new instance of the XMLReader class and specifying the location of your XML data:

// Create a new XMLReader object

$reader = new XMLReader();

// Load from a file or URL

$reader->open('document.xml');

// Or, load from a PHP variable

$reader->XML($document);

Most of the time, you’ll use the XMLReader::open() method to pull in data from an external source, but you can also load it from an existing PHP variable with XMLReader::XML().

Once the object is configured, you begin processing the data. At the start, you’re positioned at the top of the document. You can maneuver through the document using a combination of the two navigation methods XMLReader provides: XMLReader::read() and XMLReader::next(). The first method reads in the piece of XML data that immediately follows the current position. The second method moves to the next sibling element after the current position.

For example, look at this XML:

<books>

<title>PHP Cookbook</title>

<author>Sklar</author>

<author>Trachtenberg</author>

</book>

<title>Perl Cookbook</title>

<author>Christiansen</author>

<author>Torkington</author>

</book>

</books>

When the object is positioned at the first <book> element, the read() method moves you to the next element underneath <book>. (This is technically the whitespace between <book> and <title>.) In comparison, next() moves you to the next <book> element and skips the entire PHP Cookbook subtree.

These methods return true when they’re able to successfully move to another node, and false when they cannot. So, it’s typical to use them inside a while loop, as such:

/* Loop through document */

while ($reader->read()) {

/* Process XML */

}

This causes the object to read in the entire XML document one piece at a time. Inside the while(), examine $reader and process it accordingly.

A common aspect to check is the node type. This lets you know if you’ve reached an element (and then check the name of that element), a closing element, an attribute, a piece of text, some whitespace, or any other part of an XML document. Do this by referencing the nodeType attribute:

/* Loop through document */

while ($reader->read()) {

/* If you're at an element named 'author' */

if($reader->nodeType == XMLREADER::ELEMENT &&

$reader->localName == 'author') {

/* Process author element */

}

This code checks if the node is an element and, if so, that its name is author. For a complete list of possible values stored in nodeType, check out Table.

Table XMLReader node type values

Node type Description

XMLReader::NONE No node type

XMLReader::ELEMENT Start element

XMLReader::ATTRIBUTE Attribute node

XMLReader::TEXT Text node

XMLReader::CDATA CDATA node

XMLReader::ENTITY_REF Entity Reference node

XMLReader::ENTITY Entity Declaration node

XMLReader::PI Processing Instruction node

XMLReader::COMMENT Comment node

XMLReader::DOC Document node

XMLReader::DOC_TYPE Document Type node

XMLReader::DOC_FRAGMENT Document Fragment node

XMLReader::NOTATION Notation node

XMLReader::WHITESPACE Whitespace node

XMLReader::SIGNIFICANT_WHITESPACE Significant Whitespace node

XMLReader::END_ELEMENT End Element

XMLReader::END_ENTITY End Entity

XMLReader::XML_DECLARATION XML Declaration node

_____________________________________________________

From there, you can decide how to handle that element and the data it contains. For example, we can print out all the author names in the card catalog:

$reader = new XMLReader();

$reader->open(__DIR__ . '/card-catalog.xml');

/* Loop through document */

while ($reader->read()) {

/* If you're at an element named 'author' */

if($reader->nodeType == XMLREADER::ELEMENT &&

$reader->localName == 'author') {

/* Move to the text node and print it out */

$reader->read();

print $reader->value . "\n";

}

Sklar

Trachtenberg

Christiansen

Torkington

Once you’ve reached the <author> element, call $reader->read() to advance to the text inside it. From there, you can find the author names inside of $reader->value.

The XMLReader::value attribute provides you access with a node’s value. This only applies to nodes where this is a meaningful concept, such as text nodes or CDATA nodes. In all other cases, such as element nodes, this attribute is set to the empty string.

Table XMLReader node type values

Name Type Description

attributeCount int Number of node attributes

baseURI string Base URI of the node

depth int Tree depth of the node, starting at 0

hasAttributes bool If the node has attributes

hasValue bool If the node has a text value

isDefault bool If the attribute value is defaulted from DTD

isEmptyElement bool If the node is an empty element tag

localName string Local name of the node

name string Qualified name of the node

namespaceURI string URI of the namespace associated with the node

nodeType int Node type of the node

prefix string Namespace prefix associated with the node

value string Text value of the node

xmlLang string xml:lang scope of the node

____________________________________________________

There’s one remaining major piece of XMLReader functionality: attributes. XMLReader has a special set of methods to access attribute data when it’s on top of an element node, including the following: moveToAttribute(), moveToFirstAttribute(), and moveTo NextAttribute().

The moveToAttribute() method lets you specify an attribute name. For example, here’s code using the card catalog XML to print out all the ISBN numbers:

$reader = new XMLReader();

$reader->XML($catalog);

/* Loop through document */

while ($reader->read()) {

/* If you're at an element named 'book' */

if ($reader->nodeType == XMLREADER::ELEMENT &&

$reader->localName == 'book') {

$reader->moveToAttribute('isbn');

print $reader->value . "\n";

}

Once you’ve found the <book> element, call moveToAttribute('isbn') to advance to the isbn attribute, so you can read its value and print it out:

1565926811

0596003137

In the examples in this recipe, we print out information on all books. However, it’s easy to modify them to retrieve data only for one specific book. For example, this code combines pieces of the examples to print out all the data for Perl Cookbook in an efficient fashion:

$reader = new XMLReader();

$reader->XML($catalog);

// Perl Cookbook ISBN is 0596003137

// Use array to make it easy to add additional ISBNs

$isbns = array('0596003137' => true);

/* Loop through document to find first <book> */

while ($reader->read()) {

/* If you're at an element named 'book' */

if ($reader->nodeType == XMLREADER::ELEMENT &&

$reader->localName == 'book') {

break;

}

/* Loop through <book>s to find right ISBNs */

do {

if ($reader->moveToAttribute('isbn') &&

isset($isbns[$reader->value])) {

while ($reader->read()) {

switch ($reader->nodeType) {

case XMLREADER::ELEMENT:

print $reader->localName . ": ";

break;

case XMLREADER::TEXT:

print $reader->value . "\n";

break;

case XMLREADER::END_ELEMENT;

if ($reader->localName == 'book') {

break 2;

}

} while ($reader->next());

title: Perl Cookbook

author: Christiansen

author: Torkington

subject: Perl

The first while() iterates sequentially until it finds the first <book> element.

Having lined yourself up correctly, you then break out of the loop and start checking ISBN numbers. That’s handled inside a do… while() loop that uses $reader->next() to move down the <book> list. You cannot use a regular while() here or you’ll skip over the first <book>. Also, this is a perfect example of when to use $reader->next() instead of $reader->read().

If the ISBN matches a value in $isbns, then you want to process the data inside the current <book>. This is handled using yet another while() and a switch().

There are three different switch() cases: an opening element, element text, and a closing element. If you’re opening an element, you print out the element’s name and a colon. If you’re visiting text, you print out the textual data. And if you’re closing an element, you check to see whether you’re closing the <book>. If so, then you’ve reached the end of the data for that particular book, and you need to return to the do… while() loop. This is handled using a break 2;—while jumps back two levels, instead of the usual one level.

Breaking

Post Top Ad

Post Top Ad

Friday, June 14, 2019

PHP XML Parsing Large XML Documents

No comments:

Post a Comment

Post Top Ad

Author Details

Subscribe Our Youtube Channel

Featured Post

Total Pageviews

Translate

Advertisement

Recent

Popular

Comments

Ads

Archive

Technology

Tags

Contact Form

Breaking

Post Top Ad

Post Top Ad

Friday, June 14, 2019

PHP XML Parsing Large XML Documents

No comments:

Post a Comment

Post Top Ad

Author Details

Edit This Menu

Join Our Telegram Channel to Stay Updated

Socialize

Subscribe Our Youtube Channel

Featured Post

Total Pageviews

Translate

Advertisement

Recent

Popular

Comments

Ads

Archive

Technology

Tags

Contact Form