PHP XML Parsing Large XML Documents - Supercoders | Web Development and Design | Tutorial for Java, PHP, HTML, Javascript PHP XML Parsing Large XML Documents - Supercoders | Web Development and Design | Tutorial for Java, PHP, HTML, Javascript

Breaking

Post Top Ad

Post Top Ad

Friday, June 14, 2019

PHP XML Parsing Large XML Documents

PHP XML


Parsing Large XML Documents

Problem

You want to parse a large XML document. This document is so large that it’s impractical to use SimpleXML or DOM because you cannot hold the entire document in memory.

Instead, you must load the document in one section at a time.

Solution

Use the XMLReader extension:

        $reader = new XMLReader();
        $reader->open(__DIR__ . '/card-catalog.xml');
        /* Loop through document */
        while ($reader->read()) {
               /* If you're at an element named 'author' */
               if($reader->nodeType == XMLREADER::ELEMENT &&
               $reader->localName == 'author') {
                      /* Move to the text node and print it out */
                      $reader->read();
                      print $reader->value . "\n";
               }
        }

Discussion

There are two major types of XML parsers: ones that hold the entire document in memory at once, and ones that hold only a small portion of the document in memory at any given time.

The first kind are called tree-based parsers, because they store the document into a data structure known as a tree. The SimpleXML and DOM extensions are tree-based parsers. Using a tree-based parser is easier for you, but requires PHP to use more RAM. With most XML documents, this isn’t a problem. However, when your XML document is quite large, this can cause major performance issues.

The other kind of XML parser is a stream-based parser. Stream-based parsers don’t store the entire document in memory; instead, they read in one node at a time and allow you to interact with it in real time. Once you move onto the next node, the old one is thrown away—unless you explicitly store it yourself for later use. This makes stream-based parsers faster and less memory-consuming, but you may have to write more code to process the document.

The easiest way to process XML data using a stream-based parser is using the XMLReader extension. This extension is based on the C# XmlTextReader API. If you’re familiar with SAX (Simple API for XML), XMLReader is more intuitive, feature-rich, and faster.

Begin by creating a new instance of the XMLReader class and specifying the location of your XML data:

        // Create a new XMLReader object
        $reader = new XMLReader();

        // Load from a file or URL
        $reader->open('document.xml');

        // Or, load from a PHP variable
        $reader->XML($document);

Most of the time, you’ll use the XMLReader::open() method to pull in data from an external source, but you can also load it from an existing PHP variable with XMLReader::XML().

Once the object is configured, you begin processing the data. At the start, you’re positioned at the top of the document. You can maneuver through the document using a combination of the two navigation methods XMLReader provides: XMLReader::read() and XMLReader::next(). The first method reads in the piece of XML data that immediately follows the current position. The second method moves to the next sibling element after the current position.

For example, look at this XML:

       <books>
              <book isbn="1565926811">
                     <title>PHP Cookbook</title>
                     <author>Sklar</author>
                     <author>Trachtenberg</author>
                     <subject>PHP</subject>
              </book>
              <book isbn="0596003137">
                     <title>Perl Cookbook</title>
                     <author>Christiansen</author>
                     <author>Torkington</author>
                     <subject>Perl</subject>
              </book>
       </books>

When the object is positioned at the first <book> element, the read() method moves you to the next element underneath <book>. (This is technically the whitespace between <book> and <title>.) In comparison, next() moves you to the next <book> element and skips the entire PHP Cookbook subtree.

These methods return true when they’re able to successfully move to another node, and false when they cannot. So, it’s typical to use them inside a while loop, as such:

       /* Loop through document */
       while ($reader->read()) {
               /* Process XML */
       }

This causes the object to read in the entire XML document one piece at a time. Inside the while(), examine $reader and process it accordingly.

A common aspect to check is the node type. This lets you know if you’ve reached an element (and then check the name of that element), a closing element, an attribute, a piece of text, some whitespace, or any other part of an XML document. Do this by referencing the nodeType attribute:

       /* Loop through document */
       while ($reader->read()) {
              /* If you're at an element named 'author' */
              if($reader->nodeType == XMLREADER::ELEMENT &&
              $reader->localName == 'author') {
                      /* Process author element */
              }
       }

This code checks if the node is an element and, if so, that its name is author. For a complete list of possible values stored in nodeType, check out Table.

Table  XMLReader node type values

Node type                                                                    Description                                   
XMLReader::NONE                                                  No node type
XMLReader::ELEMENT                                          Start element
XMLReader::ATTRIBUTE                                       Attribute node
XMLReader::TEXT                                                    Text node
XMLReader::CDATA                                                 CDATA node
XMLReader::ENTITY_REF                                     Entity Reference node
XMLReader::ENTITY                                                Entity Declaration node
XMLReader::PI                                                           Processing Instruction node
XMLReader::COMMENT                                         Comment node
XMLReader::DOC                                                      Document node
XMLReader::DOC_TYPE                                         Document Type node
XMLReader::DOC_FRAGMENT                            Document Fragment node
XMLReader::NOTATION                                         Notation node
XMLReader::WHITESPACE                                    Whitespace node
XMLReader::SIGNIFICANT_WHITESPACE      Significant Whitespace node
XMLReader::END_ELEMENT                               End Element
XMLReader::END_ENTITY                                    End Entity
XMLReader::XML_DECLARATION                     XML Declaration node
_____________________________________________________

From there, you can decide how to handle that element and the data it contains. For example, we can print out all the author names in the card catalog:

       $reader = new XMLReader();
       $reader->open(__DIR__ . '/card-catalog.xml');

       /* Loop through document */
       while ($reader->read()) {
              /* If you're at an element named 'author' */
              if($reader->nodeType == XMLREADER::ELEMENT &&
              $reader->localName == 'author') {
                      /* Move to the text node and print it out */
                      $reader->read();
                      print $reader->value . "\n";
              }
       }

       Sklar
       Trachtenberg
       Christiansen
       Torkington

Once you’ve reached the <author> element, call $reader->read() to advance to the text inside it. From there, you can find the author names inside of $reader->value.

The XMLReader::value attribute provides you access with a node’s value. This only applies to nodes where this is a meaningful concept, such as text nodes or CDATA nodes. In all other cases, such as element nodes, this attribute is set to the empty string.

Table  XMLReader node type values

Name                       Type       Description                                                          

attributeCount       int           Number of node attributes
baseURI                  string      Base URI of the node
depth                       int            Tree depth of the node, starting at 0
hasAttributes         bool         If the node has attributes
hasValue                 bool         If the node has a text value
isDefault                 bool         If the attribute value is defaulted from DTD
isEmptyElement   bool         If the node is an empty element tag
localName              string      Local name of the node
name                       string       Qualified name of the node
namespaceURI     string       URI of the namespace associated with the node
nodeType               int             Node type of the node
prefix                      string        Namespace prefix associated with the node
value                       string        Text value of the node
xmlLang                 string        xml:lang scope of the node
____________________________________________________

There’s one remaining major piece of XMLReader functionality: attributes. XMLReader has a special set of methods to access attribute data when it’s on top of an element node, including the following: moveToAttribute(), moveToFirstAttribute(), and moveTo NextAttribute().

The moveToAttribute() method lets you specify an attribute name. For example, here’s code using the card catalog XML to print out all the ISBN numbers:

         $reader = new XMLReader();
         $reader->XML($catalog);

         /* Loop through document */
         while ($reader->read()) {
                /* If you're at an element named 'book' */
                if ($reader->nodeType == XMLREADER::ELEMENT &&
                $reader->localName == 'book') {
                        $reader->moveToAttribute('isbn');
                        print $reader->value . "\n";
                }
         }

Once you’ve found the <book> element, call moveToAttribute('isbn') to advance to the isbn attribute, so you can read its value and print it out:

         1565926811
         0596003137

In the examples in this recipe, we print out information on all books. However, it’s easy to modify them to retrieve data only for one specific book. For example, this code combines pieces of the examples to print out all the data for Perl Cookbook in an efficient fashion:

          $reader = new XMLReader();
          $reader->XML($catalog);

          // Perl Cookbook ISBN is 0596003137
          // Use array to make it easy to add additional ISBNs
          $isbns = array('0596003137' => true);

          /* Loop through document to find first <book> */
          while ($reader->read()) {
               /* If you're at an element named 'book' */
               if ($reader->nodeType == XMLREADER::ELEMENT &&
                    $reader->localName == 'book') {
                    break;
               }
          }

          /* Loop through <book>s to find right ISBNs */
          do {
               if ($reader->moveToAttribute('isbn') &&
                     isset($isbns[$reader->value])) {
                     while ($reader->read()) {
                           switch ($reader->nodeType) {
                           case XMLREADER::ELEMENT:
                                  print $reader->localName . ": ";
                                  break;
                           case XMLREADER::TEXT:
                                  print $reader->value . "\n";
                                  break;
                           case XMLREADER::END_ELEMENT;
                                  if ($reader->localName == 'book') {
                                       break 2;
                                  }
                            }
                      }
                }
          } while ($reader->next());

          title: Perl Cookbook
          author: Christiansen
          author: Torkington
          subject: Perl

The first while() iterates sequentially until it finds the first <book> element.

Having lined yourself up correctly, you then break out of the loop and start checking ISBN numbers. That’s handled inside a do… while() loop that uses $reader->next() to move down the <book> list. You cannot use a regular while() here or you’ll skip over the first <book>. Also, this is a perfect example of when to use $reader->next() instead of $reader->read().

If the ISBN matches a value in $isbns, then you want to process the data inside the current <book>. This is handled using yet another while() and a switch().

There are three different switch() cases: an opening element, element text, and a closing element. If you’re opening an element, you print out the element’s name and a colon. If you’re visiting text, you print out the textual data. And if you’re closing an element, you check to see whether you’re closing the <book>. If so, then you’ve reached the end of the data for that particular book, and you need to return to the do… while() loop. This is handled using a break 2;—while jumps back two levels, instead of the usual one level.



No comments:

Post a Comment

Post Top Ad