PHP Web Automation
Cleaning Up Broken or Nonstandard HTML
Problem
You’ve got some HTML with malformed syntax that you’d like to clean up. This makes it easier to parse and ensures that the pages you produce are standards compliant.
Solution
Use PHP’s Tidy extension. It relies on the popular, powerful, HTML Tidy library to turn frightening piles of tag soup into well-formed, standards-compliant HTML or XHTML.
Example Repairing an HTML file with Tidy
$fixed = tidy_repair_file('bad.html');
file_put_contents('good.html', $fixed);
Discussion
The HTML Tidy library has a large number of rules and features built up over time that creatively handle a wide variety of HTML abominations. Fortunately, you don’t have to care about what all those rules are to reap the benefits of Tidy. Just pass a filename to tidy_repair_file() and you get back a cleaned-up version. For example, if bad.html contains:
<img src="monkey.jpg">
<b>I <em>love</b> monkeys</em>.
then writes the following out to good.html:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>
<title></title>
</head>
<body>
<img src="monkey.jpg"> <b>I <em>love</em> monkeys</b>.
</body>
</html>
Tidy has a large number of configuration options that affect the output it produces. Pass configuration to tidy_repair_file() by providing a second argument that is an array of configuration options and values.
Example Production of XHTML with Tidy
$config = array('output-xhtml' => true);
$fixed = tidy_repair_file('bad.html', $config);
file_put_contents('good.xhtml', $fixed);
Example writes the following to good.xhtml:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title></title>
</head>
<body>
<img src="monkey.jpg" /> <b>I <em>love</em> monkeys</b>.
</body>
</html>
If your source HTML is in a string instead of a file, use tidy_repair_string(). It expects a first argument that contains HTML, not a filename.
The cleaned-up XHTML produced by Tidy also provides a way to mark up HTML without using regular expressions. After the HTML has been converted to a well-formed XHTML document, it can be systematically processed and converted by PHP’s DOM functions.
Example Marking up a web page with Tidy and DOM
$body = '
<p>I like pickles and herring.</p>
<a href="pickle.php"><img src="pickle.jpg"/>A pickle picture</a>
I have a herringbone-patterned toaster cozy.
<herring>Herring is not a real HTML element!</herring>
';
$words = array('pickle','herring');
$patterns = array();
$replacements = array();
foreach ($words as $i => $word) {
$patterns[] = '/' . preg_quote($word) . '/i';
$replacements[] = "<span class='word-$i'>$word</span>";
}
/* Tell Tidy to produce XHTML */
$xhtml = tidy_repair_string($body, array('output-xhtml' => true));
/* Load the XHTML as an XML document */
$doc = new DOMDocument;
$doc->loadXml($xhtml);
/* When turning our input HTML into a proper XHTML document,
* Tidy puts the input HTML inside the <body/> element of the
* XHTML document */
$body = $doc->getElementsByTagName('body')->item(0);
/* Visit all text nodes and mark up words if necessary */
$xpath = new DOMXpath($doc);
foreach ($xpath->query("descendant-or-self::text()", $body) as $textNode) {
$replaced = preg_replace($patterns, $replacements, $textNode->wholeText);
if ($replaced !== $textNode->wholeText) {
$fragment = $textNode->ownerDocument->createDocumentFragment();
/* This makes sure that the <span/> sub-nodes are created properly */
$fragment->appendXml($replaced);
$textNode->parentNode->replaceChild($fragment, $textNode);
}
}
/* Build the XHTML consisting of the content of everything under <body/> */
$markedup = '';
foreach ($body->childNodes as $node) {
$markedup .= $doc->saveXml($node);
}
print $markedup;
In the preg_replace() command to add the markup is run on all text nodes of the DOM tree that results from loading a Tidy-repaired version of the input HTML into a DOMDocument object. The great thing about this is that we can be certain that the replacements are only being run on text. Any broken HTML that would have confused the regular expression used for finding HTML tags is repaired by Tidy before the DOMDocument is created.
The downside of this approach is that, depending on how broken your input HTML is, the results of Tidy’s conversion may not be what you expect.
<p>I like <span class="word-0">pickle</span>s and <span class="word-1">herring↵
</span>.</p>
<a href="pickle.php"><img src="pickle.jpg" />A <span class="word-0">pickle↵
</span> picture</a> I
have a <span class="word-1">herring</span>bone-patterned toaster cozy. ↵
<span class="word-1">herring</span> is not a real HTML element!
Note that the final part of it is <span class="word-1">herring</span> is not a real HTML element!. Because <herring/> is not a valid XHTML element, Tidy has stripped the <herring> and </herring> out, leaving the enclosed text. This is a reasonable thing to do in order to produce a valid XHTML document, but could be confusing if you’re not expecting it.
No comments:
Post a Comment