PHP Web Automation Cleaning Up Broken or Nonstandard HTML

PHP Web Automation

Cleaning Up Broken or Nonstandard HTML

Problem

You’ve got some HTML with malformed syntax that you’d like to clean up. This makes it easier to parse and ensures that the pages you produce are standards compliant.

Solution

Use PHP’s Tidy extension. It relies on the popular, powerful, HTML Tidy library to turn frightening piles of tag soup into well-formed, standards-compliant HTML or XHTML.

Example Repairing an HTML file with Tidy

$fixed = tidy_repair_file('bad.html');

file_put_contents('good.html', $fixed);

Discussion

The HTML Tidy library has a large number of rules and features built up over time that creatively handle a wide variety of HTML abominations. Fortunately, you don’t have to care about what all those rules are to reap the benefits of Tidy. Just pass a filename to tidy_repair_file() and you get back a cleaned-up version. For example, if bad.html contains:

I love monkeys.

then writes the following out to good.html:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">

<html>

<head>

<title></title>

</head>

<body>

<img src="monkey.jpg"> I love monkeys.

</body>

</html>

Tidy has a large number of configuration options that affect the output it produces. Pass configuration to tidy_repair_file() by providing a second argument that is an array of configuration options and values.

Example Production of XHTML with Tidy

$config = array('output-xhtml' => true);

$fixed = tidy_repair_file('bad.html', $config);

file_put_contents('good.xhtml', $fixed);

Example writes the following to good.xhtml:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">

<head>

<title></title>

</head>

<body>

<img src="monkey.jpg" /> I love monkeys.

</body>

</html>

If your source HTML is in a string instead of a file, use tidy_repair_string(). It expects a first argument that contains HTML, not a filename.

The cleaned-up XHTML produced by Tidy also provides a way to mark up HTML without using regular expressions. After the HTML has been converted to a well-formed XHTML document, it can be systematically processed and converted by PHP’s DOM functions.

Example Marking up a web page with Tidy and DOM

$body = '

I like pickles and herring.

<a href="pickle.php"><img src="pickle.jpg"/>A pickle picture</a>

I have a herringbone-patterned toaster cozy.

<herring>Herring is not a real HTML element!</herring>

$words = array('pickle','herring');

$patterns = array();

$replacements = array();

foreach ($words as $i => $word) {

$patterns[] = '/' . preg_quote($word) . '/i';

$replacements[] = "$word";

}

/* Tell Tidy to produce XHTML */

$xhtml = tidy_repair_string($body, array('output-xhtml' => true));

/* Load the XHTML as an XML document */

$doc = new DOMDocument;

$doc->loadXml($xhtml);

/* When turning our input HTML into a proper XHTML document,

* Tidy puts the input HTML inside the <body/> element of the

* XHTML document */

$body = $doc->getElementsByTagName('body')->item(0);

/* Visit all text nodes and mark up words if necessary */

$xpath = new DOMXpath($doc);

foreach ($xpath->query("descendant-or-self::text()", $body) as $textNode) {

$replaced = preg_replace($patterns, $replacements, $textNode->wholeText);

if ($replaced !== $textNode->wholeText) {

$fragment = $textNode->ownerDocument->createDocumentFragment();

/* This makes sure that the sub-nodes are created properly */

$fragment->appendXml($replaced);

$textNode->parentNode->replaceChild($fragment, $textNode);

}

/* Build the XHTML consisting of the content of everything under <body/> */

$markedup = '';

foreach ($body->childNodes as $node) {

$markedup .= $doc->saveXml($node);

}

print $markedup;

In the preg_replace() command to add the markup is run on all text nodes of the DOM tree that results from loading a Tidy-repaired version of the input HTML into a DOMDocument object. The great thing about this is that we can be certain that the replacements are only being run on text. Any broken HTML that would have confused the regular expression used for finding HTML tags is repaired by Tidy before the DOMDocument is created.

The downside of this approach is that, depending on how broken your input HTML is, the results of Tidy’s conversion may not be what you expect.

I like pickles and herring↵

.

<a href="pickle.php"><img src="pickle.jpg" />A pickle↵

picture</a> I

have a herringbone-patterned toaster cozy. ↵

herring is not a real HTML element!

Note that the final part of it is herring is not a real HTML element!. Because <herring/> is not a valid XHTML element, Tidy has stripped the <herring> and </herring> out, leaving the enclosed text. This is a reasonable thing to do in order to produce a valid XHTML document, but could be confusing if you’re not expecting it.

Breaking

React JS Installation | Create Project React JS | How to Install Node JS for React JS Development

Javascript DOM Tutorial Part 1 [ Selectors ] How to Select HTML Elements Using Javascript

Python Django Medical Store Management Part 7 | Multiple Serializer in ViewSet | Send JSON Request

Python Django Medical Store Management System Part 6 | Complete All Serializers and Medicine Viewset

JavaScript Advance Functions Complete Tutorial Part 9 | All About Different Types of Functions in JS

Post Top Ad

Post Top Ad

Monday, June 17, 2019

PHP Web Automation Cleaning Up Broken or Nonstandard HTML

No comments:

Post a Comment

Post Top Ad

Author Details

Subscribe Our Youtube Channel

Featured Post

Total Pageviews

Translate

Advertisement

React JS Installation | Create Project React JS | How to Install Node JS for React JS Development

Javascript DOM Tutorial Part 1 [ Selectors ] How to Select HTML Elements Using Javascript

Python Django Medical Store Management Part 7 | Multiple Serializer in ViewSet | Send JSON Request

Python Django Medical Store Management System Part 6 | Complete All Serializers and Medicine Viewset

JavaScript Advance Functions Complete Tutorial Part 9 | All About Different Types of Functions in JS

Ads

Archive

Technology

Tags

Contact Form

Breaking

Post Top Ad

Post Top Ad

Monday, June 17, 2019

PHP Web Automation Cleaning Up Broken or Nonstandard HTML

No comments:

Post a Comment

Post Top Ad

Author Details

Edit This Menu

Join Our Telegram Channel to Stay Updated

Socialize

Subscribe Our Youtube Channel

Featured Post

Total Pageviews

Translate

Advertisement

Ads

Archive

Technology

Tags

Contact Form