PHP Web Automation
Removing HTML and PHP Tags
Problem
You want to remove HTML and PHP tags from a string or file. For example, you want to make sure there is no HTML in a string before printing it or PHP in a string before passing it to eval().
Solution
Example Removing HTML and PHP tags
$html = '<a href="http://www.oreilly.com">I <b>love computer books.</b></a>';
$html .= '<?php echo "Hello!" ?>';
print strip_tags($html);
print "\n";
print filter_var($html, FILTER_SANITIZE_STRING);
Example prints:
I love computer books.
I love computer books.
Example Removing HTML and PHP tags from a stream
$stream = fopen(__DIR__ . '/elephant.html','r');
stream_filter_append($stream, 'string.strip_tags');
print stream_get_contents($stream);
Discussion
Both strip_tags() and the string.strip_tags filter can be told not to remove certain tags. Provide a string containing allowable tags to strip_tags() as a second argument. The tag specification is case insensitive, and for pairs of tags, you only have to specify the opening tag. For example, to remove all but <b></b> and <i></i> tags from $html, call strip_tags($html,'<b><i>').
With the string.strip_tags filter, pass a similar string as a fourth argument to stream_filter_append(). The third argument to stream_filter_append() controls whether the filter is applied on reading (STREAM_FILTER_READ), writing (STREAM_FILTER_WRITE), or both (STREAM_FILTER_ALL).
Example Removing some HTML and PHP tags from a stream
$stream = fopen(__DIR__ . '/elephant.html','r');
stream_filter_append($stream, 'string.strip_tags',STREAM_FILTER_READ,'b,i');
print stream_get_contents($stream);
stream_filter_append() also accepts an array of tag names instead of a string: array('b','i') instead of '<b><i>'.
A more robust approach that avoids the problems that could result from strip_tags() reacting poorly to a broken tag or not removing a dangerous attribute is to allow only a whitelist of known-good tags and attributes in your stripped HTML. With this approach, you don’t remove bad things (which leaves you open to the possibility that your list of bad things is incomplete) but instead only keep good things.
Example “Stripping” tags with a whitelist
class TagStripper {
protected $allowed =
array(
/* Allow <a/> and only an "href" attribute */
'a'=> array('href' => true),
/* Allow <p/> with no attributes */
'p' => array());
public function strip($html) {
/* Tell Tidy to produce XHTML */
$xhtml = tidy_repair_string($html, array('output-xhtml' => true));
/* Load the dirty HTML into a DOMDocument */
$dirty = new DOMDocument;
$dirty->loadXml($xhtml);
$dirtyBody = $dirty->getElementsByTagName('body')->item(0);
/* Make a blank DOMDocument for the clean HTML */
$clean = new DOMDocument();
$cleanBody = $clean->appendChild($clean->createElement('body'));
/* Copy the allowed nodes from dirty to clean */
$this->copyNodes($dirtyBody, $cleanBody);
/* Return the contents of the clean body */
$stripped = '';
foreach ($cleanBody->childNodes as $node) {
$stripped .= $clean->saveXml($node);
}
return trim($stripped);
}
protected function copyNodes(DOMNode $dirty, DOMNode $clean) {
foreach ($dirty->attributes as $name => $valueNode) {
/* Copy over allowed attributes */
if (isset($this->allowed[$dirty->nodeName][$name])) {
$attr = $clean->ownerDocument->createAttribute($name);
$attr->value = $valueNode->value;
$clean->appendChild($attr);
}
}
foreach ($dirty->childNodes as $child) {
/* Copy allowed elements */
if (($child->nodeType == XML_ELEMENT_NODE) &&
(isset($this->allowed[$child->nodeName]))) {
$node = $clean->ownerDocument->createElement(
$child->nodeName);
$clean->appendChild($node);
/* Examine children of this allowed element */
$this->copyNodes($child, $node);
}
/* Copy text */
else if ($child->nodeType == XML_TEXT_NODE) {
$text = $clean->ownerDocument->createTextNode(
$child->textContent);
$clean->appendChild($text);
}
}
}
}
Given some input HTML, its strip() method of the class regularizes it into XHTML with Tidy, then walks down its DOM tree of elements, copying only allowed attributes and elements into a new DOM structure. Then, it returns the contents of that new DOM structure.
Here’s TagStripper in action:
$html=<<<_HTML_
<a href=foo onmouseover="bad()" >this is some</b>
stuff
<p>This should be OK, as <a href="beep">well</a> as this. </p>
<script>alert('whoops')<p>This gets removed.</p></script>
<p>But this <script>bad</script> stuff has the script removed.</p>
_HTML_;
$ts = new TagStripper();
print $ts->strip($html);
This prints:
<a href="foo">this is some stuff</a>
<p>This should be OK, as <a href="beep">well</a> as this.</p>
<p>But this stuff has the script removed.</p>
The initial set of allowed elements and attributes, as defined by the $allowed property of the TagStripper class is intentionally sparse. Add new elements and attributes carefully as you need them.
Note :
Whether with strip_tags() or the stream filter, attributes are not removed from allowed tags. This means that an attribute that changes display (such as style) or executes JavaScript (any event handler) is preserved. If you are displaying “stripped” text of arbitrary origin in a web browser to a user without escaping it first, this could result in cross-site scripting attacks.
No comments:
Post a Comment