PHP Web Automation Removing HTML and PHP Tags

PHP Web Automation

Removing HTML and PHP Tags

Problem

You want to remove HTML and PHP tags from a string or file. For example, you want to make sure there is no HTML in a string before printing it or PHP in a string before passing it to eval().

Solution

Example Removing HTML and PHP tags

$html = '<a href="http://www.oreilly.com">I love computer books.</a>';

$html .= '<?php echo "Hello!" ?>';

print strip_tags($html);

print "\n";

print filter_var($html, FILTER_SANITIZE_STRING);

Example prints:

I love computer books.

Example Removing HTML and PHP tags from a stream

$stream = fopen(__DIR__ . '/elephant.html','r');

stream_filter_append($stream, 'string.strip_tags');

print stream_get_contents($stream);

Discussion

Both strip_tags() and the string.strip_tags filter can be told not to remove certain tags. Provide a string containing allowable tags to strip_tags() as a second argument. The tag specification is case insensitive, and for pairs of tags, you only have to specify the opening tag. For example, to remove all but and tags from $html, call strip_tags($html,'').

With the string.strip_tags filter, pass a similar string as a fourth argument to stream_filter_append(). The third argument to stream_filter_append() controls whether the filter is applied on reading (STREAM_FILTER_READ), writing (STREAM_FILTER_WRITE), or both (STREAM_FILTER_ALL).

Example Removing some HTML and PHP tags from a stream

$stream = fopen(__DIR__ . '/elephant.html','r');

stream_filter_append($stream, 'string.strip_tags',STREAM_FILTER_READ,'b,i');

print stream_get_contents($stream);

stream_filter_append() also accepts an array of tag names instead of a string: array('b','i') instead of ''.

A more robust approach that avoids the problems that could result from strip_tags() reacting poorly to a broken tag or not removing a dangerous attribute is to allow only a whitelist of known-good tags and attributes in your stripped HTML. With this approach, you don’t remove bad things (which leaves you open to the possibility that your list of bad things is incomplete) but instead only keep good things.

Example “Stripping” tags with a whitelist

class TagStripper {

protected $allowed =

array(

/* Allow <a/> and only an "href" attribute */

'a'=> array('href' => true),

/* Allow with no attributes */

'p' => array());

public function strip($html) {

/* Tell Tidy to produce XHTML */

$xhtml = tidy_repair_string($html, array('output-xhtml' => true));

/* Load the dirty HTML into a DOMDocument */

$dirty = new DOMDocument;

$dirty->loadXml($xhtml);

$dirtyBody = $dirty->getElementsByTagName('body')->item(0);

/* Make a blank DOMDocument for the clean HTML */

$clean = new DOMDocument();

$cleanBody = $clean->appendChild($clean->createElement('body'));

/* Copy the allowed nodes from dirty to clean */

$this->copyNodes($dirtyBody, $cleanBody);

/* Return the contents of the clean body */

$stripped = '';

foreach ($cleanBody->childNodes as $node) {

$stripped .= $clean->saveXml($node);

}

return trim($stripped);

}

protected function copyNodes(DOMNode $dirty, DOMNode $clean) {

foreach ($dirty->attributes as $name => $valueNode) {

/* Copy over allowed attributes */

if (isset($this->allowed[$dirty->nodeName][$name])) {

$attr = $clean->ownerDocument->createAttribute($name);

$attr->value = $valueNode->value;

$clean->appendChild($attr);

}

foreach ($dirty->childNodes as $child) {

/* Copy allowed elements */

if (($child->nodeType == XML_ELEMENT_NODE) &&

(isset($this->allowed[$child->nodeName]))) {

$node = $clean->ownerDocument->createElement(

$child->nodeName);

$clean->appendChild($node);

/* Examine children of this allowed element */

$this->copyNodes($child, $node);

}

/* Copy text */

else if ($child->nodeType == XML_TEXT_NODE) {

$text = $clean->ownerDocument->createTextNode(

$child->textContent);

$clean->appendChild($text);

}

Given some input HTML, its strip() method of the class regularizes it into XHTML with Tidy, then walks down its DOM tree of elements, copying only allowed attributes and elements into a new DOM structure. Then, it returns the contents of that new DOM structure.

Here’s TagStripper in action:

$html=<<<_HTML_

<a href=foo onmouseover="bad()" >this is some

stuff

This should be OK, as <a href="beep">well</a> as this.

But this <script>bad</script> stuff has the script removed.

_HTML_;

$ts = new TagStripper();

print $ts->strip($html);

This prints:

<a href="foo">this is some stuff</a>

This should be OK, as <a href="beep">well</a> as this.

But this stuff has the script removed.

The initial set of allowed elements and attributes, as defined by the $allowed property of the TagStripper class is intentionally sparse. Add new elements and attributes carefully as you need them.

Note :

Whether with strip_tags() or the stream filter, attributes are not removed from allowed tags. This means that an attribute that changes display (such as style) or executes JavaScript (any event handler) is preserved. If you are displaying “stripped” text of arbitrary origin in a web browser to a user without escaping it first, this could result in cross-site scripting attacks.

Breaking

Post Top Ad

Post Top Ad

Monday, June 17, 2019

PHP Web Automation Removing HTML and PHP Tags

PHP Web Automation

No comments:

Post a Comment

Post Top Ad

Author Details

Subscribe Our Youtube Channel

Featured Post

Total Pageviews

Translate

Advertisement

Recent

Popular

Comments

Ads

Archive

Technology

Tags

Contact Form

Breaking

Post Top Ad

Post Top Ad

Monday, June 17, 2019

PHP Web Automation Removing HTML and PHP Tags

PHP Web Automation

No comments:

Post a Comment

Post Top Ad

Author Details

Edit This Menu

Join Our Telegram Channel to Stay Updated

Socialize

Subscribe Our Youtube Channel

Featured Post

Total Pageviews

Translate

Advertisement

Recent

Popular

Comments

Ads

Archive

Technology

Tags

Contact Form