Monday, July 8, 2019

Home PHP php-file_get_contents-html-tags php-parse-html-tags php-preg_match_all-html-tags php-preg_match-vs-preg_match_all php-preg-match-all-links PHP Regular Expressions Capturing Text Inside HTML Tags

PHP Regular Expressions Capturing Text Inside HTML Tags

PHP Regular Expressions

Capturing Text Inside HTML Tags

Problem

You want to capture text inside HTML tags. For example, you want to find all the heading tags in an HTML document.

Solution

Example Capturing HTML headings

$html = file_get_contents(__DIR__ . '/example.html');

preg_match_all('@<h([1-6])>(.+?)</h\1>@is', $html, $matches);

foreach ($matches[2] as $text) {

print "Heading: $text\n";

}

Discussion

Robust parsing of HTML is difficult using a simple regular expression. This is one advantage of using XHTML; it’s significantly easier to validate and parse.

For instance, the pattern can’t deal with attributes inside the heading tags and is only smart enough to find matching headings, so <h1>Dr.

Strangelove</h1> is OK, because it’s wrapped inside <h1></h1> tags, but not <h2>How I Learned to Stop Worrying and Love the Bomb</h3>, because the opening tag is <h2>, whereas the closing tag is not.

Example Extracting text from HTML tags

$html = file_get_contents(__DIR__.'/example.html');

preg_match_all('@<(strong|em)>(.+?)</\1>@is', $html, $matches);

foreach ($matches[2] as $text) {

print "Text: $text\n";

}

However, breaks on nested headings. If example.html contains <strong>Dr. Strangelove or: <em>How I Learned to Stop Worrying and Love the Bomb</em></strong>, Example doesn’t capture the text inside the <em></em> tags as a separate item.

This isn’t a problem because headings are block-level elements, it’s illegal to nest them. However, as inline elements, nested <strong> and <em> tags are valid.

Regular expressions can be moderately useful for parsing small amounts of HTML, especially if the structure of that HTML is reasonably constrained (or you’re generating it yourself). For more generalized and robust HTML parsing, use the Tidy extension.

It provides an interface to the popular libtidy HTML cleanup library. After Tidy has cleaned up your HTML, you can use its methods for getting at parts of the document. Or if you’ve told Tidy to convert your HTML to XHTML, you can use all of the XML manipulation power of SimpleXML or the DOM extension to slice and dice your HTML document.

Breaking

Post Top Ad

Post Top Ad

Monday, July 8, 2019

PHP Regular Expressions Capturing Text Inside HTML Tags

No comments:

Post a Comment

Post Top Ad

Author Details

Subscribe Our Youtube Channel

Featured Post

Total Pageviews

Translate

Advertisement

Recent

Popular

Comments

Ads

Archive

Technology

Tags

Contact Form

Breaking

Post Top Ad

Post Top Ad

Monday, July 8, 2019

PHP Regular Expressions Capturing Text Inside HTML Tags

No comments:

Post a Comment

Post Top Ad

Author Details

Edit This Menu

Join Our Telegram Channel to Stay Updated

Socialize

Subscribe Our Youtube Channel

Featured Post

Total Pageviews

Translate

Advertisement

Recent

Popular

Comments

Ads

Archive

Technology

Tags

Contact Form