PHP Regular Expressions
Capturing Text Inside HTML Tags
Problem
You want to capture text inside HTML tags. For example, you want to find all the heading tags in an HTML document.
Solution
Example Capturing HTML headings
$html = file_get_contents(__DIR__ . '/example.html');
preg_match_all('@<h([1-6])>(.+?)</h\1>@is', $html, $matches);
foreach ($matches[2] as $text) {
print "Heading: $text\n";
}
Discussion
Robust parsing of HTML is difficult using a simple regular expression. This is one advantage of using XHTML; it’s significantly easier to validate and parse.
For instance, the pattern can’t deal with attributes inside the heading tags and is only smart enough to find matching headings, so <h1>Dr.
Strangelove</h1> is OK, because it’s wrapped inside <h1></h1> tags, but not <h2>How I Learned to Stop Worrying and Love the Bomb</h3>, because the opening tag is <h2>, whereas the closing tag is not.
Example Extracting text from HTML tags
$html = file_get_contents(__DIR__.'/example.html');
preg_match_all('@<(strong|em)>(.+?)</\1>@is', $html, $matches);
foreach ($matches[2] as $text) {
print "Text: $text\n";
}
However, breaks on nested headings. If example.html contains <strong>Dr. Strangelove or: <em>How I Learned to Stop Worrying and Love the Bomb</em></strong>, Example doesn’t capture the text inside the <em></em> tags as a separate item.
This isn’t a problem because headings are block-level elements, it’s illegal to nest them. However, as inline elements, nested <strong> and <em> tags are valid.
Regular expressions can be moderately useful for parsing small amounts of HTML, especially if the structure of that HTML is reasonably constrained (or you’re generating it yourself). For more generalized and robust HTML parsing, use the Tidy extension.
It provides an interface to the popular libtidy HTML cleanup library. After Tidy has cleaned up your HTML, you can use its methods for getting at parts of the document. Or if you’ve told Tidy to convert your HTML to XHTML, you can use all of the XML manipulation power of SimpleXML or the DOM extension to slice and dice your HTML document.
No comments:
Post a Comment