PHP Web Automation
Extracting Links from an HTML File
Problem
You need to extract the URLs that are specified inside an HTML document.
Solution
Example Extracting links with Tidy and XPath
$html=<<<_HTML_
<p>Some things I enjoy eating are:</p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Pickle">Pickles</a></li>
<li><a href="http://www.eatingintranslation.com/2011/03/great_ny_noodle.html">
Salt-Baked Scallops</a></li>
<li><a href="http://www.thestoryofchocolate.com/">Chocolate</a></li>
</ul>
_HTML_;
$doc = new DOMDocument();
$opts = array('output-xhtml' => true,
// Prevent DOMDocument from being confused about entities
'numeric-entities' => true);
$doc->loadXML(tidy_repair_string($html,$opts));
$xpath = new DOMXPath($doc);
// Tell $xpath about the XHTML namespace
$xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml');
foreach ($xpath->query('//xhtml:a/@href') as $node) {
$link = $node->nodeValue;
print $link . "\n";
}
Example Extracting links without Tidy
$html=<<<_HTML_
<p>Some things I enjoy eating are:</p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Pickle">Pickles</a></li>
<li><a href="http://www.eatingintranslation.com/2011/03/great_ny_noodle.html">
Salt-Baked Scallops</a></li>
<li><a href="http://www.thestoryofchocolate.com/">Chocolate</a></li>
</ul>
_HTML_;
$links = pc_link_extractor($html);
foreach ($links as $link) {
print $link[0] . "\n";
}
function pc_link_extractor($html) {
$links = array();
preg_match_all('/<a\s+.*?href=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/a>/i',
$html,$matches,PREG_SET_ORDER);
foreach($matches as $match) {
$links[] = array($match[1],$match[2]);
}
return $links;
}
Discussion
The XHTML document that Tidy generates when the output-xhtml option is turned on may contain entities other than the four that are defined by the base XML specification (<, >, &, "). Turning on the numeric-entities option prevents those other entities from appearing in the generated XHTML document.
Their presence would cause DOMDocument to complain about undefined entities. An alternative is to leave out the numeric-entities option but set $doc->resolveExternals to true. This tells DOMDocument to fetch any Document Type Definition (DTD) referenced in the file it’s loading and use that to resolve the entities.
Tidy generates XML with an appropriate DTD in it. The downside of this approach is that the DTD URL points to a resource on an external web server, so your program would have to download that resource each time it runs.
XHTML is an XML application—a defined XML vocabulary for expressing HTML. As such, all of its elements (the familiar <a/>, <h1/>, and so on) live in a namespace. For XPath queries to work properly, the namespace has to be attached to a prefix (that’s what the registerNamespace() method does) and then used in the XPath query.
The pc_link_extractor() function is a useful alternative if Tidy isn’t available. Its regular expression won’t work on all links, such as those that are constructed with some hexadecimal escapes, but it should function on the majority of reasonably well-formed HTML. The function returns an array. Each element of that array is itself a two-element array. The first element is the target of the link, and the second element is the link anchor —text that is linked.
Example Extracting links and anchors with Tidy and XPath
$html=<<<_HTML_
<p>Some things I enjoy eating are:</p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Pickle">Pickles</a></li>
<li><a href="http://www.eatingintranslation.com/2011/03/great_ny_noodle.html">
Salt-Baked Scallops</a></li>
<li><a href="http://www.thestoryofchocolate.com/">Chocolate</a></li>
</ul>
_HTML_;
$doc = new DOMDocument();
$opts = array('output-xhtml'=>true,
'wrap' => 0,
// Prevent DOMDocument from being confused about entities
'numeric-entities' => true);
$doc->loadXML(tidy_repair_string($html,$opts));
$xpath = new DOMXPath($doc);
// Tell $xpath about the XHTML namespace
$xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml');
foreach ($xpath->query('//xhtml:a') as $node) {
$anchor = trim($node->textContent);
$link = $node->getAttribute('href');
print "$anchor -> $link\n";
}
In the XPath query finds all the <a/> element nodes. The textContent property of the node holds the anchor text and the link is in the href attribute. The additional 'wrap' => 0 Tidy option tells Tidy not to do any line-wrapping on the generated XHTML. This keeps all the link anchors on one line when extracting them.
No comments:
Post a Comment