PHP Web Automation Extracting Links from an HTML File

PHP Web Automation

Extracting Links from an HTML File

Problem

You need to extract the URLs that are specified inside an HTML document.

Solution

Example Extracting links with Tidy and XPath

$html=<<<_HTML_

Some things I enjoy eating are:

<ul>

<li><a href="http://en.wikipedia.org/wiki/Pickle">Pickles</a></li>

Salt-Baked Scallops</a></li>

<li><a href="http://www.thestoryofchocolate.com/">Chocolate</a></li>

</ul>

_HTML_;

$doc = new DOMDocument();

$opts = array('output-xhtml' => true,

// Prevent DOMDocument from being confused about entities

'numeric-entities' => true);

$doc->loadXML(tidy_repair_string($html,$opts));

$xpath = new DOMXPath($doc);

// Tell $xpath about the XHTML namespace

$xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml');

foreach ($xpath->query('//xhtml:a/@href') as $node) {

$link = $node->nodeValue;

print $link . "\n";

}

Example Extracting links without Tidy

$html=<<<_HTML_

Some things I enjoy eating are:

<ul>

<li><a href="http://en.wikipedia.org/wiki/Pickle">Pickles</a></li>

Salt-Baked Scallops</a></li>

<li><a href="http://www.thestoryofchocolate.com/">Chocolate</a></li>

</ul>

_HTML_;

$links = pc_link_extractor($html);

foreach ($links as $link) {

print $link[0] . "\n";

}

function pc_link_extractor($html) {

$links = array();

preg_match_all('/<a\s+.*?href=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/a>/i',

$html,$matches,PREG_SET_ORDER);

foreach($matches as $match) {

$links[] = array($match[1],$match[2]);

}

return $links;

}

Discussion

The XHTML document that Tidy generates when the output-xhtml option is turned on may contain entities other than the four that are defined by the base XML specification (<, >, &, "). Turning on the numeric-entities option prevents those other entities from appearing in the generated XHTML document.

Their presence would cause DOMDocument to complain about undefined entities. An alternative is to leave out the numeric-entities option but set $doc->resolveExternals to true. This tells DOMDocument to fetch any Document Type Definition (DTD) referenced in the file it’s loading and use that to resolve the entities.

Tidy generates XML with an appropriate DTD in it. The downside of this approach is that the DTD URL points to a resource on an external web server, so your program would have to download that resource each time it runs.

XHTML is an XML application—a defined XML vocabulary for expressing HTML. As such, all of its elements (the familiar <a/>, <h1/>, and so on) live in a namespace. For XPath queries to work properly, the namespace has to be attached to a prefix (that’s what the registerNamespace() method does) and then used in the XPath query.

The pc_link_extractor() function is a useful alternative if Tidy isn’t available. Its regular expression won’t work on all links, such as those that are constructed with some hexadecimal escapes, but it should function on the majority of reasonably well-formed HTML. The function returns an array. Each element of that array is itself a two-element array. The first element is the target of the link, and the second element is the link anchor —text that is linked.

Example Extracting links and anchors with Tidy and XPath

$html=<<<_HTML_

Some things I enjoy eating are:

<ul>

<li><a href="http://en.wikipedia.org/wiki/Pickle">Pickles</a></li>

Salt-Baked Scallops</a></li>

<li><a href="http://www.thestoryofchocolate.com/">Chocolate</a></li>

</ul>

_HTML_;

$doc = new DOMDocument();

$opts = array('output-xhtml'=>true,

'wrap' => 0,

// Prevent DOMDocument from being confused about entities

'numeric-entities' => true);

$doc->loadXML(tidy_repair_string($html,$opts));

$xpath = new DOMXPath($doc);

// Tell $xpath about the XHTML namespace

$xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml');

foreach ($xpath->query('//xhtml:a') as $node) {

$anchor = trim($node->textContent);

$link = $node->getAttribute('href');

print "$anchor -> $link\n";

}

In the XPath query finds all the <a/> element nodes. The textContent property of the node holds the anchor text and the link is in the href attribute. The additional 'wrap' => 0 Tidy option tells Tidy not to do any line-wrapping on the generated XHTML. This keeps all the link anchors on one line when extracting them.

Breaking

Post Top Ad

Post Top Ad

Monday, June 17, 2019

PHP Web Automation Extracting Links from an HTML File

No comments:

Post a Comment

Post Top Ad

Author Details

Subscribe Our Youtube Channel

Featured Post

Total Pageviews

Translate

Advertisement

Recent

Popular

Comments

Ads

Archive

Technology

Tags

Contact Form

Breaking

Post Top Ad

Post Top Ad

Monday, June 17, 2019

PHP Web Automation Extracting Links from an HTML File

No comments:

Post a Comment

Post Top Ad

Author Details

Edit This Menu

Join Our Telegram Channel to Stay Updated

Socialize

Subscribe Our Youtube Channel

Featured Post

Total Pageviews

Translate

Advertisement

Recent

Popular

Comments

Ads

Archive

Technology

Tags

Contact Form