PHP Web Automation Program: Finding Stale Links

PHP Web Automation

Program: Finding Stale Links

The stale-links.php program produces a list of links in a page and their status. It tells you if the links are okay, if they’ve been moved somewhere else, or if they’re bad. Run the program by passing it a URL to scan for links:

http://oreilly.com: OK

https://members.oreilly.com: MOVED: https://members.oreilly.com/account/login

http://shop.oreilly.com/basket.do: OK

http://shop.oreilly.com: OK

http://radar.oreilly.com: OK

http://animals.oreilly.com: OK

http://programming.oreilly.com: OK

...

The stale-links.php program uses the cURL extension to retrieve web pages. First, it retrieves the URL specified on the command line. Once a page has been retrieved, the program uses the XPath technique to get a list of links in the page. Then, after prepending a base URL to each link if necessary, the link is retrieved.

Because we need just the headers of these responses, we use the HEAD method instead of GET by setting the CURLOPT_NOBODY option. Setting CURLOPT_HEADER tells curl_exec() to include the response headers in the string it returns. Based on the response code, the status of the link is printed, along with its new location if it’s been moved.

Example stale-links.php

if (! isset($_SERVER['argv'][1])) {

die("No URL provided.\n");

}

$url = $_SERVER['argv'][1];

// Load the page

list($page,$pageInfo) = load_with_curl($url);

if (! strlen($page)) {

die("No page retrieved from $url");

}

// Convert to XML for easy parsing

$opts = array('output-xhtml' => true,

'numeric-entities' => true);

$xml = tidy_repair_string($page, $opts);

$doc = new DOMDocument();

$doc->loadXML($xml);

$xpath = new DOMXPath($doc);

$xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml');

// Compute the Base URL for relative links

$baseURL = '';

// Check if there is a <base href=""/> in the page

$nodeList = $xpath->query('//xhtml:base/@href');

if ($nodeList->length == 1) {

$baseURL = $nodeList->item(0)->nodeValue;

}

// No <base href=""/>, so build the Base URL from $url

else {

$URLParts = parse_url($pageInfo['url']);

if (! (isset($URLParts['path']) && strlen($URLParts['path']))) {

$basePath = '';

} else {

$basePath = preg_replace('#/[^/]*$#','',$URLParts['path']);

}

if (isset($URLParts['username']) || isset($URLParts['password'])) {

$auth = isset($URLParts['username']) ? $URLParts['username'] : '';

$auth .= ':';

$auth .= isset($URLParts['password']) ? $URLParts['password'] : '';

$auth .= '@';

} else {

$auth = '';

}

$baseURL = $URLParts['scheme'] . '://' .

$auth . $URLParts['host'] .

$basePath;

}

// Keep track of the links we visit so we don't visit each more than once

$seenLinks = array();

// Grab all links

$links = $xpath->query('//xhtml:a/@href');

foreach ($links as $node) {

$link = $node->nodeValue;

// Resolve relative links

if (! preg_match('#^(http|https|mailto):#', $link)) {

if (((strlen($link) == 0)) || ($link[0] != '/')) {

$link = '/' . $link;

}

$link = $baseURL . $link;

}

// Skip this link if we've seen it already

if (isset($seenLinks[$link])) {

continue;

}

// Mark this link as seen

$seenLinks[$link] = true;

// Print the link we're visiting

print $link.': ';

flush();

list($linkHeaders, $linkInfo) = load_with_curl($link, 'HEAD');

// Decide what to do based on the response code

// 2xx response codes mean the page is OK

if (($linkInfo['http_code'] >= 200) && ($linkInfo['http_code'] < 300)) {

$status = 'OK';

}

// 3xx response codes mean redirection

else if (($linkInfo['http_code'] >= 300) && ($linkInfo['http_code'] < 400)) {

$status = 'MOVED';

if (preg_match('/^Location: (.*)$/m',$linkHeaders,$match)) {

$status .= ': ' . trim($match[1]);

}

// Other response codes mean errors

else {

$status = "ERROR: {$linkInfo['http_code']}";

}

// Print what we know about the link

print "$status\n";

}

function load_with_curl($url, $method = 'GET') {

$c = curl_init($url);

curl_setopt($c, CURLOPT_RETURNTRANSFER, true);

if ($method == 'GET') {

curl_setopt($c,CURLOPT_FOLLOWLOCATION, true);

}

else if ($method == 'HEAD') {

curl_setopt($c, CURLOPT_NOBODY, true);

curl_setopt($c, CURLOPT_HEADER, true);

}

$response = curl_exec($c);

return array($response, curl_getinfo($c));

}

Breaking

Post Top Ad

Post Top Ad

Tuesday, June 18, 2019

PHP Web Automation Program: Finding Stale Links

No comments:

Post a Comment

Post Top Ad

Author Details

Subscribe Our Youtube Channel

Featured Post

Total Pageviews

Translate

Advertisement

Recent

Popular

Comments

Ads

Archive

Technology

Tags

Contact Form

Breaking

Post Top Ad

Post Top Ad

Tuesday, June 18, 2019

PHP Web Automation Program: Finding Stale Links

No comments:

Post a Comment

Post Top Ad

Author Details

Edit This Menu

Join Our Telegram Channel to Stay Updated

Socialize

Subscribe Our Youtube Channel

Featured Post

Total Pageviews

Translate

Advertisement

Recent

Popular

Comments

Ads

Archive

Technology

Tags

Contact Form