PHP Web Automation
Program: Finding Stale Links
The stale-links.php program produces a list of links in a page and their status. It tells you if the links are okay, if they’ve been moved somewhere else, or if they’re bad. Run the program by passing it a URL to scan for links:
http://oreilly.com: OK
https://members.oreilly.com: MOVED: https://members.oreilly.com/account/login
http://shop.oreilly.com/basket.do: OK
http://shop.oreilly.com: OK
http://radar.oreilly.com: OK
http://animals.oreilly.com: OK
http://programming.oreilly.com: OK
...
The stale-links.php program uses the cURL extension to retrieve web pages. First, it retrieves the URL specified on the command line. Once a page has been retrieved, the program uses the XPath technique to get a list of links in the page. Then, after prepending a base URL to each link if necessary, the link is retrieved.
Because we need just the headers of these responses, we use the HEAD method instead of GET by setting the CURLOPT_NOBODY option. Setting CURLOPT_HEADER tells curl_exec() to include the response headers in the string it returns. Based on the response code, the status of the link is printed, along with its new location if it’s been moved.
Example stale-links.php
if (! isset($_SERVER['argv'][1])) {
die("No URL provided.\n");
}
$url = $_SERVER['argv'][1];
// Load the page
list($page,$pageInfo) = load_with_curl($url);
if (! strlen($page)) {
die("No page retrieved from $url");
}
// Convert to XML for easy parsing
$opts = array('output-xhtml' => true,
'numeric-entities' => true);
$xml = tidy_repair_string($page, $opts);
$doc = new DOMDocument();
$doc->loadXML($xml);
$xpath = new DOMXPath($doc);
$xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml');
// Compute the Base URL for relative links
$baseURL = '';
// Check if there is a <base href=""/> in the page
$nodeList = $xpath->query('//xhtml:base/@href');
if ($nodeList->length == 1) {
$baseURL = $nodeList->item(0)->nodeValue;
}
// No <base href=""/>, so build the Base URL from $url
else {
$URLParts = parse_url($pageInfo['url']);
if (! (isset($URLParts['path']) && strlen($URLParts['path']))) {
$basePath = '';
} else {
$basePath = preg_replace('#/[^/]*$#','',$URLParts['path']);
}
if (isset($URLParts['username']) || isset($URLParts['password'])) {
$auth = isset($URLParts['username']) ? $URLParts['username'] : '';
$auth .= ':';
$auth .= isset($URLParts['password']) ? $URLParts['password'] : '';
$auth .= '@';
} else {
$auth = '';
}
$baseURL = $URLParts['scheme'] . '://' .
$auth . $URLParts['host'] .
$basePath;
}
// Keep track of the links we visit so we don't visit each more than once
$seenLinks = array();
// Grab all links
$links = $xpath->query('//xhtml:a/@href');
foreach ($links as $node) {
$link = $node->nodeValue;
// Resolve relative links
if (! preg_match('#^(http|https|mailto):#', $link)) {
if (((strlen($link) == 0)) || ($link[0] != '/')) {
$link = '/' . $link;
}
$link = $baseURL . $link;
}
// Skip this link if we've seen it already
if (isset($seenLinks[$link])) {
continue;
}
// Mark this link as seen
$seenLinks[$link] = true;
// Print the link we're visiting
print $link.': ';
flush();
list($linkHeaders, $linkInfo) = load_with_curl($link, 'HEAD');
// Decide what to do based on the response code
// 2xx response codes mean the page is OK
if (($linkInfo['http_code'] >= 200) && ($linkInfo['http_code'] < 300)) {
$status = 'OK';
}
// 3xx response codes mean redirection
else if (($linkInfo['http_code'] >= 300) && ($linkInfo['http_code'] < 400)) {
$status = 'MOVED';
if (preg_match('/^Location: (.*)$/m',$linkHeaders,$match)) {
$status .= ': ' . trim($match[1]);
}
}
// Other response codes mean errors
else {
$status = "ERROR: {$linkInfo['http_code']}";
}
// Print what we know about the link
print "$status\n";
}
function load_with_curl($url, $method = 'GET') {
$c = curl_init($url);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
if ($method == 'GET') {
curl_setopt($c,CURLOPT_FOLLOWLOCATION, true);
}
else if ($method == 'HEAD') {
curl_setopt($c, CURLOPT_NOBODY, true);
curl_setopt($c, CURLOPT_HEADER, true);
}
$response = curl_exec($c);
return array($response, curl_getinfo($c));
}
No comments:
Post a Comment