PHP Web Automation
Program: Finding Fresh Links
A modification of the program that produces a list of links and their last-modified time. If the server on which a URL lives doesn’t provide a last-modified time, the program reports the URL’s last-modified time as the time the URL was requested. If the program can’t retrieve the URL successfully, it prints out the status code it got when it tried to retrieve the URL. Run the program by passing it a URL to scan for links:
http://oreilly.com: OK; Last Modified: Fri, 24 May 2013 18:09:11 GMT
https://members.oreilly.com: MOVED: https://members.oreilly.com/account/login
http://shop.oreilly.com/basket.do: OK
http://shop.oreilly.com: OK
http://radar.oreilly.com: OK; Last Modified: Fri, 24 May 2013 20:40:56 GMT
http://animals.oreilly.com: OK; Last Modified: Fri, 24 May 2013 20:40:18 GMT
http://programming.oreilly.com: OK; Last Modified: Fri, 24 May 2013 20:42:44 GMT
...
This output is from a run of the program at about 8:43 P.M. GMT on May 24, 2013. The links that aren’t accompanied by a last-modified time means the server didn’t provide one, so those pages are probably dynamic.
The program to find fresh links is conceptually almost identical to the program to find stale links. It uses the same techniques to pull links out of a page and the same code to retrieve URLs.
Once a page has been retrieved, each linked URL is retrieved with the head method. Instead of just printing out a new location for moved links, however, it prints out a formatted version of the Last-Modified header if it’s available.
Example fresh-links.php
error_reporting(E_ALL);
if (! isset($_SERVER['argv'][1])) {
die("No URL provided.\n");
}
$url = $_SERVER['argv'][1];
// Load the page
list($page, $pageInfo) = load_with_curl($url);
if (! strlen($page)) {
die("No page retrieved from $url");
}
// Convert to XML for easy parsing
$opts = array('output-xhtml' => true,
'numeric-entities' => true);
$xml = tidy_repair_string($page, $opts);
$doc = new DOMDocument();
$doc->loadXML($xml);
$xpath = new DOMXPath($doc);
$xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml');
// Compute the Base URL for relative links.
$baseURL = '';
// Check if there is a <base href=""/> in the page
$nodeList = $xpath->query('//xhtml:base/@href');
if ($nodeList->length == 1) {
$baseURL = $nodeList->item(0)->nodeValue;
}
// No <base href=""/>, so build the Base URL from $url
else {
$URLParts = parse_url($pageInfo['url']);
if (! (isset($URLParts['path']) && strlen($URLParts['path']))) {
$basePath = '';
} else {
$basePath = preg_replace('#/[^/]*$#','',$URLParts['path']);
}
if (isset($URLParts['username']) || isset($URLParts['password'])) {
$auth = isset($URLParts['username']) ? $URLParts['username'] : '';
$auth .= ':';
$auth .= isset($URLParts['password']) ? $URLParts['password'] : '';
$auth .= '@';
} else {
$auth = '';
}
$baseURL = $URLParts['scheme'] . '://' .
$auth . $URLParts['host'] .
$basePath;
}
// Keep track of the links we visit so we don't visit each more than once
$seenLinks = array();
// Grab all links
$links = $xpath->query('//xhtml:a/@href');
foreach ($links as $node) {
$link = $node->nodeValue;
// Resolve relative links
if (! preg_match('#^(http|https|mailto):#', $link)) {
if (((strlen($link) == 0)) || ($link[0] != '/')) {
$link = '/' . $link;
}
$link = $baseURL . $link;
}
// Skip this link if we've seen it already
if (isset($seenLinks[$link])) {
continue;
}
// Mark this link as seen
$seenLinks[$link] = true;
// Print the link we're visiting
print $link.': ';
flush();
list ($linkHeaders, $linkInfo) = load_with_curl($link, 'HEAD');
// Decide what to do based on the response code
// 2xx response codes mean the page is OK
if (($linkInfo['http_code'] >= 200) && ($linkInfo['http_code'] < 300)) {
$status = 'OK';
}
// 3xx response codes mean redirection
else if (($linkInfo['http_code'] >= 300) && ($linkInfo['http_code'] < 400)) {
$status = 'MOVED';
if (preg_match('/^Location: (.*)$/m',$linkHeaders,$match)) {
$status .= ': ' . trim($match[1]);
}
}
// Other response codes mean errors
else {
$status = "ERROR: {$linkInfo['http_code']}";
}
if (preg_match('/^Last-Modified: (.*)$/mi', $linkHeaders, $match)) {
$status .= "; Last Modified: " . trim($match[1]);
}
// Print what we know about the link
print "$status\n";
}
function load_with_curl($url, $method = 'GET') {
$c = curl_init($url);
curl_setopt($c, CURLOPT_RETURNTRANSFER, true);
if ($method == 'GET') {
curl_setopt($c,CURLOPT_FOLLOWLOCATION, true);
}
else if ($method == 'HEAD') {
curl_setopt($c, CURLOPT_NOBODY, true);
curl_setopt($c, CURLOPT_HEADER, true);
}
$response = curl_exec($c);
return array($response, curl_getinfo($c));
}
No comments:
Post a Comment