PHP Web Automation Program: Finding Fresh Links

PHP Web Automation

Program: Finding Fresh Links

A modification of the program that produces a list of links and their last-modified time. If the server on which a URL lives doesn’t provide a last-modified time, the program reports the URL’s last-modified time as the time the URL was requested. If the program can’t retrieve the URL successfully, it prints out the status code it got when it tried to retrieve the URL. Run the program by passing it a URL to scan for links:

http://oreilly.com: OK; Last Modified: Fri, 24 May 2013 18:09:11 GMT

https://members.oreilly.com: MOVED: https://members.oreilly.com/account/login

http://shop.oreilly.com/basket.do: OK

http://shop.oreilly.com: OK

http://radar.oreilly.com: OK; Last Modified: Fri, 24 May 2013 20:40:56 GMT

http://animals.oreilly.com: OK; Last Modified: Fri, 24 May 2013 20:40:18 GMT

http://programming.oreilly.com: OK; Last Modified: Fri, 24 May 2013 20:42:44 GMT

...

This output is from a run of the program at about 8:43 P.M. GMT on May 24, 2013. The links that aren’t accompanied by a last-modified time means the server didn’t provide one, so those pages are probably dynamic.

The program to find fresh links is conceptually almost identical to the program to find stale links. It uses the same techniques to pull links out of a page and the same code to retrieve URLs.

Once a page has been retrieved, each linked URL is retrieved with the head method. Instead of just printing out a new location for moved links, however, it prints out a formatted version of the Last-Modified header if it’s available.

Example fresh-links.php

error_reporting(E_ALL);

if (! isset($_SERVER['argv'][1])) {

die("No URL provided.\n");

}

$url = $_SERVER['argv'][1];

// Load the page

list($page, $pageInfo) = load_with_curl($url);

if (! strlen($page)) {

die("No page retrieved from $url");

}

// Convert to XML for easy parsing

$opts = array('output-xhtml' => true,

'numeric-entities' => true);

$xml = tidy_repair_string($page, $opts);

$doc = new DOMDocument();

$doc->loadXML($xml);

$xpath = new DOMXPath($doc);

$xpath->registerNamespace('xhtml','http://www.w3.org/1999/xhtml');

// Compute the Base URL for relative links.

$baseURL = '';

// Check if there is a <base href=""/> in the page

$nodeList = $xpath->query('//xhtml:base/@href');

if ($nodeList->length == 1) {

$baseURL = $nodeList->item(0)->nodeValue;

}

// No <base href=""/>, so build the Base URL from $url

else {

$URLParts = parse_url($pageInfo['url']);

if (! (isset($URLParts['path']) && strlen($URLParts['path']))) {

$basePath = '';

} else {

$basePath = preg_replace('#/[^/]*$#','',$URLParts['path']);

}

if (isset($URLParts['username']) || isset($URLParts['password'])) {

$auth = isset($URLParts['username']) ? $URLParts['username'] : '';

$auth .= ':';

$auth .= isset($URLParts['password']) ? $URLParts['password'] : '';

$auth .= '@';

} else {

$auth = '';

}

$baseURL = $URLParts['scheme'] . '://' .

$auth . $URLParts['host'] .

$basePath;

}

// Keep track of the links we visit so we don't visit each more than once

$seenLinks = array();

// Grab all links

$links = $xpath->query('//xhtml:a/@href');

foreach ($links as $node) {

$link = $node->nodeValue;

// Resolve relative links

if (! preg_match('#^(http|https|mailto):#', $link)) {

if (((strlen($link) == 0)) || ($link[0] != '/')) {

$link = '/' . $link;

}

$link = $baseURL . $link;

}

// Skip this link if we've seen it already

if (isset($seenLinks[$link])) {

continue;

}

// Mark this link as seen

$seenLinks[$link] = true;

// Print the link we're visiting

print $link.': ';

flush();

list ($linkHeaders, $linkInfo) = load_with_curl($link, 'HEAD');

// Decide what to do based on the response code

// 2xx response codes mean the page is OK

if (($linkInfo['http_code'] >= 200) && ($linkInfo['http_code'] < 300)) {

$status = 'OK';

}

// 3xx response codes mean redirection

else if (($linkInfo['http_code'] >= 300) && ($linkInfo['http_code'] < 400)) {

$status = 'MOVED';

if (preg_match('/^Location: (.*)$/m',$linkHeaders,$match)) {

$status .= ': ' . trim($match[1]);

}

// Other response codes mean errors

else {

$status = "ERROR: {$linkInfo['http_code']}";

}

if (preg_match('/^Last-Modified: (.*)$/mi', $linkHeaders, $match)) {

$status .= "; Last Modified: " . trim($match[1]);

}

// Print what we know about the link

print "$status\n";

}

function load_with_curl($url, $method = 'GET') {

$c = curl_init($url);

curl_setopt($c, CURLOPT_RETURNTRANSFER, true);

if ($method == 'GET') {

curl_setopt($c,CURLOPT_FOLLOWLOCATION, true);

}

else if ($method == 'HEAD') {

curl_setopt($c, CURLOPT_NOBODY, true);

curl_setopt($c, CURLOPT_HEADER, true);

}

$response = curl_exec($c);

return array($response, curl_getinfo($c));

}

Breaking

Post Top Ad

Post Top Ad

Wednesday, June 19, 2019

PHP Web Automation Program: Finding Fresh Links

No comments:

Post a Comment

Post Top Ad

Author Details

Subscribe Our Youtube Channel

Featured Post

Total Pageviews

Translate

Advertisement

Recent

Popular

Comments

Ads

Archive

Technology

Tags

Contact Form

Breaking

Post Top Ad

Post Top Ad

Wednesday, June 19, 2019

PHP Web Automation Program: Finding Fresh Links

No comments:

Post a Comment

Post Top Ad

Author Details

Edit This Menu

Join Our Telegram Channel to Stay Updated

Socialize

Subscribe Our Youtube Channel

Featured Post

Total Pageviews

Translate

Advertisement

Recent

Popular

Comments

Ads

Archive

Technology

Tags

Contact Form