PHP Files
Counting Lines, Paragraphs, or Records in a File
Problem
You want to count the number of lines, paragraphs, or records in a file.
Solution
To count lines, use fgets():
$lines = 0;
if ($fh = fopen('orders.txt','r')) {
while (! feof($fh)) {
if (fgets($fh)) {
$lines++;
}
}
}
print $lines;
Because fgets() reads a line at a time, you can count the number of times it’s called before reaching the end of a file.
To count paragraphs, increment the counter only when you read a blank line:
$paragraphs = 0;
if ($fh = fopen('great-american-novel.txt','r')) {
while (! feof($fh)) {
$s = fgets($fh);
if (("\n" == $s) || ("\r\n" == $s)) {
$paragraphs++;
}
}
}
print $paragraphs;
To count records, increment the counter only when the line read contains just the record separator and whitespace. Here the record separator is stored in $record_separator:
$records = 0;
$record_separator = '--end--';
if ($fh = fopen('great-american-textfile-database.txt','r')) {
while (! feof($fh)) {
$s = rtrim(fgets($fh));
if ($s == $record_separator) {
$records++;
}
}
}
print $records;
Discussion
When counting lines, $lines is incremented only if fgets() returns a true value. As fgets() moves through the file, it returns each line it retrieves. When it reaches the last line, it returns false, so $lines isn’t incremented incorrectly. Because EOF has been reached on the file, feof() returns true, and the while loop ends.
When counting paragraphs, the solution works properly on simple text but may produce unexpected results when presented with a long string of blank lines or a file without two consecutive line breaks. These problems can be remedied with functions based on preg_split(). If the file is small and can be read into memory, use the split_paragraphs() function. This function returns an array containing each paragraph in the file. For example:
function split_paragraphs($file,$rs="\r?\n") {
$text = file_get_contents($file);
$matches = preg_split("/(.*?$rs)(?:$rs)+/s",$text,-1,
PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
return $matches;
}
The contents of the file are broken on two or more consecutive newlines and returned in the $matches array. The default record-separation regular expression, \r?\n, matches both Windows and Unix line breaks.
If the file is too big to read into memory at once, use the split_paragraphs_large file() function, which reads the file in 16 KB chunks:
function split_paragraphs_largefile($file,$rs="\r?\n") {
global $php_errormsg;
$unmatched_text = '';
$paragraphs = array();
$fh = fopen($file,'r') or die($php_errormsg);
while(! feof($fh)) {
$s = fread($fh,16384) or die($php_errormsg);
$text_to_split = $unmatched_text . $s;
$matches = preg_split("/(.*?$rs)(?:$rs)+/s",$text_to_split,-1,
PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);
// if the last chunk doesn't end with two record separators, save it
// to prepend to the next section that gets read
$last_match = $matches[count($matches)-1];
if (! preg_match("/$rs$rs\$/",$last_match)) {
$unmatched_text = $last_match;
array_pop($matches);
} else {
$unmatched_text = '';
}
$paragraphs = array_merge($paragraphs,$matches);
}
// after reading all sections, if there is a final chunk that doesn't
// end with the record separator, count it as a paragraph
if ($unmatched_text) {
$paragraphs[] = $unmatched_text;
}
return $paragraphs;
}
This function uses the same regular expression as split_paragraphs() to split the file into paragraphs. When it finds a paragraph end in a chunk read from the file, it saves the rest of the text in the chunk in $unmatched_text and prepends it to the next chunk read. This includes the unmatched text as the beginning of the next paragraph in the file.
The record-counting function at the end of the Solution lets fgets() figure out how long each line is. If you can supply a reasonable upper bound on line length, stream_get_line() provides a more concise way to count records. This function reads a line until it reaches a certain number of bytes or it sees a particular delimiter. Supply it with the record separator as the delimiter, as shown:
$records = 0;
$record_separator = '--end--';
if ($fh = fopen('great-american-textfile-database.txt','r')) {
$done = false;
while (! $done) {
$s = stream_get_line($fh, 65536, $record_separator);
if (feof($fh)) {
$done = true;
} else {
$records++;
}
}
}
print $records;
This example assumes that each record is no more that 64 KB (65,536 bytes) long. Each call to stream_get_line() returns one record, not including the record separator. When stream_get_line() has advanced past the last record separator, it reaches the end of the file, so $done is set to true to stop counting records.
No comments:
Post a Comment