PHP Regular Expressions
Matching Words
Problem
You want to pull out all words from a string.
Solution
The simplest way to do this is to use the PCRE “word character” character type escape sequence, \w:
$text = "Knock, knock. Who's there? r2d2!";
$words = preg_match_all('/\w+/', $text, $matches);
var_dump($matches[0]);
Discussion
The \w escape sequence matches letters, digits, and underscores. It does not include other punctuation. So the output from the preceding code is:
array(6) {
[0]=>
string(5) "Knock"
[1]=>
string(5) "knock"
[2]=>
string(3) "Who"
[3]=>
string(1) "s"
[4]=>
string(5) "there"
[5]=>
string(4) "r2d2"
}
This is mostly correct except that Who’s is broken up into Who and s. To extend this pattern to handle English contractions properly, we can match against either a word character or an apostrophe sandwiched by word characters:
$text = "Knock, knock. Who's there? r2d2!";
$pattern = "/(?:\w'\w|\w)+/";
$words = preg_match_all($pattern, $text, $matches);
var_dump($matches[0]);
(The ?: syntax in this pattern prevents the text that matches the parenthesized subpattern from being “captured.”)
With the addition of the u modifier, a pattern becomes Unicode-aware and will handle words properly in non-ASCII character sets. For example:
$fr = 'Toc, toc. Qui est là? R2D2!';
$fr_words = preg_match_all('/\w+/u', $fr, $matches);
print "The French words are:\n\t";
print implode(', ', $matches[0]) . "\n";
$kr = '노크, 노크. 거기 누구입니까? R2D2!';
$kr_words = preg_match_all('/\w+/u', $kr, $matches);
print "The Korean words are:\n\t";
print implode(', ', $matches[0]) . "\n";
This prints:
The French words are:
Toc, toc, Qui, est, là, R2D2
The Korean words are:
노크, 노크, 거기, 누구입니까, R2D2
Without that u at the end of each pattern, the non-ASCII characters would be stripped out of the matches, producing incorrect results.
No comments:
Post a Comment