PHP Internationalization and Localization Manipulating UTF-8 Text - Supercoders | Web Development and Design | Tutorial for Java, PHP, HTML, Javascript PHP Internationalization and Localization Manipulating UTF-8 Text - Supercoders | Web Development and Design | Tutorial for Java, PHP, HTML, Javascript

Breaking

Post Top Ad

Post Top Ad

Saturday, June 29, 2019

PHP Internationalization and Localization Manipulating UTF-8 Text

PHP Internationalization and Localization


Manipulating UTF-8 Text


Problem

You want to work with UTF-8–encoded text in your programs. For example, you want to properly calculate the length of multibyte strings and make sure that all text is output as proper UTF-8–encoded characters.

Solution

Use a combination of PHP functions for the variety of tasks that UTF-8 compliance demands. If the mbstring extension is available, use its string functions for UTF-8–aware string manipulation.

Example  Using mb_strlen( )

       // Set the encoding properly
       mb_internal_encoding('UTF-8');
       // ö is two bytes
       $name = 'Kurt Gödel';
       // Each of these Hangul characters is three bytes
       $dinner = '불고기';

       $name_len_bytes = strlen($name);
       $name_len_chars = mb_strlen($name);

       $dinner_len_bytes = strlen($dinner);
       $dinner_len_chars = mb_strlen($dinner);

       print "$name is $name_len_bytes bytes and $name_len_chars chars\n";
       print "$dinner is $dinner_len_bytes bytes and $dinner_len_chars chars\n";

Example prints:

       Kurt Gödel is 11 bytes and 10 chars
       ë¶ˆê³ ê¸° is 9 bytes and 3 chars

Example  Using iconv

       // Set the encoding properly
       iconv_set_encoding('internal_encoding','UTF-8');
       // ö is two bytes
       $name = 'Kurt Gödel';
       // Each of these Hangul characters is three bytes
       $dinner = '불고기';

       $name_len_bytes = strlen($name);
       $name_len_chars = iconv_strlen($name);

       $dinner_len_bytes = strlen($dinner);
       $dinner_len_chars = iconv_strlen($dinner);

       print "$name is $name_len_bytes bytes and $name_len_chars chars\n";
       print "$dinner is $dinner_len_bytes bytes and $dinner_len_chars chars\n";

       print "The seventh character of $name is " . iconv_substr($name,6,1) . "\n";
       print "The last two characters of $dinner are " . iconv_substr($dinner,-2);

Use the optional third argument to functions such as htmlentities() and htmlspecialchars() that instructs them to treat input as UTF-8 encoded.

Example  UTF-8 HTML encoding

       $encoded_name = htmlspecialchars($_POST['name'], ENT_QUOTES, 'UTF-8');
       $encoded_dinner = htmlentities($_POST['dinner'], ENT_QUOTES, 'UTF-8');

Discussion

Eternal vigilance is the price of proper character encoding. If you’ve followed the instructions data coming into your program should be UTF-8 encoded and browsers will properly handle data coming out of your program as UTF-8 encoded. This leaves you with two responsibilities: to operate on strings in a UTF-8–aware manner and to generate text that is UTF-8 encoded.

Fulfulling the first responsibility is made easier once you have adopted the fundamental credo of internationalization awareness: a character is not a byte. The PHP-specific corollary to this axiom is that PHP’s string functions only know about bytes, not characters. For example, the strlen() function counts the number of bytes in a string, not the number of characters. In the prelapsarian days of ISO-8859-1 encoding, this wasn’t a problem—each of the 256 characters in the character set took up one byte. A UTF-8–encoded character, on the other hand, uses between one and four bytes. The mbstring and iconv extensions provide alternatives for some string functions that operate on a character-by-character basis, not a byte-by-byte basis. These functions are listed in Table.

Table Character-based functions

Regular function    mbstring function          iconv function                                                                                   
strlen()                      mb_strlen()                    iconv_strlen()
strpos()                     mb_strpos()                    iconv_strpos()
strrpos()                    mb_strrpos()                 iconv_strrpos()
substr()                     mb_substr()                    iconv_substr()
strtolower()              mb_strtolower()            -
strtoupper()             mb_strtoupper()            -
substr_count()        mb_substr_count()       -
ereg()                         mb_ereg()                       -
eregi()                        mb_eregi()                      -
ereg_replace()         mb_ereg_replace()        -
eregi_replace()        mb_eregi_replace()      -
split()                         mb_split()                       -
mail()                         mb_send_mail()           -
____________________________________________________________________ 

For mbstring to work properly, it needs to be told to use the UTF-8 encoding scheme. As in Example 19-10, you can do this in script with the mb_internal_encoding() function. Or to set this value system-wide, set the mbstring.internal_encoding configuration directive to UTF-8.

iconv has similar needs. Use the iconv_set_encoding() function or set the iconv.internal_encoding configuration directive.

mbstring provides alternatives for the ereg family of regular expression functions. However, you can always use UTF-8 strings with the PCRE (preg_*()) regular expression functions. The u modifier tells a preg function that the pattern string is UTF-8 encoded and enables the use of various Unicode properties in patterns. Uses the “lowercase letter” Unicode property to count the number of lowercase letters in each of two strings.

Example  UTF-8 regular expression matching

       $name = 'Kurt Gödel';
       $dinner = '불고기';

       $name_lower = preg_match_all('/\p{Ll}/u',$name,$match);
       $dinner_lower = preg_match_all('/\p{Ll}/u',$dinner,$match);

       print "There are $name_lower lowercase letters in $name.\n";
       print "There are $dinner_lower lowercase letters in $dinner.\n";

Example  prints:

       There are 7 lowercase letters in Kurt Gödel.
       There are 0 lowercase letters in 불고기.

Other functions help you translate between other character encodings and UTF-8. The utf8_encode() and utf8_decode() functions move strings between the ISO-8859-1 encoding and UTF-8. Because ISO-8859-1 is the default encoding in many situations, these functions are a handy way to bring non-UTF-8–aware data into compliance. For example, the dictionaries that the pspell extension uses often have their entries encoded in ISO-8859-1. In the utf8_encode() function is necessary to turn the output of pspell_suggest() into a proper UTF-8–encoded string.

Example  Applying UTF-8 encoding to ISO-8859-1 strings

       $lang = isset($_GET['lang']) ? $_GET['lang'] : 'en';
       $word = isset($_GET['word']) ? $_GET['word'] : 'asparagus';

       $ps = pspell_new($lang);
       $check = pspell_check($ps, $word);

       print htmlspecialchars($word,ENT_QUOTES,'UTF-8');
       print $check ? ' is ' : ' is not ';
       print ' found in the dictionary.';
       print '<hr/>';
   
       if (! $check) {
            $suggestions = pspell_suggest($ps, $word);
            if (count($suggestions)) {
                 print 'Suggestions: <ul>';
                 foreach ($suggestions as $suggestion) {
                      $utf8suggestion = utf8_encode($suggestion);
                      $safesuggestion = htmlspecialchars($utf8suggestion,
                                                                                         ENT_QUOTES,'UTF-8');
                      print "<li>$safesuggestion</li>";
                 }
                 print '</ul>';
            }
       }

It may ease the cognitive burden of proper character encoding to think of it as a task similar to HTML entity encoding. In each case, text must be processed so that it is appropriately formatted for a particular context. With entity encoding, that usually means running data retrieved from an external source through htmlentities() or htmlspecialchars(). With character encoding, it means turning everything into UTF-8 before you process it, using a character-aware function for string operations, and ensuring strings are UTF-8 encoded before outputting them.

No comments:

Post a Comment

Post Top Ad