PHP Web Automation Marking Up a Web Page - Supercoders | Web Development and Design | Tutorial for Java, PHP, HTML, Javascript PHP Web Automation Marking Up a Web Page - Supercoders | Web Development and Design | Tutorial for Java, PHP, HTML, Javascript

Breaking

Post Top Ad

Post Top Ad

Sunday, June 16, 2019

PHP Web Automation Marking Up a Web Page

PHP Web Automation


Marking Up a Web Page

Problem

You want to display a web page—for example, a search result—with certain words highlighted.

Solution

Build an array replacement for each word you want to highlight. Then, chop up the page into “HTML elements” and “text between HTML elements” and apply the replacements to just the text between HTML elements. 

Example  Marking up a web page

         $body = '
         <p>I like pickles and herring.</p>

         <a href="pickle.php"><img src="pickle.jpg"/>A pickle picture</a>

         I have a herringbone-patterned toaster cozy.

         <herring>Herring is not a real HTML element!</herring>
         ';

         $words = array('pickle','herring');
         $replacements = array();
         foreach ($words as $i => $word) {
                  $replacements[] = "<span class='word-$i'>$word</span>";
         }

         // Split up the page into chunks delimited by a
         // reasonable approximation of what an HTML element
         // looks like.
         $parts = preg_split("{(<(?:\"[^\"]*\"|'[^']*'|[^'\">])*>)}",
                                              $body,
                                              -1, // Unlimited number of chunks
                                              PREG_SPLIT_DELIM_CAPTURE);
         foreach ($parts as $i => $part) {
                  // Skip if this part is an HTML element
                  if (isset($part[0]) && ($part[0] == '<')) { continue; }
                  // Wrap the words with <span/>s
                  $parts[$i] = str_replace($words, $replacements, $part);
         }

         // Reconstruct the body
         $body = implode('',$parts);

         print $body;

Discussion

Example  prints:

         <p>I like <span class='word-0'>pickle</span>s and <span class='word-1'>↵
         herring</span>.</p>

         <a href="pickle.php"><img src="pickle.jpg"/>A <span class='word-0'>pickle</span>↵
         picture</a>

         I have a <span class='word-1'>herring</span>bone-patterned toaster cozy.

         <herring>Herring is not a real HTML element!</herring>

Each of the words in $words (pickle and herring) has been wrapped with a <span/> that has a specific class attribute. Use a CSS stylesheet to attach particular display attributes to these classes, such as a bright yellow background or a border.

The regular expression chops up $body into a series of chunks delimited by HTML elements. This lets us just replace the text between HTML elements and leaves HTML elements or attributes alone whose values might contain a search term. The regular expression does a pretty good job of matching HTML elements, but if you have some particularly crazy, malformed markup with mismatched or unescaped quotes, it might get confused.

Because str_replace() is case sensitive, only strings that exactly match words in $words are replaced. The last Herring doesn’t get highlighted because it begins with a capital letter. To do case-insensitive matching, we need to switch from str_replace() to regular expressions. (We can’t use str_ireplace() because the replacement has to preserve the case of what matched.) 

Example  Marking up a web page with regular expressions

         $body = '
         <p>I like pickles and herring.</p>

         <a href="pickle.php"><img src="pickle.jpg"/>A pickle picture</a>

         I have a herringbone-patterned toaster cozy.

         <herring>Herring is not a real HTML element!</herring>
         ';

         $words = array('pickle','herring');
         $patterns = array();
         $replacements = array();
         foreach ($words as $i => $word) {
                  $patterns[] = '/' . preg_quote($word) .'/i';
                  $replacements[] = "<span class='word-$i'>\\0</span>";
         }

         // Split up the page into chunks delimited by a
         // reasonable approximation of what an HTML element
         // looks like.
         $parts = preg_split("{(<(?:\"[^\"]*\"|'[^']*'|[^'\">])*>)}",
                                              $body,
                                              -1, // Unlimited number of chunks
                                              PREG_SPLIT_DELIM_CAPTURE);
         foreach ($parts as $i => $part) {
                  // Skip if this part is an HTML element
                  if (isset($part[0]) && ($part[0] == '<')) { continue; }
                  // Wrap the words with <span/>s
                  $parts[$i] = preg_replace($patterns, $replacements, $part);
         }

         // Reconstruct the body
         $body = implode('',$parts);

         print $body;

The two differences are that it builds a $patterns array in the loop at the top and it uses preg_replace() (with the $patterns array) instead of str_replace(). The i at the end of each element in $patterns makes the match case insensitive. The \\0 in the replacement preserves the case in the replacement with the case of what it matched.

Switching to regular expressions also makes it easy to prevent substring matching. The herring in herringbone gets highlighted. To prevent this, change $patterns[] = '/' . preg_quote($word) .'/i'; in to $patterns[] = '/\b' . preg_quote($word) .'\b/i';. The additional \b items in the pattern tell preg_replace() only to match a word if it stands on its own.


No comments:

Post a Comment

Post Top Ad