PHP Regex Remove Whitespace from HTML


There are many ways to optimize a HTML page, and one of these way is to remove white space. Whitespace between tags in HTML pages is just for readability, so if you have a site that has a lot of visits, it’s a good idea to consider strip away things such as extra whitespace in HTML. For a small to medium size webite, you can easily save 500 megabytes (MB) to a few gigabytes (GB) transfers a month just by cleaning whitespace and newline characters out of their HTML.

<html>
<head><title>Removing whitespace from HTML</title></head>
<body>
<form action="<?= $_SERVER['PHP_SELF']; ?>" method="post">
<input type="text" name="html"
value="<?php print $_POST['html'];?>" /><br />
<input type="submit" value="Remove whitespace" /><br /><br />
<?php
if ( $_SERVER['REQUEST_METHOD'] == "POST" )
{
$html = $_POST['html'];
$newhtml = preg_replace( "/(?:(?<=\>)|(?<=\/\))(\s+)(?=\<\/?)/","", $html );
print "<b>Original text was: &amp;nbsp;'". htmlspecialchars($html) .
"'</b><br/>";
print "<b>New text is: &amp;nbsp;'". htmlspecialchars($newhtml) . "'</b><br />";
}
?>
</form>
</body>
</html>

Regular Expression Explanation:

The look-behind group (?:(?<=\>)|(?<=\/\>)) matches the end of an HTML tag. The reason (?<=\>|\/\>) doesn’t work in the expression is because neither Perl nor PHP permits variable-length look-behinds. Each look-behind needs to be broken up by itself and put inside a group, such as (?:(?<=\>)|(?<\/\>)).

(?:

a noncapturing group that contains

(?<=

a positive look-behind with

\>

a >

)

the end of the positive look-behind

|

or

(?<=

a positive look-behind with

\/

a slash, followed by

\>

a >

)

the end of the positive look-behind

)

the end of the noncapturing group

(

a capturing group that contains

\s

whitespace

+

one time or more

)

the end of the capturing group

(?=

a positive look-ahead

\<

a <, followed by

\/

a slash

?

that can occur at most once

)

the end of the positive look-ahead.

Share and Enjoy:
  • Print
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • Technorati
  • Twitter
  • Yahoo! Bookmarks

  1. #1 by Toby on September 4, 2008 - 3:24 am

    Very useful but noticed a type in the rendering of the code perhaps because of the wrap around:

    /(?:(?)|(?<=\/
    \))(\s+)(?=\<\/?)/

    Based on your breakdown of the regex I think it should be:

    /(?:(?)|(?))(\s+)(?=\ is missing from the colour render.

    Thanks again.

  2. #2 by admin on September 4, 2008 - 5:34 am

    Hi, Toby

    thanks for your comment, yea, i guess my wrapper has that problem, i will change to a better one once i find one. :)

  3. #3 by Andre on October 14, 2008 - 8:00 am

    Hi, i like to steal code very often, but I have Problems now to understand where and how to add the correction of Toby.

    I also get an Error with the Regex used in your Demo:

    Warning: preg_replace() [function.preg-replace]: Compilation failed: missing ) at offset 34 in /home/chris/html/xeloop/root/TEST/form.php on line 12

    Is it possible to add the full working regex again?

    Thanks for your work.

  4. #4 by Lawanda Mcconnell on January 9, 2009 - 6:46 pm

    hi
    fde3eu999ecc591t
    good luck

  5. #5 by sr on February 3, 2009 - 2:57 am

    Hi, pls post the full regex again

    Thanks

  6. #6 by Ryan on February 22, 2009 - 3:41 pm

    Full pattern:
    $pattern = ‘/(?:(?)|(?))(\s+)(?=\<\/?)/’;

  7. #7 by Ryan on February 22, 2009 - 3:49 pm

    Whoops. That stripped a lot of text. Try this instead (**note that the comment form would not let me post a bunch of needed characters):

    1. Click the link: http://tinyurl.com/cm2zcq
    2. Check out the pattern in the google search bar

    -Ryan

  8. #8 by Matt Williams on May 28, 2009 - 9:09 am

    I couldn’t get this regex to work for me (even from the tinyurl link Ryan provided. I did however find another alternative on an asp site.
    regex pattern: “\s+<”
    replace with: “<”
    This did the trick for me, so I’d thought I’d share here too.
    In case posting this does not work, there should be a less than sign after the plus sign and a less than sign in the replace with.

  9. #9 by jim on May 29, 2009 - 7:25 am

    the posted regex in the article left me with nothing…

  10. #10 by R on July 13, 2009 - 5:02 am

    Thanks for the article, and thank you Ryan.

    I can confirm the regular expression Ryan posted works well in removing whitespace from HTML. Very useful correction from the regular expression in the article, as that one removes everything.

  11. #11 by Asem Alhaji on February 11, 2010 - 5:29 am

    Thank you for this nice and beneficial article even the regex didn’t work but “Ryan” saved the situation and you regex worked fine .. thank you guys.

(will not be published)

Comments are closed.