PHP Regex Remove Whitespace from HTML

There are many ways to optimize a HTML page, and one of these way is to remove white space. Whitespace between tags in HTML pages is just for readability, so if you have a site that has a lot of visits, it’s a good idea to consider strip away things such as extra whitespace in HTML. For a small to medium size webite, you can easily save 500 megabytes (MB) to a few gigabytes (GB) transfers a month just by cleaning whitespace and newline characters out of their HTML.

<html>
<head><title>Removing whitespace from HTML</title></head>
<body>
<form action="<?= $_SERVER['PHP_SELF']; ?>" method="post">
<input type="text" name="html"
value="<?php print $_POST['html'];?>" /><br />
<input type="submit" value="Remove whitespace" /><br /><br />
<?php
if ( $_SERVER['REQUEST_METHOD'] == "POST" )
{
$html = $_POST['html'];
$newhtml = preg_replace( "/(?:(?<=\>)|(?<=\/\))(\s+)(?=\<\/?)/","", $html );
print "<b>Original text was: &amp;nbsp;'". htmlspecialchars($html) .
"'</b><br/>";
print "<b>New text is: &amp;nbsp;'". htmlspecialchars($newhtml) . "'</b><br />";
}
?>
</form>
</body>
</html>

Regular Expression Explanation:

The look-behind group (?:(?<=\>)|(?<=\/\>)) matches the end of an HTML tag. The reason (?<=\>|\/\>) doesn’t work in the expression is because neither Perl nor PHP permits variable-length look-behinds. Each look-behind needs to be broken up by itself and put inside a group, such as (?:(?<=\>)|(?<\/\>)).

(?:

a noncapturing group that contains

(?<=

a positive look-behind with

\>

a >

)

the end of the positive look-behind

|

or

(?<=

a positive look-behind with

\/

a slash, followed by

\>

a >

)

the end of the positive look-behind

)

the end of the noncapturing group

(

a capturing group that contains

\s

whitespace

+

one time or more

)

the end of the capturing group

(?=

a positive look-ahead

\<

a <, followed by

\/

a slash

?

that can occur at most once

)

the end of the positive look-ahead.

Share and Enjoy:
  • Print
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • Technorati
  • Twitter
  • Yahoo! Bookmarks

About Shi Chuan

I am a web developer.
This entry was posted in PHP Regex and tagged . Bookmark the permalink.

16 Responses to PHP Regex Remove Whitespace from HTML

  1. Toby says:

    Very useful but noticed a type in the rendering of the code perhaps because of the wrap around:

    /(?:(?)|(?<=\/
    \))(\s+)(?=\<\/?)/

    Based on your breakdown of the regex I think it should be:

    /(?:(?)|(?))(\s+)(?=\ is missing from the colour render.

    Thanks again.

  2. admin says:

    Hi, Toby

    thanks for your comment, yea, i guess my wrapper has that problem, i will change to a better one once i find one. :)

  3. Andre says:

    Hi, i like to steal code very often, but I have Problems now to understand where and how to add the correction of Toby.

    I also get an Error with the Regex used in your Demo:

    Warning: preg_replace() [function.preg-replace]: Compilation failed: missing ) at offset 34 in /home/chris/html/xeloop/root/TEST/form.php on line 12

    Is it possible to add the full working regex again?

    Thanks for your work.

  4. hi
    fde3eu999ecc591t
    good luck

  5. sr says:

    Hi, pls post the full regex again

    Thanks

  6. Ryan says:

    Full pattern:
    $pattern = ‘/(?:(?)|(?))(\s+)(?=\<\/?)/’;

  7. Ryan says:

    Whoops. That stripped a lot of text. Try this instead (**note that the comment form would not let me post a bunch of needed characters):

    1. Click the link: http://tinyurl.com/cm2zcq
    2. Check out the pattern in the google search bar

    -Ryan

  8. I couldn’t get this regex to work for me (even from the tinyurl link Ryan provided. I did however find another alternative on an asp site.
    regex pattern: “\s+<”
    replace with: “<”
    This did the trick for me, so I’d thought I’d share here too.
    In case posting this does not work, there should be a less than sign after the plus sign and a less than sign in the replace with.

  9. jim says:

    the posted regex in the article left me with nothing…

  10. R says:

    Thanks for the article, and thank you Ryan.

    I can confirm the regular expression Ryan posted works well in removing whitespace from HTML. Very useful correction from the regular expression in the article, as that one removes everything.

  11. Pingback: Scott Bush » Examples of design as a fail-safe measure

  12. Asem Alhaji says:

    Thank you for this nice and beneficial article even the regex didn’t work but “Ryan” saved the situation and you regex worked fine .. thank you guys.

  13. Pingback: A Popurls Clone with PHP, jQuery, Awesomeness | profeshunl newbie

  14. Pingback: Examples of design as a fail-safe measure | Scott Bush

  15. Dyscrete says:

    <?php
    function html_compact($content){
    return trim(
    preg_replace("/\s+</", "\s+/”, “>”,
    preg_replace(“|(>(\s+)

  16. NEX-5 says:

    vale la pena leggere. L’ho trovato molto istruttivo, come ho fatto ricerche molto ultimamente sulle questioni pratiche, come si parla di …

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">