There are many ways to optimize a HTML page, and one of these way is to remove white space. Whitespace between tags in HTML pages is just for readability, so if you have a site that has a lot of visits, it’s a good idea to consider strip away things such as extra whitespace in HTML. For a small to medium size webite, you can easily save 500 megabytes (MB) to a few gigabytes (GB) transfers a month just by cleaning whitespace and newline characters out of their HTML.
<html>
<head><title>Removing whitespace from HTML</title></head>
<body>
<form action="<?= $_SERVER['PHP_SELF']; ?>" method="post">
<input type="text" name="html"
value="<?php print $_POST['html'];?>" /><br />
<input type="submit" value="Remove whitespace" /><br /><br />
<?php
if ( $_SERVER['REQUEST_METHOD'] == "POST" )
{
$html = $_POST['html'];
$newhtml = preg_replace( "/(?:(?<=\>)|(?<=\/\))(\s+)(?=\<\/?)/","", $html );
print "<b>Original text was: &nbsp;'". htmlspecialchars($html) .
"'</b><br/>";
print "<b>New text is: &nbsp;'". htmlspecialchars($newhtml) . "'</b><br />";
}
?>
</form>
</body>
</html>
Regular Expression Explanation:
The look-behind group (?:(?<=\>)|(?<=\/\>)) matches the end of an HTML tag. The reason (?<=\>|\/\>) doesn’t work in the expression is because neither Perl nor PHP permits variable-length look-behinds. Each look-behind needs to be broken up by itself and put inside a group, such as (?:(?<=\>)|(?<\/\>)).
|
(?: |
a noncapturing group that contains … |
|
(?<= |
a positive look-behind with … |
|
\> |
a > … |
|
) |
the end of the positive look-behind … |
|
| |
or … |
|
(?<= |
a positive look-behind with … |
|
\/ |
a slash, followed by … |
|
\> |
a > … |
|
) |
the end of the positive look-behind … |
|
) |
the end of the noncapturing group … |
|
( |
|
|
\s |
whitespace … |
|
+ |
one time or more … |
|
) |
the end of the capturing group … |
|
(?= |
a positive look-ahead … |
|
\< |
a <, followed by … |
|
\/ |
a slash … |
|
? |
that can occur at most once … |
|
) |
the end of the positive look-ahead. |











































#1 by Toby on September 4, 2008 - 3:24 am
Very useful but noticed a type in the rendering of the code perhaps because of the wrap around:
/(?:(?)|(?<=\/
\))(\s+)(?=\<\/?)/
Based on your breakdown of the regex I think it should be:
/(?:(?)|(?))(\s+)(?=\ is missing from the colour render.
Thanks again.
#2 by admin on September 4, 2008 - 5:34 am
Hi, Toby
thanks for your comment, yea, i guess my wrapper has that problem, i will change to a better one once i find one. :)
#3 by Andre on October 14, 2008 - 8:00 am
Hi, i like to steal code very often, but I have Problems now to understand where and how to add the correction of Toby.
I also get an Error with the Regex used in your Demo:
Warning: preg_replace() [function.preg-replace]: Compilation failed: missing ) at offset 34 in /home/chris/html/xeloop/root/TEST/form.php on line 12
Is it possible to add the full working regex again?
Thanks for your work.
#4 by Lawanda Mcconnell on January 9, 2009 - 6:46 pm
hi
fde3eu999ecc591t
good luck
#5 by sr on February 3, 2009 - 2:57 am
Hi, pls post the full regex again
Thanks
#6 by Ryan on February 22, 2009 - 3:41 pm
Full pattern:
$pattern = ‘/(?:(?)|(?))(\s+)(?=\<\/?)/’;
#7 by Ryan on February 22, 2009 - 3:49 pm
Whoops. That stripped a lot of text. Try this instead (**note that the comment form would not let me post a bunch of needed characters):
1. Click the link: http://tinyurl.com/cm2zcq
2. Check out the pattern in the google search bar
-Ryan
#8 by Matt Williams on May 28, 2009 - 9:09 am
I couldn’t get this regex to work for me (even from the tinyurl link Ryan provided. I did however find another alternative on an asp site.
regex pattern: “\s+<”
replace with: “<”
This did the trick for me, so I’d thought I’d share here too.
In case posting this does not work, there should be a less than sign after the plus sign and a less than sign in the replace with.
#9 by jim on May 29, 2009 - 7:25 am
the posted regex in the article left me with nothing…
#10 by R on July 13, 2009 - 5:02 am
Thanks for the article, and thank you Ryan.
I can confirm the regular expression Ryan posted works well in removing whitespace from HTML. Very useful correction from the regular expression in the article, as that one removes everything.
#11 by Asem Alhaji on February 11, 2010 - 5:29 am
Thank you for this nice and beneficial article even the regex didn’t work but “Ryan” saved the situation and you regex worked fine .. thank you guys.