On ASP.NET MVC 3, I created a Action Filter for white space removal from the entire html. It works as I expected most of the time but now I need to change the RegEx in order not to touch inside pre
element.
I get the RegEx logic from awesome Mads Kristensen's blog and I am not sure how to modify it for this purpose.
Here is the logic:
public override void Write(byte[] buffer, int offset, int count) {
string HTML = Encoding.UTF8.GetString(buffer, offset, count);
Regex reg = new Regex(@"(?<=[^])\t{2,}|(?<=[>])\s{2,}(?=[<])|(?<=[>])\s{2,11}(?=[<])|(?=[\n])\s{2,}");
HTML = reg.Replace(HTML, string.Empty);
buffer = System.Text.Encoding.UTF8.GetBytes(HTML);
this.Base.Write(buffer, 0, buffer.Length);
}
Whole code of the filter:
https://github.com/tugberkugurlu/MvcBloggy/blob/master/src/MvcBloggy.Web/Application/ActionFilters/RemoveWhitespacesAttribute.cs
Any idea?
EDIT:
BIG NOTE:
My intention is totally not speed up the response time. In fact,
maybe this slows things down. I GZiped the pages and this minification makes me
gain approx 4 - 5 kb per page which is nothing.
Parsing HTML with regex very complicated and any simple solutions could break easily. (Use the right tool for the job.) That being said I'll show a simple solution.
First I simplified the regex you had to:
(?<=\s)\s+
Replace those matches with an empty string to get rid of double spaces everywhere.
Assuming there are no <
or >
inside the pre
tag, you can add (?![^<>]*</pre>)
at the end of the expression to make it fail inside of pre
tags. This makes sure that </pre>
doesn't follow current match, without any tags in between.
Resulting in:
(?<=\s)\s+(?![^<>]*</pre>)
Please see the very epic RegEx match open tags except XHTML self-contained tags for all the reasons why regular expressions and HTML don't get along.
If you're using that approach above to make the page size smaller, you should definitely look into IIS compression as most browsers can take advantage of it and it'd be easier than how you're going about it. Here's how to do it in IIS 6 and IIS 7:
http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/502ef631-3695-4616-b268-cbe7cf1351ce.mspx?mfr=true
http://technet.microsoft.com/en-us/library/cc771003(WS.10).aspx
Maybe break it up into four steps:
- extract any matching PRE elements using regex, something simple like "
start with <pre>(anything not </pre>)* end with </pre>
"
- replace each of those matches with a separate GUID and save a dictionary of GUID -> pre element html.
- take out whitespace (won't affect the GUIDs or their placement.
- iterate through the dictionary you saved in 2. and put the pre elements back in the correct spot.