I'm building a stress-testing client that hammers servers and analyzes responses using as many threads as the client can muster. I'm constantly finding myself throttled by garbage collection (and/or lack thereof), and in most cases, it comes down to strings that I'm instantiating only to pass them off to a Regex or an Xml parsing routine.
If you decompile the Regex class, you'll see that internally, it uses StringBuilders to do nearly everything, but you can't pass it a string builder; it helpfully dives down into private methods before starting to use them, so extension methods aren't going to solve it either. You're in a similar situation if you want to get an object graph out of the parser in System.Xml.Linq.
This is not a case of pedantic over-optimization-in-advance. I've looked at the Regex replacements inside a StringBuilder question and others. I've also profiled my app to see where the ceilings are coming from, and using Regex.Replace()
now is indeed introducing significant overhead in a method chain where I'm trying to hit a server with millions of requests per hour and examine XML responses for errors and embedded diagnostic codes. I've already gotten rid of just about every other inefficiency that's throttling the throughput, and I've even cut a lot of the Regex overhead out by extending StringBuilder to do wildcard find/replace when I don't need capture groups or backreferences, but it seems to me that someone would have wrapped up a custom StringBuilder (or better yet, Stream) based Regex and Xml parsing utility by now.
Ok, so rant over, but am I going to have to do this myself?
Update: I found a workaround which lowered peak memory consumption from multiple gigabytes to a few hundred megs, so I'm posting it below. I'm not adding it as an answer because a) I generally hate to do that, and b) I still want to find out if someone takes the time to customize StringBuilder to do Regexes (or vice-versa) before I do.
In my case, I could not use XmlReader because the stream I am ingesting contains some invalid binary content in certain elements. In order to parse the XML, I have to empty out those elements. I was previously using a single static compiled Regex instance to do the replace, and this consumed memory like mad (I'm trying to process ~300 10KB docs/sec). The change that drastically reduced consumption was:
- I added the code from this StringBuilder Extensions article on
CodeProject for the handy
IndexOf
method. - I added a (very) crude
WildcardReplace
method that allows one wildcard character (* or ?) per invocation - I replaced the Regex usage with a
WildcardReplace()
call to empty the contents of the offending elements
This is very unpretty and tested only as far as my own purposes required; I would have made it more elegant and powerful, but YAGNI and all that, and I'm in a hurry. Here's the code:
/// <summary>
/// Performs basic wildcard find and replace on a string builder, observing one of two
/// wildcard characters: * matches any number of characters, or ? matches a single character.
/// Operates on only one wildcard per invocation; 2 or more wildcards in <paramref name="find"/>
/// will cause an exception.
/// All characters in <paramref name="replaceWith"/> are treated as literal parts of
/// the replacement text.
/// </summary>
/// <param name="find"></param>
/// <param name="replaceWith"></param>
/// <returns></returns>
public static StringBuilder WildcardReplace(this StringBuilder sb, string find, string replaceWith) {
if (find.Split(new char[] { '*' }).Length > 2 || find.Split(new char[] { '?' }).Length > 2 || (find.Contains("*") && find.Contains("?"))) {
throw new ArgumentException("Only one wildcard is supported, but more than one was supplied.", "find");
}
// are we matching one character, or any number?
bool matchOneCharacter = find.Contains("?");
string[] parts = matchOneCharacter ?
find.Split(new char[] { '?' }, StringSplitOptions.RemoveEmptyEntries)
: find.Split(new char[] { '*' }, StringSplitOptions.RemoveEmptyEntries);
int startItemIdx;
int endItemIdx;
int newStartIdx = 0;
int length;
while ((startItemIdx = sb.IndexOf(parts[0], newStartIdx)) > 0
&& (endItemIdx = sb.IndexOf(parts[1], startItemIdx + parts[0].Length)) > 0) {
length = (endItemIdx + parts[1].Length) - startItemIdx;
newStartIdx = startItemIdx + replaceWith.Length;
// With "?" wildcard, find parameter length should equal the length of its match:
if (matchOneCharacter && length > find.Length)
break;
sb.Remove(startItemIdx, length);
sb.Insert(startItemIdx, replaceWith);
}
return sb;
}
XmlReader is a stream-based XML parser. See http://msdn.microsoft.com/en-us/library/756wd7zs.aspx
Here try this. Everything's char based and relatively low level for efficiency. Any number of your
*
s or?
s can be used. However, your*
is now✪
and your?
is now★
. Around three days of work went into this to make it as clean as possible. You can even enter multiple queries on one sweep!Example usage:
wildcard(new StringBuilder("Hello and welcome"), "hello✪w★l", "be")
results in "become".The Mono project has switched the license for their core libraries to an MIT X11 license. If you need to create a regex library customized for performance in your particular application, you should be able to start with the latest code from Mono's implementation of the System library.