I want to add regular expression search capability to my public web page. Other than HTML encoding the output, do I need to do anything to guard against malicious user input?
Google searches are swamped by people solving the converse problem-- using regular expressions to detect malicious input--which I'm not interested in. In my scenario, the user input is a regular expression.
I'll be using the Regex library in .NET (C#).
You'll want to read this paper:
Insecure Context Switching: Inoculating regular expressions for survivability The paper is more about what can go wrong with regular expression engines (e.g. PCRE), but it may help you understand what you're up against.
Yes.
Regexes can be used to perform DOS attacks.
There is no simple solution.
Adding to tchrist's excellent answer: the same Russ Cox who wrote the "Regular Expression" page has also released code! re2 is a C++ library which guarantees O(length_of_regex) runtime and configurable memory-use limit. It's used within Google so that you can type a regex into google code search -- meaning that it's been battle tested.
A good way to test your RegEx's for security issues (at least for Windows) is the SDL RegEx fuzzing tool released by Microsoft recently. This can help avoid pathologically bad RegEx construction.
Denial‐of‐Service Concerns
The most common concern with regexes is a denial‐of‐service attack through pathological patterns that go exponential — or even super‐exponential! — and so appear to take forever to solve. These may only show up on particular input data, but one can generally create one wherein this doesn’t matter.
Which ones these are will depend somewhat on how smart the regex compiler you’re using happens to be, because some of these can be detected during compilation time. Regex compilers that implement recursion usually have a built‐in recursion‐depth counter for checking non‐progression.
Russ Cox’s excellent 2007 paper on Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...) talks about ways that most modern NFAs, which all seem to derive from Henry Spencer’s code, suffer severe performance degradation, but where a Thompson‐style NFA has no such problems.
If you only admit patterns that can be solved by DFAs, you can compile them up as such, and they will run faster, possibly much faster. However, it takes time to do this. The Cox paper mentions this approach and its attendant issues. It all comes down to a classic time–space trade‐off.
With a DFA, you spend more time building it (and allocating more states), whereas with an NFA you spend more time executing it, since it can be multiple states at the same time, and backtracking can eat your lunch — and your CPU.
Denial‐of‐Service Solutions
Probably the most reasonable way to address these patterns that are on the losing end of a race with the heat‐death of the universe is to wrap them with a timer that effectively places a maximum amount of time allowed for their execution. Usually this will be much, much less than the default timeout that most HTTP servers provide.
There are various ways to implement these, ranging form a simple
alarm(N)
at the C level, to some sort oftry {}
block the catches alarm‐type exceptions, all the way to spawning off a new thread that’s specially created with a timing constraint built right into it.Code Callouts
In regex languages that admit code callouts, some mechanism for allowing or disallowing these from the string you’re going to compile should be provided. Even if code callouts are only to code in the language you are using, you should restrict them; they don’t have to be able to call external code, although if they can, you’ve got much bigger problems.
For example, in Perl one cannot have code callouts in regexes created from string interpolation (as these would be, as they’re compiled during run‐time) unless the special lexically‐scoped pragma
use re "eval";
in active in the current scope.That way nobody can sneak in a code callout to run system programs like
rm -rf *
, for example. Because code callouts are so security‐sensitive, Perl disables them by default on all interpolated strings, and you have to go out of your way to re‐enable them.User‐Defined \P{roperties}
There remains one more security‐sensitive issue related to Unicode-style properties — like
\pM
,\p{Pd}
,\p{Pattern_Syntax}
, or\p{Script=Greek}
— that may exist in some regex compilers that support that notation.The issue is that in some of these, the set of possible properties is user‐extensible. That means you can have custom properties that are actual code callouts to named functions in some particular namepace, like
\p{GoodChars}
or\p{Class::Good_Characters}
. How your language handles those might be worth looking at.Sandboxing
In Perl, a sandboxed compartment via the
Safe
module would give control over namespace visibility. Other languages offer similar sandboxing technologies. If such devices are available, you might want to look into them, because they are specifically designed for limited execution of untrusted code.You have to not only worry about the matching itself, but how you do the matching. For example, if your input goes through some sort of eval phase or command substitution on its way to the regular expression engine there could be code that gets executed inside the pattern. Or, if your regular expression syntax allows for embedded commands you have to be wary of that, too. Since you didn't specify the language in your question it's hard to say for sure what all the security implications are.