I tried to split a line based on spaces not enclosed between double quotes.
My regex is
(([\"]([^\\\"]|\\.)+[\"]|[^ ]+))+
My Code
Pattern regex = Pattern.compile("(([\"]([^\\\"]|\\.)+[\"]|[^ ]+))+");
Matcher regexMatcher = regex.matcher(line);
List<String> rule = new ArrayList<String>();
while(regexMatcher.find())
rule.add(regexMatcher.group());
Input for which it is failed.
SecRule REQUEST_COOKIES|!REQUEST_COOKIES:/__utm/|REQUEST_COOKIES_NAMES|ARGS_NAMES|ARGS|XML:/* "(?i:\b(?:(?:s(?:t(?:d(?:dev(_pop|_samp)?)?|r(?:_to_date|cmp))|u(?:b(?:str(?:ing(_index)?)?|(?:dat|tim)e)|m)|e(?:c(?:_to_time|ond)|ssion_user)|ys(?:tem_user|date)|ha(1|2)?|oundex|chema|ig?n|pace|qrt)|i(?:s(null|_(free_lock|ipv4_compat|ipv4_mapped|ipv4|ipv6|not_null|not|null|used_lock))?|n(?:et6?_(aton|ntoa)|s(?:ert|tr)|terval)?|f(null)?)|u(?:n(?:compress(?:ed_length)?|ix_timestamp|hex)|tc_(date|time|timestamp)|p(?:datexml|per)|uid(_short)?|case|ser)|l(?:o(?:ca(?:l(timestamp)?|te)|g(2|10)?|ad_file|wer)|ast(_day|_insert_id)?|e(?:(?:as|f)t|ngth)|case|trim|pad|n)|t(?:ime(stamp|stampadd|stampdiff|diff|_format|_to_sec)?|o_(base64|days|seconds|n?char)|r(?:uncate|im)|an)|m(?:a(?:ke(?:_set|date)|ster_pos_wait|x)|i(?:(?:crosecon)?d|n(?:ute)?)|o(?:nth(name)?|d)|d5)|r(?:e(?:p(?:lace|eat)|lease_lock|verse)|o(?:w_count|und)|a(?:dians|nd)|ight|trim|pad)|f(?:i(?:eld(_in_set)?|nd_in_set)|rom_(base64|days|unixtime)|o(?:und_rows|rmat)|loor)|a(?:es_(?:de|en)crypt|s(?:cii(str)?|in)|dd(?:dat|tim)e|(?:co|b)s|tan2?|vg)|p(?:o(?:sition|w(er)?)|eriod_(add|diff)|rocedure_analyse|assword|i)|b(?:i(?:t_(?:length|count|x?or|and)|n(_to_num)?)|enchmark)|e(?:x(?:p(?:ort_set)?|tract(value)?)|nc(?:rypt|ode)|lt)|v(?:a(?:r(?:_(?:sam|po)p|iance)|lues)|ersion)|g(?:r(?:oup_conca|eates)t|et_(format|lock))|o(?:(?:ld_passwo)?rd|ct(et_length)?)|we(?:ek(day|ofyear)?|ight_string)|n(?:o(?:t_in|w)|ame_const|ullif)|(rawton?)?hex(toraw)?|qu(?:arter|ote)|(pg_)?sleep|year(week)?|d?count|xmltype|hour)\W*\(|\b(?:(?:s(?:elect\b(?:.{1,100}?\b(?:(?:length|count|top)\b.{1,100}?\bfrom|from\b.{1,100}?\bwhere)|.*?\b(?:d(?:ump\b.*\bfrom|ata_type)|(?:to_(?:numbe|cha)|inst)r))|p_(?:sqlexec|sp_replwritetovarbin|sp_help|addextendedproc|is_srvrolemember|prepare|sp_password|execute(?:sql)?|makewebtask|oacreate)|ql_(?:longvarchar|variant))|xp_(?:reg(?:re(?:movemultistring|ad)|delete(?:value|key)|enum(?:value|key)s|addmultistring|write)|terminate|xp_servicecontrol|xp_ntsec_enumdomains|xp_terminate_process|e(?:xecresultset|numdsn)|availablemedia|loginconfig|cmdshell|filelist|dirtree|makecab|ntsec)|u(?:nion\b.{1,100}?\bselect|tl_(?:file|http))|d(?:b(?:a_users|ms_java)|elete\b\W*?\bfrom)|group\b.*\bby\b.{1,100}?\bhaving|open(?:rowset|owa_util|query)|load\b\W*?\bdata\b.*\binfile|(?:n?varcha|tbcreato)r|autonomous_transaction)\b|i(?:n(?:to\b\W*?\b(?:dump|out)file|sert\b\W*?\binto|ner\b\W*?\bjoin)\b|(?:f(?:\b\W*?\(\W*?\bbenchmark|null\b)|snull\b)\W*?\()|print\b\W*?\@\@|cast\b\W*?\()|c(?:(?:ur(?:rent_(?:time(?:stamp)?|date|user)|(?:dat|tim)e)|h(?:ar(?:(?:acter)?_length|set)?|r)|iel(?:ing)?|ast|r32)\W*\(|o(?:(?:n(?:v(?:ert(?:_tz)?)?|cat(?:_ws)?|nection_id)|(?:mpres)?s|ercibility|alesce|t)\W*\(|llation\W*\(a))|d(?:(?:a(?:t(?:e(?:(_(add|format|sub))?|diff)|abase)|y(name|ofmonth|ofweek|ofyear)?)|e(?:(?:s_(de|en)cryp|faul)t|grees|code)|ump)\W*\(|bms_\w+\.\b)|(?:;\W*?\b(?:shutdown|drop)|\@\@version)\b|\butl_inaddr\b|\bsys_context\b|'(?:s(?:qloledb|a)|msdasql|dbo)'))" "phase:2,rev:'2',ver:'OWASP_CRS/2.2.9',maturity:'9',accuracy:'8',capture,t:none,t:urlDecodeUni,ctl:auditLogParts=+E,block,msg:'SQL Injection Attack',id:'950001',tag:'OWASP_CRS/WEB_ATTACK/SQL_INJECTION',tag:'WASCTC/WASC-19',tag:'OWASP_TOP_10/A1',tag:'OWASP_AppSensor/CIE1',tag:'PCI/6.5.2',logdata:'Matched Data: %{TX.0} found within %{MATCHED_VAR_NAME}: %{MATCHED_VAR}',severity:'2',setvar:'tx.msg=%{rule.msg}',setvar:tx.sql_injection_score=+%{tx.critical_anomaly_score},setvar:tx.anomaly_score=+%{tx.critical_anomaly_score},setvar:tx.%{rule.id}-OWASP_CRS/WEB_ATTACK/SQL_INJECTION-%{matched_var_name}=%{tx.0}
When i used this in java, some lines are separated successfully, but some lines are causing errors
Exception in thread "main" java.lang.StackOverflowError
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4235)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
at java.util.regex.Pattern$Loop.match(Pattern.java:4312)
at java.util.regex.Pattern$GroupTail.match(Pattern.java:4244)
at java.util.regex.Pattern$BranchConn.match(Pattern.java:4095)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3362)
at java.util.regex.Pattern$Branch.match(Pattern.java:4131)
at java.util.regex.Pattern$GroupHead.match(Pattern.java:4185)
Sample Input:
The "world \"is beautiful" but i "cannot see" it
Expected Output:
The
"world \" beautiful"
but
i
"cannot see"
it
Why does StackOverflowError occur?
In reference implementation of
Pattern
class (which comes with Oracle's JRE, OpenJDK, and a number of other JVMs), greedy and lazy quantifiers are implemented with recursion1 when the repeated pattern is non-trivial. Therefore, you will run intoStackOverflowError
when the input string is long enough.1 Recursion is a quick but not scalable solution to allow backtracking in the pattern. Better implementation uses a data structure to store backtracking points (which basically converts recursive solution to iterative solution with stack).
Solution
The following regex should work:
Well, the regex is quite confusing with 2 layers of escaping: escaping in Java string literal and escaping in regex syntax.
The raw regex when you print the string out. My explanation will be based on the raw regex.
Explanation
Since you only care about what the whole regex matches, all the capturing groups
(pattern)
has been turned into non-capturing group(?:pattern)
for efficiency.The first alternative
"(?:[^"\\]++|\\.)*+"
matches a quoted string.The second alternative
[^ "]++
matches a sequence of character that doesn't contain space and double quote"
.Since the regex is written so that there is no needs for backtracking, all quantifiers are made possessive. Since
Pattern
class implements possessive quantifier with a loop, instead of recursion as the case with greedy/lazy quantifiers,StackOverflowError
will not occur.I remove the need for backtracking by writing the regex so that it matches the correct string on first try:
Since
[^"\\]
excludes the\
, there is no way we can "steal" a\
from an escaping sequence, or "steal" a"
and mess up the closing quote, we can safely advance ahead without backtracking. This explains the possessive quantifier here[^"\\]++
. There is no need to assign a quantifier here, but I do this to reduce the work on the branching.Since both
[^"\\]++
and\\.
can't "steal" a"
and mess up the closing quote, we can advance ahead without backtracking. This explains the possessive quantifier here(?:[^"\\]++|\\.)*+
[^ "]
can't start a quoted string, and it also can't match a space (delimiter). This is why we can use possessive quantifier on it.Since
"(?:[^"\\]++|\\.)*+"
and[^ "]++
can't mess up the match of each other, we can make the outer most quantifier possessive.Example of a regex that doesn't match things correctly on first try and only get the correct result after backtracking would be
^([bcd]+:[ab]+)+$
against inputs such asb:ab:a
. The first iteration will matchb:ab
, which cause the 2nd iteration to fail, then it backtracks and retry with the first iteration beingb:a
and then successfully match the whole string.Your regex is broken:
At this point I stopped looking further, because I am sure this is not what you want.
By the way, I recommend writing the regex first and only then do the quoting (you could write yourself a tool that does this, it is purely mechanical: add one
\
before every"
and every\
, then enclose in ""). Also, don't use character classes for single characters.In fact, it appears what you're looking for are words, or strings. So, why don't you say just that.
You can use a top down approach:
Now:
You can test important sub-regexes separately, for example the STRING. Turned out I had several errors in my first version, and this even when writing unquoted! To write/discuss such a regex in the form java demands from the start is virtually impossible.
Thanks for all your responses. Atlast i found my mistake. The actual reason is for stackoverflow is not my regex. My regex was correct. I used eclipse for coding. The actual reason for stackoverflow is my stack size. Intially my stacksize was 1Mb. I increased my stack size for the program in my eclipse and there was no error.
Java stack overflow error - how to increase the stack size in Eclipse?
UPDATE:
There is no need to change the stack size. As mentioned by nhahtdh, I have changed the regex to regex with possessive quantifier and there was no stackoverflow error.
My Regex is now
("([^\\"]|\\.)++"|[^\s]++)
To learn more about Possessive quantifier follow this link.
The first thing to try is to increase the stack size.
If that does not work, you might have hit a bug. You could try a different JVM and set up the JVM to use something other than OpenJDK for the class library, and tinker with the regex to see exactly what is triggering it.