RegEx split string with on a delimeter(semi-colon

2019-05-26 14:21发布

I have a Java String which is actually an SQL script.

CREATE OR REPLACE PROCEDURE Proc
   AS
        b NUMBER:=3;
        c VARCHAR2(2000);
    begin
        c := 'BEGIN ' || ' :1 := :1 + :2; ' || 'END;';
   end Proc;

I want to split the script on semi-colon except those that appear inside a string. The desired output is four different strings as mentioned below

1- CREATE OR REPLACE PROCEDURE Proc AS b NUMBER:=3
2- c VARCHAR2(2000)
3- begin c := 'BEGIN ' || ' :1 := :1 + :2; ' || 'END;';
4- end Proc

Java Split() method will split above string into tokens as well. I want to keep this string as it is as the semi-colons are inside quotes.

c := 'BEGIN ' || ' :1 := :1 + :2; ' || 'END;';

Java Split() method output

1- c := 'BEGIN ' || ' :1 := :1 + :2
2- ' || 'END
3- '

Please suggest a RegEx that could split the string on semi-colons except those that come inside string.

===================== CASE-2 ========================

Above Section has been answered and its working

Here is another more complex case

======================================================

I have an SQL Script and I want to tokenize each SQL query. Each SQL query is separated by either semi-colon(;) or forward slash(/).

1- I want to escape semi colon or / sign if they appear inside a string like

...WHERE col1 = 'some ; name/' ..

2- Expression must also escape any multiline comment syntax which is /*

Here is the input

/*Query 1*/
SELECT
*
FROM  tab t
WHERE (t.col1 in (1, 3)
            and t.col2 IN (1,5,8,9,10,11,20,21,
                                     22,23,24,/*Reaffirmed*/
                                     25,26,27,28,29,30,
                                     35,/*carnival*/
                                     75,76,77,78,79,
                                     80,81,82, /*Damark accounts*/
                                     84,85,87,88,90))
;
/*Query 2*/    
select * from table
/
/*Query 3*/
select col form tab2
;
/*Query 4*/
select col2 from tab3 /*this is a multi line comment*/
/

Desired Result

[1]: /*Query 1*/
    SELECT
    *
    FROM  tab t
    WHERE (t.col1 in (1, 3)
                and t.col2 IN (1,5,8,9,10,11,20,21,
                                         22,23,24,/*Reaffirmed*/
                                         25,26,27,28,29,30,
                                         35,/*carnival*/
                                         75,76,77,78,79,
                                         80,81,82, /*Damark accounts*/
                                         84,85,87,88,90))

[2]:/*Query 2*/    
    select * from table

[3]: /*Query 3*/
    select col form tab2

[4]:/*Query 4*/
    select col2 from tab3 /*this is a multi line comment*/

Half of it can already be achieved by what was suggested to me in the previous post( link a start) but when comments syntax(/*) is introduced into the queries and each query can also be separated by forward slash(/), expression doesn't work.

3条回答
女痞
2楼-- · 2019-05-26 14:41

What you might try is just splitting on ";". Then for each string, if it has an odd number of 's, concatenate it with the following string until it has an even number of 's adding the ";"s back in.

查看更多
爷的心禁止访问
3楼-- · 2019-05-26 14:55

The regular expression pattern ((?:(?:'[^']*')|[^;])*); should give you what you need. Use a while loop and Matcher.find() to extract all the SQL statements. Something like:

Pattern p = Pattern.compile("((?:(?:'[^']*')|[^;])*);";);
Matcher m = p.matcher(s);
int cnt = 0;
while (m.find()) {
    System.out.println(++cnt + ": " + m.group(1));
}

Using the sample SQL you provided, will output:

1: CREATE OR REPLACE PROCEDURE Proc
   AS
        b NUMBER:=3
2: 
        c VARCHAR2(2000)
3: 
    begin
        c := 'BEGIN ' || ' :1 := :1 + :2; ' || 'END;'
4: 
   end Proc

If you want to get the terminating ;, use m.group(0) instead of m.group(1).

For more information on regular expressions, see the Pattern JavaDoc and this great reference. Here's a synopsis of the pattern:

(              Start capturing group
  (?:          Start non-capturing group
    (?:        Start non-capturing group
      '        Match the literal character '
      [^']     Match a single character that is not '
      *        Greedily match the previous atom zero or more times
      '        Match the literal character '
    )          End non-capturing group
    |          Match either the previous or the next atom
    [^;]       Match a single character that is not ;
  )            End non-capturing group
  *            Greedily match the previous atom zero or more times
)              End capturing group
;              Match the literal character ;
查看更多
兄弟一词,经得起流年.
4楼-- · 2019-05-26 15:02

I was having the same issue. I saw previous recommendations and decided to improve handling for:

  • Comments
  • Escaped single quotes
  • Single querys not ended by semicolon

My solution is written for java. Some things as backslash ecaping and DOTALL mode may change from one language to another one.

this worked for me "(?s)\s*((?:'(?:\\.|[^\\']|''|)'|/\.*?\*/|(?:--|#)[^\r\n]|[^\\'])?)(?:;|$)"

"
(?s)                 DOTALL mode. Means the dot includes \r\n
\\s*                 Initial whitespace
(
    (?:              Grouping content of a valid query
        '            Open string literal
        (?:          Grouping content of a string literal expression
            \\\\.    Any escaped character. Doesn't matter if it's a single quote
        |
            [^\\\\'] Any character which isn't escaped. Escaping is covered above.
        |
            ''       Escaped single quote
        )            Any of these regexps are valid in a string literal.
        *            The string can be empty 
        '            Close string literal
    |
        /\\*         C-style comment start
        .*?          Any characters, but as few as possible (doesn't include */)
        \\*/         C-style comment end
    |
        (?:--|#)     SQL comment start
        [^\r\n]*     One line comment which ends with a newline
    |
        [^\\\\']     Anything which doesn't have to do with a string literal
    )                Theses four tokens basically define the contents of a query
    *?               Avoid greediness of above tokens to match the end of a query
)
(?:;|$)              After a series of query tokens, find ; or EOT
"

As for your second case, please notice the last part of the regexp expresses how your regular expression will be ended. Right now it only accepts semicolon or end of text. However, you can add whatever you want to the ending. For example (?:;|@|/|$) accepts at and slash as ending characters. Haven't tested this solution for you, but shouldn't be hard.

查看更多
登录 后发表回答