python multiline regex to get optional group in pa

2019-07-10 11:10发布

问题:

I am struggling to write a regex to basically get 4 groups for both tables

  1. Table Name e.g. table_v1
  2. Table columns in first () after table name
  3. Primary key values in ()
  4. Optionally Value in () if CLUSTERING ORDER is there

I tried this, mostly works except cannot get cluster order value .

EDIT: HERE IS A FAILING DEMO

re.compile("CREATE\s+TABLE\s+(?:[a-z][a-z0-9_]*).*?((?:[a-z][a-z0-9_"]*)).*?(\(.*?\)) WITH.*?(\(.*?\)).*?;").findall(string_below)

Here is the String trying to run above regex on.

CREATE TABLE abcdeg.table_v1 (
    "id" text,
    "obj" text,
    "version" bigint,
    output text,
    server text,
    PRIMARY KEY ("id", "obj", "version")
) WITH CLUSTERING ORDER BY ("id" ASC, "version" DESC)
    AND bloom_filter_fp_chance = 0.1
    AND comment = ''
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';

CREATE TABLE abcdeg.result_v1 (
    "id" text,
    "obj" text,
    time int,
    PRIMARY KEY (("id", "obj"))
) WITH bloom_filter_fp_chance = 0.1
    AND comment = ''
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND speculative_retry = '99.0PERCENTILE';

CREATE TABLE abcdeg.result_v2 (
    "id" text PRIMARY KEY,
    "obj" text,
    time int
) WITH bloom_filter_fp_chance = 0.1
    AND comment = ''
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND speculative_retry = '99.0PERCENTILE';

回答1:

I can't test this in python, but it works in the demo. Captures table name into group 1, fields into group 2, primary keys (even with double parantheses) into group3, and if Clustering Order is present, into group 5

CREATE\sTABLE\s.+?\.(\w+)\s\(\n?(.+?)\n\s*PRIMARY\sKEY\s\(+([^)]*)\)+.*?(CLUSTERING\sORDER\sBY\s\(([^)]+)\)|;)

Demo



回答2:

I could write a regex that handles these two SQL-s properly. However, this solution is brittle and hard to understand. It's brittle because there can be other SQL constructs that we haven't considered yet. And it's hard to understand, because, you know, just look at this:

CREATE\s+TABLE[^.]*\.(\S*)\s*\((.*?)PRIMARY KEY \(?\(([^)]*)\)\)?\s*\)\s*(?:WITH CLUSTERING ORDER BY \(([^)]*)\))?

The trick was to say what you mean. Instead of lazy quantifiers I used negated character classes: \([^)]\). This way we can retrieve the contents of a parentheses effectively and without unexpected matches.

Demo here.

I suggest to take a different approach. You can use an SQL parser. SQLParse looks promising. I haven't worked with it though.