How to parse big string U-SQL Regex

2019-07-14 09:53发布

问题:

I have got a big CSVs that contain big strings. I wanna parse them in U-SQL.

@t1 = 
SELECT
    Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)") AS p
FROM
    (VALUES(1)) AS fe(n);

@t2 = 
SELECT
    p.Groups["ID"].Value AS gads_id,
    p.Groups["T"].Value AS gads_t,
    p.Groups["S"].Value AS gads_s
FROM
    @t1;

OUTPUT @t
TO "/inhabit/test.csv"
USING Outputters.Csv();

Severity Code Description Project File Line Suppression State Error E_CSC_USER_INVALIDCOLUMNTYPE: 'System.Text.RegularExpressions.Match' cannot be used as column type.

I know how to do it in a SQL way with EXPLODE/CROSS APPLY/GROUP BY. But may be it is possible to do without these dances?

One more update

@t1 = 
SELECT
    Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)").Groups["ID"].Value AS id,
    Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)").Groups["T"].Value AS t,
    Regex.Match("ID=881cf2f5f474579a:T=1489536183:S=ALNI_MZsMMpA4voGE4kQMYxooceW2AOr0Q", "ID=(?<ID>\\w+):T=(?<T>\\w+):S=(?<S>[\\w\\d_]*)").Groups["S"].Value AS s
FROM
    (VALUES(1)) AS fe(n);

OUTPUT @t1
TO "/inhabit/test.csv"
USING Outputters.Csv();

This wariant works fine. But there is a question. Will the regex evauated 3 times per row? Does exists any chance to hint U-SQL engine - the function Regex.Match is deterministic.

回答1:

You should probably be using something more efficient than Regex.Match. But to answer your original question:

System.Text.RegularExpressions.Match is not part of the built-in U-SQL types.

Thus you would need to convert it into a built-in type, such as string or SqlArray<string> or wrap it into a udt that provides an IFormatter to make it a user-defined type.



回答2:

Looks like it is better to use something like this to parse the simple strings. Regexes are slow for the task and if i will use simple string expressions (instead of CLR calls) they probably will be translated into c++ code at codegen phase... and .net interop will be eliminated (i'm not sure).

@t1 = 
SELECT
    pv.cust_gads != null ? new SQL.ARRAY<string>(pv.cust_gads.Split(':')) : null AS p
FROM
    dwh.raw_page_view_data AS pv
WHERE
    pv.year == "2017" AND
    pv.month == "04";

@t3 = 
SELECT
    p != null && p.Count == 3 ? p[0].Split('=')[1] : null AS id,
    p != null && p.Count == 3 ? p[1].Split('=')[1] : null AS t,
    p != null && p.Count == 3 ? p[2].Split('=')[1] : null AS s
FROM
    @t1 AS t1;

OUTPUT @t3
TO "/tmp/test.csv"
USING Outputters.Csv();