Transform data in Google bigquery - extract text,

2019-07-15 09:14发布

问题:

I have some weblog data in big query which I need to transform to make it easier to use and query. The data looks like:

I want to extract and transform the data within the curled brackets after Results{…..} (colored blue). The data is of the form ‘(\d+((PQ)|(KL))+\d+)’ and there can be 1-20+ entries in the result array. I am only interested in the first 16 entries.

I have been able to extract the data within curled brackets into a new column, using Substr and regext_extract. But I'm unable to SPLIT it into columns (sometimes there is only 1 result and so the delimiter "," is missing. I'm new with regex, may be I can use something like ‘(\d+((PQ)|(KL))+\d+){1}’ etc. to split the data into multiple columns and then pivot it.

Ideal output in my case would be to transform it into something like:

In the above solution, each row in original table is repeated from 1-16 times depending on the number of items in the Results array.

I’m not completely sure if it’s possible to do this in big query. I’ll be grateful if anyone can help me out a little here.

If this is not possible, then I can have 16 rows for every event with NULL values in Event_details for cases where there are less than 16 entries in result array.

In case both of these are not possible, the last solution would be to have it transformed into something like:

The reason I want to transform the data is that in most of the cases I would need to find which result array items are appearing and in what order.

回答1:

Check this out: Split string into multiple columns with bigquery. In their case its delimited by spaces. replace the \s with ','

something like:

SELECT  
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){0}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word0,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){1}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word1,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){2}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word2,
Regexp_extract(StringToParse,r'^*{(?:[^,]*,){3}(\d+(?:(?:PQ)|(?:KL))+\d+)\s?') as Word3,
FROM
(SELECT 'bla{1234PQ5,6789KL0,1234PQ5,6789KL0,123' as StringToParse)


回答2:

Use SPLIT()

SELECT Event_ID, Event_UserID, Event_SessionID, Keyword,
SPLIT(REGEXP_EXTRACT(Event_details,"Results\{(.*)\}"),",") as Event_details_item
FROM mydata.mytable