I got a table which contains URLs and some other columns, for example dates. The URLs contain IDs, separated by different values. What the IDs have in common is that they contain only numeric values [0-9] and are separated by non-numeric characters:
date | url
01-01-1999 | https://www.example.com/category1/subcategory1/71347983~7275798_fui~85092374238590235.......
01-02-1999 | https://www.example.com/category1/subcategory2/71347983_7275798/85092374238590235~773429834.......
01-02-1999 | https://www.example.com/category1/subcategory2/71347983_23235~773429834.......
01-01-1999 | https://www.example.com/category2/subcategory1/71347983_7275798_85092374238590235~773429834-1378162468.......
01-01-1999 | https://www.example.com/category2/subcategory2/71347983........
01-01-1999 | https://www.example.com/category2/subcategory2/45256........
Note: the amount of IDs per URl differs Note: IDs differ in length Note: IDs are separated by different characters Note: IDs are not always at the third level in the URLs
I'd like to construct a query that counts the amount of ID's in the URL. The query should return something like:
date | #IDs
01-01-1999 | 4
01-02-1999 | 4
01-02-1999 | 3
01-01-1999 | 5
01-01-1999 | 1
01-01-1999 | 1
Secondly, I'd like to group the "amounts" by date:
date | #IDs | amount date/ID combination
01-01-1999 | 4 | 1
01-02-1999 | 4 | 1
01-02-1999 | 3 | 1
01-01-1999 | 5 | 1
01-01-1999 | 1 | 2
The part of identify an ID is something that I am not manage to get working. Should I work with regex? Or something else? Hope someone could help me out!
Below is for BigQuery Standard SQL
If to apply to sample data from your question - the output is
If to apply to sample data from your question - the output is
try this one
I see that in one of your examples IDs can be separated by
/
too. If you know the maximum number of "levels" (number of/
in your path). You can use this query to extract all the IDs.You can build on top of this to do more aggregation.