-->

BigQuery to find the Subsequence

2019-06-05 21:06发布

问题:

Assuming my table is

WITH `sample_project.sample_dataset.table` AS (
  SELECT 'user1' user, 2 sequence, 'T1' ts UNION ALL
  SELECT 'user1', 2, 'T2' UNION ALL
  SELECT 'user1', 1, 'T3' UNION ALL
  SELECT 'user1', 1, 'T4' UNION ALL
  SELECT 'user1', 3, 'T5' UNION ALL
  SELECT 'user1', 2, 'T6' UNION ALL
  SELECT 'user1', 3, 'T7' UNION ALL
  SELECT 'user1', 3, 'T8' 
)

Can I find Subsequence of Integers available in sequence column without using STRING_AGG and REGEX OR JOIN operations ? This is to make query more efficient.

A subsequence is a part of String. For example consider String "banana", A sample subsequence is "anna" as each index character of "anna" from banana is strictly increasing. Characters in a subsequence need not be contiguous.

Say for the above table when order by timestamp (INCREASING), I would get STRING_AGG for sequence column as 22113233. In the String 22113233 subsequence 1 2 3 is available whereas subsequence 3 2 1 is not available. Given a subsequence 213, How can I say if this subsequence is available or not (in 22113233 which sorted by timestamp) ?

回答1:

Given a subsequence 213, How can I say if this subsequence is available or not (in 22113233 ...

Below example is for BigQuery SQL

#standardSQL
WITH `sequences` AS (
  SELECT '22113233' sequence_list 
), `subsequenses` AS (
  SELECT '123' subsequence UNION ALL
  SELECT '321' UNION ALL
  SELECT '213'
)
SELECT sequence_list, subsequence, 
  REGEXP_CONTAINS(sequence_list, REGEXP_REPLACE(subsequence, '', '.*')) available
FROM `sequences` l
CROSS JOIN `subsequenses` s   

with result as below

sequence_list   subsequence     available    
22113233        321             false    
22113233        123             true     
22113233        213             true     

if you are looking for specific subsequence - this can be further simplified as

#standardSQL
WITH `sequences` AS (
  SELECT '22113233' sequence_list UNION ALL
  SELECT '11223322'
)
SELECT sequence_list,  
  REGEXP_CONTAINS(sequence_list, REGEXP_REPLACE('213', '', '.*')) available
FROM `sequences`

with result as

sequence_list   available    
22113233        true     
11223322        false