Split values over multiple rows in RedShift

2019-04-29 15:01发布

问题:

The question of how to split a field (e.g. a CSV string) into multiple rows has already been answered: Split values over multiple rows.

However, this question refers to MSSQL, and the answers use various features for which there are no RedShift equivalents.

For the sake of completeness, here's an example of what I'd like to do:

Current data:

| Key | Data     |
+-----+----------+
| 1   | 18,20,22 |
| 2   | 17,19    |

Required data:

| Key | Data     |
+-----+----------+
| 1   | 18       |
| 1   | 20       |
| 1   | 22       |
| 2   | 17       |
| 2   | 19       |

Now, I can suggest a walkaround for the case of small, bounded number of elements in the CSV field: use split_part and union over all possible array locations, like so:

SELECT Key, split_part(Data, ',', 1) 
FROM mytable
WHERE split_part(Data, ',', 1) != ""
    UNION
SELECT Key, split_part(Data, ',', 2) 
FROM mytable
WHERE split_part(Data, ',', 2) != ""
-- etc. etc.

However, this is obviously very inefficient, and would not work for longer lists.

Any better ideas on how to do this?

EDIT:

There's also a somewhat similar question regarding multiplying rows: splitting rows in Redshift. However I don't see how this approach can be applied here.

EDIT 2:

A possible duplicate: Redshift. Convert comma delimited values into rows. But nothing new - the answer by @Masashi Miyazaki is similar to my suggestion above, and suffers from the same issues.

回答1:

Here is the Redshift answer, it will work with up to 10 thousand values per row.

Set up test data

create table test_data (key varchar(50),data varchar(max));
insert into test_data
    values
      (1,'18,20,22'),
      (2,'17,19')
;

code

with ten_numbers as (select 1 as num union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 0)
  , generted_numbers AS
(
    SELECT (1000 * t1.num) + (100 * t2.num) + (10 * t3.num) + t4.num AS gen_num
    FROM ten_numbers AS t1
      JOIN ten_numbers AS t2 ON 1 = 1
      JOIN ten_numbers AS t3 ON 1 = 1
      JOIN ten_numbers AS t4 ON 1 = 1
)
  , splitter AS
(
    SELECT *
    FROM generted_numbers
    WHERE gen_num BETWEEN 1 AND (SELECT max(REGEXP_COUNT(data, '\\,') + 1)
                                 FROM test_data)
)
  , expanded_input AS
(
    SELECT
      key,
      split_part(data, ',', s.gen_num) AS data
    FROM test_data AS td
      JOIN splitter AS s ON 1 = 1
    WHERE split_part(data, ',', s.gen_num) <> ''
)
SELECT * FROM expanded_input
order by key,data;


回答2:

You welcome to take RDS PostgreSql instance and create dblink to RedShift. Then you can manipulate on result set like on normal PostgreSQL DB and even put result back into RedShift through same dblink.