Given a query, select * from ...
(that might be part of CTAS statement)
The goal is to add an additional column, ID
, where ID
is a unique integer.
select ... as ID,* from ...
P.s.
ID
does not have to be sequential (there could be gaps)- The ID could be arbitrary (doesn't have to represent a specific order within the result set)
row_number logically solves the problem -
select row_number() over () as ID,* from ...
The problem is, that at least for now, global row_number (no partition by) is being implemented using a single reducer (hive) / task (spark).
Check this solution from Manoj Kumar: https://github.com/manojkumarvohra/hive-hilo
Usage:
hive
hivevar:buckets
should be high enough relatively to the number of reducers (mapred.reduce.tasks
), so the rows will be evenly distributed between the reduces.spark-sql
For both hive and spark-sql
The
rand()
is used to generate a good distribution.If You already have in your query a column / combination of columns with good distribution (might be unique, not a must) you might use it instead, e.g. -
Check this out for a globally unique id service https://github.com/spinaki/distributed-unique-id It has a docker image too which you can test quickly.
If you are using Spark-sql your best bet would be to use the inbuilt function
monotonically_increasing_id
which generates unique random id in a separate column. And as you said you don't need it to be sequential so this should ideally suffice your requirement.