I use spark-shell
to do the below operations.
Recently loaded a table with an array column in spark-sql .
Here is the DDL for the same:
create table test_emp_arr{
dept_id string,
dept_nm string,
emp_details Array<string>
}
the data looks something like this
+-------+-------+-------------------------------+
|dept_id|dept_nm| emp_details|
+-------+-------+-------------------------------+
| 10|Finance|[Jon, Snow, Castle, Black, Ned]|
| 20| IT| [Ned, is, no, more]|
+-------+-------+-------------------------------+
I can query the emp_details column something like this :
sqlContext.sql("select emp_details[0] from emp_details").show
Problem
I want to query a range of elements in the collection :
Expected query to work
sqlContext.sql("select emp_details[0-2] from emp_details").show
or
sqlContext.sql("select emp_details[0:2] from emp_details").show
Expected output
+-------------------+
| emp_details|
+-------------------+
|[Jon, Snow, Castle]|
| [Ned, is, no]|
+-------------------+
In pure Scala, if i have an array something as :
val emp_details = Array("Jon","Snow","Castle","Black")
I can get the elements from 0 to 2 range using
emp_details.slice(0,3)
returns me
Array(Jon, Snow,Castle)
I am not able to apply the above operation of the array in spark-sql.
Thanks
Since Spark 2.4 you can use
slice
function. In Python):In Scala
The same thing can be of course done in SQL
Important:
Please note, that unlike
Seq.slice
, values are indexed from zero and the second argument is length, not end position.You can use the function
array
to build a new Array out of the three values:Edit2: For who wants to avoid udf at the expense of readability ;-)
If you really want to do it in one step, you will have to use Scala to create a lambda function returning an sequence of
Column
and wrap it in an array. This is a bit involved, but it's one step:The
_:*
works a bit of magic to pass an list to a so-called variadic function (array
in this case, which construct the sql array). But I would advice against using this solution as is. put the lambda function in a named functionfor code readability. Note that in general, sticking to
Column
expressions (without using `udf) has better performances.Edit: In order to do it in a sql statement (as you ask in your question...), following the same logic you would generate the sql query using scala logic (not saying it's the most readable)
note that you can replace
until
byto
in order to provide the last element taken rather than the element at which the iteration stops.Here is my generic slice UDF, support array with any type. A little bit ugly because you need to know the element type in advance.
Use nested split:
split(split(concat_ws(',',emp_details),concat(',',emp_details[3]))[0],',')
for example :