I have a parquet table with one of the columns being
, array<struct<col1,col2,..colN>>
Can run queries against this table in Hive using LATERAL VIEW syntax.
How to read this table into an RDD, and more importantly how to filter, map etc this nested collection in Spark?
Could not find any references to this in Spark documentation. Thanks in advance for any information!
ps. Felt might be helpful to give some stats on the table. Number of columns in main table ~600. Number of rows ~200m. Number of "columns" in nested collection ~10. Avg number of records in nested collection ~35.
I'll give a Python-based answer since that's what I'm using. I think Scala has something similar.
The
explode
function was added in Spark 1.4.0 to handle nested arrays in DataFrames, according to the Python API docs.Create a test dataframe:
Use
explode
to flatten the list column:There is no magic in the case of nested collection. Spark will handle the same way a
RDD[(String, String)]
and aRDD[(String, Seq[String])]
.Reading such nested collection from Parquet files can be tricky, though.
Let's take an example from the
spark-shell
(1.3.1):Write the parquet file:
Read the parquet file:
The important part is
row.getAs[Seq[Row]](1)
. The internal representation of a nested sequence ofstruct
isArrayBuffer[Row]
, you could use any super-type of it instead ofSeq[Row]
. The1
is the column index in the outer row. I used the methodgetAs
here but there are alternatives in the latest versions of Spark. See the source code of the Row trait.Now that you have a
RDD[Outer]
, you can apply any wanted transformation or action.Note that we used the spark-SQL library only to read the parquet file. You could for example select only the wanted columns directly on the DataFrame, before mapping it to a RDD.
Another approach would be using pattern matching like this:
You can pattern match directly on Row but it is likely to fail for a few reasons.
Above answers are all great answers and tackle this question from different sides; Spark SQL is also quite useful way to access nested data.
Here's example how to use explode() in SQL directly to query nested collection.
tsp_ids is a nested of structs, which has many attributes, including person_seq_no which I'm selecting in the outer query above.
Above was tested in Spark 2.0. I did a small test and it doesn't work in Spark 1.6. This question was asked when Spark 2 wasn't around, so this answer adds nicely to the list of available options to deal with nested structures.
Noticable not resolved JIRAs on explode() for SQL access: