I have a PySpark DataFrame with 2 ArrayType fields:
>>>df
DataFrame[id: string, tokens: array<string>, bigrams: array<string>]
>>>df.take(1)
[Row(id='ID1', tokens=['one', 'two', 'two'], bigrams=['one two', 'two two'])]
I would like to combine them into a single ArrayType field:
>>>df2
DataFrame[id: string, tokens_bigrams: array<string>]
>>>df2.take(1)
[Row(id='ID1', tokens_bigrams=['one', 'two', 'two', 'one two', 'two two'])]
The syntax that works with strings does not seem to work here:
df2 = df.withColumn('tokens_bigrams', df.tokens + df.bigrams)
Thanks!
Spark >= 2.4
You can use
concat
function (SPARK-23736):To keep data when one of the values is
NULL
you cancoalesce
witharray
:Spark < 2.4
Unfortunately to concatenate
array
columns in general case you'll need an UDF, for example like this:which can be used as:
In Spark 2.4.0 (2.3 on Databricks platform) you can do it natively in the DataFrame API using the concat function. In your example you could do this:
Here is the related jira.