I have two array fields in a data frame.
I have a requirement to compare these two arrays and get the difference as an array(new column) in the same data frame.
Expected output is:
Column B is a subset of column A. Also the words is going to be in the same order in both arrays.
Can any one please help me to get a solution for this?
Since Spark 2.4.0, this can be solved easily using array_except. Taking the example
for more similar operations on arrays, I suggest this blogpost https://www.waitingforcode.com/apache-spark-sql/apache-spark-2.4.0-features-array-higher-order-functions/read
You can use a user-defined function. My example dataframe differs a bit from yours, but the code should work fine:
EDIT:
This does not work if there are duplicates as set retains only uniques. So you can amend the udf as follows: