我有两个嵌套的数组一个是一个字符串,其他都浮动。 我想是ZIP这件事,并有一个(价值,VAR)每行组合。 我试图只是一个数据帧做,而不必诉诸RDDS或UDF的思维,这将是更清洁和更快速。
我可以把值,每行变量的数组的值,变量的一个结构,1-每行,但因为我的数组大小不同我必须在不同的范围内运行我阵列理解。 所以,我想我可能只是在指定的列长度和使用。 但是,因为我将使用列这是一个语法错误。 关于如何使用列动态构建这样的结构(不RDD / UDF如果可能的话)有什么建议?:
from pyspark.sql.functions import col, array, struct, explode
DF1 = spark.createDataFrame([(["a", "b", "c", "d", "e", "f"], [1,2,3,4,5,6], 6),
(["g"], [7], 1),
(["a", "b", "g", "c"], [4,5,3,6], 4),
(["c", "d"], [2,3], 2),
(["a", "b", "c"], [5,7,2], 3)],
["vars", "vals", "num_elements"])
DF1.show()
arrayofstructs = array(*[struct(
DF1.vars[c].alias("variables"),
DF1.vals[c].alias("values")
#) for c in DF1.num_elements]) # <- DOES NOT WORK
) for c in range(10)]) # <- FIXED SIZE DOES WORK
DF2 = DF1.withColumn("new", explode(arrayofstructs))
DF2.show()
DF3 = DF2.filter(DF2.new.variables.isNotNull())
DF3.show()
+------------------+------------------+------------+
| vars| vals|num_elements|
+------------------+------------------+------------+
|[a, b, c, d, e, f]|[1, 2, 3, 4, 5, 6]| 6|
| [g]| [7]| 1|
| [a, b, g, c]| [4, 5, 3, 6]| 4|
| [c, d]| [2, 3]| 2|
| [a, b, c]| [5, 7, 2]| 3|
+------------------+------------------+------------+
+------------------+------------------+------------+------+
| vars| vals|num_elements| new|
+------------------+------------------+------------+------+
|[a, b, c, d, e, f]|[1, 2, 3, 4, 5, 6]| 6|[a, 1]|
|[a, b, c, d, e, f]|[1, 2, 3, 4, 5, 6]| 6|[b, 2]|
|[a, b, c, d, e, f]|[1, 2, 3, 4, 5, 6]| 6|[c, 3]|
|[a, b, c, d, e, f]|[1, 2, 3, 4, 5, 6]| 6|[d, 4]|
|[a, b, c, d, e, f]|[1, 2, 3, 4, 5, 6]| 6|[e, 5]|
|[a, b, c, d, e, f]|[1, 2, 3, 4, 5, 6]| 6|[f, 6]|
|[a, b, c, d, e, f]|[1, 2, 3, 4, 5, 6]| 6| [,]|
|[a, b, c, d, e, f]|[1, 2, 3, 4, 5, 6]| 6| [,]|
|[a, b, c, d, e, f]|[1, 2, 3, 4, 5, 6]| 6| [,]|
|[a, b, c, d, e, f]|[1, 2, 3, 4, 5, 6]| 6| [,]|
| [g]| [7]| 1|[g, 7]|
| [g]| [7]| 1| [,]|
| [g]| [7]| 1| [,]|
| [g]| [7]| 1| [,]|
| [g]| [7]| 1| [,]|
| [g]| [7]| 1| [,]|
| [g]| [7]| 1| [,]|
| [g]| [7]| 1| [,]|
| [g]| [7]| 1| [,]|
| [g]| [7]| 1| [,]|
+------------------+------------------+------------+------+
only showing top 20 rows
+------------------+------------------+------------+------+
| vars| vals|num_elements| new|
+------------------+------------------+------------+------+
|[a, b, c, d, e, f]|[1, 2, 3, 4, 5, 6]| 6|[a, 1]|
|[a, b, c, d, e, f]|[1, 2, 3, 4, 5, 6]| 6|[b, 2]|
|[a, b, c, d, e, f]|[1, 2, 3, 4, 5, 6]| 6|[c, 3]|
|[a, b, c, d, e, f]|[1, 2, 3, 4, 5, 6]| 6|[d, 4]|
|[a, b, c, d, e, f]|[1, 2, 3, 4, 5, 6]| 6|[e, 5]|
|[a, b, c, d, e, f]|[1, 2, 3, 4, 5, 6]| 6|[f, 6]|
| [g]| [7]| 1|[g, 7]|
| [a, b, g, c]| [4, 5, 3, 6]| 4|[a, 4]|
| [a, b, g, c]| [4, 5, 3, 6]| 4|[b, 5]|
| [a, b, g, c]| [4, 5, 3, 6]| 4|[g, 3]|
| [a, b, g, c]| [4, 5, 3, 6]| 4|[c, 6]|
| [c, d]| [2, 3]| 2|[c, 2]|
| [c, d]| [2, 3]| 2|[d, 3]|
| [a, b, c]| [5, 7, 2]| 3|[a, 5]|
| [a, b, c]| [5, 7, 2]| 3|[b, 7]|
| [a, b, c]| [5, 7, 2]| 3|[c, 2]|
+------------------+------------------+------------+------+