Pyspark : select specific column with its position

2020-04-06 02:44发布

问题:

I would like to know how to select a specific column with its number but not with its name in a dataframe ?

Like this in Pandas:

df = df.iloc[:,2]

It's possible ?

回答1:

You can always get the name of the column with df.columns[n] and then select it:

df = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])

To select column at position n:

n = 1
df.select(df.columns[n]).show()
+---+                                                                           
|  b|
+---+
|  2|
|  4|
+---+

To select all but column n:

n = 1

You can either use drop:

df.drop(df.columns[n]).show()
+---+
|  a|
+---+
|  1|
|  3|
+---+

Or select with manually constructed column names:

df.select(df.columns[:n] + df.columns[n+1:]).show()
+---+
|  a|
+---+
|  1|
|  3|
+---+