Is there a way to reference Spark DataFrame columns by position using an integer?
Analogous Pandas DataFrame operation:
df.iloc[:0] # Give me all the rows at column position 0
Is there a way to reference Spark DataFrame columns by position using an integer?
Analogous Pandas DataFrame operation:
df.iloc[:0] # Give me all the rows at column position 0
Not really, but you can try something like this:
Python:
df = sc.parallelize([(1, "foo", 2.0)]).toDF()
df.select(*df.columns[:1]) # I assume [:1] is what you really want
## DataFrame[_1: bigint]
or
df.select(df.columns[1:3])
## DataFrame[_2: string, _3: double]
Scala
val df = sc.parallelize(Seq((1, "foo", 2.0))).toDF()
df.select(df.columns.slice(0, 1).map(col(_)): _*)
Note:
Spark SQL doesn't support and it is unlikely to ever support row indexing so it is not possible to index across row dimension.
The equivalent of Python df.iloc
is collect
PySpark examples:
X = df.collect()[0]['age']
or
X = df.collect()[0][1] #row 0 col 1
You can use like this in spark-shell.
scala>: df.columns
Array[String] = Array(age, name)
scala>: df.select(df.columns(0)).show()
+----+
| age|
+----+
|null|
| 30|
| 19|
+----+