问题:

Is there a way to reference Spark DataFrame columns by position using an integer?

Analogous Pandas DataFrame operation:

df.iloc[:0] # Give me all the rows at column position 0

回答1:

Not really, but you can try something like this:

Python:

df = sc.parallelize([(1, "foo", 2.0)]).toDF()
df.select(*df.columns[:1])  # I assume [:1] is what you really want
## DataFrame[_1: bigint]

df.select(df.columns[1:3])
## DataFrame[_2: string, _3: double]

Scala

val df = sc.parallelize(Seq((1, "foo", 2.0))).toDF()
df.select(df.columns.slice(0, 1).map(col(_)): _*)

Note:

Spark SQL doesn't support and it is unlikely to ever support row indexing so it is not possible to index across row dimension.

The equivalent of Python df.iloc is collect

PySpark examples:

X = df.collect()[0]['age']

X = df.collect()[0][1]  #row 0 col 1

You can use like this in spark-shell.

scala>: df.columns  
Array[String] = Array(age, name)

scala>: df.select(df.columns(0)).show()
+----+
| age|
+----+
|null|
|  30|
|  19|
+----+