How do I select all the columns of a dataframe that has certain indexes in Scala?
For example if a dataframe has 100 columns and i want to extract only columns (10,12,13,14,15), how to do the same?
Below selects all columns from dataframe df
which has the column name mentioned in the Array colNames:
df = df.select(colNames.head,colNames.tail: _*)
If there is similar, colNos array which has
colNos = Array(10,20,25,45)
How do I transform the above df.select
to fetch only those columns at the specific indexes.
Example: Grab first 14 columns of Spark Dataframe by Index using Scala.
You cannot simply do this (as I tried and failed):
The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.
OR Wrap it in a function using Currying (high five to my colleague for this):
@user6910411's answer above works like a charm and the number of tasks/logical plan is similar to my approach below. BUT my approach is a bit faster.
So,
I would suggest you to go with the
column names
rather thancolumn numbers
.Column names
are much safer and much ligher than usingnumbers
. You can use the following solution :If you are hesitant to write all the 100 column names then there is a shortcut method too
You can
map
overcolumns
:or:
or:
All the methods shown above are equivalent and don't impose performance penalty. Following mapping:
is just a local
Array
access (constant time access for each index) and choosing betweenString
orColumn
based variant ofselect
doesn't affect the execution plan: