If I have a RDD with about 500 columns and 200 million rows, and RDD.columns.indexOf("target", 0)
shows Int = 77
which tells me my targeted dependent variable is at column number 77. But I don't have enough knowledge on how to select desired (partial) columns as features (say I want columns from 23 to 59, 111 to 357, 399 to 489). I am wondering if I can apply such:
val data = rdd.map(col => new LabeledPoint(
col(77).toDouble, Vectors.dense(??.map(x => x.toDouble).toArray))
Any suggestions or guidance will be much appreciated.
Maybe I messed up RDD with DataFrame, I can convert the RDD to DataFrame with .toDF()
or it is easier to accomplish the goal with DataFrame than RDD.
I assume your data looks more or less like this:
So we have data as below:
and we want to ignore
foo
andx2
and extractLabeledPoint(target, Array(x1, x3))
: