Get a range of columns of Spark RDD

2019-06-19 05:59发布

Now I have 300+ columns in my RDD, but I found there is a need to dynamically select a range of columns and put them into LabledPoints data type. As a newbie to Spark, I am wondering if there is any index way to select a range of columns in RDD. Something like temp_data = data[, 101:211] in R. Is there something like val temp_data = data.filter(_.column_index in range(101:211)...?

Any thought is welcomed and appreciated.

标签： scala apache-spark rdd

3条回答

一纸荒年 Trace。

2楼-- · 2019-06-19 06:19

If it is a DataFrame, then something like this should work:

val df = rdd.toDF
df.select(df.columns.slice(101,211) : _*)

0人赞添加讨论(0) 举报

趁早两清

3楼-- · 2019-06-19 06:35

Kind of old thread, but I recently had to do something similar and search around. I needed to select all but the last column where I had 200+ columns.

Spark 1.4.1
Scala 2.10.4

val df = hiveContext.sql("SELECT * FROM foobar")
val cols = df.columns.slice(0, df.columns.length - 1)
val new_df = df.select(cols.head, cols.tail:_*)

0人赞添加讨论(0) 举报

对你真心纯属浪费

4楼-- · 2019-06-19 06:38

Assuming you have an RDD of Array or any other scala collection (e.g., List). You can do something like this:

val data: RDD[Array[Int]] = sc.parallelize(Array(Array(1,2,3), Array(4,5,6)))
val sliced: RDD[Array[Int]] = data.map(_.slice(0,2))

sliced.collect()
> Array[Array[Int]] = Array(Array(1, 2), Array(4, 5))

0人赞添加讨论(0) 举报

Get a range of columns of Spark RDD

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间