Dataset-API analog of JavaSparkContext.wholeTextFi

2019-06-04 22:23发布

We can call JavaSparkContext.wholeTextFiles and get JavaPairRDD<String, String>, where first String is file name and second String is whole file contents. Is there similar method in Dataset API, or all I can do is to load files into JavaPairRDD and then convert to Dataset (which is working, but I'm looking for non-RDD solution).

标签： java apache-spark dataset rdd

1条回答

看我几分像从前

2楼-- · 2019-06-04 22:45

If you want to use Dataset API then you can use spark.read.text("path/to/files/"). Please check here for API details. Please note that using text() method returns Dataframe in which "Each line in the text files is a new row in the resulting DataFrame". So text() method will provide file content. In order to get the file name you will have to use input_file_name() function.

import static org.apache.spark.sql.functions.input_file_name;
Dataset<Row> ds = spark.read().text("c:\\temp").withColumnRenamed("value", "content").withColumn("fileName", input_file_name());
ds.show(false);

If you want to concatenate rows from same file so it will be like whole file content, you would need to use groupBy function on fileName column with concat_ws and collect_list functions.

import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.concat_ws;
import static org.apache.spark.sql.functions.collect_list;
ds = ds.groupBy(col("fileName")).agg(concat_ws("",collect_list(ds.col("content"))).as("content"));
ds.show(false);

0人赞添加讨论(0) 举报

Dataset-API analog of JavaSparkContext.wholeTextFi

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间