We can call JavaSparkContext.wholeTextFiles
and get JavaPairRDD<String, String>
, where first String is file name and second String is whole file contents. Is there similar method in Dataset API, or all I can do is to load files into JavaPairRDD
and then convert to Dataset (which is working, but I'm looking for non-RDD solution).
可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
回答1:
If you want to use Dataset API then you can use spark.read.text("path/to/files/")
. Please check here for API details. Please note that using text()
method returns Dataframe in which "Each line in the text files is a new row in the resulting DataFrame". So text()
method will provide file content. In order to get the file name you will have to use input_file_name()
function.
import static org.apache.spark.sql.functions.input_file_name;
Dataset<Row> ds = spark.read().text("c:\\temp").withColumnRenamed("value", "content").withColumn("fileName", input_file_name());
ds.show(false);
If you want to concatenate rows from same file so it will be like whole file content, you would need to use groupBy
function on fileName column with concat_ws
and collect_list
functions.
import static org.apache.spark.sql.functions.col;
import static org.apache.spark.sql.functions.concat_ws;
import static org.apache.spark.sql.functions.collect_list;
ds = ds.groupBy(col("fileName")).agg(concat_ws("",collect_list(ds.col("content"))).as("content"));
ds.show(false);