We can call JavaSparkContext.wholeTextFiles
and get JavaPairRDD<String, String>
, where first String is file name and second String is whole file contents. Is there similar method in Dataset API, or all I can do is to load files into JavaPairRDD
and then convert to Dataset (which is working, but I'm looking for non-RDD solution).
相关问题
- Delete Messages from a Topic in Apache Kafka
- Jackson Deserialization not calling deserialize on
- How to maintain order of key-value in DataFrame sa
- StackExchange API - Deserialize Date in JSON Respo
- Difference between Types.INTEGER and Types.NULL in
If you want to use Dataset API then you can use
spark.read.text("path/to/files/")
. Please check here for API details. Please note that usingtext()
method returns Dataframe in which "Each line in the text files is a new row in the resulting DataFrame". Sotext()
method will provide file content. In order to get the file name you will have to useinput_file_name()
function.If you want to concatenate rows from same file so it will be like whole file content, you would need to use
groupBy
function on fileName column withconcat_ws
andcollect_list
functions.