I have an RDD whose elements are of type (Long, String). For some reason, I want to save the whole RDD into the HDFS, and later also read that RDD back in a Spark program. Is it possible to do that? And if so, how?
相关问题
- How to maintain order of key-value in DataFrame sa
- Unusual use of the new keyword
- Get Runtime Type picked by implicit evidence
- Spark on Yarn Container Failure
- What's the point of nonfinal singleton objects
相关文章
- Java写文件至HDFS失败
- Gatling拓展插件开发,check(bodyString.saveAs("key"))怎么实现
- Livy Server: return a dataframe as JSON?
- RDF libraries for Scala [closed]
- Why is my Dispatching on Actors scaled down in Akk
- How do you run cucumber with Scala 2.11 and sbt 0.
- GRPC: make high-throughput client in Java/Scala
- Setting up multiple test folders in a SBT project
It is possible.
In RDD you have
saveAsObjectFile
andsaveAsTextFile
functions. Tuples are stored as(value1, value2)
, so you can later parse it.Reading can be done with
textFile
function from SparkContext and then.map
to eliminate()
So: Version 1:
Version 2:
I would recommend to use DataFrame if your RDD is in tabular format. a data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case. a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query. where a RDD is a Resilient Distributed Dataset that is more of a blackbox or core abstraction of data that cannot be optimized. However, you can go from a DataFrame to an RDD and vice-versa, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via toDF method.
The following is the example to create/store a DataFrame in CSV and Parquet format in HDFS,