I am trying to load seg-Y type files into spark, and transfer them into rdd for mapreduce operation. But I failed to transfer them into rdd. Does anyone who can offer help?
相关问题
- How to maintain order of key-value in DataFrame sa
- Spark on Yarn Container Failure
- In Spark Streaming how to process old data and del
- Filter from Cassandra table by RDD values
- Spark 2.1 cannot write Vector field on CSV
相关文章
- Livy Server: return a dataframe as JSON?
- SQL query Frequency Distribution matrix for produc
- How to filter rows for a specific aggregate with s
- How to name file when saveAsTextFile in spark?
- Spark save(write) parquet only one file
- Could you give me any clue Why 'Cannot call me
- Why does the Spark DataFrame conversion to RDD req
- How do I enable partition pruning in spark
You've not really given much detail, but you can start with using the SparkContextbinaryFiles() API
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
You could use binaryRecords() pySpark call to convert binary file's content into an RDD
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords
Then you could map() that RDD into a structure by using, for example, struct.unpack()
https://docs.python.org/2/library/struct.html
We use this approach to ingest propitiatory fixed-width records binary files. There is a bit of Python code that generates Format string (1st argument to
struct.unpack
), but if your files layout is static, it's fairly simple to do manually one time.Similarly is possible to do using pure Scala:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext@binaryRecords(path:String,recordLength:Int,conf:org.apache.hadoop.conf.Configuration):org.apache.spark.rdd.RDD[Array[Byte]]