How to transfer binary file into rdd in spark?

2019-07-09 02:56发布

I am trying to load seg-Y type files into spark, and transfer them into rdd for mapreduce operation. But I failed to transfer them into rdd. Does anyone who can offer help?

标签： apache-spark rdd

2条回答

成全新的幸福

2楼-- · 2019-07-09 03:55

You've not really given much detail, but you can start with using the SparkContextbinaryFiles() API

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext

0人赞添加讨论(0) 举报

\"骚年 ilove

3楼-- · 2019-07-09 03:56

You could use binaryRecords() pySpark call to convert binary file's content into an RDD

http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords

binaryRecords(path, recordLength)

Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant.

Parameters: path – Directory to the input data files recordLength – The length at which to split the records

Then you could map() that RDD into a structure by using, for example, struct.unpack()

https://docs.python.org/2/library/struct.html

We use this approach to ingest propitiatory fixed-width records binary files. There is a bit of Python code that generates Format string (1st argument to struct.unpack), but if your files layout is static, it's fairly simple to do manually one time.

Similarly is possible to do using pure Scala:

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext@binaryRecords(path:String,recordLength:Int,conf:org.apache.hadoop.conf.Configuration):org.apache.spark.rdd.RDD[Array[Byte]]

0人赞添加讨论(0) 举报

How to transfer binary file into rdd in spark?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间