reading binary data into (py) spark DataFrame

2019-08-14 15:12发布

问题:

I'm ingesting a binary file into Spark -- the file structure is simple, it consists of a series of records and each record holds a number of floats. At the moment, I'm reading in the data in chunks in python and then iterating through individual records to turn them into Row objects that Spark can use to construct a DataFrame. This is very inefficient because instead of processing the data in chunks it requires me to loop through the individual elements.

Is there an obvious (preferred) way of ingesting data like this? Ideally, I would be able to read a chunk of the file (say 10240 records or so) into a buffer, specify the schema and turn this into the DataFrame directly. I don't see a way of doing this with the current API, but maybe I'm missing something?

Here's an example notebook that demonstrates the current procedure: https://gist.github.com/rokroskar/bc0b4713214bb9b1e5ed

Ideally, I would be able to get rid of the for loop over buf in read_batches and just convert the entire batch into an array of Row objects directly.