I'm ingesting a binary file into Spark -- the file structure is simple, it consists of a series of records and each record holds a number of floats. At the moment, I'm reading in the data in chunks in python and then iterating through individual records to turn them into Row
objects that Spark can use to construct a DataFrame
. This is very inefficient because instead of processing the data in chunks it requires me to loop through the individual elements.
Is there an obvious (preferred) way of ingesting data like this? Ideally, I would be able to read a chunk of the file (say 10240 records or so) into a buffer, specify the schema and turn this into the DataFrame
directly. I don't see a way of doing this with the current API, but maybe I'm missing something?
Here's an example notebook that demonstrates the current procedure: https://gist.github.com/rokroskar/bc0b4713214bb9b1e5ed
Ideally, I would be able to get rid of the for loop over buf
in read_batches
and just convert the entire batch into an array of Row
objects directly.