Im working with pandas and with spark dataframes. The dataframes are always very big (> 20 GB) and the standard spark functions are not sufficient for those sizes. Currently im converting my pandas dataframe to a spark dataframe like this:
dataframe = spark.createDataFrame(pandas_dataframe)
I do that transformation because with spark writing dataframes to hdfs is very easy:
dataframe.write.parquet(output_uri, mode="overwrite", compression="snappy")
But the transformation is failing for dataframes which are bigger than 2 GB. If I transform a spark dataframe to pandas I can use pyarrow:
// temporary write spark dataframe to hdfs
dataframe.write.parquet(path, mode="overwrite", compression="snappy")
// open hdfs connection using pyarrow (pa)
hdfs = pa.hdfs.connect("default", 0)
// read parquet (pyarrow.parquet (pq))
parquet = pq.ParquetDataset(path_hdfs, filesystem=hdfs)
table = parquet.read(nthreads=4)
// transform table to pandas
pandas = table.to_pandas(nthreads=4)
// delete temp files
hdfs.delete(path, recursive=True)
This is a fast converstion from spark to pandas and it also works for dataframes bigger than 2 GB. I yet could not find a way to do it the other way around. Meaning having a pandas dataframe which I transform to spark with the help of pyarrow. The problem is that I really cant find how to write a pandas dataframe to hdfs.
My pandas version: 0.19.0
From https://issues.apache.org/jira/browse/SPARK-6235
is resolved.
From https://pandas.pydata.org/pandas-docs/stable/r_interface.html
you can convert a pandas dataframe to an R data.frame
So perhaps the transformation pandas -> R -> Spark -> hdfs?
An hack could be to create N pandas dataframes (each less than 2 GB) (horizontal partitioning) from the big one and create N different spark dataframes, then merging (Union) them to create a final one to write into HDFS. I am assuming that your master machine is powerful but you also have available a cluster in which you are running Spark.
pyarrow.Table.fromPandas
is the function your looking for:The result can be written directly to Parquet / HDFS without passing data via Spark:
See also
pyarrow
documentation.Spark notes:
Furthermore since Spark 2.3 (current master) Arrow is supported directly in
createDataFrame
(SPARK-20791 - Use Apache Arrow to Improve Spark createDataFrame from Pandas.DataFrame). It usesSparkContext.defaultParallelism
to compute number of chunks so you can easily control the size of individual batches.Finally
defaultParallelism
can be used to control number of partitions generated using standard_convert_from_pandas
, effectively reducing size of the slices to something more manageable.Unfortunately these are unlikely to resolve your current memory problems. Both depend on
parallelize
, therefore store all data in memory of the driver node. Switching to Arrow or adjusting configuration can only speedup the process or address block size limitations.In practice I don't see any reason to switch to Spark here, as long as you use local Pandas
DataFrame
as the input. The most severe bottleneck in this scenario is driver's network I/O and distributing data won't address that.