Spark converting Pandas df to S3

2019-09-16 11:55发布

Currently i am using Spark along with Pandas framework. How can I convert Pandas Dataframe in a convenient way which can be written to s3.

I have tried below option but I get error as df is Pandas dataframe and it has no write option.

df.write()
    .format("com.databricks.spark.csv")
    .option("header", "true")
    .save("123.csv");

标签： python amazon-web-services pandas apache-spark amazon-s3

1条回答

叼着烟拽天下

2楼-- · 2019-09-16 12:18

As you are running this in Spark, one approach would be to convert the Pandas DataFrame into a Spark DataFrame and then save this to S3.

The code snippet below creates the pdf Pandas DataFrame and converts it into the df Spark DataFrame.

import numpy as np
import pandas as pd

# Create Pandas DataFrame
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
pdf = pd.DataFrame(d)

# Convert Pandas DataFrame to Spark DataFrame
df = spark.createDataFrame(pdf)
df.printSchema()

To validate, we can also print out the schema for the Spark DataFrame with the output below.

root
 |-- one: double (nullable = true)
 |-- two: double (nullable = true)

Now that it is a Spark DataFrame, you can use the spark-csv package to save the file with the example below.

# Save Spark DataFrame to S3
df.write.format('com.databricks.spark.csv').options(header='true').save('123.csv')

0人赞添加讨论(0) 举报

Spark converting Pandas df to S3

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间