How to write the resulting RDD to a csv file in Sp

I have a resulting RDD labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions). This has output in this format:

[(0.0, 0.08482142857142858), (0.0, 0.11442786069651742),.....]

What I want is to create a CSV file with one column for labels (the first part of the tuple in above output) and one for predictions(second part of tuple output). But I don't know how to write to a CSV file in Spark using Python.

How can I create a CSV file with the above output?

标签： python csv apache-spark pyspark file-writing

3条回答

孤傲高冷的网名

2楼-- · 2019-01-11 04:04

It's not good to just join by commas because if fields contain commas, they won't be properly quoted, e.g. ','.join(['a', 'b', '1,2,3', 'c']) gives you a,b,1,2,3,c when you'd want a,b,"1,2,3",c. Instead, you should use Python's csv module to convert each list in the RDD to a properly-formatted csv string:

# python 3
import csv, io

def list_to_csv_str(x):
    """Given a list of strings, returns a properly-csv-formatted string."""
    output = io.StringIO("")
    csv.writer(output).writerow(x)
    return output.getvalue().strip() # remove extra newline

# ... do stuff with your rdd ...
rdd = rdd.map(list_to_csv_str)
rdd.saveAsTextFile("output_directory")

Since the csv module only writes to file objects, we have to create an empty "file" with io.StringIO("") and tell the csv.writer to write the csv-formatted string into it. Then, we use output.getvalue() to get the string we just wrote to the "file". To make this code work with Python 2, just replace io with the StringIO module.

If you're using the Spark DataFrames API, you can also look into the DataBricks save function, which has a csv format.

0人赞添加讨论(0) 举报

放我归山

3楼-- · 2019-01-11 04:14

Just map the lines of the RDD (labelsAndPredictions) into strings (the lines of the CSV) then use rdd.saveAsTextFile().

def toCSVLine(data):
  return ','.join(str(d) for d in data)

lines = labelsAndPredictions.map(toCSVLine)
lines.saveAsTextFile('hdfs://my-node:9000/tmp/labels-and-predictions.csv')

0人赞添加讨论(0) 举报

等我变得足够好

4楼-- · 2019-01-11 04:24

I know this is an old post. But to help someone searching for the same, here's how I write a two column RDD to a single CSV file in PySpark 1.6.2

The RDD:

>>> rdd.take(5)
[(73342, u'cells'), (62861, u'cell'), (61714, u'studies'), (61377, u'aim'), (60168, u'clinical')]

Now the code:

# First I convert the RDD to dataframe
from pyspark import SparkContext
df = sqlContext.createDataFrame(rdd, ['count', 'word'])

The DF:

>>> df.show()
+-----+-----------+
|count|       word|
+-----+-----------+
|73342|      cells|
|62861|       cell|
|61714|    studies|
|61377|        aim|
|60168|   clinical|
|59275|          2|
|59221|          1|
|58274|       data|
|58087|development|
|56579|     cancer|
|50243|    disease|
|49817|   provided|
|49216|   specific|
|48857|     health|
|48536|      study|
|47827|    project|
|45573|description|
|45455|  applicant|
|44739|    program|
|44522|   patients|
+-----+-----------+
only showing top 20 rows

Now write to CSV

# Write CSV (I have HDFS storage)
df.coalesce(1).write.format('com.databricks.spark.csv').options(header='true').save('file:///home/username/csv_out')

P.S: I am just a beginner learning from posts here in Stackoverflow. So I don't know whether this is the best way. But it worked for me and I hope it will help someone!

0人赞添加讨论(0) 举报

How to write the resulting RDD to a csv file in Sp

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间