I'm new to Spark and I'm trying to read CSV data from a file with Spark. Here's what I am doing :
sc.textFile('file.csv')
.map(lambda line: (line.split(',')[0], line.split(',')[1]))
.collect()
I would expect this call to give me a list of the two first columns of my file but I'm getting this error :
File "<ipython-input-60-73ea98550983>", line 1, in <lambda>
IndexError: list index out of range
although my CSV file as more than one column.
If your csv data happens to not contain newlines in any of the fields, you can load your data with
textFile()
and parse itAre you sure that all the lines have at least 2 columns? Can you try something like, just to check?:
Alternatively, you could print the culprit (if any):
Now, there's also another option for any general csv file: https://github.com/seahboonsiew/pyspark-csv as follows:
Assume we have the following context
First, distribute pyspark-csv.py to executors using SparkContext
Read csv data via SparkContext and convert it to DataFrame
Simply splitting by comma will also split commas that are within fields (e.g.
a,b,"1,2,3",c
), so it's not recommended. zero323's answer is good if you want to use the DataFrames API, but if you want to stick to base Spark, you can parse csvs in base Python with the csv module:EDIT: As @muon mentioned in the comments, this will treat the header like any other row so you'll need to extract it manually. For example,
header = rdd.first(); rdd = rdd.filter(lambda x: x != header)
(make sure not to modifyheader
before the filter evaluates). But at this point, you're probably better off using a built-in csv parser.This is in-line with what JP Mercier initially suggested about using Pandas, but with a major modification: If you read data into Pandas in chunks, it should be more malleable. Meaning, that you can parse a much larger file than Pandas can actually handle as a single piece and pass it to Spark in smaller sizes. (This also answers the comment about why one would want to use Spark if they can load everything into Pandas anyways.)
If you want to load csv as a dataframe then you can do the following:
It worked fine for me.