I am working with Python on Spark and reading my dataset from a .csv file whose first a few rows are:
17 0.2 7
17 0.2 7
39 1.3 7
19 1 7
19 0 7
When I read from the file line by line with the code below:
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.replace(',', ' ').split(' ')]
return LabeledPoint(values[0], values[1:])
I get the this error:
Traceback (most recent call last):
File "<stdin>", line 3, in parsePoint
ValueError: could not convert string to float: "17"
Any help is greatly appreciated.
Following the comments below this answer, you should use:
[float(x.strip(' "')) for x in line.split(',')]
You do not need to replace ','
with ' '
, you should simply split
on ,
and then remove leading and trailing whitespaces and quotes (x.strip(' "')
) before converting to float
.
Also, have a look at the csv
packages which may simplify your work.
Below is the answer to the original question (before comments).
You need to use .split()
instead of .split(' ')
. You have multiple consecutive space characters in your line, so splitting on ' '
results in empty strings, e.g. your first line is split into:
['17', '', '0.2', '', '7']
The problem are those empty strings that you (obviously) cannot convert to float
.
Using split()
will solve the problem thanks to the behaviour of split
when its sep
argument is None
(or not present):
If the optional second argument sep is absent or None, the words are separated by arbitrary strings of whitespace characters (space, tab, newline, return, formfeed).
See the doc of split
, and a small example to understand the difference:
>>> sp5 = ' ' * 5
>>> sp5.split()
[]
>>> sp5.split(' ')
['', '', '', '', '', '']