Python (pyspark) Error = ValueError: could not con

2019-07-30 08:28发布

问题:

I am working with Python on Spark and reading my dataset from a .csv file whose first a few rows are:

17  0.2  7
17  0.2  7
39  1.3  7
19   1   7
19   0   7

When I read from the file line by line with the code below:

# Load and parse the data
def parsePoint(line):
   values = [float(x) for x in line.replace(',', ' ').split(' ')]
   return LabeledPoint(values[0], values[1:])

I get the this error:

Traceback (most recent call last):
  File "<stdin>", line 3, in parsePoint
ValueError: could not convert string to float: "17"

Any help is greatly appreciated.

回答1:

Following the comments below this answer, you should use:

[float(x.strip(' "')) for x in line.split(',')]

You do not need to replace ',' with ' ', you should simply split on , and then remove leading and trailing whitespaces and quotes (x.strip(' "')) before converting to float.

Also, have a look at the csv packages which may simplify your work.


Below is the answer to the original question (before comments).

You need to use .split() instead of .split(' '). You have multiple consecutive space characters in your line, so splitting on ' ' results in empty strings, e.g. your first line is split into:

['17', '', '0.2', '', '7']

The problem are those empty strings that you (obviously) cannot convert to float.

Using split() will solve the problem thanks to the behaviour of split when its sep argument is None (or not present):

If the optional second argument sep is absent or None, the words are separated by arbitrary strings of whitespace characters (space, tab, newline, return, formfeed).

See the doc of split, and a small example to understand the difference:

>>> sp5 = ' ' * 5
>>> sp5.split()
[]
>>> sp5.split(' ')
['', '', '', '', '', '']