pySpark (v2.4) DataFrameReader adds leading whites

2019-08-21 04:01发布

问题:

Here is a snippet of a CSV file that I have:

"Index", "Living Space (sq ft)", "Beds", "Baths", "Zip", "Year", "List Price ($)"
 1,       2222,                   3,      3.5,    32312, 1981,    250000
 2,       1628,                   3,      2,      32308, 2009,    185000
 3,       3824,                   5,      4,      32312, 1954,    399000
 4,       1137,                   3,      2,      32309, 1993,    150000
 5,       3560,                   6,      4,      32309, 1973,    315000

Oddly, when I perform the following pySpark (v2.4) statements, the header column names (minus the first column) have leading whitespaces. I've tried different quote and escape options, but to no avail.

Does anyone know why this is happening and how to strip the extra whitespaces on load? Thank you in advance!

>>> csv_file = '/tmp/file.csv'

>>> spark_reader.format('csv')

>>> spark_reader.option("inferSchema", "true")
>>> spark_reader.option("header", "true")
>>> spark_reader.option("quote", '"')

>>> df = spark_reader.load(csv_file)

>>> df.columns
['Index', ' "Living Space (sq ft)"', ' "Beds"', ' "Baths"', ' "Zip"', ' "Year"', ' "List Price ($)"']

回答1:

From the docs for pyspark.sql.DataFrameReader, you can use the ignoreLeadingWhiteSpace parameter.

ignoreLeadingWhiteSpace – A flag indicating whether or not leading whitespaces from values being read should be skipped. If None is set, it uses the default value, false.

In your case, you just need to add:

spark_reader.option("ignoreLeadingWhiteSpace", "true")