Here is a snippet of a CSV file that I have:
"Index", "Living Space (sq ft)", "Beds", "Baths", "Zip", "Year", "List Price ($)"
1, 2222, 3, 3.5, 32312, 1981, 250000
2, 1628, 3, 2, 32308, 2009, 185000
3, 3824, 5, 4, 32312, 1954, 399000
4, 1137, 3, 2, 32309, 1993, 150000
5, 3560, 6, 4, 32309, 1973, 315000
Oddly, when I perform the following pySpark (v2.4) statements, the header column names (minus the first column) have leading whitespaces. I've tried different quote
and escape
options
, but to no avail.
Does anyone know why this is happening and how to strip the extra whitespaces on load? Thank you in advance!
>>> csv_file = '/tmp/file.csv'
>>> spark_reader.format('csv')
>>> spark_reader.option("inferSchema", "true")
>>> spark_reader.option("header", "true")
>>> spark_reader.option("quote", '"')
>>> df = spark_reader.load(csv_file)
>>> df.columns
['Index', ' "Living Space (sq ft)"', ' "Beds"', ' "Baths"', ' "Zip"', ' "Year"', ' "List Price ($)"']