pySpark (v2.4) DataFrameReader adds leading whites

2019-08-21 04:01发布

问题:

Here is a snippet of a CSV file that I have:

"Index", "Living Space (sq ft)", "Beds", "Baths", "Zip", "Year", "List Price ($)"
 1,       2222,                   3,      3.5,    32312, 1981,    250000
 2,       1628,                   3,      2,      32308, 2009,    185000
 3,       3824,                   5,      4,      32312, 1954,    399000
 4,       1137,                   3,      2,      32309, 1993,    150000
 5,       3560,                   6,      4,      32309, 1973,    315000

Oddly, when I perform the following pySpark (v2.4) statements, the header column names (minus the first column) have leading whitespaces. I've tried different quote and escape options, but to no avail.

Does anyone know why this is happening and how to strip the extra whitespaces on load? Thank you in advance!

>>> csv_file = '/tmp/file.csv'

>>> spark_reader.format('csv')

>>> spark_reader.option("inferSchema", "true")
>>> spark_reader.option("header", "true")
>>> spark_reader.option("quote", '"')

>>> df = spark_reader.load(csv_file)

>>> df.columns
['Index', ' "Living Space (sq ft)"', ' "Beds"', ' "Baths"', ' "Zip"', ' "Year"', ' "List Price ($)"']

回答1:

From the docs for pyspark.sql.DataFrameReader, you can use the ignoreLeadingWhiteSpace parameter.

ignoreLeadingWhiteSpace – A flag indicating whether or not leading whitespaces from values being read should be skipped. If None is set, it uses the default value, false.

In your case, you just need to add:

spark_reader.option("ignoreLeadingWhiteSpace", "true")

pySpark (v2.4) DataFrameReader adds leading whites

问题:

回答1:

收藏的人(0)

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮