I have the following file named 'data.csv':
1997,Ford,E350
1997, Ford , E350
1997,Ford,E350,"Super, luxurious truck"
1997,Ford,E350,"Super ""luxurious"" truck"
1997,Ford,E350," Super luxurious truck "
"1997",Ford,E350
1997,Ford,E350
2000,Mercury,Cougar
And I would like to parse it into a pandas DataFrame so that the DataFrame looks as follows:
Year Make Model Description
0 1997 Ford E350 None
1 1997 Ford E350 None
2 1997 Ford E350 Super, luxurious truck
3 1997 Ford E350 Super "luxurious" truck
4 1997 Ford E350 Super luxurious truck
5 1997 Ford E350 None
6 1997 Ford E350 None
7 2000 Mercury Cougar None
The best I could do was:
pd.read_table("data.csv", sep=r',', names=["Year", "Make", "Model", "Description"])
Which gets me:
Year Make Model Description
0 1997 Ford E350 None
1 1997 Ford E350 None
2 1997 Ford E350 Super, luxurious truck
3 1997 Ford E350 Super "luxurious" truck
4 1997 Ford E350 Super luxurious truck
5 1997 Ford E350 None
6 1997 Ford E350 None
7 2000 Mercury Cougar None
How can I get the DataFrame without those whitespaces?
I don't have enough reputation to leave a comment, but the answer above suggesting using the
map
function along withstrip
won't work if you have NaN values, since strip only works on chars and NaN are floats.There is a built-in pandas function to do this, which I used:
pd.core.strings.str_strip(df['Description'])
where
df
is your dataframe. In my case I used it on a dataframe with ~1.2 million rows and it was very fast.Well, the whitespace is in your data, so you can't read in the data without reading in the whitespace. However, after you've read it in, you could strip out the whitespace by doing, e.g.,
df["Make"] = df["Make"].map(str.strip)
(wheredf
is your dataframe).I don't believe Pandas supported this at the time this question was posted but the the most straight forward way to do this is by using regex in the
sep
parameter ofread_csv
. So something like the following should work for this issue.The str.strip() function works really well on Series. Thus, I convert the dataframe column that contains the whitespaces into a Series, strip the whitespace using the str.strip() function and then replace the converted column back into the dataframe. Below is the example code.
You could use converters:
yields
Here's a function to iterate through each column and apply
pd.core.strings.str_strip
: