How can I remove extra whitespace from strings whe

I have the following file named 'data.csv':

    1997,Ford,E350
    1997, Ford , E350
    1997,Ford,E350,"Super, luxurious truck"
    1997,Ford,E350,"Super ""luxurious"" truck"
    1997,Ford,E350," Super luxurious truck "
    "1997",Ford,E350
    1997,Ford,E350
    2000,Mercury,Cougar

And I would like to parse it into a pandas DataFrame so that the DataFrame looks as follows:

       Year     Make   Model              Description
    0  1997     Ford    E350                     None
    1  1997     Ford    E350                     None
    2  1997     Ford    E350   Super, luxurious truck
    3  1997     Ford    E350  Super "luxurious" truck
    4  1997     Ford    E350    Super luxurious truck
    5  1997     Ford    E350                     None
    6  1997     Ford    E350                     None
    7  2000  Mercury  Cougar                     None

The best I could do was:

    pd.read_table("data.csv", sep=r',', names=["Year", "Make", "Model", "Description"])

Which gets me:

    Year     Make   Model              Description
 0  1997     Ford    E350                     None
 1  1997    Ford     E350                     None
 2  1997     Ford    E350   Super, luxurious truck
 3  1997     Ford    E350  Super "luxurious" truck
 4  1997     Ford    E350   Super luxurious truck 
 5  1997     Ford    E350                     None
 6  1997     Ford    E350                     None
 7  2000  Mercury  Cougar                     None

How can I get the DataFrame without those whitespaces?

标签： python parsing pandas

7条回答

别忘想泡老子

2楼-- · 2019-01-16 08:35

I don't have enough reputation to leave a comment, but the answer above suggesting using the map function along with strip won't work if you have NaN values, since strip only works on chars and NaN are floats.

There is a built-in pandas function to do this, which I used: pd.core.strings.str_strip(df['Description'])
where df is your dataframe. In my case I used it on a dataframe with ~1.2 million rows and it was very fast.

0人赞添加讨论(0) 举报

Luminary・发光体

3楼-- · 2019-01-16 08:48

Well, the whitespace is in your data, so you can't read in the data without reading in the whitespace. However, after you've read it in, you could strip out the whitespace by doing, e.g., df["Make"] = df["Make"].map(str.strip) (where df is your dataframe).

0人赞添加讨论(0) 举报

干净又极端

4楼-- · 2019-01-16 08:48

I don't believe Pandas supported this at the time this question was posted but the the most straight forward way to do this is by using regex in the sep parameter of read_csv. So something like the following should work for this issue.

table = pd.read_table("data.csv", sep=' *, *')

0人赞添加讨论(0) 举报

爷、活的狠高调

5楼-- · 2019-01-16 08:48

The str.strip() function works really well on Series. Thus, I convert the dataframe column that contains the whitespaces into a Series, strip the whitespace using the str.strip() function and then replace the converted column back into the dataframe. Below is the example code.

import pandas as pd
data = pd.DataFrame({'values': ['   ABC   ', '   DEF', '  GHI  ']})
new = pd.Series([])
new = data['values'].str.strip()
data['values'] = new

0人赞添加讨论(0) 举报

SAY GOODBYE

6楼-- · 2019-01-16 08:56

You could use converters:

import pandas as pd

def strip(text):
    try:
        return text.strip()
    except AttributeError:
        return text

def make_int(text):
    return int(text.strip('" '))

table = pd.read_table("data.csv", sep=r',',
                      names=["Year", "Make", "Model", "Description"],
                      converters = {'Description' : strip,
                                    'Model' : strip,
                                    'Make' : strip,
                                    'Year' : make_int})
print(table)

yields

   Year     Make   Model              Description
0  1997     Ford    E350                     None
1  1997     Ford    E350                     None
2  1997     Ford    E350   Super, luxurious truck
3  1997     Ford    E350  Super "luxurious" truck
4  1997     Ford    E350    Super luxurious truck
5  1997     Ford    E350                     None
6  1997     Ford    E350                     None
7  2000  Mercury  Cougar                     None

0人赞添加讨论(0) 举报

放我归山

7楼-- · 2019-01-16 09:01

Here's a function to iterate through each column and apply pd.core.strings.str_strip:

def df_strip(df):
  df = df.copy()
  for c in df.columns:
    if df[c].dtype == np.object:
      df[c] = pd.core.strings.str_strip(df[c])
    df = df.rename(columns={c:c.strip()})
  return df

0人赞添加讨论(0) 举报

1 2 下一页

How can I remove extra whitespace from strings whe

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间