pandas split list into columns with regex

2020-04-04 05:27发布

问题:

I have a string list:

content
01/09/15, 10:07 - message1
01/09/15, 10:32 - message2
01/09/15, 10:44 - message3

I want a data frame, like:

     date                message
01/09/15, 10:07          message1
01/09/15, 10:32          message2
01/09/15, 10:44          message3

Considering the fact that all my strings in the list starts in that format, I can just split by -, but I rather look for a smarter way to do so.

history = pd.DataFrame([line.split(" - ", 1) for line in content], columns=['date', 'message'])

(I'll convert the date to date time afterwards)

Any help would be appreciated.

回答1:

You can use str.extract - where named groups can become column names

In [5827]: df['content'].str.extract('(?P<date>[\s\S]+) - (?P<message>[\s\S]+)', 
                                     expand=True)
Out[5827]:
              date   message
0  01/09/15, 10:07  message1
1  01/09/15, 10:32  message2
2  01/09/15, 10:44  message3

Details

In [5828]: df
Out[5828]:
                      content
0  01/09/15, 10:07 - message1
1  01/09/15, 10:32 - message2
2  01/09/15, 10:44 - message3


回答2:

Use str.split by \s+-\s+ - \s+ is one or more whitespaces:

df[['date','message']] = df['content'].str.split('\s+-\s+', expand=True)
print (df)
                      content             date   message
0  01/09/15, 10:07 - message1  01/09/15, 10:07  message1
1  01/09/15, 10:32 - message2  01/09/15, 10:32  message2
2  01/09/15, 10:44 - message3  01/09/15, 10:44  message3

If need remove content column add DataFrame.pop:

df[['date','message']] = df.pop('content').str.split('\s+-\s+', expand=True)

print (df)
              date   message
0  01/09/15, 10:07  message1
1  01/09/15, 10:32  message2
2  01/09/15, 10:44  message3