Tokenise text and create more rows for each row in

2019-08-23 10:14发布

I want to do this with python and pandas.

Let's suppose that I have the following:

file_id   text
1         I am the first document. I am a nice document.
2         I am the second document. I am an even nicer document.

and I finally want to have the following:

file_id   text
1         I am the first document
1         I am a nice document
2         I am the second document
2         I am an even nicer document

So I want the text of each file to be splitted at every fullstop and to create new lines for each of the tokens of these texts.

What is the most efficient way to do this?

标签： python pandas tokenize

2条回答

啃猪蹄的小仙女

2楼-- · 2019-08-23 10:35

df = pd.DataFrame( { 'field_id': [1,2], 
                    'text': ["I am the first document. I am a nice document.",
                             "I am the second document. I am an even nicer document."]})

df['sents'] = df.text.apply(lambda txt: [x for x in txt.split(".") if len(x) > 1])
df = df.set_index(['field_id']).apply(lambda x: 
                                      pd.Series(x['sents']),axis=1).stack().reset_index(level=1, drop=True)
df = df.reset_index()
df.columns = ['field_id','text']

0人赞添加讨论(0) 举报

贪生不怕死

3楼-- · 2019-08-23 10:46

Use:

s = (df.pop('text')
      .str.strip('.')
      .str.split('\.\s+', expand=True)
      .stack()
      .rename('text')
      .reset_index(level=1, drop=True))

df = df.join(s).reset_index(drop=True)
print (df)
   file_id                         text
0        1      I am the first document
1        1         I am a nice document
2        2     I am the second document
3        2  I am an even nicer document

Explanation:

First use DataFrame.pop for extract column, remove last . by Series.str.rstrip and split by with Series.str.split with escape . because special regex character, reshape by DataFrame.stack for Series, DataFrame.reset_index and rename for Series for DataFrame.join to original.

0人赞添加讨论(0) 举报

Tokenise text and create more rows for each row in

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间