I want to do this with python
and pandas
.
Let's suppose that I have the following:
file_id text
1 I am the first document. I am a nice document.
2 I am the second document. I am an even nicer document.
and I finally want to have the following:
file_id text
1 I am the first document
1 I am a nice document
2 I am the second document
2 I am an even nicer document
So I want the text of each file to be splitted at every fullstop and to create new lines for each of the tokens of these texts.
What is the most efficient way to do this?
Use:
s = (df.pop('text')
.str.strip('.')
.str.split('\.\s+', expand=True)
.stack()
.rename('text')
.reset_index(level=1, drop=True))
df = df.join(s).reset_index(drop=True)
print (df)
file_id text
0 1 I am the first document
1 1 I am a nice document
2 2 I am the second document
3 2 I am an even nicer document
Explanation:
First use DataFrame.pop
for extract column, remove last .
by Series.str.rstrip
and split by with Series.str.split
with escape .
because special regex character, reshape by DataFrame.stack
for Series, DataFrame.reset_index
and rename
for Series for DataFrame.join
to original.
df = pd.DataFrame( { 'field_id': [1,2],
'text': ["I am the first document. I am a nice document.",
"I am the second document. I am an even nicer document."]})
df['sents'] = df.text.apply(lambda txt: [x for x in txt.split(".") if len(x) > 1])
df = df.set_index(['field_id']).apply(lambda x:
pd.Series(x['sents']),axis=1).stack().reset_index(level=1, drop=True)
df = df.reset_index()
df.columns = ['field_id','text']