Removing multiple recurring text from pandas rows`

I am having a pandas dataframe which consists of scraped articles from websites as rows. I have 100 thousand articles in the similar nature.

Here is a glimse of my dataset.

text
0   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
1   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
2   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
3   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
4   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
5   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
6   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
7   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
8   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
for those who werent as productive as they would have liked during the first half of 2018
28  for those who werent as productive as they would have liked during the first half of 2018
29  for those who werent as productive as they would have liked during the first half of 2018
30  for those who werent as productive as they would have liked during the first half of 2018
31  for those who werent as productive as they would have liked during the first half of 2018
32  for those who werent as productive as they would have liked during the first half of 2018

Now, these are intials of each texts and they are repetitive. The main text lies after these texts.

Is there any way or a function possible, which identifies these texts and swipe them out in a few lines of code.

标签： python pandas nlp data-science text-processing

2条回答

霸刀☆藐视天下

2楼-- · 2019-07-17 07:36

If you want to remove strings that are exactly the same, sort your dataframe and then go through it in order. (This is similar to what Nerdrigo mentioned in a comment.)

sents = ... # sorted dataframe
out = [] # stuff here will be unique
for ii in range(len(sents) - 1):
    if sents[ii] != sents[ii + 1]:
        out.append(sents[ii])

If you want to remove sentences that are very similar but not exactly the same, the problem is much harder and there's no easy solution. You need to look into locality-sensitive hashing or near-duplicate detection. The datasketch library may be helpful.

Based on your comment I think I finally get it - you want to remove a common prefix. In that case modify the above code to be like this:

sents = ... # sorted dataframe
out = [] # cleaned sentences go here
lml = -1 # last match length
for ii in range(len(sents) - 1):
    # first check if the match from the last iteration still works
    if sents[ii][:lml] == sents[ii+1][:lml] and sents[ii][:lml + 1] != sents[ii+1][:lml + 1]:
        # old prefix still worked, chop and move on
        out.append(sents[ii][lml:])
        continue

    # if we're here, it means the prefix changed
    ml = 1 # match length
    # find the longest matching prefix
    while sents[ii][:ml] == sents[ii+1][:ml]:
        ml += 1

    # save the prefix length
    lml = ml
    # chop off the shared prefix
    out.append(sents[ii][ml:])

0人赞添加讨论(0) 举报

The star\"

3楼-- · 2019-07-17 07:54

I think you could use difflib somehow, for example:

>>> import difflib
>>> a = "my mother always told me to mind my business" 
>>> b = "my mother always told me to be polite"
>>> s = difflib.SequenceMatcher(None,a,b)
>>> s.find_longest_match(0,len(a),0,len(b))

Output:

Match(a=0, b=0, size=28)

Where a=0 means that the matching sequence starts at character 0 in string a, and b=0 means that the matching sequence starts at character 0 for string b.

Now if you do:

>>> b.replace(a[:28],"")

The ouptu will be:

'be polite'

And if you choose to do c = s.find_longest_match(0,len(a),0,len(b)) then c[0] = 0, c[1] = 0 and c[2] = 28.

You can read more about it here: https://docs.python.org/2/library/difflib.html

0人赞添加讨论(0) 举报

Removing multiple recurring text from pandas rows`

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间