Removing multiple recurring text from pandas rows`

2019-07-17 07:14发布

I am having a pandas dataframe which consists of scraped articles from websites as rows. I have 100 thousand articles in the similar nature.

Here is a glimse of my dataset.

text
0   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
1   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
2   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
3   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
4   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
5   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
6   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
7   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
8   which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
for those who werent as productive as they would have liked during the first half of 2018
28  for those who werent as productive as they would have liked during the first half of 2018
29  for those who werent as productive as they would have liked during the first half of 2018
30  for those who werent as productive as they would have liked during the first half of 2018
31  for those who werent as productive as they would have liked during the first half of 2018
32  for those who werent as productive as they would have liked during the first half of 2018

Now, these are intials of each texts and they are repetitive. The main text lies after these texts.

Is there any way or a function possible, which identifies these texts and swipe them out in a few lines of code.

2条回答
霸刀☆藐视天下
2楼-- · 2019-07-17 07:36

If you want to remove strings that are exactly the same, sort your dataframe and then go through it in order. (This is similar to what Nerdrigo mentioned in a comment.)

sents = ... # sorted dataframe
out = [] # stuff here will be unique
for ii in range(len(sents) - 1):
    if sents[ii] != sents[ii + 1]:
        out.append(sents[ii])

If you want to remove sentences that are very similar but not exactly the same, the problem is much harder and there's no easy solution. You need to look into locality-sensitive hashing or near-duplicate detection. The datasketch library may be helpful.


Based on your comment I think I finally get it - you want to remove a common prefix. In that case modify the above code to be like this:

sents = ... # sorted dataframe
out = [] # cleaned sentences go here
lml = -1 # last match length
for ii in range(len(sents) - 1):
    # first check if the match from the last iteration still works
    if sents[ii][:lml] == sents[ii+1][:lml] and sents[ii][:lml + 1] != sents[ii+1][:lml + 1]:
        # old prefix still worked, chop and move on
        out.append(sents[ii][lml:])
        continue

    # if we're here, it means the prefix changed
    ml = 1 # match length
    # find the longest matching prefix
    while sents[ii][:ml] == sents[ii+1][:ml]:
        ml += 1

    # save the prefix length
    lml = ml
    # chop off the shared prefix
    out.append(sents[ii][ml:])
查看更多
The star\"
3楼-- · 2019-07-17 07:54

I think you could use difflib somehow, for example:

>>> import difflib
>>> a = "my mother always told me to mind my business" 
>>> b = "my mother always told me to be polite"
>>> s = difflib.SequenceMatcher(None,a,b)
>>> s.find_longest_match(0,len(a),0,len(b))

Output:

Match(a=0, b=0, size=28)

Where a=0 means that the matching sequence starts at character 0 in string a, and b=0 means that the matching sequence starts at character 0 for string b.

Now if you do:

>>> b.replace(a[:28],"")

The ouptu will be:

'be polite'

And if you choose to do c = s.find_longest_match(0,len(a),0,len(b)) then c[0] = 0, c[1] = 0 and c[2] = 28.

You can read more about it here: https://docs.python.org/2/library/difflib.html

查看更多
登录 后发表回答