I am having a pandas dataframe which consists of scraped articles from websites as rows. I have 100 thousand articles in the similar nature.
Here is a glimse of my dataset.
text
0 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
1 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
2 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
3 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
4 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
5 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
6 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
7 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
8 which brings not only warmer weather but also the unsettling realization that the year is more than halfway over. So
for those who werent as productive as they would have liked during the first half of 2018
28 for those who werent as productive as they would have liked during the first half of 2018
29 for those who werent as productive as they would have liked during the first half of 2018
30 for those who werent as productive as they would have liked during the first half of 2018
31 for those who werent as productive as they would have liked during the first half of 2018
32 for those who werent as productive as they would have liked during the first half of 2018
Now, these are intials of each texts and they are repetitive. The main text lies after these texts.
Is there any way or a function possible, which identifies these texts and swipe them out in a few lines of code.
If you want to remove strings that are exactly the same, sort your dataframe and then go through it in order. (This is similar to what Nerdrigo mentioned in a comment.)
If you want to remove sentences that are very similar but not exactly the same, the problem is much harder and there's no easy solution. You need to look into locality-sensitive hashing or near-duplicate detection. The datasketch library may be helpful.
Based on your comment I think I finally get it - you want to remove a common prefix. In that case modify the above code to be like this:
I think you could use
difflib
somehow, for example:Output:
Where
a=0
means that the matching sequence starts at character0
in stringa
, andb=0
means that the matching sequence starts at character0
for stringb
.Now if you do:
The ouptu will be:
And if you choose to do
c = s.find_longest_match(0,len(a),0,len(b))
thenc[0] = 0
,c[1] = 0
andc[2] = 28
.You can read more about it here: https://docs.python.org/2/library/difflib.html