How to filter overlap of two rows in a big file in

2019-08-05 07:27发布

I am trying to filter overlap rows in a big file in python.

The overlap degrees is set to 25% of two rows and the other two rows. In other words, the overlap degrees is a*b/(c+d-a*b)>0.25, a is the number of intersection between the 1st row and 3rd row, b is the number of intersection between the 2nd row and 4th row, c is the number of elements of the 1st row multiplied by the number of elements of the 2nd row, d is the number of elements of the 3rd row multiplied by the number of elements of the 4th row . If the overlap degrees is more than 0.25,the 3rd and 4th rows are deleted. So if I have a big file with 1000 000 rows in total, the first 6 rows are as follows:

c6 c24 c32 c54 c67
k6 k12 k33 k63 k62
c6 c24 c32 c51 c68 c78
k6 k12 k24 k63
c6 c32 c24 c63 c67 c54 c75
k6 k12 k33 k63

Because the overlap degrees of the 1st two rows and 2nd rows is a=3, (such as c6,c24,c32),b=3,(such as k6,k12,k63), c=25,d=24,a*b/(c+d-a*b)=9/40<0.25,the 3rd and 4th rows are not deleted. Next the overlap degrees of the 1st two rows and 3rd two rows is 5*4/(25+28-5*4)=0.61>0.25,the 3rd two rows are deleted.
The final answer is the 1st and 2nd two rows.

c6 c24 c32 c54 c67
k6 k12 k33 k63 k62
c6 c24 c32 c51 c68 c78
k6 k12 k24 k63

The pseudo code are as follows:

for i=1:(n-1)    # n is a half of the number of rows of the big file
    for j=(i+1):n  
        if  overlap degrees of the ith two rows and jth two rows is more than 0.25
          delete the jth two rows from the big file
        end
    end
end

The python code are as follows.But it is wrong. How to fix it?

with open("iuputfile.txt") as fileobj: 
    sets = [set(line.split()) for line in fileobj]
    for first_index in range(len(sets) - 4, -2, -2):
        c=len(sets[first_index])*len(sets[first_index+1])
        for second_index in range(len(sets)-2 , first_index, -2):
            d=len(sets[second_index])*len(sets[second_index+1])
            ab = len(sets[first_index] | sets[second_index])*len(sets[first_index+1] | sets[second_index+1])
            if (ab/(c+d-ab))>0.25:
                del sets[second_index]
                del sets[second_index+1]
with open("outputfile.txt", "w") as fileobj:
    for set_ in sets:
        # order of the set is undefined, so we need to sort each set
        output = " ".join(set_)
        fileobj.write("{0}\n".format(output))

The similar problem can be found in https://stackoverflow.com/questions/17321275/

How to modify that code to solve this problem in Python? Thank you!

2条回答
在下西门庆
2楼-- · 2019-08-05 07:54

I've been thinking about how to solve this problem in a better way, without all the reversing and indexing and stuff, and I've come up with a solution that's longer and more involved, but easier to read, prettier, more maintainable and extendable, IMHO.

First we need a special kind of list that we can iterate over "correctly", even if an item in it is deleted. Here is a blog post going into more detail about how lists and iterators work, and reading it will help you understand what's going on here:

class SmartList(list):
    def __init__(self, *args, **kwargs):
        super(SmartList, self).__init__(*args, **kwargs)
        self.iterators = []

    def __iter__(self):
        return SmartListIter(self)

    def __delitem__(self, index):
        super(SmartList, self).__delitem__(index)
        for iterator in self.iterators:
            iterator.item_deleted(index)

We extend the built-in list and make it return a custom iterator instead of the default. Whenever an item in the list is deleted, we call the item_deleted method of every item in the self.iterators list. Here's the code for SmartListIter:

class SmartListIter(object):
    def __init__(self, smartlist, index=0):
        self.smartlist = smartlist
        smartlist.iterators.append(self)
        self.index = index

    def __iter__(self):
        return self

    def next(self):
        try:
            item = self.smartlist[self.index]
        except IndexError:
            self.smartlist.iterators.remove(self)
            raise StopIteration
        index = self.index
        self.index += 1
        return (index, item)

    def item_deleted(self, index):
        if index >= self.index:
            return
        self.index -= 1

So the iterator adds itself to the list of iterators, and removes itself from the same list when it is done. If an item with an index less than the current index is deleted, we decrement the current index by one so that we won't skip an item like a normal list iterator would do.

The next method returns a tuple (index, item) instead of just the item, because that makes things easier when it's time to use these classes -- we won't have to mess around with enumerate.

So that should take care of having to go backwards, but we're still having to use a lot of indexes to juggle between four different lines in every loop. Since two and two lines go together, let's make a class for that:

class LinePair(object):
    def __init__(self, pair):
        self.pair = pair
        self.sets = [set(line.split()) for line in pair]
        self.c = len(self.sets[0]) * len(self.sets[1])

    def overlap(self, other):
        ab = float(len(self.sets[0] & other.sets[0]) * \
            len(self.sets[1] & other.sets[1]))
        overlap = ab / (self.c + other.c - ab)
        return overlap

    def __str__(self):
        return "".join(self.pair)

The pair attribute is a tuple of two lines read directly from the input file, complete with newlines. We use it later to write that pair back to a file. We also convert both lines to a set and calculate the c attribute, which is a property of every pair of lines. Finally we make a method that will compute the overlap between one LinePair and another. Notice that d is gone, since that is just the c attribute of the other pair.

Now for the grand finale:

from itertools import izip

with open("iuputfile.txt") as fileobj:
    pairs = SmartList([LinePair(pair) for pair in izip(fileobj, fileobj)])

for first_index, first_pair in pairs:
    for second_index, second_pair in SmartListIter(pairs, first_index + 1):
        if first_pair.overlap(second_pair) > 0.25:
            del pairs[second_index]

with open("outputfile.txt", "w") as fileobj:
    for index, pair in pairs:
        fileobj.write(str(pair))

Notice how easy it is to read the central loop here, and how short it is. If you need to change this algorithm in the future, it's likely much more easily done with this code than with my other code. The izip is used to group two and two lines of the input file, as explained here.

查看更多
劳资没心,怎么记你
3楼-- · 2019-08-05 08:01

Stack Overflow is not here to program for you or to solve general debugging tasks. It's for specific problems that you've tried to solve yourself but can't. You're asking questions you should, as a programmer, be able to figure out yourself. Start your program like this:

python -m pdb my_script.py

Now you can step through your script line by line using the n command. If you want to see what's inside a variable, simply type the name of that variable. By using this method you will find out why things don't work yourself. There are lots of other smart things you can do using pdb (the python debugger) but for this case the n command is sufficient.

Please put in some more effort towards solving your problem yourself before asking another question here.

That being said, here is what was wrong with your modified script:

with open("iuputfile.txt") as fileobj:
    sets = [set(line.split()) for line in fileobj]
    for first_index in range(len(sets) - 4, -2, -2):
        c = len(sets[first_index]) * len(sets[first_index + 1])
        for second_index in range(len(sets) - 2, first_index, -2):
            d = len(sets[second_index]) * len(sets[second_index + 1])
            # Error 1:
            ab = len(sets[first_index] & sets[second_index]) * \
                len(sets[first_index + 1] & sets[second_index + 1])
            # Error 2:
            overlap = (float(ab) / (c + d - ab))
            if overlap > 0.25:
                # Error 3:
                del sets[second_index + 1]
                del sets[second_index]
    with open("outputfile.txt", "w") as fileobj:
        for set_ in sets:
            # You've removed the sorting, I assume it's because the order
            # is unimportant
            output = " ".join(set_)
            fileobj.write("{0}\n".format(output))

The mistakes were:

  • Error 1: Intersection is &. Union is |.
  • Error 2: Since all of the variables are integers, the result will be an integer too, unless you're using python 3. If you are, this is not an error. If you're not, you need to make sure one of the variables are a float to force the result to be a float as well. Hence the float(ab).
  • Error 3: Remember to always work from the back and forwards. When you delete sets[second_index], what used to be at sets[second_index + 1] takes it place, so deleting sets[second_index + 1] afterwards will delete what used to be at sets[second_index + 2], which is not what you want. So we delete the largest index first.
查看更多
登录 后发表回答