I am trying to filter overlap rows in a big file in python.
The overlap degrees is set to 25% of two rows and the other two rows. In other words, the overlap degrees is a*b/(c+d-a*b)>0.25
, a
is the number of intersection between the 1st row and 3rd row, b
is the number of intersection between the 2nd row and 4th row, c
is the number of elements of the 1st row multiplied by the number of elements of the 2nd row, d
is the number of elements of the 3rd row multiplied by the number of elements of the 4th row . If the overlap degrees is more than 0.25,the 3rd and 4th rows are deleted. So if I have a big file with 1000 000 rows in total, the first 6 rows are as follows:
c6 c24 c32 c54 c67
k6 k12 k33 k63 k62
c6 c24 c32 c51 c68 c78
k6 k12 k24 k63
c6 c32 c24 c63 c67 c54 c75
k6 k12 k33 k63
Because the overlap degrees of the 1st two rows and 2nd rows is a=3
, (such as c6,c24,c32
),b=3
,(such as k6,k12,k63
), c=25,d=24
,a*b/(c+d-a*b)=9/40<0.25
,the 3rd and 4th rows are not deleted. Next the overlap degrees of the 1st two rows and 3rd two rows is 5*4/(25+28-5*4)=0.61>0.25
,the 3rd two rows are deleted.
The final answer is the 1st and 2nd two rows.
c6 c24 c32 c54 c67
k6 k12 k33 k63 k62
c6 c24 c32 c51 c68 c78
k6 k12 k24 k63
The pseudo code are as follows:
for i=1:(n-1) # n is a half of the number of rows of the big file for j=(i+1):n if overlap degrees of the ith two rows and jth two rows is more than 0.25 delete the jth two rows from the big file end end end
The python code are as follows.But it is wrong. How to fix it?
with open("iuputfile.txt") as fileobj: sets = [set(line.split()) for line in fileobj] for first_index in range(len(sets) - 4, -2, -2): c=len(sets[first_index])*len(sets[first_index+1]) for second_index in range(len(sets)-2 , first_index, -2): d=len(sets[second_index])*len(sets[second_index+1]) ab = len(sets[first_index] | sets[second_index])*len(sets[first_index+1] | sets[second_index+1]) if (ab/(c+d-ab))>0.25: del sets[second_index] del sets[second_index+1] with open("outputfile.txt", "w") as fileobj: for set_ in sets: # order of the set is undefined, so we need to sort each set output = " ".join(set_) fileobj.write("{0}\n".format(output))
The similar problem can be found in https://stackoverflow.com/questions/17321275/
How to modify that code to solve this problem in Python? Thank you!
I've been thinking about how to solve this problem in a better way, without all the reversing and indexing and stuff, and I've come up with a solution that's longer and more involved, but easier to read, prettier, more maintainable and extendable, IMHO.
First we need a special kind of list that we can iterate over "correctly", even if an item in it is deleted. Here is a blog post going into more detail about how lists and iterators work, and reading it will help you understand what's going on here:
We extend the built-in
list
and make it return a custom iterator instead of the default. Whenever an item in the list is deleted, we call theitem_deleted
method of every item in theself.iterators
list. Here's the code forSmartListIter
:So the iterator adds itself to the list of iterators, and removes itself from the same list when it is done. If an item with an index less than the current index is deleted, we decrement the current index by one so that we won't skip an item like a normal list iterator would do.
The
next
method returns a tuple(index, item)
instead of just the item, because that makes things easier when it's time to use these classes -- we won't have to mess around withenumerate
.So that should take care of having to go backwards, but we're still having to use a lot of indexes to juggle between four different lines in every loop. Since two and two lines go together, let's make a class for that:
The
pair
attribute is a tuple of two lines read directly from the input file, complete with newlines. We use it later to write that pair back to a file. We also convert both lines to a set and calculate thec
attribute, which is a property of every pair of lines. Finally we make a method that will compute the overlap between one LinePair and another. Notice thatd
is gone, since that is just thec
attribute of the other pair.Now for the grand finale:
Notice how easy it is to read the central loop here, and how short it is. If you need to change this algorithm in the future, it's likely much more easily done with this code than with my other code. The
izip
is used to group two and two lines of the input file, as explained here.Stack Overflow is not here to program for you or to solve general debugging tasks. It's for specific problems that you've tried to solve yourself but can't. You're asking questions you should, as a programmer, be able to figure out yourself. Start your program like this:
Now you can step through your script line by line using the
n
command. If you want to see what's inside a variable, simply type the name of that variable. By using this method you will find out why things don't work yourself. There are lots of other smart things you can do using pdb (the python debugger) but for this case then
command is sufficient.Please put in some more effort towards solving your problem yourself before asking another question here.
That being said, here is what was wrong with your modified script:
The mistakes were:
&
. Union is|
.float(ab)
.sets[second_index]
, what used to be atsets[second_index + 1]
takes it place, so deletingsets[second_index + 1]
afterwards will delete what used to be atsets[second_index + 2]
, which is not what you want. So we delete the largest index first.