I have this code that works great and does what I want, however it does it in linear form which is way to slow for the size of my data files so I want to convert it to Log. I tried this code and many others posted here but still no luck at getting it to work. I will post both sets of code and give examples of what I expect.
import pandas
import fileinput
'''This code runs fine and does what I expect removing duplicates from big
file that are in small file, however it is a linear function.'''
with open('small.txt') as fin:
exclude = set(line.rstrip() for line in fin)
for line in fileinput.input('big.txt', inplace=True):
if line.rstrip() not in exclude:
print(line, end='')
else:
print('')
'''This code is my attempt at conversion to a log function.'''
def log_search(small, big):
first = 0
last = len(big.txt) - 1
while first <= last:
mid = (first + last) / 2
if str(mid) == small.txt:
return True
elif small.txt < str(mid):
last = mid - 1
else:
first = mid + 1
with open('small.txt') as fin:
exclude = set(line.rstrip() for line in fin)
for line in fileinput.input('big.txt', inplace=True):
if line.rstrip() not in exclude:
print(line, end='')
else:
print('')
return log_search(small, big)
- big file has millions of lines of int data.
- small file has hundreds of lines of int data.
- compare data and remove duplicated data in big file but leave line number blank.
running the first block of code works but it takes too long to search through the big file. Maybe I am approaching the problem in a wrong way. My attempt at converting it to log runs without error but does nothing.
I don't think there is a better or faster way to do this that what you are currently doing in your first approach. (Update: There is, see below.) Storing the lines from
small.txt
in aset
and iterating the lines inbig.txt
, checking whether they are in that set, will have complexity ofO(b)
, withb
being the number of lines inbig.txt
.What you seem to be trying is to reduce this to
O(s*logb)
, withs
being the number of lines insmall.txt
, by using binary search to check for each line insmall.txt
whether it is inbig.txt
and removing/overwriting it then.This would work well if all the lines were in a
list
with random access to any array, but you have just the file, which does not allow random access to any line. It does, however, allow random access to any character withfile.seek
, which (at least in some cases?) seems to be O(1). But then you will still have to find the previous line break to that position before you can actually read that line. Also, you can not just replace lines with empty lines, but you have to overwrite the number with the same number of characters, e.g. spaces.So, yes, theoretically it can be done in
O(s*logb)
, if you do the following:On my system, reading and writing a file with 10 million lines of numbers only took 3 seconds each, or about 8 seconds with
fileinput.input
andprint
. Thus, IMHO, this is not really worth the effort, but of course this may depend on how often you have to do this operation.Okay, so I got curious myself --and who needs a lunch break anyway?-- so I tried to implement this... and it works surprisingly well. This will find the given number in the file and replace it with an accordant number of
-
(not just a blank line, that's impossible without rewriting the entire file). Note that I did not thoroughly test the binary-search algorithm for edge cases, off-by-one erros etc.This will first collect all the positions to be overwritten, and then do the actual overwriting in a second stes, otherwise the binary search will have problems when it hits one of the previously "removed" lines. I tested this with a file holding all the numbers from 0 to 1,000 in sorted order and a list of random numbers (both in- and out-of-bounds) to be removed and it worked just fine.
Update: Also tested it with a file with random numbers from 0 to 100,000,000 in sorted order (944 MB) and overwriting 100 random numbers, and it finished immediately, so this should indeed be O(s*logb), at least on my system (the complexity of
file.seek
may depend on file system, file type, etc.).The
bsearch
function could also be generalized to accept another parametervalue_function
instead of hardcodingval = int(line)
. Then it could be used for binary-searching in arbitrary files, e.g. huge dictionaries, gene databases, csv files, etc., as long as the lines are sorted by that same value function.