What is the fastest method for searching lines in a file containing a string. I have a file containing strings to search. This small file (smallF) contains about 50,000 lines and looks like:
stringToSearch1
stringToSearch2
stringToSearch3
I have to search all of these strings in a larger file (about 100 million lines). If any line in this larger file contains the search string the line is printed.
The best method I have come up with so far is
grep -F -f smallF largeF
But this is not very fast. With just 100 search strings in smallF it takes about 4 minutes. For over 50,000 search strings it will take a lot of time.
Is there a more efficient method?
I once noticed that using -E
or multiple -e
parameters is faster than using -f
. Note that this might not be applicable for your problem as you are searching for 50,000 string in a larger file. However I wanted to show you what can be done and what might be worth testing:
Here is what I noticed in detail:
Have 1.2GB file filled with random strings.
>ls -has | grep string
1,2G strings.txt
>head strings.txt
Mfzd0sf7RA664UVrBHK44cSQpLRKT6J0
Uk218A8GKRdAVOZLIykVc0b2RH1ayfAy
BmuCCPJaQGhFTIutGpVG86tlanW8c9Pa
etrulbGONKT3pact1SHg2ipcCr7TZ9jc
.....
Now I want to search for strings "ab", "cd" and "ef" using different grep approaches:
Using grep without flags, search one at a time:
grep "ab" strings.txt > m1.out
2,76s user 0,42s system 96% cpu 3,313 total
grep "cd" strings.txt >> m1.out
2,82s user 0,36s system 95% cpu 3,322 total
grep "ef" strings.txt >> m1.out
2,78s user 0,36s system 94% cpu 3,360 total
So in total the search takes nearly 10 seconds.
Using grep with -f
flag with search strings in search.txt
>cat search.txt
ab
cd
ef
>grep -F -f search.txt strings.txt > m2.out
31,55s user 0,60s system 99% cpu 32,343 total
For some reasons this takes nearly 32 seconds.
Now using multiple search patterns with -e
grep -E "ab|cd|ef" strings.txt > m3.out
3,80s user 0,36s system 98% cpu 4,220 total
or
grep --color=auto -e "ab" -e "cd" -e "ef" strings.txt > /dev/null
3,86s user 0,38s system 98% cpu 4,323 total
The third methode using -E
only took 4.22 seconds to search through the file.
Now lets check if the results are the same:
cat m1.out | sort | uniq > m1.sort
cat m3.out | sort | uniq > m3.sort
diff m1.sort m3.sort
#
The diff produces no output, which means the found results are the same.
Maybe want to give it a try, otherwise I would advise you to look at the thread "Fastest possible grep", see comment from Cyrus.
You may want to try sift or ag. Sift in particular lists some pretty impressive benchmarks versus grep.
Note: I realise the following is not a bash
based solution, but given your large search space, a parallel solution is warranted.
If your machine has more than one core/processor, you could call the following function in Pythran, to parallelize the search:
#!/usr/bin/env python
#pythran export search_in_file(string, string)
def search_in_file(long_file_path, short_file_path):
_long = open(long_file_path, "r")
#omp parallel for schedule(guided)
for _string in open(short_file_path, "r"):
if _string in _long:
print(_string)
if __name__ == "__main__":
search_in_file("long_file_path", "short_file_path")
Note: Behind the scenes, Pythran takes Python code and attempt to aggressively compile it into very fast C++.