I am trying to improve on a script which scans files for malicious code. We have a list of regex patterns in a file, one pattern on each line. These regex are for grep as our current implementation is basically a bash script find\grep combo. The bash script takes 358 seconds on my benchmark directory. I was able to write a python script that did this in 72 seconds but want to improve more. First I will post the base-code and then tweaks I have tried:
import os, sys, Queue, threading, re
fileList = []
rootDir = sys.argv[1]
class Recurser(threading.Thread):
def __init__(self, queue, dir):
self.queue = queue
self.dir = dir
threading.Thread.__init__(self)
def run(self):
self.addToQueue(self.dir)
## HELPER FUNCTION FOR INTERNAL USE ONLY
def addToQueue(self, rootDir):
for root, subFolders, files in os.walk(rootDir):
for file in files:
self.queue.put(os.path.join(root,file))
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
self.queue.put(-1)
class Scanner(threading.Thread):
def __init__(self, queue, patterns):
self.queue = queue
self.patterns = patterns
threading.Thread.__init__(self)
def run(self):
nextFile = self.queue.get()
while nextFile is not -1:
#print "Trying " + nextFile
self.scanFile(nextFile)
nextFile = self.queue.get()
#HELPER FUNCTION FOR INTERNAL UES ONLY
def scanFile(self, file):
fp = open(file)
contents = fp.read()
i=0
#for patt in self.patterns:
if self.patterns.search(contents):
print "Match " + str(i) + " found in " + file
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
fileQueue = Queue.Queue()
#Get the shell scanner patterns
patterns = []
fPatt = open('/root/patterns')
giantRE = '('
for line in fPatt:
#patterns.append(re.compile(line.rstrip(), re.IGNORECASE))
giantRE = giantRE + line.rstrip() + '|'
giantRE = giantRE[:-1] + ')'
giantRE = re.compile(giantRE, re.IGNORECASE)
#start recursing the directories
recurser = Recurser(fileQueue,rootDir)
recurser.start()
print "starting scanner"
#start checking the files
for scanner in xrange(0,8):
scanner = Scanner(fileQueue, giantRE)
scanner.start()
This is obviously debugging\ugly code, do not mind the million queue.put(-1), I will clean this up later. Some indentations are not showing up properly, paticularly in scanFile.
Anyway some things I've noticed. Using 1, 4, and even 8 threads (for scanner in xrange(0,???):) does not make a difference. I still get ~72 seconds regardless. I assume this is due to python's GIL.
As opposed to making a giant regex I tried placing each line (pattern) as a compilex RE in a list and iterating through this list in my scanfile function. This resulted in longer execution time.
In an effort to avoid python's GIL I tried having each thread fork to grep as in:
#HELPER FUNCTION FOR INTERNAL UES ONLY
def scanFile(self, file):
s = subprocess.Popen(("grep", "-El", "--file=/root/patterns", file), stdout = subprocess.PIPE)
output = s.communicate()[0]
if output != '':
print 'Matchfound in ' + file
This resulted in longer execution time.
Any suggestions on improving performance.
:::::::::::::EDIT::::::::
I can not post answers to my own questions yet however here are the answers to several points raised:
@David Nehme - Just to let people know I am aware of the fact that I have a million queue.put(-1)'s
@Blender - To mark the bottom of the queue. My scanner threads keep dequeing until they hit -1 which is at the bottom (while nextFile is not -1:). The processor cores is 8 however due to the GIL using 1 thread, 4 threads, or 8 threads does NOT make a difference. Spawning 8 subprocesses resulted in significantly slower code (142 sec vs 72)
@ed - Yes that and it's just as slow as the find\grep combo, actually slower because it indiscriminately greps file that aren't needed
@Ron - Can't upgrade, this must be universal. Do you think this will speed up > 72 seconds? The bash grepper does 358 seconds. My python giant RE method does 72 seconds w\ 1-8 threads. The popen method w\ 8 thrads (8 subprocesses) ran at 142 seconds. So far the giant RE python only method is a clear winner by far
@intuted
Here's the meat of our current find\grep combo (Not my script). It's pretty simple. There are some additional things in there like ls, but nothing that should result in a 5x slowdown. Even if grep -r is slightly more efficient 5x is a HUGE slowdown.
find "${TARGET}" -type f -size "${SZLIMIT}" -exec grep -Eaq --file="${HOME}/patterns" "{}" \; -and -ls | tee -a "${HOME}/found.txt"
The python code is more efficient, I don't know why, but I experimentally tested it. I prefer to do this in python. I already achieved a speedup of 5x with python, I would like to get it sped up more.
:::::::::::::WINNER WINNER WINNER:::::::::::::::::
Looks like we have a winner.
intued's shell script comes in 2nd place with 34 seconds however @steveha's came in first with 24 seconds. Due to the fact that a lot of our boxes do not have python2.6 I had to cx_freeze it. I can write a shell script wrapper to wget a tar and unpack it. I do like intued's for simplicity however.
Thanks you for all your help guys, I now have an efficient tool for sysadmining
I think that, rather than using the
threading
module, you should be using themultiprocessing
module for your Python solution. Python threads can run afoul of the GIL; the GIL is not a problem if you simply have multiple Python processes going.I think that for what you are doing a pool of worker processes is just what you want. By default, the pool will default to one process for each core in your system processor. Just call the
.map()
method with a list of filenames to check and the function that does the checking.http://docs.python.org/library/multiprocessing.html
If this is not faster than your
threading
implementation, then I don't think the GIL is your problem.EDIT: Okay, I'm adding a working Python program. This uses a pool of worker processes to open each file and search for the pattern in each. When a worker finds a filename that matches, it simply prints it (to standard output) so you can redirect the output of this script into a file and you have your list of files.
EDIT: I think this is a slightly easier to read version, easier to understand.
I timed this, searching through the files in /usr/include on my computer. It completes the search in about half a second. Using
find
piped throughxargs
to run as fewgrep
processes as possible, it takes about 0.05 seconds, about a 10x speedup. But I hate the baroque weird language you must use to getfind
to work properly, and I like the Python version. And perhaps on really big directories the disparity would be smaller, as part of the half-second for Python must have been startup time. And maybe half a second is fast enough for most purposes!I'm a bit confused as to how your Python script ended up being faster than your find/grep combo. If you want to use
grep
in a way somewhat similar to what's suggested by Ron Smith in his answer, you can do something liketo launch
grep
processes which will process 100 files before exiting, keeping up to 8 such processes active at any one time. Having them process 100 files should make the process startup overhead time of each one negligible.note: The
-d \\n
option toxargs
is a GNU extension which won't work on all POSIX-ish systems. It specifies that the *d*elimiter between filenames is a newline. Although technically filenames can contain newlines, in practice nobody does this and keeps their jobs. For compatibility with non-GNUxargs
you need to add the-print0
option tofind
and use-0
instead of-d \\n
withxargs
. This will arrange for the null byte\0
(hex0x00
) to be used as the delimiter both byfind
andxargs
.You could also take the approach of first counting the number of files to be grepped
and then using that number to get an even split among the 8 processes (assuming
bash
as shell)I think this might work better because the disk I/O of
find
won't be interfering with the disk I/O of the variousgrep
s. I suppose it depends in part on how large the files are, and whether they are stored contiguously — with small files, the disk will be seeking a lot anyway, so it won't matter as much. Note also that, especially if you have a decent amount of RAM, subsequent runs of such a command will be faster because some of the files will be saved in your memory cache.Of course, you can parameterize the
8
to make it easier to experiment with different numbers of concurrent processes.As ed. mentions in the comments, it's quite possible that the performance of this approach will still be less impressive than that of a single-process
grep -r
. I guess it depends on the relative speed of your disk [array], the number of processors in your system, etc.If you are willing to upgrade to version 3.2 or better, you can take advantage of the concurrent.futures.ProcessPoolExecutor. I think it will improve performance over the popen method you attempted because it will pre-create a pool of processes where your popen method creates a new process every time. You could write your own code to do the same thing for an earlier version if you can't move to 3.2 for some reason.