I am using Python as a script language to do some data processing and call command-line tools for number crunching. I wish to run command-line tools in parallel since they are independent with each other. When one command-line tool is finished, I can collect its results from the output file. So I also need some synchronization mechanism to notify my main Python program that one task is finished so that the result could be parsed into my main program.
Currently, I use os.system()
, which works fine for one-thread, but cannot be parallelized.
Thanks!
Use the Pool
object from the multiprocessing
module. You can then use e.g. Pool.map()
to do parallel processing. An example would be my markphotos script (see below), where a function is called multiple times in parallel to each process a picture.
#! /usr/bin/env python
# -*- coding: utf-8 -*-
# Adds my copyright notice to photos.
#
# Author: R.F. Smith <rsmith@xs4all.nl>
# $Date: 2012-10-28 17:00:24 +0100 $
#
# To the extent possible under law, Roland Smith has waived all copyright and
# related or neighboring rights to markphotos.py. This work is published from
# the Netherlands. See http://creativecommons.org/publicdomain/zero/1.0/
import sys
import subprocess
from multiprocessing import Pool, Lock
from os import utime, devnull
import os.path
from time import mktime
globallock = Lock()
def processfile(name):
"""Adds copyright notice to the file.
Arguments:
name -- file to modify
"""
args = ['exiftool', '-CreateDate', name]
createdate = subprocess.check_output(args)
fields = createdate.split(":") #pylint: disable=E1103
year = int(fields[1])
cr = "R.F. Smith <rsmith@xs4all.nl> http://rsmith.home.xs4all.nl/"
cmt = "Copyright © {} {}".format(year, cr)
args = ['exiftool', '-Copyright="Copyright (C) {} {}"'.format(year, cr),
'-Comment="{}"'.format(cmt), '-overwrite_original', '-q', name]
rv = subprocess.call(args)
modtime = int(mktime((year, int(fields[2]), int(fields[3][:2]),
int(fields[3][3:]), int(fields[4]), int(fields[5]),
0,0,-1)))
utime(name, (modtime, modtime))
globallock.acquire()
if rv == 0:
print "File '{}' processed.".format(name)
else:
print "Error when processing file '{}'".format(name)
globallock.release()
def checkfor(args):
"""Make sure that a program necessary for using this script is
available.
Arguments:
args -- list of commands to pass to subprocess.call.
"""
if isinstance(args, str):
args = args.split()
try:
with open(devnull, 'w') as f:
subprocess.call(args, stderr=subprocess.STDOUT, stdout=f)
except:
print "Required program '{}' not found! exiting.".format(args[0])
sys.exit(1)
def main(argv):
"""Main program.
Arguments:
argv -- command line arguments
"""
if len(argv) == 1:
binary = os.path.basename(argv[0])
print "Usage: {} [file ...]".format(binary)
sys.exit(0)
checkfor(['exiftool', '-ver'])
p = Pool()
p.map(processfile, argv[1:])
p.close()
if __name__ == '__main__':
main(sys.argv)
If you want to run commandline tools as separate processes, just use os.system
(or better: The subprocess
module) to start them asynchronously. On Unix/linux/macos:
subprocess.call("command -flags arguments &", shell=True)
On Windows:
subprocess.call("start command -flags arguments", shell=True)
As for knowing when a command has finished: Under unix you could get set up with wait
etc., but if you're writing the commandline scripts, I'd just have them write a message into a file, and monitor the file from the calling python script.
@James Youngman proposed a solution to your second question: Synchronization. If you want to control your processes from python, you could start them asynchronously with Popen.
p1 = subprocess.Popen("command1 -flags arguments")
p2 = subprocess.Popen("command2 -flags arguments")
Beware that if you use Popen and your processes write a lot of data to stdout, your program will deadlock. Be sure to redirect all output to a log file.
p1
and p2
are objects that you can use to keep tabs on your processes. p1.poll()
will not block, but will return None if the process is still running. It will return the exit status when it is done, so you can check if it is zero.
while True:
time.sleep(60)
for proc in [p1, p2]:
status = proc.poll()
if status == None:
continue
elif status == 0:
# harvest the answers
else:
print "command1 failed with status", status
The above is just a model: As written, it will never exit, and it will keep "harvesting" the results of completed processes. But I trust you get the idea.