I have the following Java command line working fine Mac os.
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer file.txt > output.txt
Multiple files can be passed as input with spaces as follows.
java -cp stanford-ner.jar edu.stanford.nlp.process.PTBTokenizer file1.txt file2.txt > output.txt
Now I have 100 files in a folder. All these files I have to pass as input to this command. I used
python os.system in a for loop of directories as follows .
for i,f in enumerate(os.listdir(filedir)):
os.system('java -cp "stanford-ner.jar" edu.stanford.nlp.process.PTBTokenizer "%s" > "annotate_%s.txt"' %(f,i))
This works fine only for the first file. But for all othe outputs like annotate_1,annotate_2 it creates only the file with nothing inside that. I thought of using for loop the files and pass it to subprocess.popen() , but that seems of no hope.
Now I am thinking of passing the files in a loop one by one to execute the command sequentially by passing each file in a bash script. I am also wondering whether I can parallely executes 10 files (atleast) in different terminals at a time. Any solution is fine, but I think this question will help me to gain some insights into different this.
To pass all
.txt
files in the current directory at once to the java subprocess:It is similar to running the shell command but without running the shell:
To run no more than 10 subprocesses at a time, passing no more than 100 files at a time, you could use
multiprocessing.pool.ThreadPool
:It is similar to this
xargs
command (suggested by @abarnert):except that each thread in the Python script writes to its own output file to avoid corrupting the output due to parallel writes.
If you have 100 files, and you want to kick off 10 processes, each handling 10 files, all in parallel, that's easy.
First, you want to group them into chunks of 10. You can do this with slicing or with zipping iterators; in this case, since we definitely have a list, let's just use slicing:
Now, you want to kick off process for each group, and then wait for all of the processes, instead of waiting for each one to finish before kicking off the next. This is impossible with
os.system
, which is one of the many reasonsos.system
says "Thesubprocess
module provides more powerful facilities for spawning new processes…"So, what do you pass on the command line to give it 10 filenames instead of 1? If none of the names have spaces or other special characters, you can just
' '.join
them. But otherwise, it's a nightmare. Another reasonsubprocess
is better: you can just pass a list of arguments:But now how to do you get all of the results?
One way is to go back to using a shell command line with the
>
redirection. But a better way is to do it in Python:(You might want to use a
with
statement withExitStack
, but I wanted to make sure this didn't require Python 2.7/3.3+, so I used explicitclose
.)If you want to do this from the shell instead of Python, the
xargs
tool can almost do everything you want.You give it a command with a fixed list of arguments, and feed it input with a bunch of filenames, and it'll run the command multiple times, using the same fixed list plus a different batch of filenames from its input. The
--max-args
option sets the size of the biggest group. If you want to run things in parallel, the--max-procs
option lets you do that.But that's not quite there, because it doesn't do the output redirection. But… do you really need 10 separate files instead of 1 big one? Because if 1 big one is OK, you can just redirect all of them to it:
Inside your input file directory you can do the following in bash:
If you want to run it as a script. Save the file with some name, my_exec.bash:
Make it an executable file
USAGE: