Escaping both types of quotes in subprocess.Popen

2019-08-17 14:14发布

问题:

My subprocess call should be calling tabix 1kg.phase1.snp.bed.gz -B test.bed | awk '{FS="\t";OFS="\t"} $4 >= 10' but is giving me errors because it has both " and ' in it. I have tried using r for a raw string but I can't figure out the right combination to prevent errors. My current call looks like:

snp_tabix = subprocess.Popen(["tabix", tgp_snp, "-B", infile, "|", "awk", """'{FS="\t";OFS="\t"}""", "$4", ">=", maf_cut_off, r"'"], stdout=subprocess.PIPE)

Which gives the error TypeError: execv() arg 2 must contain only strings

回答1:

r"'" is not the issue. Most likely you're passing maf_cut_off as an integer, which is incorrect. You should use str(maf_cut_off).



回答2:

There are several issues. You are trying to execute a shell command (there is a pipe | in the command). So it won't work even if you convert all variables to strings.

You could execute it using shell:

from pipes import quote
from subprocess import check_output

cmd = r"""tabix %s -B %s | awk '{FS="\t";OFS="\t"} $4 >= %d'""" % (
    quote(tgp_snp), quote(infile), maf_cut_off)
output = check_output(cmd, shell=True)

Or you could use the pipe recipe from subprocess' docs:

from subprocess import Popen, PIPE

tabix = Popen(["tabix", tgp_snp, "-B", infile], stdout=PIPE)
awk = Popen(["awk", r'{FS="\t";OFS="\t"} $4 >= %d' % maf_cut_off],
            stdin=tabix.stdout, stdout=PIPE)
tabix.stdout.close() # allow tabix to receive a SIGPIPE if awk exits
output = awk.communicate()[0]
tabix.wait()

Or you could use plumbum that provides some syntax sugar for shell commands:

from plumbum.cmd import tabix, awk

cmd = tabix[tgp_snp, '-B', infile]
cmd |= awk[r'{FS="\t";OFS="\t"} $4 >= %d' % maf_cut_off]
output = cmd() # run it and get output

Another option is to reproduce the awk command in pure Python. To get all lines that have 4th field larger than or equal to maf_cut_off numerically (as an integer):

from subprocess import Popen, PIPE

tabix = Popen(["tabix", tgp_snp, "-B", infile], stdout=PIPE)
lines = []
for line in tabix.stdout:
    columns = line.split(b'\t', 4)
    if len(columns) > 3 and int(columns[3]) >= maf_cut_off:
       lines.append(line)
output = b''.join(lines)
tabix.communicate() # close streams, wait for the subprocess to exit

tgp_snp, infile should be strings and maf_cut_off should be an integer.

You could use bufsize=-1 (Popen()'s parameter) to improve time performance.