My subprocess call should be calling tabix 1kg.phase1.snp.bed.gz -B test.bed | awk '{FS="\t";OFS="\t"} $4 >= 10'
but is giving me errors because it has both "
and '
in it. I have tried using r
for a raw string but I can't figure out the right combination to prevent errors. My current call looks like:
snp_tabix = subprocess.Popen(["tabix", tgp_snp, "-B", infile, "|", "awk", """'{FS="\t";OFS="\t"}""", "$4", ">=", maf_cut_off, r"'"], stdout=subprocess.PIPE)
Which gives the error TypeError: execv() arg 2 must contain only strings
r"'"
is not the issue. Most likely you're passing maf_cut_off
as an integer, which is incorrect. You should use str(maf_cut_off)
.
There are several issues. You are trying to execute a shell command (there is a pipe |
in the command). So it won't work even if you convert all variables to strings.
You could execute it using shell:
from pipes import quote
from subprocess import check_output
cmd = r"""tabix %s -B %s | awk '{FS="\t";OFS="\t"} $4 >= %d'""" % (
quote(tgp_snp), quote(infile), maf_cut_off)
output = check_output(cmd, shell=True)
Or you could use the pipe recipe from subprocess
' docs:
from subprocess import Popen, PIPE
tabix = Popen(["tabix", tgp_snp, "-B", infile], stdout=PIPE)
awk = Popen(["awk", r'{FS="\t";OFS="\t"} $4 >= %d' % maf_cut_off],
stdin=tabix.stdout, stdout=PIPE)
tabix.stdout.close() # allow tabix to receive a SIGPIPE if awk exits
output = awk.communicate()[0]
tabix.wait()
Or you could use plumbum
that provides some syntax sugar for shell commands:
from plumbum.cmd import tabix, awk
cmd = tabix[tgp_snp, '-B', infile]
cmd |= awk[r'{FS="\t";OFS="\t"} $4 >= %d' % maf_cut_off]
output = cmd() # run it and get output
Another option is to reproduce the awk
command in pure Python. To get all lines that have 4th field larger than or equal to maf_cut_off
numerically (as an integer):
from subprocess import Popen, PIPE
tabix = Popen(["tabix", tgp_snp, "-B", infile], stdout=PIPE)
lines = []
for line in tabix.stdout:
columns = line.split(b'\t', 4)
if len(columns) > 3 and int(columns[3]) >= maf_cut_off:
lines.append(line)
output = b''.join(lines)
tabix.communicate() # close streams, wait for the subprocess to exit
tgp_snp
, infile
should be strings and maf_cut_off
should be an integer.
You could use bufsize=-1
(Popen()
's parameter) to improve time performance.