I'm trying to call a process on a file after part of it has been read. For example:
with open('in.txt', 'r') as a, open('out.txt', 'w') as b:
header = a.readline()
subprocess.call(['sort'], stdin=a, stdout=b)
This works fine if I don't read anything from a before doing the subprocess.call, but if I read anything from it, the subprocess doesn't see anything. This is using python 2.7.3. I can't find anything in the documentation that explains this behaviour, and a (very) brief glance at the subprocess source didn't reveal a cause.
If you open the file unbuffered then it works:
subprocess
module works at a file descriptor level (low-level unbuffered I/O of the operating system). It may work withos.pipe()
,socket.socket()
,pty.openpty()
, anything with a valid.fileno()
method if OS supports it.It is not recommended to mix the buffered and unbuffered I/O on the same file.
On Python 2,
file.flush()
causes the output to appear e.g.:The issue can be reproduced without
subprocess
module withos.read()
:If the buffer size is small then the rest of the file is printed:
It eats more input if the first line size is not evenly divisible by
bufsize
.The default
bufsize
andbufsize=1
(line-buffered) behave similar on my machine: the beginning of the file vanishes -- around 4KB.file.tell()
reports for all buffer sizes the position at the beginning of the 2nd line. Usingnext(file)
instead offile.readline()
leads tofile.tell()
around 5K on my machine on Python 2 due to the read-ahead buffer bug (io.open()
gives the expected 2nd line position).Trying
file.seek(file.tell())
before the subprocess call doesn't help on Python 2 with default stdio-based file objects. It works withopen()
functions fromio
,_pyio
modules on Python 2 and with the defaultopen
(alsoio
-based) on Python 3.Trying
io
,_pyio
modules on Python 2 and Python 3 with and withoutfile.flush()
produces various results. It confirms that mixing buffered and unbuffered I/O on the same file descriptor is not a good idea.I solved it in python 2.7 by aligning the file descriptor position.
os.lseek(_file.fileno(), _file.tell(), os.SEEK_SET) truncate_null_cmd = ['tr','-d', '\\000'] subprocess.Popen(truncate_null_cmd, stdin=_file, stdout=subprocess.PIPE)
It happens because the subprocess module extracts the File handle from the File Object.
http://hg.python.org/releasing/2.7.6/file/ba31940588b6/Lib/subprocess.py
In line 1126, coming from 701.
The File Object uses buffers and has already read a lot from the file handle when the subprocess extracts it.