How to read records terminated by custom separator

I would like a way to do for line in file in python, where the end of line is redefined to be any string that I want. Another way of saying that is I want to read records from file rather than lines; I want it to be equally fast and convenient to do as reading lines.

This is the python equivalent to setting perl's $/ input record separator, or using Scanner in java. This doesn't necessarily have to use for line in file (in particular, the iterator may not be a file object). Just something equivalent which avoids reading too much data into memory.

标签： python file io record separator

1条回答

ら.Afraid

2楼-- · 2019-01-15 15:59

There is nothing in the Python 2.x file object, or the Python 3.3 io classes, that lets you specify a custom delimiter for readline. (The for line in file is ultimately using the same code as readline.)

But it's pretty easy to build it yourself. For example:

def delimited(file, delimiter='\n', bufsize=4096):
    buf = ''
    while True:
        newbuf = file.read(bufsize)
        if not newbuf:
            yield buf
            return
        buf += newbuf
        lines = buf.split(delimiter)
        for line in lines[:-1]:
            yield line
        buf = lines[-1]

Here's a stupid example of it in action:

>>> s = io.StringIO('abcZZZdefZZZghiZZZjklZZZmnoZZZpqr')
>>> d = delimited(s, 'ZZZ', bufsize=2)
>>> list(d)
['abc', 'def', 'ghi', 'jkl', 'mno', 'pqr']

If you want to get it right for both binary and text files, especially in 3.x, it's a bit trickier. But if it only has to work for one or the other (and one language or the other), you can ignore that.

Likewise, if you're using Python 3.x (or using io objects in Python 2.x), and want to make use of the buffers that are already being maintained in a BufferedIOBase instead of just putting a buffer on top of the buffer, that's trickier. The io docs do explain how to do everything… but I don't know of any simple examples, so you're really going to have to read at least half of that page and skim the rest. (Of course, you could just use the raw files directly… but not if you want to find unicode delimiters…)

0人赞添加讨论(0) 举报

How to read records terminated by custom separator

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间