Python Stream Extraction

2019-06-23 16:46发布

问题:

The standard library of many programming languages includes a "scanner API" to extract strings, numbers, or other objects from text input streams. (For example, Java includes the Scanner class, C++ includes istream, and C includes scanf).

What is the equivalent of this in Python?

Python has a stream interface, i.e. classes that inherit from io.IOBase. However, the Python TextIOBase stream interface only provides facilities for line-oriented input. After reading the documentation and searching on Google, I can't find something in the standard Python modules that would let me, for example, extract an integer from a text stream, or extract the next space-delimited word as a string. Are there any standard facilities to do this?

回答1:

There is no equivalent of fscanf or Java's Scanner. The simplest solution is to require the user to use newline separeted input instead of space separated input, you can then read line by line and convert the lines to the correct type.

If you want the user to provide more structured input then you probably should create a parser for the user input. There are some nice parsing libraries for python, for example pyparsing. There is also a scanf module, even though the last update is of 2008.

If you don't want to have external dependencies then you can use regexes to match the input sequences. Certainly regexes require to work on strings, but you can easily overcome this limitation reading in chunks. For example something like this should work well most of the time:

import re


FORMATS_TYPES = {
    'd': int,
    'f': float,
    's': str,
}


FORMATS_REGEXES = {    
    'd': re.compile(r'(?:\s|\b)*([+-]?\d+)(?:\s|\b)*'),
    'f': re.compile(r'(?:\s|\b)*([+-]?\d+\.?\d*)(?:\s|\b)*'),
    's': re.compile(r'\b(\w+)\b'),
}


FORMAT_FIELD_REGEX = re.compile(r'%(s|d|f)')


def scan_input(format_string, stream, max_size=float('+inf'), chunk_size=1024):
    """Scan an input stream and retrieve formatted input."""

    chunk = ''
    format_fields = format_string.split()[::-1]
    while format_fields:
        fields = FORMAT_FIELD_REGEX.findall(format_fields.pop())
        if not chunk:
            chunk = _get_chunk(stream, chunk_size)

        for field in fields:
            field_regex = FORMATS_REGEXES[field]
            match = field_regex.search(chunk)
            length_before = len(chunk)
            while match is None or match.end() >= len(chunk):
                chunk += _get_chunk(stream, chunk_size)
                if not chunk or length_before == len(chunk):
                    if match is None:
                        raise ValueError('Missing fields.')
                    break
            text = match.group(1)
            yield FORMATS_TYPES[field](text)
            chunk = chunk[match.end():]



def _get_chunk(stream, chunk_size):
    try:
        return stream.read(chunk_size)
    except EOFError:
        return ''

Example usage:

>>> s = StringIO('1234 Hello World -13.48 -678 12.45')
>>> for data in scan_input('%d %s %s %f %d %f', s): print repr(data)
...                                                                                            
1234                                                                                           
'Hello'
'World'
-13.48
-678
12.45

You'll probably have to extend this, and test it properly but it should give you some ideas.



回答2:

There is no direct equivalent (as far as I know). However, you can do pretty much the same thing with regular expressions (see the re module).

For instance:

# matching first integer (space delimited)
re.match(r'\b(\d+)\b',string)

# matching first space delimited word
re.match(r'\b(\w+)\b',string)

# matching a word followed by an integer (space delimited)
re.match(r'\b(\w+)\s+(\d+)\b',string)

It requires a little more work than the usual C-style scanner interface, but it is also very flexible and powerful. You will have to process stream I/O yourself though.