Is there any way to use regex match on a stream in python? like
reg = re.compile(r'\w+')
reg.match(StringIO.StringIO('aa aaa aa'))
And I don't want to do this by getting the value of the whole string. I want to know if there's any way to match regex on a srtream(on-the-fly).
I had the same problem. The first thought was to implement a
LazyString
class, which acts like a string but only reading as much data from the stream as currently needed (I did this by reimplementing__getitem__
and__iter__
to fetch and buffer characters up to the highest position accessed...).This didn't work out (I got a "TypeError: expected string or buffer" from
re.match
), so I looked a bit into the implementation of there
module in the standard library.Unfortunately using regexes on a stream seems not possible. The core of the module is implemented in C and this implementation expects the whole input to be in memory at once (I guess mainly because of performance reasons). There seems to be no easy way to fix this.
I also had a look at PYL (Python LEX/YACC), but their lexer uses
re
internally, so this wouldnt solve the issue.A possibility could be to use ANTLR which supports a Python backend. It constructs the lexer using pure python code and seems to be able to operate on input streams. Since for me the problem is not that important (I do not expect my input to be extensively large...), I will probably not investigate that further, but it might be worth a look.
Yes - using the
getvalue
method:This seems to be an old problem. As I have posted to a a similar question, you may want to subclass the Matcher class of my solution streamsearch-py and perform regex matching in the buffer. Check out the kmp_example.py for a template. If it turns out classic Knuth-Morris-Pratt matching is all you need, then your problem would be solved right now with this little open source library :-)
In the specific case of a file, if you can memory-map the file with
mmap
and if you're working with bytestrings instead of Unicode, you can feed a memory-mapped file tore
as if it were a bytestring and it'll just work. This is limited by your address space, not your RAM, so a 64-bit machine with 8 GB of RAM can memory-map a 32 GB file just fine.If you can do this, it's a really nice option. If you can't, you have to turn to messier options.
The 3rd-party
regex
module (notre
) offers partial match support, which can be used to build streaming support... but it's messy and has plenty of caveats. Things like lookbehinds and^
won't work, zero-width matches would be tricky to get right, and I don't know if it'd interact correctly with other advanced featuresregex
offers andre
doesn't. Still, it seems to be the closest thing to a complete solution available.If you pass
partial=True
toregex.match
,regex.fullmatch
,regex.search
, orregex.finditer
, then in addition to reporting complete matches,regex
will also report things that could be a match if the data was extended:It'll report a partial match instead of a complete match if more data could change the match result, so for example,
regex.search(r'[\s\S]*', anything, partial=True)
will always be a partial match.With this, you can keep a sliding window of data to match, extending it when you hit the end of the window and discarding consumed data from the beginning. Unfortunately, anything that would get confused by data disappearing from the start of the string won't work, so lookbehinds,
^
,\b
, and\B
are out. Zero-width matches would also need careful handling. Here's a proof of concept that uses a sliding window over a file or file-like object: