I have small problem with a simple tokenizer regex:
def test_tokenizer_regex_limit
string = '<p>a</p>' * 400
tokens = string.scan(/(<\s*tag:.*?\/?>)|((?:[^<]|\<(?!\s*tag:.*?\/?>))+)/)
end
Basically it runs through the text and gets pairs of [ matched_tag , other_text ]. Here's an example: http://rubular.com/r/f88JBjfzFh
Works fine for smaller sets. If you run in under ruby 1.8.7 it will blow up. 1.9.2 works fine.
Any ideas how to simplify / improve this? My regex-fu is weak
This is a bit more simplified but not much:
(<.*?>|[^<>]+)