Building a Regex Based Parser [closed]

2020-02-04 21:57发布

问题:

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.

Closed 7 years ago.

Is it stupid to build a regex based parser?

For work? Yes. For learning? No.

回答3:

The allure of parsing your own little languages with regular expressions cannot be overstated: most sysadmins could write a simple language parser entirely in Perl very quickly, but parsing the same language with lex/yacc would take most programmers a few hours.

And the Perl version would probably just about do the job. But as gpvos points out, using regex backend for your parsing drastically reduces future enhancement options, and sometimes attempts to work around the limitations leads to some pretty awful code, when it would be easy to handle those general enhancements with table-driven tools or hand written recursive descent parsers.

If you know the language is always going to remain easily parse-able with regex, you might do the right thing by spending an hour to get the job done, rather than four or five re-learning lex and yacc enough to write a similar parser with stronger tools. But if the language is liable to grow or change much, using real parser generators will probably help in the long run.

回答4:

It depends on what you want to parse, but IMO for most of the practical cases the answer is "No". Regex are quite limited on the grammar they can recognize (the limits being set by the regex implementation, as everybody put their own spice on it)

As you stated in your comments that you're building a parser for VBScript, forget about regexes as you need to recognize a Context Free Grammar. Check GOLD Parser or ANTLR.

回答5:

Often, regexes are used for the lexer (the recognizing of tokens), and something more powerful such as a recursive descent parser is used for recognizing the sequences of tokens, i.e., the actual parsing.

For very simple languages, a regex could be enough, but you would be limiting yourself very much. For example, you cannot parse an expression like (1 + 2) * 3 - 4 using a regex.

回答6:

Have a look at the GoldParser. It allows the use of regular expression for finding the tokens.