可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效,请关闭广告屏蔽插件后再试):
问题:
This is just a question out of curiosity since I have been needing to get more and more into parsing and using regex lately.. it seems, for questions I come across in my searches regarding parsing of some sort, someone always ends up saying, when asked something relating to regex, "regex isn't good for that, use such and such parser instead"... as I have come to better understand regex, I think most stuff is possible, just its rather complex and time consuming since you have to account for many different possiblities, and of course, it has to be combined with conditional statements and loops to build any sort of parser.. so I'm wondering if regex is what is used to build most parsers or is there some other method being used.. I am just wondering since I may have the need to build some fairly complex custom parsers coming up where there isn't necessarily an existing one to use.
thanks for any info as I can't seem to find a direct answer to this.
回答1:
Typically, you'll use two (at least) types of tools in building your parser.
The first part is lexical analysis -- separating characters into tokens and filtering out comments and whitespace. That part is typically done with regular expressions. Well, it's even more typically done using a scanner generator that converts a collection of pairs of regular expressions and code into a program that executes the corresponding code when it recognizes the regular expressions. This turns out to be more efficient than testing each regular expression each time, and it also works better for various other reasons. FLEX is a common tool for this in C.
The second part of your parser is the grammar. The most typical tool for this is another program-generator that accepts a context-free grammar (CFG) annotated with rules for interpreting the component "parts of speech", as it were. A CFG is able to express things like balanced parenthesis, which a regular expression cannot (unless it's been extended with CF features, making it not strictly "regular" in the mathematical sense). But a CFG with rules is very nice because you can attach a semantic meaning to the phrase structure of your language. BISON is a common tool for this part in C.
But I actually told you a little lie. You see, every real programming language has parts that cannot be expressed within a context-free framework. For example, you need to connect the definition of a variable with the use of it so that you know what instructions to generate, and also if an operation on it is legal. That's typically considered outside the scope of parsing, but there are such things as "attribute grammars" which are like CFGs extended with features that can make even these context-dependencies much easier to code up and work with.
Now, there's no rule that says you HAVE to use such tools. Many simple grammars are easy enough to process with hand-written tools. For example, LISP's S-expressions can be simply scanned as:
If it starts with a digit, read a number.
If it starts with a letter, read a symbol.
If it's a space, skip it.
If it's an open-paren, then skip it, recurse this routine for a value, and expect a close paren.
Well, there are a few more complications for strings and what-have-you, but that's the basic idea. Parsing FORTH is even simpler, because it doesn't build a recursive data structure.
Anyway, that should get you going on whatever your project is.
回答2:
No, parsers are built from grammars.
But most compilers (interpreters) will use a separate scanner (lexer) to recognize the input-tokens. A scanner can be specified with regular expressions, but afaik they are not built using the usual RegEx library classes.
A separate scanner is a practical approach. It is possible to define full grammars all the way down to the character level, but that is impractical. Regular expressions handle the endpoint subset of the grammars easier.
For reference, see Yacc and Lex. They both have modern successors.
回答3:
Well, building a parser is pretty complex and you can use regex but that's not the only things you use. I suggest to read the Dragon Book
These days, in my opinion, you should use a parser generator because you can do it from scratch but it's not simple nor quick to do. You have to consider, generally speaking, regex and finite state automata for the lexical analysis; context-free grammars, LL parsers, bottom-up parsers, and LR parsers for Syntax analysis etc...etc...
回答4:
Regexes can be used to parse a certain class of language (finite state language), but their power is limited compared to other formalisms, and as you mention, they quickly become unweildy and hard to maintain.
For example, it is not possible to have a regex that can ensure for each open parenthesis that there is a matching close parenthesis - something that most languages have in their expression syntax.
Regexes are usually used to do tokenization, and the tokens then combined to create the desired syntax.
回答5:
(Most) parsers are created for recursive languages ie. languages that have recursive features. RegExps can't handle recursivity, so they aren't used for parser construction (without extra hacks a la Perl Markdown). However, RegExps are used for developing lexers, as they make life much easier that way.
回答6:
A 'regex' as you know it is a particular notation for creating deterministic finite automata. A DFA is a parsing device, and thus regexps do parse. When you use regexps to match something, you are parsing a string to align it with the pattern. When you use regexps to chop something up into bits with parentheses, you are parsing.
DFAs are formally defined as parsers for a particular category of languages called 'regular languages' (thanks to Gumbo for reminding me). Many important tasks do not involve regular languages.
Thus, DFAs are not a good approach to many parsing problems. The most famous examples around here are XML and HTML. There are many reasons, but I'll fill in one. These things are fundamentally tree structures. To parse them, a program has to maintain state as it descends the tree. Regexps don't do that.
Parsers defined by a grammar (such as LR(k) and LL(k)) do that, top-down hand-coded parsers do that.
There are books and books on the various alternative parsing technologies that are commonly applied to parsing things like C++, or XML.
回答7:
Regular expressions are defined over arbitrary tokens, but most programmers encounter them only in the context of strings of characters, and so it is easy to beleive they are only useful for strings.
As a pure capability, regular expressions (actually, a single regular expression) cannot parse any language that requires a context-free grammar.
What makes context-free grammars different than regular expressions is that you can define a large set of named "recognizers" of subgrammars of a language, that can refer to one another recursively. These rules
can all be limited to just the simple form of:
LHS = RHS1 RHS2 ... RHSn ;
(so call "Backus Naur form" or BNF) where each LHS and RHSi are names primitive language elements or nonterminals in the langauge. (I build a very complex language processing tool that uses just this form; you need more rules but it is very usable).
But most people writing grammars want a more expressive form, and so use an "extended BNF". If you examine these EBNFs closely what they generally do is add the ideas from regular expressions (alternation, kleene star/plus)
to the BNF formalism. Thus you can find EBNFs with "*" and "+".
So, what follows is an EBNF for itself, using regexp ideas:
EBNF = RULE+ ;
RULE = IDENTIFIER '=' ALTERNATIVES ';' ;
ALTERNATIVES = RHS ( '|' RHS )* ;
RHS = ITEM* ;
ITEM = IDENTIFIER | QUOTEDTOKEN | '(' ALTERNATIVES ')' | ITEM ( '*' | '+' ) ;
So, regular expression ideas can be used to express grammars. A parser generator that accepts such notation (including you doing it by hand) is needed to generate a parser from a grammar instance.
回答8:
Typically, you use some sort of pattern-matching (not necessarily regular expressions) in a lexer, to turn your stream of characters into a stream of tokens, and have your parser look at those tokens instead of the raw character input.
If you're looking to produce your own parsers, you're probably better off looking at something like Bison to assist with that.