FParsec: backtracking `sepBy`

2019-06-21 16:22发布

问题:

Consider the following toy grammar and parser:

(* in EBNF:
  ap = "a", { "ba" }
  bp = ap, "bc"
*)
let ap = sepBy1 (pstring "a") (pstring "b")
let bp = ap .>> (pstring "bc")
let test = run bp "abababc"

I get the following output:

Error in Ln: 1 Col: 7
abababc
      ^
Expecting: 'a'

Clearly sepBy1 sees the last b and expects it to lead into another a, failing when it doesn't find one. Is there a variant of sepBy1 which would backtrack over b and make this parse succeed? Is there any reason why I shouldn't use that instead?

回答1:

This is one way to implement such a variant of sepBy1:

let backtrackingSepBy1 p sep = pipe2 p (many (sep >>? p)) (fun hd tl -> hd::tl)

Avoiding backtracking in your parser grammar usually makes the parser faster, more portable and easier to debug. Hence, if you have the chance to avoid backtracking by restructuring the grammar (without complicating it too much), I'd recommend doing it.



回答2:

I've been looking through the FParsec source for the sepBy family of functions, and it looks like there's no way (currently) to do precisely what you're asking for with any of the sepBy functions. However, you might be able to approximate it well enough if your separator parser is simple.

To approximate what you're looking for, you could try using sepEndBy, or more likely sepEndBy1 in your case. Its intended use is to consume things like Python syntax for lists, where the final item is allowed to have a trailing comma: [ 1, 2, 3, ]. Using sepEndBy1 will consume the final separator, because that's what you want in its intended use. In your use case, though, you would have to work around the fact that sepEndBy1 consumes the final separator. E.g.,

// This parser will be either a separator *or* a prefix of the next item
let sharedParser = pstring "b"

let ap = sepEndBy1 (pstring "a") sharedParser
let restOfBp = pstring "c"
let bp = restOfBp <|> (pipe2 sharedParser restOfBp (fun b c -> b + c))

Here, the combining function in pipe2 is simple string concatenation, but in your actual usage scenario it might be different. The key idea is that if you are able to write a simple function to combine your b parser with your c parser to produce the bc result, and if the c parser is not too complicated, then this will work for you. If c is extremely complicated but b is simple, then just reverse the order of the two sides of <|>, so that b is tested first and c is only tested after b has either succeeded or failed; that way you won't have to apply the c parser twice, as you would in the sample code I just posted. I wrote it the way I did because that's slightly easier to understand.

If sepEndBy1 does not work for you, because both the b and c parsers in your real-life case are too expensive to parse twice, then the only solution I can think of would be to ask for this new sepBy variant to be added to FParsec as a new feature. My look through the FParsec code suggests that it would be possible to rewrite the implementation to optionally backtrack to before the final separator, but that it would need some significant testing and consideration of edge cases so it wouldn't be a five-minute fix. But if the sepEndBy1 workaround doesn't work for you, then go ahead and submit that feature request.



回答3:

I'm not sure if the returned value matters, but if not, then the easiest way to go is to transcribe the grammar more closely:

let ap = pstring "a" >>. many (pstring "ba")
let bp = ap .>> pstring "bc"

Note that the pstring "ba" will not cause the same problem you had, because pstring cannot consume only part of its argument; either it consumes a full "ba", or it fails without consuming anything and therefore bp can successfully see a b if there is one.

Another possibility , if this is indeed the full grammar (ie ap is not used in another production elsewhere), is to change it to the following EBNF, which is equivalent:

ap = "ab", { "ab" }
cp = ap, "c"

This makes it easier to implement with FParsec:

let ap = many1 (pstring "ab")
let cp = ap .>> pstring "c"