Improving the readability of a FParsec parser

2019-07-06 20:54发布

问题:

I have a hand-written CSS parser done in C# which is getting unmanageable and was trying to do it i FParsec to make it more mantainable. Here's a snippet that parses a css selector element made with regexes:

var tagRegex = @"(?<Tag>(?:[a-zA-Z][_\-0-9a-zA-Z]*|\*))";
var idRegex = @"(?:#(?<Id>[a-zA-Z][_\-0-9a-zA-Z]*))";
var classesRegex = @"(?<Classes>(?:\.[a-zA-Z][_\-0-9a-zA-Z]*)+)";
var pseudoClassRegex = @"(?::(?<PseudoClass>link|visited|hover|active|before|after|first-line|first-letter))";
var selectorRegex = new Regex("(?:(?:" + tagRegex + "?" + idRegex + ")|" +
                                 "(?:" + tagRegex + "?" + classesRegex + ")|" +
                                  tagRegex + ")" +
                               pseudoClassRegex + "?");

var m = selectorRegex.Match(str);

if (m.Length != str.Length) {
    cssParserTraceSwitch.WriteLine("Unrecognized selector: " + str);
    return null;
}

string tagName = m.Groups["Tag"].Value;

string pseudoClassString = m.Groups["PseudoClass"].Value;
CssPseudoClass pseudoClass;
if (pseudoClassString.IsEmpty()) {
    pseudoClass = CssPseudoClass.None;
} else {
    switch (pseudoClassString.ToLower()) {
        case "link":
            pseudoClass = CssPseudoClass.Link;
            break;
        case "visited":
            pseudoClass = CssPseudoClass.Visited;
            break;
        case "hover":
            pseudoClass = CssPseudoClass.Hover;
            break;
        case "active":
            pseudoClass = CssPseudoClass.Active;
            break;
        case "before":
            pseudoClass = CssPseudoClass.Before;
            break;
        case "after":
            pseudoClass = CssPseudoClass.After;
            break;
        case "first-line":
            pseudoClass = CssPseudoClass.FirstLine;
            break;
        case "first-letter":
            pseudoClass = CssPseudoClass.FirstLetter;
            break;
        default:
            cssParserTraceSwitch.WriteLine("Unrecognized selector: " + str);
            return null;
    }
}

string cssClassesString = m.Groups["Classes"].Value;
string[] cssClasses = cssClassesString.IsEmpty() ? EmptyArray<string>.Instance : cssClassesString.Substring(1).Split('.');
allCssClasses.AddRange(cssClasses);

return new CssSelectorElement(
    tagName.ToLower(),
    cssClasses,
    m.Groups["Id"].Value,
    pseudoClass);

My first attempt yielded this:

type CssPseudoClass =
    | None = 0
    | Link = 1
    | Visited = 2
    | Hover = 3
    | Active = 4
    | Before = 5
    | After = 6
    | FirstLine = 7
    | FirstLetter = 8

type CssSelectorElement = 
    { Tag : string
      Id : string
      Classes : string list    
      PseudoClass : CssPseudoClass } 
with
    static member Default = 
        { Tag = "";
          Id = "";
          Classes = [];
          PseudoClass = CssPseudoClass.None; }

open FParsec

let ws = spaces
let str = skipString
let strWithResult str result = skipString str >>. preturn result

let identifier =
    let isIdentifierFirstChar c = isLetter c || c = '-'
    let isIdentifierChar c = isLetter c || isDigit c || c = '_' || c = '-'    
    optional (str "-") >>. many1Satisfy2L isIdentifierFirstChar isIdentifierChar "identifier"

let stringFromOptional strOption =
    match strOption with
    | Some(str) -> str
    | None -> ""

let pseudoClassFromOptional pseudoClassOption =
    match pseudoClassOption with
    | Some(pseudoClassOption) -> pseudoClassOption
    | None -> CssPseudoClass.None

let parseCssSelectorElement =
    let tag = identifier <?> "tagName"
    let id = str "#" >>. identifier <?> "#id"
    let classes = many1 (str "." >>. identifier) <?> ".className"
    let parseCssPseudoClass =
        choiceL [ strWithResult "link" CssPseudoClass.Link;
                  strWithResult "visited" CssPseudoClass.Visited;
                  strWithResult "hover" CssPseudoClass.Hover;
                  strWithResult "active" CssPseudoClass.Active;
                  strWithResult "before" CssPseudoClass.Before;
                  strWithResult "after" CssPseudoClass.After;
                  strWithResult "first-line" CssPseudoClass.FirstLine;
                  strWithResult "first-letter" CssPseudoClass.FirstLetter]
                 "pseudo-class"    
    // (tag?id|tag?classes|tag)pseudoClass?
    pipe2 ((pipe2 (opt tag) 
                  id
                  (fun tag id -> 
                      { CssSelectorElement.Default with 
                          Tag = stringFromOptional tag;
                          Id = id })) |> attempt
           <|>
           (pipe2 (opt tag) 
                  classes
                  (fun tag classes -> 
                      { CssSelectorElement.Default with 
                          Tag = stringFromOptional tag;
                          Classes = classes })) |> attempt
           <|>
           (tag |>> (fun tag -> { CssSelectorElement.Default with Tag = tag })))
           (opt (str ":" >>. parseCssPseudoClass) |> attempt)
           (fun selectorElem pseudoClass -> { selectorElem with PseudoClass = pseudoClassFromOptional pseudoClass })

But I'm not really liking how it's shaping up. I was expecting to come up with something easier to understand, but the part parsing (tag?id|tag?classes|tag)pseudoClass? with a few pipe2's and attempt's is really bad.

Came someone with more experience in FParsec educate me on better ways to accomplish this? I'm thinking on trying FSLex/Yacc or Boost.Spirit instead of FParsec is see if I can come up with nicer code with them

回答1:

As Mauricio said, if you find yourself repeating code in an FParsec parser, you can always factor out the common parts into a variable or custom combinator. This is one of the great advantages of combinator libraries.

However, in this case you could also simplify and optimize the parser by reorganizing the grammer a bit. You could, for example, replace the lower half of the parseCssSelectorElement parser with

let defSel = CssSelectorElement.Default

let pIdSelector = id |>> (fun str -> {defSel with Id = str})
let pClassesSelector = classes |>> (fun strs -> {defSel with Classes = strs})

let pSelectorMain = 
     choice [pIdSelector 
             pClassesSelector 
             pipe2 tag (pIdSelector <|> pClassesSelector <|>% defSel)
                   (fun tagStr sel -> {sel with Tag = tagStr})]

pipe2 pSelectorMain (opt (str ":" >>. parseCssPseudoClass))
      (fun sel optPseudo ->
           match optPseudo with
           | None -> sel
           | Some pseudo -> {sel with PseudoClass = pseudo})

By the way, if you want to parse a large number of string constants, it's more efficient to use a dictionary based parsers, like

let pCssPseudoClass : Parser<CssPseudoClass,unit> =
    let pseudoDict = dict ["link", CssPseudoClass.Link
                           "visited", CssPseudoClass.Visited
                           "hover", CssPseudoClass.Hover
                           "active", CssPseudoClass.Active
                           "before", CssPseudoClass.Before
                           "after", CssPseudoClass.After
                           "first-line", CssPseudoClass.FirstLine
                           "first-letter", CssPseudoClass.FirstLetter]        
    fun stream ->
        let reply = identifier stream            
        if reply.Status <> Ok then Reply(reply.Status, reply.Error)
        else 
            let mutable pseudo = CssPseudoClass.None
            if pseudoDict.TryGetValue(reply.Result, &pseudo) then Reply(pseudo)
            else // skip to beginning of invalid pseudo class                   
                stream.Skip(-reply.Result.Length)
                Reply(Error, messageError "unknown pseudo class")


回答2:

You could extract some parts of that complex parser to variables, e.g.:

let tagid = 
    pipe2 (opt tag) 
      id
      (fun tag id -> 
          { CssSelectorElement.Default with 
              Tag = stringFromOptional tag
              Id = id })

You could also try using an applicative interface, personally I find it easier to use and think than pipe2:

let tagid =
    (fun tag id -> 
          { CssSelectorElement.Default with 
              Tag = stringFromOptional tag
              Id = id })
    <!> opt tag
    <*> id