How do I make Attoparsec parser succeed without co

I wrote a quick attoparsec parser to walk an aspx file and drop all the style attributes, and it's working fine except for one piece of it where I can't figure out how to make it succeed on matching > without consuming it.

Here's what I have:

anyTill = manyTill anyChar
anyBetween start end = start *> anyTill end

styleWithQuotes = anyBetween (stringCI "style=\"") (stringCI "\"")
styleWithoutQuotes = anyBetween (stringCI "style=") (stringCI " " <|> ">")
everythingButStyles = manyTill anyChar (styleWithQuotes <|> styleWithoutQuotes) <|> many1 anyChar

I understand it's partially because of how I'm using manyTill in everythingButStyles, that's how I am actively dropping all the styles stuff on the ground, but in styleWithoutQuotes I need it to match ">" as an end, but not consume it, in parsec I would have just done lookAhead ">" but I can't do that in attoparsec.

标签： parsing haskell attoparsec

2条回答

Explosion°爆炸

2楼-- · 2019-03-31 01:40

anyBetween start end = start *> anyTill end

Your anyBetween parser eats its last character because anyTill does - it's designed to parse upto an end marker, but assuming you didn't want to keep the closing brace in the input to parse again.

Notice that your end parsers are all single character parsers, so we can change the functionality to make use of this:

anyBetween'' start ends = start *> many (satisfy (not.flip elem ends))

but many isn't as efficient as Attoparsec's takeWhile, which you should use as much as possible, so if you've done

import qualified Data.Attoparsec.Text as A

then

anyBetween' start ends = start *> A.takeWhile (not.flip elem ends)

should do the trick, and we can rewrite

styleWithoutQuotes = anyBetween' (stringCI "style=") [' ','>']

If you want it to eat the ' ' but not the '>' you can explicitly eat spaces afterwards:

styleWithoutQuotes = anyBetween' (stringCI "style=") [' ','>'] 
                     <* A.takeWhile isSpace

Going for more `takeWhile`

Perhaps styleWithQuotes could do with a rewrite to use takeWhile as well, so let's make two helpers on the lines of anyBetween. They take from a starting parser up to an ending character, and there's inclusive and exclusive versions:

fromUptoExcl startP endChars = startP *> takeTill (flip elem endChars)
fromUptoIncl startP endChars = startP *> takeTill (flip elem endChars) <* anyChar

But I think from what you said, you want styleWithoutQuotes to be a hybrid; it eats ' ' but not >:

fromUptoEat startP endChars eatChars = 
            startP 
            *> takeTill (flip elem endChars) 
            <* satisfy (flip elem eatChars)

(All of these assume a small number of characters in your end character lists, otherwise elem isn't efficient - there are some Set variants if you're checking against a big list like an alphabet.)

Now for the rewrite:

styleWithQuotes' = fromUptoIncl (stringCI "style=\"") "\""
styleWithoutQuotes' = fromUptoEat (stringCI "style=") " >" " "

The overall parser

everythingButStyles uses <|> in a way that means that if it doesn't find "style" it will backtrack then take everything. This is an example of the sort of thing which can be slow. The problem is that we fail late - at the end of the input string, which is a bad time to make a choice about whether we should fail. Let's go all out and try to

Fail straight away if we're going to fail.
Maximise use of the faster parsers from Data.Attoparsec.Text.Internal

Idea: take until we get an s, then skip the style if there's one there.

notStyleNotEvenS = takeTill (flip elem "sS") 
skipAnyStyle = (styleWithQuotes' <|> styleWithoutQuotes') *> notStyleNotEvenS 
               <|> cons <$> anyChar <*> notStyleNotEvenS

The anyChar is usually an s or S, but there's no sense checking that again.

noStyles = append <$> notStyleNotEvenS <*> many skipAnyStyle 

parseNoStyles = parseOnly noStyles

0人赞添加讨论(0) 举报

爷、活的狠高调

3楼-- · 2019-03-31 01:51

Meanwhile, the lookAhead combinator was added to attoparsec, so now one can just use lookAhead (char '>') or lookAhead (string ">") to achieve the goal.

Below is a workaround from the times before its introduction.

You can build your non-consuming parser using peekWord8, which just looks at the next byte (if any). Since ByteString has a Monoid instance, Parser ByteString is a MonadPlus, and you can use

lookGreater = do
    mbw <- peekWord8
    case mbw of
      Just 62 -> return ">"
      _ -> mzero

(62 is the code point of '>') to either find a '>' without consuming it or fail.

0人赞添加讨论(0) 举报