I wrote a quick attoparsec parser to walk an aspx file and drop all the style attributes, and it's working fine except for one piece of it where I can't figure out how to make it succeed on matching >
without consuming it.
Here's what I have:
anyTill = manyTill anyChar
anyBetween start end = start *> anyTill end
styleWithQuotes = anyBetween (stringCI "style=\"") (stringCI "\"")
styleWithoutQuotes = anyBetween (stringCI "style=") (stringCI " " <|> ">")
everythingButStyles = manyTill anyChar (styleWithQuotes <|> styleWithoutQuotes) <|> many1 anyChar
I understand it's partially because of how I'm using manyTill in everythingButStyles, that's how I am actively dropping all the styles stuff on the ground, but in styleWithoutQuotes
I need it to match ">" as an end, but not consume it, in parsec I would have just done lookAhead ">"
but I can't do that in attoparsec.
Your
anyBetween
parser eats its last character becauseanyTill
does - it's designed to parse upto an end marker, but assuming you didn't want to keep the closing brace in the input to parse again.Notice that your
end
parsers are all single character parsers, so we can change the functionality to make use of this:but
many
isn't as efficient as Attoparsec'stakeWhile
, which you should use as much as possible, so if you've donethen
should do the trick, and we can rewrite
If you want it to eat the
' '
but not the'>'
you can explicitly eat spaces afterwards:Going for more
takeWhile
Perhaps
styleWithQuotes
could do with a rewrite to usetakeWhile
as well, so let's make two helpers on the lines ofanyBetween
. They take from a starting parser up to an ending character, and there's inclusive and exclusive versions:But I think from what you said, you want
styleWithoutQuotes
to be a hybrid; it eats' '
but not>
:(All of these assume a small number of characters in your end character lists, otherwise
elem
isn't efficient - there are someSet
variants if you're checking against a big list like an alphabet.)Now for the rewrite:
The overall parser
everythingButStyles
uses<|>
in a way that means that if it doesn't find"style"
it will backtrack then take everything. This is an example of the sort of thing which can be slow. The problem is that we fail late - at the end of the input string, which is a bad time to make a choice about whether we should fail. Let's go all out and try toIdea: take until we get an s, then skip the style if there's one there.
The
anyChar
is usually ans
orS
, but there's no sense checking that again.Meanwhile, the
lookAhead
combinator was added to attoparsec, so now one can just uselookAhead (char '>')
orlookAhead (string ">")
to achieve the goal.Below is a workaround from the times before its introduction.
You can build your non-consuming parser using
peekWord8
, which just looks at the next byte (if any). SinceByteString
has aMonoid
instance,Parser ByteString
is aMonadPlus
, and you can use(62 is the code point of
'>'
) to either find a'>'
without consuming it or fail.