I wrote a quick attoparsec parser to walk an aspx file and drop all the style attributes, and it's working fine except for one piece of it where I can't figure out how to make it succeed on matching >
without consuming it.
Here's what I have:
anyTill = manyTill anyChar
anyBetween start end = start *> anyTill end
styleWithQuotes = anyBetween (stringCI "style=\"") (stringCI "\"")
styleWithoutQuotes = anyBetween (stringCI "style=") (stringCI " " <|> ">")
everythingButStyles = manyTill anyChar (styleWithQuotes <|> styleWithoutQuotes) <|> many1 anyChar
I understand it's partially because of how I'm using manyTill in everythingButStyles, that's how I am actively dropping all the styles stuff on the ground, but in styleWithoutQuotes
I need it to match ">" as an end, but not consume it, in parsec I would have just done lookAhead ">"
but I can't do that in attoparsec.
Meanwhile, the lookAhead
combinator was added to attoparsec, so now one can just use lookAhead (char '>')
or lookAhead (string ">")
to achieve the goal.
Below is a workaround from the times before its introduction.
You can build your non-consuming parser using peekWord8
, which just looks at the next byte (if any). Since ByteString
has a Monoid
instance, Parser ByteString
is a MonadPlus
, and you can use
lookGreater = do
mbw <- peekWord8
case mbw of
Just 62 -> return ">"
_ -> mzero
(62 is the code point of '>'
) to either find a '>'
without consuming it or fail.
anyBetween start end = start *> anyTill end
Your anyBetween
parser eats its last character because anyTill
does - it's designed to parse upto an end marker, but assuming you didn't want to keep the closing brace in the input to parse again.
Notice that your end
parsers are all single character parsers, so we can change the functionality to make use of this:
anyBetween'' start ends = start *> many (satisfy (not.flip elem ends))
but many
isn't as efficient as Attoparsec's takeWhile
, which you should use as much as possible, so if you've done
import qualified Data.Attoparsec.Text as A
anyBetween' start ends = start *> A.takeWhile (not.flip elem ends)
should do the trick, and we can rewrite
styleWithoutQuotes = anyBetween' (stringCI "style=") [' ','>']
If you want it to eat the ' '
but not the '>'
you can explicitly eat spaces afterwards:
styleWithoutQuotes = anyBetween' (stringCI "style=") [' ','>']
<* A.takeWhile isSpace
Going for more takeWhile
Perhaps styleWithQuotes
could do with a rewrite to use takeWhile
as well, so let's make two helpers on the lines of anyBetween
. They take from a starting parser up to an ending character, and there's inclusive and exclusive versions:
fromUptoExcl startP endChars = startP *> takeTill (flip elem endChars)
fromUptoIncl startP endChars = startP *> takeTill (flip elem endChars) <* anyChar
But I think from what you said, you want styleWithoutQuotes
to be a hybrid; it eats ' '
but not >
fromUptoEat startP endChars eatChars =
*> takeTill (flip elem endChars)
<* satisfy (flip elem eatChars)
(All of these assume a small number of characters in your end character lists, otherwise elem
isn't efficient - there are some Set
variants if you're checking against a big list like an alphabet.)
Now for the rewrite:
styleWithQuotes' = fromUptoIncl (stringCI "style=\"") "\""
styleWithoutQuotes' = fromUptoEat (stringCI "style=") " >" " "
The overall parser
uses <|>
in a way that means that if it doesn't find "style"
it will backtrack then take everything. This is an example of the sort of thing which can be slow. The problem is that we fail late - at the end of the input string, which is a bad time to make a choice about whether we should fail. Let's go all out and try to
- Fail straight away if we're going to fail.
- Maximise use of the faster parsers from Data.Attoparsec.Text.Internal
Idea: take until we get an s, then skip the style if there's one there.
notStyleNotEvenS = takeTill (flip elem "sS")
skipAnyStyle = (styleWithQuotes' <|> styleWithoutQuotes') *> notStyleNotEvenS
<|> cons <$> anyChar <*> notStyleNotEvenS
The anyChar
is usually an s
or S
, but there's no sense checking that again.
noStyles = append <$> notStyleNotEvenS <*> many skipAnyStyle
parseNoStyles = parseOnly noStyles