I can't seem to find decent documentation on haskell's POSIX implementation.
Specifically the module Text.Regex.Posix
.
Can anyone point me in the right direction of using multiline matching on a string?
A snippet for the curious:
> extractToken body = body =~ "<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>" :: String
I'm trying to extract the source of wikipedia pages, however this method clearly falls over when more than one line is involved.
You may need to import Text.Regex.Base.RegexLike
for access to makeRegexOpts
and friends.
extractToken body = match regex body where
regex = makeRegexOpts (defaultCompOpt - compNewline) defaultExecOpt
"<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>"
Well, since Text.Regex.Posix
's defaultCompOpt = compExtended + compNewline
, that works out equivalently as
extractToken body = match regex body where
regex = makeRegexOpts compExtended defaultExecOpt
"<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>"
To pull out just the first group, use one of the other instances of RegexLike
. One possibility is
extractToken body = head groups where
(preMatch, inMatch, postMatch, groups) =
match regex body :: (String, String, String, [String])
regex = makeRegexOpts compExtended defaultExecOpt
"<textarea[^>]*id=\"wpTextbox1\"[^>]*>(.*)</textarea>"
You may need to use the PCRE backend instead if you want to do anything more flexible, or with better performance, than Posix regexes.
pcre-light and regex-pcre are both fine.
I solved in this case by matching
((.*)|\n*)*
Although this may not always work depending on your expression.
The above solution is probably the best way to go if you're able to.