I'm trying to scrape for a webpage using Haskell and compile the results into an object.
If, for whatever reason, I can't get all the items from the pages, I want to stop trying to process the page and return early.
For example:
scrapePage :: String -> IO ()
scrapePage url = do
doc <- fromUrl url
title <- liftM headMay $ runX $ doc >>> css "head.title" >>> getText
when (isNothing title) (return ())
date <- liftM headMay $ runX $ doc >>> css "span.dateTime" ! "data-utc"
when (isNothing date) (return ())
-- etc
-- make page object and send it to db
return ()
The problem is the when
doesn't stop the do block or keep the other parts from being executed.
What is the right way to do this?
return
in haskell does not do the same thing as return
in other languages. Instead, what return
does is to inject a value into a monad (in this case IO
). You have a couple of options
the most simple is to use if
scrapePage :: String -> IO ()
scrapePage url = do
doc <- fromUrl url
title <- liftM headMay $ runX $ doc >>> css "head.title" >>> getText
if (isNothing title) then return () else do
date <- liftM headMay $ runX $ doc >>> css "span.dateTime" ! "data-utc"
if (isNothing date) then return () else do
-- etc
-- make page object and send it to db
return ()
another option is to use unless
scrapePage url = do
doc <- fromUrl url
title <- liftM headMay $ runX $ doc >>> css "head.title" >>> getText
unless (isNothing title) do
date <- liftM headMay $ runX $ doc >>> css "span.dateTime" ! "data-utc"
unless (isNothing date) do
-- etc
-- make page object and send it to db
return ()
the general problem here is that the IO
monad doesn't have control effects (except for exceptions). On the other hand, you could use the maybe monad transformer
scrapePage url = liftM (maybe () id) . runMaybeT $ do
doc <- liftIO $ fromUrl url
title <- liftIO $ liftM headMay $ runX $ doc >>> css "head.title" >>> getText
guard (isJust title)
date <- liftIO $ liftM headMay $ runX $ doc >>> css "span.dateTime" ! "data-utc"
guard (isJust date)
-- etc
-- make page object and send it to db
return ()
if you really want to get full blown control effects you need to use ContT
scrapePage :: String -> IO ()
scrapePage url = runContT return $ do
doc <- fromUrl url
title <- liftM headMay $ runX $ doc >>> css "head.title" >>> getText
when (isNothing title) $ callCC ($ ())
date <- liftM headMay $ runX $ doc >>> css "span.dateTime" ! "data-utc"
when (isNothing date) $ callCC ($ ())
-- etc
-- make page object and send it to db
return ()
WARNING: none of the above code has been tested, or even type checked!
Use a monad transformer!
import Control.Monad.Trans.Class -- from transformers package
import Control.Error.Util -- from errors package
scrapePage :: String -> IO ()
scrapePage url = maybeT (return ()) return $ do
doc <- lift $ fromUrl url
title <- liftM headMay $ lift . runX $ doc >>> css "head.title" >>> getText
guard . not $ isNothing title
date <- liftM headMay $ lift . runX $ doc >>> css "span.dateTime" ! "data-utc"
guard . not $ isNothing date
-- etc
-- make page object and send it to db
return ()
For more flexibility in the return value when you early return, use throwError
/eitherT
/EitherT
instead of mzero
/maybeT
/MaybeT
. (Although then you can't use guard
.)
(Probably also use headZ
instead of headMay
and ditch the explicit guard
.)
I have never worked with Haskell, but it seems quitte easy. Try when (isNothing date) $ exit ()
. If this also isn't working, then make sure your statement is correct. Also see this website for more info: Breaking From loop.