Multi-line *non* match with attoparsec

2019-09-02 01:42发布

问题:

I was playing around with parsing (PostgreSQL) logs which can have entries that are multi-line.

2016-01-01 01:01:01 entry1
2016-01-01 01:01:02 entry2a
    entry2b
2016-01-01 01:01:03 entry3

So - with a Perl or Python script I'd just grab the next line and if it wasn't starting with a timestamp append it to the previous log entry. What is a sensible way to approach this with attoparsec hooked up to io-streams? I clearly want to do something with lookAhead and failing to match a timestamp but my brain is just missing something.


Nope - still can't see it. I've stripped back what I've got. Parsing a single line is easy. I can't figure out how to parse "up to" another parsing pattern - I can see a lookAhead function I can use, but I don't see how that fits in with applying a "not" condition.

I can't see how I can match either. Entirely possible my brain has seized up.

{-# LANGUAGE OverloadedStrings #-}

module DummyParser (
    LogStatement (..), parseLogLine
    -- and, so we can test it...
    , LogTimestamp , parseTimestamp
    , parseSqlStmt
    , newLineAndTimestamp
) where

{-  we want to parse...
TIME001 statement: SELECT true;
TIME002 statement: SELECT 'b',
  'c';
TIME003 statement: SELECT 3;
-}

import           Data.Attoparsec.ByteString.Char8
import qualified Data.ByteString.Char8            as B

type LogTimestamp = Int

data LogStatement = LogStatement {
     l_ts  :: LogTimestamp
    ,l_sql :: String
} deriving (Eq, Show)


restOfLine :: Parser B.ByteString
restOfLine = do
    rest <- takeTill (== '\n')
    isEOF <- atEnd
    if isEOF then
        return rest
    else
        (char '\n') >> return rest


-- e.g. TIME001
parseTimestamp :: Parser LogTimestamp
parseTimestamp  = do
  string "TIME"
  digits  <- count 3 digit
  return (read digits)


-- e.g. statement: SELECT 1
parseSqlStmt :: Parser String
parseSqlStmt = do
    string "statement: "
    -- How can I match until the next timestamp?
    sql <- restOfLine
    return (B.unpack sql)


newLineAndTimestamp :: Parser LogTimestamp
newLineAndTimestamp = (char '\n') *> parseTimestamp


spaces :: Parser ()
spaces = do
    skipWhile (== ' ')


-- e.g. TIME001 statement: SELECT * FROM schema.table;
parseLogLine :: Parser LogStatement
parseLogLine = do
    log_ts <- parseTimestamp
    spaces
    log_sql <- parseSqlStmt
    let ls = LogStatement log_ts log_sql
    return ls

EDIT: So, this was what I finally ended up with thank's to arrowd's help

isTimestampNext = lookAhead parseTimestamp *> pure()

parseLogLine :: Parser LogStatement
parseLogLine = do
    log_ts <- parseTimestamp
    spaces
    log_sql <- parseSqlStmt
    extraLines <- manyTill restOfLine (endOfInput <|> isTimestampNext)
    let ls = LogStatement log_ts (log_sql ++ (B.unpack $ B.concat extraLines))
    return ls

回答1:

The combinator i shared on many attoparsec questions:

notFollowedBy p = p >> fail "not followed by"

Your solution would be something like

parseLogLine :: Parser LogStatement
parseLogLine = do
    log_ts <- parseTimestamp
    spaces
    log_sql <- parseSqlStmt
    newlineLeftover <- ((notFollowedBy parseTimestamp) *> parseSqlStmt) <|> pure ""
    let ls = LogStatement log_ts (log_sql ++ newlineLeftover
    return ls

The right hand of *> for newlineLeftOver expression would need some more work, i guess, but overall idea is like that.