How to extract messages using regex in Scala?

2019-09-03 17:22发布

问题:

My version of RegEx is being greedy and now working as it suppose to. I need extract each message with timestamp and user who created it. Also if user has two or more consecutive messages it should go inside one match / block / group. How to solve it?

https://regex101.com/r/zD5bR6/1

val pattern = "((a\.b|c\.d)\n(.+\n)+)+?".r
for(m <- pattern.findAllIn(str).matchData; e <- m.subgroups) println(e)

UPDATE

ndn solution throws StackOverflowError when executed:

Exception in thread "main" java.lang.StackOverflowError
    at java.util.regex.Pattern$GroupTail.match(Pattern.java:4708)
    .......

Code:

    val pattern = "(?:.+(?:\\Z|\\n))+?(?=\\Z|\\w\\.\\w)".r
    val array = (pattern findAllIn str).toArray.reverse foreach{println _}
    for(m <- pattern.findAllIn(str).matchData; e <- m.subgroups) println(e)

回答1:

I don't think a regular expression is the right tool for this job. My solution below uses a (tail) recursive function to loop over the lines, keep the current username and create a Message for every timestamp / message pair.

import java.time.LocalTime

case class Message(user: String, timestamp: LocalTime, message: String)

val Timestamp = """\[(\d{2})\:(\d{2})\:(\d{2})\]""".r

def parseMessages(lines: List[String], usernames: Set[String]) = {
  @scala.annotation.tailrec
  def go(
    lines: List[String], currentUser: Option[String], messages: List[Message]
  ): List[Message] = lines match {
    // no more lines -> return parsed messages
    case Nil => messages.reverse
    // found a user -> keep as currentUser
    case user :: tail if usernames.contains(user) => 
      go(tail, Some(user), messages)
    // timestamp and message on next line -> create a Message
    case Timestamp(h, m, s) :: msg :: tail if currentUser.isDefined =>
      val time = LocalTime.of(h.toInt, m.toInt, s.toInt)
      val newMsg = Message(currentUser.get, time, msg)
      go(tail, currentUser, newMsg :: messages)
    // invalid line -> ignore
    case _ =>
      go(lines.tail, currentUser, messages)
  }
  go(lines, None, Nil)
}

Which we can use as :

val input = """
a.b
[10:12:03]
you can also get commands
[10:11:26]
from the console
[10:11:21]
can you check if has been resolved
[10:10:47]
ah, okay
c.d
[10:10:39]
anyways startsLevel is still 4
a.b
[10:09:25]
might be a dead end
[10:08:56]
that need to be started early as well
"""

val lines = input.split('\n').toList
val users = Set("a.b", "c.d")

parseMessages(lines, users).foreach(println)
// Message(a.b,10:12:03,you can also get commands)
// Message(a.b,10:11:26,from the console)
// Message(a.b,10:11:21,can you check if has been resolved)
// Message(a.b,10:10:47,ah, okay)
// Message(c.d,10:10:39,anyways startsLevel is still 4)
// Message(a.b,10:09:25,might be a dead end)
// Message(a.b,10:08:56,that need to be started early as well)


回答2:

The idea is to take as little characters as possible that will be followed by a username or the end of the string:

(?:.+(?:\Z|\n))+?(?=\Z|\w\.\w)

See it in action



标签: regex scala