How to convert formatted String to Tuple in Scala?

2019-09-11 06:56发布

问题:

I have a text file with following content.

//((number,(number,date)),number)
((210,(18,2015/06/28)),57.0)
((92,(60,2015/06/16)),102.89777479000209)
((46,(18,2015/06/17)),52.8940162267246)
((204,(27,2015/06/06)),75.2807019793683)

I wish to convert it to tuple and need a fast way to do it. As the list of such string's I have is substantially huge.

EDIT : I would also, like to maintain the type and structure information.

Any help would be appreciated.

回答1:

I find scala-parser-combinators is the nice way to do this kind of thing; it's a lot more self-documenting than splits or regexes:

import scala.util.parsing.combinator.JavaTokenParsers
import org.joda.time.LocalDate

object MyParser extends JavaTokenParsers {
  override val skipWhitespace = false
  def date = (wholeNumber ~ "/" ~ wholeNumber ~ "/" ~ wholeNumber) ^^ { 
    case day ~ _ ~ month ~ _ ~ year =>
      new LocalDate(year.toInt, month.toInt, day.toInt)
  }
  def myNumber = decimalNumber ^^ { _.toDouble }
  def tupleElement: Parser[Any] = date | myNumber | tuple
  def tuple: Parser[List[Any]] = "(" ~> repsep(tupleElement, ",") <~ ")"
  def data = repsep(tuple, "\\n")
}

Hopefully the way to extend this is obvious. Use is something like:

scala> MyParser.parseAll(MyParser.data, """((210,(18,2015/06/28)),57.0)
 | ((92,(60,2015/06/16)),102.89777479000209)
 | ((46,(18,2015/06/17)),52.8940162267246)
 | ((204,(27,2015/06/06)),75.2807019793683)""")
res1: MyParser.ParseResult[List[List[Any]]] = [4.41] parsed: List(List(List(210, List(18, LocalDate(28,6,2015))), 57.0), List(List(92, List(60, LocalDate(16,6,2015))), 102.89777479000209), List(List(46, List(18, LocalDate(17,6,2015))), 52.8940162267246), List(List(204, List(27, LocalDate(6,6,2015))), 75.2807019793683))

The types can't be fully known at compile time (short of doing the parsing at compile time with a macro or some such) - the above is a List[List[Any]] where the elements are either LocalDate, Double or another List. You could handle it using pattern matching at runtime. A nicer approach could be to use a sealed trait:

sealed trait TupleElement
case class NestedTuple(val inner: List[TupleElement]) extends TupleElement
case class NumberElement(val value: Double) extends TupleElement
case class DateElement(val value: LocalDate) extends TupleElement

def myNumber = decimalNumber ^^ { d => NumberElement(d.toDouble) }
def tupleElement: Parser[TupleElement] = ... //etc.

Then when you have a TupleElement in code and you pattern-match, the compiler will warn if you don't cover all the cases.



回答2:

A super easy way:

val splitRegex = "[(),]+".r
def f(s: String) = {
  val split = splitRegex.split(s)
 (split(1).toInt, split(2).toInt, split(3), split(4).toDouble)
}

f("((210,(18,2015/06/28)),57.0)")
// res0: (Int, Int, String, Double) = (210.0,18.0,2015/06/28,57.0)

A cleaner way:

val TupleRegex = """\(\((\d+),\((\d+),(\d+/\d+/\d+)\)\),([\d.]+)\)""".r
def f(s: String) = s match {
  case TupleRegex(n1, n2, d, n3) => (n1.toInt, n2.toInt, d, n3.toDouble)
}

f("((210,(18,2015/06/28)),57.0)")
// res1: (Int, Int, String, Double) = (210.0,18.0,2015/06/28,57.0)


回答3:

Assuming the strings are all well formed, regular expressions, splitting and parsing will be plenty fast. You didn't mention if you wanted to maintain the structure of the original data (and gain types) or just a bag of tuples, but either is easy enough:

val strings = Array("((210,(18,2015/06/28)),57.0)",
  "((92,(60,2015/06/16)),102.89777479000209)",
  "((46,(18,2015/06/17)),52.8940162267246)",
  "((204,(27,2015/06/06)),75.2807019793683)")

val dateFormat = new java.text.SimpleDateFormat("yyyy/MM/dd")

def toUnstructuredTuple(s:String):(Int, Int, java.util.Date, Double) = {
  val noParens = s.replaceAll("[\\(\\)]", "")
  val split = noParens.split(",")

  (split(0).toInt, split(1).toInt, dateFormat.parse(split(2)), split(3).toDouble)
}

def toStructedTuple(s:String):((Int,(Int, java.util.Date)), Double) = {
  val noParens = s.replaceAll("[\\(\\)]", "")
  val split = noParens.split(",")

  ((split(0).toInt, (split(1).toInt, dateFormat.parse(split(2)))), split(3).toDouble)
}


strings.foreach { s =>
  println("%s -> %s".format(s, toUnstructuredTuple(s)))
}


strings.foreach { s =>
  println("%s -> %s". format(s, toStructedTuple(s)))
}

This results in:

benderino 21:54 $ bin/scala tuples.scala
((210,(18,2015/06/28)),57.0) -> (210,18,Sun Jun 28 00:00:00 PDT 2015,57.0)
((92,(60,2015/06/16)),102.89777479000209) -> (92,60,Tue Jun 16 00:00:00 PDT 2015,102.89777479000209)
((46,(18,2015/06/17)),52.8940162267246) -> (46,18,Wed Jun 17 00:00:00 PDT 2015,52.8940162267246)
((204,(27,2015/06/06)),75.2807019793683) -> (204,27,Sat Jun 06 00:00:00 PDT 2015,75.2807019793683)
((210,(18,2015/06/28)),57.0) -> ((210,(18,Sun Jun 28 00:00:00 PDT 2015)),57.0)
((92,(60,2015/06/16)),102.89777479000209) -> ((92,(60,Tue Jun 16 00:00:00 PDT 2015)),102.89777479000209)
((46,(18,2015/06/17)),52.8940162267246) -> ((46,(18,Wed Jun 17 00:00:00 PDT 2015)),52.8940162267246)
((204,(27,2015/06/06)),75.2807019793683) -> ((204,(27,Sat Jun 06 00:00:00 PDT 2015)),75.2807019793683)


标签: scala tuples