Text manipulation in Spark and Scala

2020-06-23 09:50发布

问题:

This is my data:

review/text: The product picture and part number match, but they together do not math the description.

review/text: A necessity for the Garmin. Used the adapter to power the unit on my motorcycle. Works like a charm.

review/text: This power supply did the job and got my computer back online in a hurry.

review/text: Not only did the supply work. it was easy to install, a lot quieter than the PowMax that fried.

review/text: This is an awesome power supply that was extremely easy to install. 

review/text: I had my doubts since best buy would end up charging me $60. at the time I bought my camera for the card and the cable.

review/text: Amazing... Installed the board, and that's it, no driver needed. Work great, no error messages.

and I've tried:

import org.apache.spark.{SparkContext, SparkConf}

object test12 {
  def filterfunc(s: String): Array[((String))] = {
    s.split( """\.""") 
      .map(_.split(" ")
      .filter(_.nonEmpty)
      .map(_.replaceAll( """\W""", "")
      .toLowerCase)
      .filter(_.nonEmpty)
      .flatMap(x=>x)
  }

  def main(args: Array[String]): Unit = {
    val conf1 = new SparkConf().setAppName("pre2").setMaster("local")
    val sc = new SparkContext(conf1)
    val rdd = sc.textFile("data/2012/2012.txt")
    val stopWords = sc.broadcast(List[String]("reviewtext", "a", "about", "above", "according", "accordingly", "across", "actually",...)

    var grouped_doc_words = rdd.flatMap({ (line) =>
      val words = line.map(filterfunc).filter(word_filter.value))
      words.map(w => {
        (line.hashCode(), w)
      })
    }).groupByKey()

  }
}

and I want to generate this output :

doc1: product picture number match together not math description. 
doc2: necessity garmin. adapter power unit my motorcycle. works like charm.
doc3: power supply job computer online hurry.
doc4: not supply work. easy install quieter powmax fried.
...

some exception: 1- (not , n't , non , none) not to be emitted 2- all dot (.) symbols must be held

my above code doesn't work very well.

回答1:

Why not just sth like this:

This way you don't need any grouping or flatMapping.

EDIT:

I was writing this by hand and indeed there was some little bugs but i hoped idea was clear. Here is tested code:

def processLine(s: String, stopWords: Set[String]): List[String] = {
    s.toLowerCase()
      .replaceAll(""""[^a-zA-Z\.]""", "")
      .replaceAll("""\.""", " .")
      .split("\\s+")
      .filter(!stopWords.contains(_))
      .toList
  }

  def main(args: Array[String]): Unit = {
    val conf1 = new SparkConf().setAppName("pre2").setMaster("local")
    val sc = new SparkContext(conf1)
    val rdd = sc.parallelize(
      List(
        "The product picture and part number match, but they together do not math the description.",
        "A necessity for the Garmin. Used the adapter to power the unit on my motorcycle. Works like a charm.",
        "This power supply did the job and got my computer back online in a hurry."
      )
    )
    val stopWords = sc.broadcast(
      Set("reviewtext", "a", "about", "above",
        "according", "accordingly",
        "across", "actually", "..."))
    val grouped_doc_words = rdd.map(processLine(_, stopWords.value))
    grouped_doc_words.collect().foreach(p => println(p))
  }

This as result gives you:

List(the, product, picture, and, part, number, match,, but, they, together, do, not, math, the, description, .)
List(necessity, for, the, garmin, ., used, the, adapter, to, power, the, unit, on, my, motorcycle, ., works, like, charm, .)
List(this, power, supply, did, the, job, and, got, my, computer, back, online, in, hurry, .)

Now if you want string not list just do:

grouped_doc_words.map(_.mkString(" "))


回答2:

I think there is a bug at marked line:

var grouped_doc_words = rdd.flatMap({ (line) =>
  val words = line.map(filterfunc).filter(word_filter.value)) // **
  words.map(w => {
    (line.hashCode(), w)
  })
}).groupByKey()

Here:

line.map(filterfunc)

should be:

filterfunc(line)

Explanation:

line is a String. map runs over a collection of items. When you do line.map(...) it basically runs the passed function on each Char - not something that you want.

scala> val line2 = "This is a long string"
line2: String = This is a long string

scala> line2.map(_.length)
<console>:13: error: value length is not a member of Char
              line2.map(_.length)

Additionally, I don't know what are you using this in filterfunction:

.map(_.replaceAll( """\W""", "")

I am not able to run spark-shell properly at my end. Can you please update if these fix your problem?