How to generate n-grams in scala?

2019-03-16 14:01发布

I am trying to code dissociated press algorithm based on n-gram in scala. How to generate an n-gram for a large files: For example, for the file containing "the bee is the bee of the bees".

  1. First it has to pick a random n-gram. For example, the bee.
  2. Then it has to look for n-grams starting with (n-1) words. For example, bee of.
  3. it prints the last word of this n-gram. Then repeats.

Can you please give me some hints how to do it? Sorry for the inconvenience.

标签: scala n-gram
3条回答
我想做一个坏孩纸
2楼-- · 2019-03-16 14:29

Here is a stream based approach. This will not required too much memory while computing n-grams.

object ngramstream extends App {

  def process(st: Stream[Array[String]])(f: Array[String] => Unit): Stream[Array[String]] = st match {
    case x #:: xs => {
      f(x)
      process(xs)(f)
    }
    case _ => Stream[Array[String]]()
  }

  def ngrams(n: Int, words: Array[String]) = {
    // exclude 1-grams
    (2 to n).map { i => words.sliding(i).toStream }
      .foldLeft(Stream[Array[String]]()) {
        (a, b) => a #::: b
      }
  }

  val words = "the bee is the bee of the bees"
  val n = 4
  val ngrams2 = ngrams(n, words.split(" "))

  process(ngrams2) { x =>
    println(x.toList)
  }

}

OUTPUT:

List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)
查看更多
看我几分像从前
3楼-- · 2019-03-16 14:31

You may try this with a parameter of n

val words = "the bee is the bee of the bees"
val w = words.split(" ")

val n = 4
val ngrams = (for( i <- 1 to n) yield w.sliding(i).map(p => p.toList)).flatMap(x => x)
ngrams foreach println

List(the)
List(bee)
List(is)
List(the)
List(bee)
List(of)
List(the)
List(bees)
List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)
查看更多
混吃等死
4楼-- · 2019-03-16 14:54

Your questions could be a little more specific but here is my try.

val words = "the bee is the bee of the bees"
words.split(' ').sliding(2).foreach( p => println(p.mkString))
查看更多
登录 后发表回答