I am trying to code dissociated press algorithm based on n-gram in scala.
How to generate an n-gram for a large files:
For example, for the file containing "the bee is the bee of the bees".
- First it has to pick a random n-gram. For example, the bee.
- Then it has to look for n-grams starting with (n-1) words. For example, bee of.
- it prints the last word of this n-gram. Then repeats.
Can you please give me some hints how to do it?
Sorry for the inconvenience.
Your questions could be a little more specific but here is my try.
val words = "the bee is the bee of the bees"
words.split(' ').sliding(2).foreach( p => println(p.mkString))
You may try this with a parameter of n
val words = "the bee is the bee of the bees"
val w = words.split(" ")
val n = 4
val ngrams = (for( i <- 1 to n) yield w.sliding(i).map(p => p.toList)).flatMap(x => x)
ngrams foreach println
List(the)
List(bee)
List(is)
List(the)
List(bee)
List(of)
List(the)
List(bees)
List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)
Here is a stream based approach. This will not required too much memory while computing n-grams.
object ngramstream extends App {
def process(st: Stream[Array[String]])(f: Array[String] => Unit): Stream[Array[String]] = st match {
case x #:: xs => {
f(x)
process(xs)(f)
}
case _ => Stream[Array[String]]()
}
def ngrams(n: Int, words: Array[String]) = {
// exclude 1-grams
(2 to n).map { i => words.sliding(i).toStream }
.foldLeft(Stream[Array[String]]()) {
(a, b) => a #::: b
}
}
val words = "the bee is the bee of the bees"
val n = 4
val ngrams2 = ngrams(n, words.split(" "))
process(ngrams2) { x =>
println(x.toList)
}
}
OUTPUT:
List(the, bee)
List(bee, is)
List(is, the)
List(the, bee)
List(bee, of)
List(of, the)
List(the, bees)
List(the, bee, is)
List(bee, is, the)
List(is, the, bee)
List(the, bee, of)
List(bee, of, the)
List(of, the, bees)
List(the, bee, is, the)
List(bee, is, the, bee)
List(is, the, bee, of)
List(the, bee, of, the)
List(bee, of, the, bees)