蒙戈：在计算一组文档词出现的次数(Mongo: count the number of word o

我有一组在蒙戈文件。说：

[
    { summary:"This is good" },
    { summary:"This is bad" },
    { summary:"Something that is neither good nor bad" }
]

我想统计每个字（不区分大小写）出现的号码，然后按降序排序。结果应该是这样的：

[
    "is": 3,
    "bad": 2,
    "good": 2,
    "this": 2,
    "neither": 1,
    "nor": 1,
    "something": 1,
    "that": 1
]

任何想法如何做到这一点？聚合框架将是首选，因为我知道它在一定程度上已经:)

Answer 1:

MapReduce的可能是一个不错的选择，可以处理服务器上的文档，而无需在客户端上执行操作（因为没有分裂DB服务器（一根绳子功能开放问题）。

先从map功能。另外，在以下（这可能需要更健壮）的例子中，每个文档被传递给map函数（ this ）。该代码查找的summary字段，如果它的存在，它小写，分割的是空间，然后发出1对发现的每一个单词。

var map = function() {  
    var summary = this.summary;
    if (summary) { 
        // quick lowercase to normalize per your requirements
        summary = summary.toLowerCase().split(" "); 
        for (var i = summary.length - 1; i >= 0; i--) {
            // might want to remove punctuation, etc. here
            if (summary[i])  {      // make sure there's something
               emit(summary[i], 1); // store a 1 for each word
            }
        }
    }
};

然后，在reduce的功能，它求和由所有所找到的结果的map功能并返回该被每个字的离散总emit泰德上方。

var reduce = function( key, values ) {    
    var count = 0;    
    values.forEach(function(v) {            
        count +=v;    
    });
    return count;
}

最后，执行MapReduce的：

> db.so.mapReduce(map, reduce, {out: "word_count"})

与你的样本数据的结果：

> db.word_count.find().sort({value:-1})
{ "_id" : "is", "value" : 3 }
{ "_id" : "bad", "value" : 2 }
{ "_id" : "good", "value" : 2 }
{ "_id" : "this", "value" : 2 }
{ "_id" : "neither", "value" : 1 }
{ "_id" : "or", "value" : 1 }
{ "_id" : "something", "value" : 1 }
{ "_id" : "that", "value" : 1 }

Answer 2:

一个基本的MapReduce的例子

var m = function() {
    var words = this.summary.split(" ");
    if (words) {
        for(var i=0; i<words.length; i++) {
            emit(words[i].toLowerCase(), 1);
        }   
    }
}

var r = function(k, v) {
    return v.length;
};

db.collection.mapReduce(
    m, r, { out: { merge: "words_count" } }
)

这将插入字数为集合名称words_count您可以排序（和索引）

需要注意的是它不使用词干，省略标点符号，处理停用词等。

还注意到可以通过累加重复字（一个或多个）事件和发射计数，而不仅仅是1优化映射函数

文章来源: Mongo: count the number of word occurrences in a set of documents