We are planning on using MongoDB to store large amounts of analytics data such as views and clicks. I'm unsure on the best way to structure the documents within MongoDB to aid querying and reduce database size.
We need to record actions agains a pagename, client and the type of action. Ideally we need stats which go down the the year/month/day/hour level, we don't need or care about views per second or minute. While this document structure looks ok, I'm aware 100 vistors would generate a 100 new documents.
{
"_id" : ObjectId( "4dabdef81a34961506040000" ),
"pagename" : "Hello",
"action" : "view",
"client" : "client-name",
"time" : Date( "Mon Apr 18 07:49:28 2011" )
}
Is there best practice way of doing this, either using $inc or Capped Collections?
Updated answer
Hacked together in the mongo shell:
use pagestats;
// a little helper function
var pagePerHour = function(pagename) {
d = new Date();
return {
page : pagename,
year: d.getUTCFullYear(),
month: d.getUTCMonth(),
day : d.getUTCDate(),
hour: d.getUTCHours(),
}
}
// a pageview happened
db.pagestats.update(
pagePerHour('Hello'),
{ $inc : { views : 1 }},
true ); //we want to upsert
// somebody tweeted our page twice!
db.pagestats.update(
pagePerHour('Hello'),
{ $inc : { tweets : 2 }},
true ); //we want to upsert
db.pagestats.find();
// { "_id" : ObjectId("4dafe88a02662f38b4a20193"),
// "year" : 2011, "day" : 21, "hour" : 8, "month" : 3,
// "page" : "Hello",
// "tweets" : 2, "views" : 1 }
// 24 hour summary 'Hello' on 2011-4-21
for(i = 0; i < 24; i++) {
//careful: days (1-31), month (0-11) and hours (0-23)
stats = db.pagestats.findOne({ page: 'Hello', year: 2011, month: 3, day : 21, hour : i})
if(stats) {
print(i + ': ' + stats.views + ' views')
} else {
print(i + ': no hits')
};
}
Depending on which aspects you want to track you might consider adding more collections (e.g. a collection for user centric tracking). Hope that helps.
See also
Blogpost about Analytics Data
I wouldn't worry too much about space, Mongo can scale pretty much infinitely in that regard, adding more space would be reasonably cheap.
One thing to be aware of is the fact that if you keep updating a document its size will grow, which means Mongo will eventually need to find a new place for it in the index. If you have a lot of documents being updated and increasing in size Mongo will need to copy these documents around a lot, this can slow stuff down significantly. Of course this all depends on how much traffic you're expecting.
Based on my experience, go with a simple document format where you don't need to update the documents, it might complicate your querying later on, but you can use map/reduce to get whatever information you want regardless of your document structure (map reduce is very flexible given enough experience you can do anything).