I need a access statistics module for appengine that tracks a few request-handlers and collects statistics to bigtable. I have not found any ready made solution on github and Google's examples are either oversimplified (memcached frontpage counter with cron) or overkill (accurate sharded counter). But most importantly, no appengine-counter solution discussed elsewhere includes a time component (hourly, daily counts) needed for statistics.
Requirements: The system does not need to be 100% accurate and could just ignore memcache loss (if infrequent). This should simplify things considerably. The idea is to just use memcache and accumulate stats in time intervals.
UseCase: Users on your system create content (e.g. pages). You want to track approx. How often a user's pages are viewed per hour or day. Some pages are viewed often, some never. You want to query by user and timeframe. Subpages may have fixed IDs (Query for user with most hits on homepage). You may want to delete old entries (Query for entries of year=xxxx).
class StatisticsDB(ndb.Model):
# key.id() = something like YYYY-MM-DD-HH_groupId_countableID ... contains date
# timeframeId = ndb.StringProperty() YYYY-MM-DD-HH needed for cleanup if counter uses ancestors
countableId = ndb.StringProperty(required=True) # name of counter within group
groupId = ndb.StringProperty() # counter group (allows single DB query with timeframe prefix inequality)
count = ndb.Integerproperty() # count per specified timeframe
@classmethod
def increment(class, groupID, countableID):
# increment memcache
# save hourly to DB (see below)
Note: groupId and countableId indexes are necessary to avoid 2 inequalities in queries. (query all countables of a groupId/userId and chart/highcount-query: countableId with highest counts derives groupId/user), using ancestors in the DB may not support chart queries.
The problem is how to best save the memcached counter to DB:
- cron: This approach is mentioned in example docs (example front-page counter), but uses fixed counter ids that are hardcoded in the cron-handler. As there is no prefix-query for existing memcache keys, determining which counter-ids were created in memcache during the last time interval and need to be saved is probably the bottleneck.
- task-queue: if a counter is created schedule a task to collect it and write it to DB. COST: 1 task-queue entry per used counter and one ndb.put per time granularity (e.g. 1 hour) when the queue-handler saves the data. Seems the most promising approach to also capture infrequent events accurately.
- infrequently when increment(id) executes: If a new timeframe starts, save the previous. This needs at least 2 memcache accesses (get date, incr counter) per increment. One for tracking the timeframe and one for the counter. Disadvantage: bursty counters with longer stale periods may loose the cache.
- infrequently when increment(id) executes: probabilistic: if random % 100 == 0 then save to DB, but the counter should have uniformly distributed counting events
- infrequently when increment(id) executes: if the counter reaches e.g. 100 then save to DB
Did anyone solve this problem, what would be a good way to design this? What are the weaknesses and strengths of each approach? Are there alternate approaches that are missing here?
Assumptions: Counting can be slightly inaccurate (cache loss), the counterID space is large, counterIDs are incremented sparesly (some once per day, some often per day)
Update: 1) I think cron can be used similar to the task queue. One only has to create the DB model of the counter with memcached=True and run a query in cron for all counters marked that way. COST: 1 put at 1st increment, query at cron, 1 put to update counter. Without thinking it through fully this appears slightly more costly/complex than the task approach.
Discussed elsewhere:
- High concurrency non-sharded counters - no count per timeframe
- Open Source GAE Fast Counters - no count per timeframe, nice performance comparison to sharded solution, expected losses due to memcache loss reported
Yep, your #2 idea seems to best address your requirements.
To implement it you need a task execution with a specified delay.
I used the deferred library for such purpose, using the
deferred.defer()
'scountdown
argument. I learned in the meantime that the standard queue library has similar support, by specifying thecountdown
argument for a Task constructor (I have yet to use this approach, tho).So whenever you create a memcache counter also enqueue a delayed execution task (passing in its payload the counter's memcache key) which will:
You'll probably lose the increments from concurrent requests between the moment the memcache counter is read in the task execution and the memcache counter being deleted. You could reduce such loss by deleting the memcache counter immediately after reading it, but you'd risk losing the entire count if the DB update fails for whatever reason - re-trying the task would no longer find the memcache counter. If neither of these is satisfactory you could further refine the solution:
The delayed task:
The non-delayed task is now idempotent and can be safely re-tried until successful.
The risk of loss of increments from concurrent requests still exists, but I guess it's smaller.
Update:
The Task Queues are preferable to the deferred library, the deferred functionality is available using the optional
countdown
oreta
arguments to taskqueue.add():Counting things in a distributed system is a hard problem. There's some good info on the problem from the early days of App Engine. I'd start with Sharding Counter, which, despites being written in 2008, is still relevant.
Here is the code for the implementation of the task-queue approach with hourly timeframe. Interestingly it works without transactions and other mutex magic. (For readability the python indent of methods is wrong.)
Supporting priorities for memcache would increase accuracy of this solution.