Designing an access/web statistics counter module

2020-03-26 13:48发布

I need a access statistics module for appengine that tracks a few request-handlers and collects statistics to bigtable. I have not found any ready made solution on github and Google's examples are either oversimplified (memcached frontpage counter with cron) or overkill (accurate sharded counter). But most importantly, no appengine-counter solution discussed elsewhere includes a time component (hourly, daily counts) needed for statistics.

Requirements: The system does not need to be 100% accurate and could just ignore memcache loss (if infrequent). This should simplify things considerably. The idea is to just use memcache and accumulate stats in time intervals.

UseCase: Users on your system create content (e.g. pages). You want to track approx. How often a user's pages are viewed per hour or day. Some pages are viewed often, some never. You want to query by user and timeframe. Subpages may have fixed IDs (Query for user with most hits on homepage). You may want to delete old entries (Query for entries of year=xxxx).

class StatisticsDB(ndb.Model):
    # key.id() = something like YYYY-MM-DD-HH_groupId_countableID ... contains date
    # timeframeId = ndb.StringProperty() YYYY-MM-DD-HH needed for cleanup if counter uses ancestors
    countableId = ndb.StringProperty(required=True) # name of counter within group
    groupId = ndb.StringProperty() # counter group (allows single DB query with timeframe prefix inequality)
    count = ndb.Integerproperty() # count per specified timeframe

    @classmethod
    def increment(class, groupID, countableID):
        # increment memcache
        # save hourly to DB (see below)

Note: groupId and countableId indexes are necessary to avoid 2 inequalities in queries. (query all countables of a groupId/userId and chart/highcount-query: countableId with highest counts derives groupId/user), using ancestors in the DB may not support chart queries.

The problem is how to best save the memcached counter to DB:

  1. cron: This approach is mentioned in example docs (example front-page counter), but uses fixed counter ids that are hardcoded in the cron-handler. As there is no prefix-query for existing memcache keys, determining which counter-ids were created in memcache during the last time interval and need to be saved is probably the bottleneck.
  2. task-queue: if a counter is created schedule a task to collect it and write it to DB. COST: 1 task-queue entry per used counter and one ndb.put per time granularity (e.g. 1 hour) when the queue-handler saves the data. Seems the most promising approach to also capture infrequent events accurately.
  3. infrequently when increment(id) executes: If a new timeframe starts, save the previous. This needs at least 2 memcache accesses (get date, incr counter) per increment. One for tracking the timeframe and one for the counter. Disadvantage: bursty counters with longer stale periods may loose the cache.
  4. infrequently when increment(id) executes: probabilistic: if random % 100 == 0 then save to DB, but the counter should have uniformly distributed counting events
  5. infrequently when increment(id) executes: if the counter reaches e.g. 100 then save to DB

Did anyone solve this problem, what would be a good way to design this? What are the weaknesses and strengths of each approach? Are there alternate approaches that are missing here?

Assumptions: Counting can be slightly inaccurate (cache loss), the counterID space is large, counterIDs are incremented sparesly (some once per day, some often per day)

Update: 1) I think cron can be used similar to the task queue. One only has to create the DB model of the counter with memcached=True and run a query in cron for all counters marked that way. COST: 1 put at 1st increment, query at cron, 1 put to update counter. Without thinking it through fully this appears slightly more costly/complex than the task approach.

Discussed elsewhere:

3条回答
老娘就宠你
2楼-- · 2020-03-26 13:56

Yep, your #2 idea seems to best address your requirements.

To implement it you need a task execution with a specified delay.

I used the deferred library for such purpose, using the deferred.defer()'s countdown argument. I learned in the meantime that the standard queue library has similar support, by specifying the countdown argument for a Task constructor (I have yet to use this approach, tho).

So whenever you create a memcache counter also enqueue a delayed execution task (passing in its payload the counter's memcache key) which will:

  • get the memcache counter value using the key from the task payload
  • add the value to the corresponding db counter
  • delete the memcache counter when the db update is successful

You'll probably lose the increments from concurrent requests between the moment the memcache counter is read in the task execution and the memcache counter being deleted. You could reduce such loss by deleting the memcache counter immediately after reading it, but you'd risk losing the entire count if the DB update fails for whatever reason - re-trying the task would no longer find the memcache counter. If neither of these is satisfactory you could further refine the solution:

The delayed task:

  • reads the memcache counter value
  • enqueues another (transactional) task (with no delay) for adding the value to the db counter
  • deletes the memcache counter

The non-delayed task is now idempotent and can be safely re-tried until successful.

The risk of loss of increments from concurrent requests still exists, but I guess it's smaller.

Update:

The Task Queues are preferable to the deferred library, the deferred functionality is available using the optional countdown or eta arguments to taskqueue.add():

  • countdown -- Time in seconds into the future that this task should run or be leased. Defaults to zero. Do not specify this argument if you specified an eta.

  • eta -- A datetime.datetime that specifies the absolute earliest time at which the task should run. You cannot specify this argument if the countdown argument is specified. This argument can be time zone-aware or time zone-naive, or set to a time in the past. If the argument is set to None, the default value is now. For pull tasks, no worker can lease the task before the time indicated by the eta argument.

查看更多
乱世女痞
3楼-- · 2020-03-26 14:01

Counting things in a distributed system is a hard problem. There's some good info on the problem from the early days of App Engine. I'd start with Sharding Counter, which, despites being written in 2008, is still relevant.

查看更多
劳资没心,怎么记你
4楼-- · 2020-03-26 14:15

Here is the code for the implementation of the task-queue approach with hourly timeframe. Interestingly it works without transactions and other mutex magic. (For readability the python indent of methods is wrong.)

Supporting priorities for memcache would increase accuracy of this solution.

TASK_URL = '/h/statistics/collect/' # Example: '/h/statistics/collect/{counter-id}"?groupId=" + groupId + "&countableId=" + countableId'
MEMCACHE_PREFIX = "StatisticsDB_"

class StatisticsDB(ndb.Model):
"""
Memcached counting saved each hour to DB.
"""
    # key.id() = 2016-01-31-17_groupId_countableId
    countableId = ndb.StringProperty(required=True) # unique name of counter within group
    groupId = ndb.StringProperty() # couter group (allows single DB query for group of counters)
    count = ndb.IntegerProperty(default=0) # count per timeframe


@classmethod
def increment(cls, groupId, countableId):  # throws InvalidTaskNameError
    """
    Increment a counter. countableId is the unique id of the countable
    throws InvalidTaskNameError if ids do not match: [a-zA-Z0-9-_]{1,500}
    """
    # Calculate memcache key and db_key at this time
    # the counting timeframe is 1h, determined by %H, MUST MATCH ETA calculation in _add_task()
    counter_key = datetime.datetime.utcnow().strftime("%Y-%m-%d-%H") + "_" + groupId +"_"+ countableId;
    client = memcache.Client()

    n = client.incr(MEMCACHE_PREFIX + counter_key)
    if n is None:
        cls._add_task(counter_key, groupId, countableId)
        client.incr(MEMCACHE_PREFIX + counter_key, initial_value=0)


@classmethod
def _add_task(cls, counter_key, groupId, countableId):
    taskurl = TASK_URL + counter_key + "?groupId=" + groupId + "&countableId=" + countableId
    now = datetime.datetime.now()
    # the counting timeframe is 1h, determined by counter_key, MUST MATCH ETA calculation
    eta = now + datetime.timedelta(minutes = (61-now.minute)) # at most 1h later, randomized over 1 minute, throttled by queue parameters
    task = taskqueue.Task(url=taskurl, method='GET', name=MEMCACHE_PREFIX + counter_key, eta=eta)
    queue = taskqueue.Queue(name='StatisticsDB')
    try:
        queue.add(task)
    except taskqueue.TaskAlreadyExistsError: # may also occur if 2 increments are done simultaneously
        logging.warning("StatisticsDB TaskAlreadyExistsError lost memcache for %s", counter_key)
    except taskqueue.TombstonedTaskError: # task name is locked for ...
        logging.warning("StatisticsDB TombstonedTaskError some bad guy ran this task premature manually %s", counter_key)


@classmethod
def save2db_task_handler(cls, counter_key, countableId, groupId):
    """
    Save counter from memcache to DB. Idempotent method.
    At the time this executes no more increments to this counter occur.
    """
    dbkey = ndb.Key(StatisticsDB, counter_key)

    n = memcache.get(MEMCACHE_PREFIX + counter_key)        
    if n is None:
        logging.warning("StatisticsDB lost count for %s", counter_key)
        return

    stats = StatisticsDB(key=dbkey, count=n, countableId=countableId, groupId=groupId)
    stats.put()
    memcache.delete(MEMCACHE_PREFIX + counter_key) # delete if put succeeded
    logging.info("StatisticsDB saved %s n = %i", counter_key, n)
查看更多
登录 后发表回答