A web application I am developing (in PHP) requires the ability to log each page request.
Just like a normal access_log, it will store details like url requested, source ip address, date/time but I also need it to store the User ID of the logged in user (which is stored in a php session variable).
This data will then be queried to create site-wide or per user analytics reports as required at a later date - things such as total number of visits/unique visits, page views in a certain time period, geo-locating the ip addresses and looking at locations, most active times of day, most active members etc.
The obvious thing to do would be to have a mysql insert statement on each page but if the application is receiving thousands of req/sec, this is going to be a hugh bottleneck on the database so I am looking at alternative, scalable ways of doing this without big infrastructure requirements.
A few of the ideas i've had are:
1) Work on a way for Nginx to be able to log the user_id from the session/application in the normal web server access_log, which can be parsed and loaded into a database periodically (nightly). This feels like a bit of a hack and will need doing on each web server as the system scales out.
2) Log each page request into Redis which has high write speeds - the problem with this is the lack of ability to query the date at a later date.
3) Log each page request into either Memcache/Redis acting as a cache (or a message queue) and from there it would be regularly extracted, inserted into MySQL and removed.
4) Would something like MongoDB which has more query capability be suitable?
I'm interested in how you would approach this and if anyone has any experience of a similar application (or has come across anything online).
I'm also interested on thoughts on how the data could be suitably structured to be stored in memcache/redis.
Thanks
It is certainly doable in a variety of methods. I'll address each listed option as well as some additional commentary.
1) If NGinx can do it, let it. I do it with Apache as well as JBOSS and Tomcat. I then use syslog-ng to collect them centrally and process from there. For this route I'd suggest a delimited log message format such as tab-separated as it makes it easier to parse and read. I don't know about it logging PHP variables, but it can certainly log headers and cookie information. If you are going to use NGinx logging at all I'd recommend this route if possible - why log twice?
2) There is no "lack of ability to query the date at a later date", more down below.
3) This is an option but whether or not it is useful depends on how long you want to keep the data and how much cleanup you want to write. More below.
4) MongoDB could certainly work. You will have to write the queries, and they are not simple SQL commands.
Now, to storing the data in redis. I currently log things with syslog-ng as noted and use a program destination to parse the data and stuff it into Redis. In my case I've got several grouping criteria such as by vhost and by cluster, so my structures may be a bit different.
The question you need to address first is "what data do I want out of this data"? Some of it will be counters such as traffic rates. Some of it will be aggregates, and still more will be things like "order my pages by popularity".
I'll demonstrate some of the techniques to easily get this into redis (and thus back out).
First, let us consider the traffic over time stats. First decide on the granularity. Do you want per-minute stats or will per-hour stats suffice? Here is one way to track a given URL's traffic:
Store the data in a sorted set using the key "traffic-by-url:URL:YYYY-MM-DD" in this sorted set you'll use the zincrby command and supply the member "HH:MM". for example in Python where "r' is your redis connection:
r.zincrby("traffic-by-url:/foo.html:2011-05-18", "01:04",1)
This example increases the counter for the url "/foo.html" on the 18th of May at 1:04 in the morning.
To retrieve data for a specific day, you can call zrange on the key (""traffic-by-url:URL:YYYY-MM-DD") to get a sorted set from least popular to most popular. To get the top 10, for example, you'd use zrevrange and give it the range. Zrevrange returns a reverse sort, the most hit will be at the top. Several more sorted set commands are available that allow you to do nice queries such as pagination, get a range of results by minimum score, etc..
You can simply alter or extend your key name to handle different temporal windows. By combining this with zunionstore you can automatically roll-up to less granular time periods. For example you could do a union of all keys in a week or month and store in a new key like "traffic-by-url:monthly:URL:YYYY-MM". By doing the above on all URLs in a given day you can get daily. Of course, you could also have a daily total traffic key and increment that. It mostly depends on when you want the data to be input - offline via logfile import or as part of the user experience.
I'd recommend against doing much during the actual user session as it extends the time it takes for your users to experience it (and server load). Ultimately that will be a call based on traffic levels and resources.
As you could imagine the above storage scheme can be applied to any counter based stat you want or determine. For example change URL to userID and you have per-user tracking.
You could also store logs raw in Redis. I do this for some logs storing them as JSON strings (I have them as key-value pairs). Then I have a second process that pulls them out and does things with the data.
For storing raw hits you could also use a sorted sets using the Epoch Time as the rank and easily grab a temporal window using the zrange/zrevrange commands. Or store them in a key that is based on the user ID. Sets would work for this, as would sorted sets.
Another option I've not discussed but for some of your data may be useful is storing as a hash. This could be useful for storing detailed information about a given session for example.
If you really want the data in a database, try using Redis' Pub/Sub feature and have a subscriber that parses it into a delimited format and dumps to a file. Then have an import process that uses the copy command (or equivalent for your DB) to import in bulk. Your DB will thank you.
A final bit of advice here (I've probably taken enough mental time already) is to make judicious and liberal use of the expire command. Using Redis 2.2 or newer you can set expiration on even counter keys. The big advantage here is automatic data cleanup. Imagine you follow a scheme like I've outlined above. By using the expiration commands you can automatically purge old data. Perhaps you want hourly stats for up to 3 months, then only the daily stats; daily stats for 6 months then monthly stats only. Simply expire your hourly keys after three months (86400*90), your daily at 6 (86400*180) and you won't need to do cleanup.
For geotagging I do offline processing of the IP. Imagine a sorted set with this key structure: "traffic-by-ip:YYYY-MM-DD" using the IP as the element and using the zincryby command noted above you get per-IP traffic data. Now, in your report, you can get the sorted set and do lookups of the IP. To save traffic when doing the reports, you could set up a hash in redis that maps the IP to the location you want. For example "geo:country" as the key and IP as the hash member with country code as the stored value.
A big caveat I would add is that if your traffic level is very high you may want to run two instances of Redis (or more depending on traffic). The first would be the write instance, It would not have the bgsave option enabled. If your traffic is pretty high you'll always be doing a bgsave. This is what I recommend the second instance for. It is a slave to the first and it does the saves to disk. You can also run your queries against the slave to distribute load.
I hope that gives you some ideas and things to try out. Play around with the different options to see what works best for your specific needs. I am tracking a lot of stats on a high traffic website (and also MTA log stats) in redis and it performs beautifully - combined with Django and Google's Visualization API I get very nice looking graphs.
When you use MongoDB for logging, the concern is the lock contention by high write throughputs. Although MongoDB's insert is fire-and-forget style by default, calling a lot of insert() causes a heavy write lock contention. This could affect the application performance, and prevent the readers to aggregate / filter the stored logs.
One solution might be using the log collector framework such as Fluentd, Logstash, or Flume. These daemons are supposed to be launched at every application nodes, and takes the logs from app processes.
They buffer the logs and asynchronously writes out the data to other systems like MongoDB / PostgreSQL / etc. The write is done by batches, so it's a lot more efficient than writing directly from apps. This link describes how to put the logs into Fluentd from PHP program.
- Fluentd: Data Import from PHP Applications
Here's some tutorials about MongoDB + Fluentd.
- Fluentd + MongoDB: The Easiest Way to Log Your Data Effectively on 10gen blog
- Fluentd: Store Apache Logs into MongoDB
Send the logging information to syslog-ng :)