I need an advice on optimal approach to store statistical data. There is a project on Django, which has a database (mysql) of 30 000 online games.
Each game has three statistical parameters:
- number of views,
- number of plays,
- number of likes
Now I need to store historical data for these three parameters on a daily basis, so I was thinking on creating a single database which will has five columns:
gameid, number of views, plays, likes, date (day-month-year data).
So in the end, every day for every game will be logged in one row, so in one day this table will have 30000 rows, in 10 days it will have size of 300000 and in a year it will have size of 10 950 000 rows. I'm not a big specialist in DBA stuff, but this says me, that this quickly will become a performance problem. I'm not talking what will happen in 5 years time.
The data collected in this table is needed for simple graphs
(daily, weekly, monthly, custom range).
Maybe you have better ideas on how to store this data? Maybe noSQL will be more suitable in this case? Really need your advice on this.d
Partitioning in postgresql works great for big logs. First create the parent table:
create table game_history_log (
gameid integer,
views integer,
plays integer,
likes integer,
log_date date
);
Now create the partitions. In this case one for each month, 900 k rows, would be good:
create table game_history_log_201210 (
check (log_date between '2012-10-01' and '2012-10-31')
) inherits (game_history_log);
create table game_history_log_201211 (
check (log_date between '2012-11-01' and '2012-11-30')
) inherits (game_history_log);
Notice the check constraints in each partition. If you try to insert in the wrong partition:
insert into game_history_log_201210 (
gameid, views, plays, likes, log_date
) values (1, 2, 3, 4, '2012-09-30');
ERROR: new row for relation "game_history_log_201210" violates check constraint "game_history_log_201210_log_date_check"
DETAIL: Failing row contains (1, 2, 3, 4, 2012-09-30).
One of the advantages of partitioning is that it will only search in the correct partition reducing drastically and consistently the search size regardless of how many years of data there is. Here the explain for the search for a certain date:
explain
select *
from game_history_log
where log_date = date '2012-10-02';
QUERY PLAN
------------------------------------------------------------------------------------------------------
Result (cost=0.00..30.38 rows=9 width=20)
-> Append (cost=0.00..30.38 rows=9 width=20)
-> Seq Scan on game_history_log (cost=0.00..0.00 rows=1 width=20)
Filter: (log_date = '2012-10-02'::date)
-> Seq Scan on game_history_log_201210 game_history_log (cost=0.00..30.38 rows=8 width=20)
Filter: (log_date = '2012-10-02'::date)
Notice that apart from the parent table it only scanned the correct partition. Obviously you can have indexes on the partitions to avoid a sequential scan.
Inheritance Partitioning
11M rows ins't excessive, but indexing in general and the clustering of your primary key will matter more (on InnoDB). I suggest (game_id, date) for a primary key so that queries for all data about a certain game are in sequential rows. Also, you may want to keep a separate table of just the current values for ranking games, etc, when just the latest figures are necessary.
There is no performance problems with MySQL with 10kk data. You can just apply partitioning by game id (requires atleast 5.5 version).
I have MySQL DB with data like this and currently there is no problems with 980kk rows.
Instead of keeping every row, keep recent data at high precision, middle data at mdeium precision, and long-term at low precision. This is the appoach taken by rrdtool which might be better for this than mysql.
I would advice not to use relational database at all.
Statictics is sort of things which is changing pretty rapidly, because new data are constantly arriving.
I believe smth like HBase will fit better - because adding new records works faster here.