Imagine you have a table 'users' with 10 Mio records and a table 'groups' with 1 mio records. In average you have 50 users per group which I would store at least in an rdbms in a table called users2groups. users2groups is in fact a sparse matrix. Only 80% of the full dataset of users and groups fit into available memory. The data for the group membership (users2groups) comes on top, so that if memory is needed to cache group memberships this has to be deallocated from either the users or the groups table or both.
I would like to be able to:
- find users quickly by name AND
- find groups quickly by name AND
- get all users of a group quickly AND
- get all groups a user belongs to quickly
From experiences I know, that disk latencies determine your access speed to a good extend. Also you can balance between read and write speed. However before one can do this one has to decide for a database type these days... such as:
- Relational DBMS
- Key-value stores
- Document stores
- RDF stores
- Object oriented DBMS
- Graph DBMS
- Search engines
- Multivalue DBMS
- Native XML DBMS
- Wide column stores
- Content stores
- Navigational DBMS
- (compressed) Bitmaps
- files
- more...?
So the question is which of all these systems or which combinations give the best overall read access performance (with an acceptable write access performance) when RAM capacity is lower then available data considering sparse matixes? ...and how has the memory utilization accross all three entities/tables has to be balanced in the choosen technology...?
Since this is a conceptional question disk space and cpu capacity are out of scope or considered to be available "indefinitely".
Btw. I am aware that searching for names such as user names or group names can efficiently be speeded up by the use of indexes based on hashes (eg. crc32(lower(STRING))) - an example select would than be this: select somethinguseful from users where name=SEARCHSTRING and hash=crc32(lower(SEARCHSTRING)). However the hashes and their indexes are not included in the memory yet, when I said the users and groups table have 80% RAM coverage. That's because I am unclear, if there is not a better integrated solution. At the moment I just assume, that breaking up the whole topic into three pieces users, groups and users2groups is most sensible. I am lacking proof here.
-------------------- UPDATE -------------------------------------
I understand that there are competing concepts on the table:
- denormalization to an extend, where I can live with a very reduced set of queries against the disk. (eg. mongodb, etc)
- squeeze data so that most of the data fits into memory anyway (eg. compressed bitmap)
As denormalization means: 'blowing up data volumes' these both concepts seem to contradict each other. Are there best practices or scientific or sensible arguments, when to use denormalization and when to use the squeeze data approach? Eg. a KPI saying: If less than 80% fit into memory, go for denormalization or so?
-------------------- UPDATE -------------------------------------
Extra memory costs extra money, most db servers have usually lots of empty disk space getting bored on their feet. So denormalization is cheap. On the other hand denormalization opportunities are limited: Disk latency physically limits the amount of max queries per second, full stop. So too many queries against disk get queued which constrains denormalization to an extend for applications with lots of traffic. Even denormalized data access speed depends on memory to a large extend.
So maybe KPIs are not possible here. Anyway for the given sparse matrix expample how does denormalization and the squeeze data approach need to be balanced? I thought at compressing the users and groups table, leave them in a rdbms and than assign the freed memory to a cache of a document db which serves the users2groups relations. However this introduces a whole set of new issues such as muliple round trips to deal with 2 database systems and more complicated query management. So how to tackle this down?
----------------------- UPDATE -----
As per suggestion of lagivan the sparse matrix with only flagging relations seems to be solved in a sensible way: have 2 tables users and groups and then have in the table users a multipe ID field with IDs related to groups and vice versa have in table groups a multiple ID fields with fields related to users. I guess this solution is not tightly coupled to a specific technology. It could even be implemented in MYSQL via VARBINARY or any blobs.
The outstanding part of the question is related to sparse matrixes that contain some 'juice information' like status or lastupdatedate. So using foreign key arrays would sort of disable those information by concept. So the original questions for such scenario is still open: Which of all these systems or which combinations give the best overall read access performance (with an acceptable write access performance) when RAM capacity is lower then available data considering sparse matixes? ...and how has the memory utilization accross all three entities/tables has to be balanced in the choosen technology...?