I am trying to figure out the fastest way to access data stored in a junction object. The example below is analagous to my problem, but with a different context, because the actual dataset I am dealing with is somewhat unintuitive in its relationships.
We have 3 classes: User
, Product
, and Rating
. User has a many-to-many relationship to Product
with Rating
as the junction/'through' class.
The Rating
object stores the answers to several questions which are integer ratings on a scale of 1-5 (Example questions: How is the quality of the Product
, how is the value of the Product
, how user-friendly is the Product
). For simplification assume every User
rates every Product
they buy.
Now here is the calculation I want to perform: For a User
, calculate the average rating of all the Product
s they have bought (that is, the average rating from all other Users
, one of which will be from this User
themself). Then we can tell the user "On average, you buy products rated 3/5 for value by all customers who bought that product".
The simple and slow way is just to iterate over all of a user's review objects. If we assume that each user has bought a small (<100) number of products, and each product has n ratings, this is O(100n) = O(n).
However, I could also do the following: On the Product
class, keep a counter of the number of Rating
s that selected each number (e.g. how many User
s rated this product 3/5 for value). If you increment that counter every time a Product
is rated, then computing the average for a given Product
just requires checking the 5 counters for each Rating
criteria.
Is this a valid technique? Is it commonly employed/is there a name for it? It seems intuitive to me, but I don't know enough about databases to tell whether there's some fundamental flaw or not.
This is normal. It is ultimately caching: encoding of state redundantly to benefit some patterns of usage at the expense of others. Of course it's also a complexification.
Just because the RDBMS data structure is relations doesn't mean you can't rearrange how you are encoding state from some straightforward form. Eg denormalization.
(Sometimes redundant designs (including ones like yours) are called "denormalized" when they are not actually the result of denormalization and the redundancy is not the kind that denormalization causes or normalization removes. Cross Table Dependency/Constraint in SQL Database Indeed one could reasonably describe your case as involving normalization without preserving FDs (functional dependencies). Start with a table with a user's
id
& other columns, theirratings
(a relation) & itscounter
. Thenratings
functionally determinescounter
sincecounter
=select count(*) from ratings
. Decompose touser
etc +counter
, ie tableUser
, anduser
+ratings
, which ungroups to tableRating
. )A frequent comment by me: Google many clear, concise & specific phrasings of your question/problem/goal/desiderata with various subsets of terms & tags as you may discover them with & without your specific names (of variables/databases/tables/columns/constraints/etc). Eg 'when can i store a (sum OR total) redundantly in a database'. Human phrasing, not just keywords, seems to help. Your best bet may be along the lines of optimizing SQL database designs for performance. There are entire books ('amazon isbn'), some online ('pdf'). (But maybe mostly re queries). Investigate techniques relevant to warehousing, since an OLTP database acts as an input buffer to an OLAP database, and using SQL with big data. (Eg snapshot scheduling.)
PS My calling this "caching" (so does tag caching) is (typical of me) rather abstract, to the point where there are serious-jokes that everything in CS is caching. (Googling... "There are only two hard problems in Computer Science: cache invalidation and naming things."--Phil Karlton.) (Welcome to both.)