Am wondering if anyone might provide some conceptual advice on an efficient way to build a data model to accomplish the simple system described below. Am somewhat new to thinking in a non-relational manner and want to try avoiding any obvious pitfalls. It's my understanding that a basic principal is that "storage is cheap, don't worry about data duplication" as you might in a normalized RDBMS.
What I'd like to model is:
A blog article which can be given 0-n tags. Many blog articles can share the same tag. When retrieving data would like to allow retrieval of all articles matching a tag. In many ways very similar to the approach taken here at stackoverflow.
My normal mindset would be to create a many-to-may relationship between tags and blog articles. However, I'm thinking in the context of GAE that this would be expensive, although I have seen examples of it being done.
Perhaps using a ListProperty containing each tag as part of the article entities, and a second data model to track tags as they're added and deleted? This way no need for any relationships and the ListProperty still allows queries where any list element matching will return results.
Any suggestions on the most efficient way to approach this on GAE?
Many-to-many sounds reasonable. Perhaps you should try it first to see if it is actually expensive.
Good thing about G.A.E. is that it will tell you when you are using too many cycles. Profiling for free!
counts being pre-computed is
not onlypractical, but also necessary because the count() function returns a maximum of 1000. if write-contention might be an issue, make sure to check out the sharded counter example.http://code.google.com/appengine/articles/sharding_counters.html
Thanks to both of you for your suggestions. I've implemented (first iteration) as follows. Not sure if it's the best approach, but it's working.
Class A = Articles. Has a StringListProperty which can be queried on it's list elements
Class B = Tags. One entity per tag, also keeps a running count of the total number of articles using each tag.
Data modifications to A are accompanied by maintenance work on B. Thinking that counts being pre-computed is a good approach in a read-heavy environment.
One possible way is with
Expando
, where you'd add a tag like:Then you could query all the entities with a tag like:
Of course you have to clean up your tags to be proper Python identifiers. I haven't tried this, so I'm not sure if it's really a good solution.