As a continuation of in this post, this is a bit of a capstone-style question to solidify my understanding of gae-datastore and get some critiques on my data modeling decisions. I'll be modifying he Jukebox example created by @Jimmy Kane to better reflect my real world case.
In the original setup,
imagine that you have a jukebox with queues per room let's say. And people are queueing songs to each queue of each jukebox.
J=Jukebox, Q=queue, S=Song
Jukebox
/ | \
Q1 Q2 Q3
/ | \ | \
S1 S2 S3 S4 S5
First, fill out the Song model as such:
Song(ndb.Model):
user_key = ndb.KeyProperty()
status = ndb.StringProperty()
datetime_added = ndb.DateTimeProperty()
My modification is to add a User
that can CUD songs to any queue. In the frontend, users will visit a UI to see their songs in each of the queues, and make changes. In the backend, the application needs to know which songs are in each queue, play the right song off each queue and remove songs from queues once played.
In order for a User to be able to see its songs in queue I'm presuming each User would be a root entity and would need to store a list of Song keys
User(ndb.Model):
song_keys = ndb.KeyProperty(kind='Song', repeated=True)
Then, to retrieve the user's songs, the application would (presuming user_id is known)
user = User.get_by_id(user_id)
songs = ndb.get_multi(user.song_keys)
And, since get
s are strongly consistent, the user would always see non-stale data
Then, when queue 1 is finished playing a song, the application could do something like:
current_song.status = "inactive"
current_song.put()
query=Song.query(ancestor=ndb.Key('Jukebox', '1', 'Queue', '1')).filter(Song.status=="active").order(Song.datetime_added)
next_song = query.get()
Am I right in thinking that the ancestor query ensures consistent representation of the preceding deactivation of the current song as well as any CUD from the Users?
The final step would be to update the User's song_keys list in a transaction
user = current_song.user_key.get()
user.song_keys.remove(current_song.key)
user.put()
Summary and some pros/cons
- The consistency seems to be doing the right things in the rightplaces if my understanding is right?
- Should I be concerned about contention on the
Jukebox
entity group?- I wouldn't expect it to be a high throughput type of use case but my real-life scenario needs to scale with the number of users and there are probably a similar number of
queue
s as there areuser
s, maybe 2x - 5x moreuser
s thanqueue
s. If the whole group is limited to 1 write / sec and lots of users as well as each queue could be creating and updating songs, this could be a bottleneck - One solution could be to do away with the
Jukebox
root entity and have eachQueue
be its own root entity
- I wouldn't expect it to be a high throughput type of use case but my real-life scenario needs to scale with the number of users and there are probably a similar number of
User.song_keys
could be long-ish, say 100song.key
s. This article advised "to avoid storing overly large lists of keys in a ListProperty". What's the concern here? Is this a db concept and moot with ndb's way of handling lists with therepeated=True
property option?
Opinions on this approach or critiques on things I'm fundamentally misunderstanding?
- Presumably, I could also alternatively, kind of just symmetrically flip
the data models and have entity groups that look like
User
->Song
and storesong_keys
lists in theQueue
model
I think you should reconsider how important is strong consistency for your use case. From what I can see it is not critical that all this entities have strong consistency. In my opinion, eventual consistency will work just fine. Most of the time you will see up to date data and only sometimes (read: really really rarely) you will see some stale data. Think about how critical is that you always get up to date data vs how much it penalizes your application. Entities that need strong consistency are not stored in the most efficient way in terms of number of reads per second.
Also if you look at the document Structuring Data for Strong Consistency, you will see that it mentions that you can't have more then 1 write per second when using that approach.
Also having entity groups effects data locality as per AppEngine Model Class docs.
If you also read the famous Google's doc on Google Spanner, section 2 you will see how they deal with entities which have same parent key. Essentially, they are put closer together. I assume Google might be using similar approach with AppEngine Datastore. At some point, according to this source Google might use Spanner for AppEngine Datastore in the future.
Another point, there is no cheaper of faster get then get by key. Having said this, if you can somehow avoid querying this could reduct the cost of running you application. Assuming that you're developing a web application you can store your song keys in a JSON/text object and then use Prospective Search API to get up to date results. This approach requires a bit more work and requires you to embrace eventual consistency model as the data might be slightly out of date by the time it reaches the client. Depending on your use case (this does not apply a small application and small user base obviously) the savings might out-weight the cost. When I say the cost I mean the fact that data might be slightly out of date.
In my experience, strong consistency is not a requirement for a large number of applications. The number of applications that can live with slightly stale data seems to outnumber the applications that cannot. Take YouTube for example, I don't really mind if I don't see all the videos immediately (as there's such a large number that I can't even know if I see all of them or not). When you design something like this, first ask yourself question, is it really necessary to provide up to date data or a bit stale data is good enough? Can the user even tell the difference? Up to date data is much more expensive then a little bit stale.
I've decided to take another approach, which is to rely on lists of song_keys at the Queues in addition to the Users. This way, I have strong consistency when dealing with Users and with Queues without needing to deal with the performance/consistency tradeoff that comes with entity groups. As a positive byproduct,
get
tingkeys
leverages ndb autocaching so I anticipate a performance boost with enhanced code simplicity.Still welcome any critiques...
UDPATE: A little more detail regarding autocaching. NDB automatically manages caching via memcache and an in-context cache. For my purposes, I'm mostly interested in the automatic memcache. By using predominantly
get
requests in favor of queries, NDB will check memcache first before reading from the datastore for all of those reads. I anticipate most requests to actually be fulfilled out of memcache rather than the datastore. I understand that I could manage all of that memcache activity myself and most likely in a way that would work decently with a query-focused approach so perhaps some wouldn't consider that a great rationale for the design decision. But the impact on code simplicity is pretty nice.