We have a mulch-tenancy and I need to search and fetch in a huge appengine datastore based on an indexed attribute range and client id. Does usage of Ancestor Paths make it efficient ? Alternatively Same can be done using an additional filter
e.g. to get the top 100 salaries via objectify
Key<Clients> clientIdKey = Key.create(Clients.class, 500)
ofy().load().type(Salaries.class).ancestor(clientIdKey).order("-salary").limit(100).list()
Alternatively just
ofy().load().type(Salaries.class).filter("clientId = ", 500 ).order("-salary").limit(100).list()
My assumption is that in the first case all entities belonging to any other Client will be ignored but in later case it will be full scan which will more expensive. Is this assumption right ?
Also does the index "salary" stored globally or it is partitioned according to the ancestor so that the index update happens only within same ancestor ? This will reduce time taken update the index and will be a good solution when we are never going to query across different clients.
The first thing I need to point out is that the datastore does not do table scans. With a couple exceptions (most notably zig-zag merges), GAE queries only follow indexes - so usually these kinds of questions boil down to "which index is more efficient to maintain?"
Let's start by talking about the second case (note that I've singularized Salary, which I assume is your intention):
This requires a multi-property index on
Salary { clientId, salary } DESC
. GAE will navigate the index to the start ofSalary/clientId/500
and then read off each index record one at a time. It will do this on the index table in an arbitrary datacenter - and since these index tables are replicated asynchronously, you get an eventually consistent result.In order for an entity to participate in a multi-property index, each of the individual single properties must be indexed themselves. If Salary had no other indexed properties, writing a single Salary would cost:
clientId
index (asc and desc)salary
index (asc and desc){ clientId, salary } DESC
Now let's look at the first case:
This requires a different multi-property index in your datastore-indexes.xml. This time you need an index on
Salary { ancestor, salary } DESC
. In addition, the default behavior of GAE is to read from a quorum of datacenters to make this a strongly consistent operation. This should be somewhat slower (although no more expensive) than the other method, however, you can explicitly specify eventual consistency to get the same "any datacenter" behavior:ofy().consistency(Consistency.EVENTUAL).load()...
The nice thing here is that you have the option of strong consistency.Another bonus of the ancestor approach is that you don't need to maintain a single-property index on
clientId
. Here's what happens when you write this Salary (assuming no other indexed fields):salary
index (asc and desc){ ancestor, salary } DESC
This can make your system considerably cheaper. The biggest cost of multi-property indexes is often the cost of all the (otherwise irrelevant) bidirectional single-property indexes you must maintain simply as a flag to GAE.
Regarding your last question, it might help to explain what GAE index tables look like. There are three BigTable tables for indexes, shared across all applications. The first two are single-property index tables (one for ascending, one for descending). Their contents look something very roughly like this:
By doing range scans (one of the primitive operations of BigTable), you can determine which entities match your query. This is also why keys-only queries are fast/cheap; you can return the key immediately without doing the subsequent lookup.
The multi-property index table looks (again, this is not exact) like this:
It might be easier to visualize with some values for a multi-property index on
Salary { clientId, salary } DESC
:Again, you can see how by performing range scans, GAE can find the entities which match your queries.
I hope this helps clear things up.