I am trying to perform some data processing in a GAE application over data that is stored in the Datastore. The bottleneck point is the throughput in which the query returns entities and I wonder how to improve the query's performance.
What I do in general:
- everything works in a task queue, so we have plenty of time (10 minute deadline).
- I run a query over the ndb entities in order to select which entities need to be processed.
- as the query returns results, I group entities in batches of, say, 1000 and send them to another task queue for further processing.
- the stored data is going to be large (say 500K-1M entities) and there is a chance that the 10 minutes deadline is not enough. Therefore, when the task is reaching the taskqueue deadline, I spawn a new task. This means I need an ndb.Cursor in order to continue the query from where it stopped.
The problem is the rate in which the query returns entities. I have tried several approaches and observed the following performance (which is too slow for my app):
Use fetch_page() in a while loop.
The code is straightforward
while has_more and theres_more_time:
entities, cursor, more = query.fetch_page(1000, ...)
send_to_process_queue(entities)
has_more = more and cursor
With this approach, it takes 25-30 seconds to process 10K entities. Roughly speaking, that is 20K entities per minute. I tried changing the page size or the class of the frontend instance; neither made any difference in performance.
Segment the data and fire multiple fetch_page_async() in parallel.
This approach is taken from here (approach C)
The overall performance remains the same as above. I tried with various number of segments (from 2 to 10) in order to have 2-10 parallel fetch_async() calls. In all cases, the overall time remained the same. The more parallel fetch_page_async() are called, the longer it takes for each one to complete. I also tried with 20 parallel fetches and it got worse. Changing the page size or the fronted instance class did not have and impact either.
Fetch everything with a single fetch() call.
Now this is the least suitable approach (if not unsuitable at all) as the instance may run out of memory, plus I don't get a cursor in case I need to spawn to another task (in fact I won't even have the ability to do so, the task will simply exceed the deadline). I tried this out of curiosity in order to see how it performs and I observed the best performance! It took 8-10 seconds for 10K entities, which is roughly be 60K entities per minute. Now that is approx. 3 times faster than fetch_page(). I wonder why this happens.
Use query.iter() in a single loop.
This is match like the first approach. This will make use of the query iterator's underlying generator, plus I can obtain a cursor from the iterator in case I need to spawn a new task, so it suits me. With the query iterator, it fetched 10K entities in 16-18 seconds, which is approx. 36-40K entities per minute. The iterator is 30% faster than fetch_page, but much slower that fetch().
For all the above approaches, I tried F1 and F4 frontend instances without any difference in Datastore performance. I also tried to change the batch_size parameter in the queries, still without any change.
A first question is why do fetch(), fetch_page() and iter() behave so differently and how to make either fetch_page() or iter() do equally well as fetch()? And then another critical question is whether these throughputs (20-60K entities per minute, depending on api call) are the best we can do in GAE.
I 'm aware of the MapReduce API but I think it doesn't suit me. AFAIK, the MapReduce API doesn't support queries and I don't want to scan all the Datastore entities (it's will be too costly and slow - the query may return only a few results). Last, but not least, I have to stick to GAE. Resorting to another platform is not an option for me. So the question really is how to optimize the ndb query.
Any suggestions?
The new experimental Data Processing feature (an AppEngine API for MapReduce) might be suitable. It uses automatic sharding to execute multiple parallel worker processes, which may or may not help (like the Approach C in the other linked question).
In case anyone is interested, I was able to significantly increase the throughput of the data processing by re-designing the component - it was suggested that I change the data models but that was not possible.
First, I segmented the data and then processed each data segment in a separate taskqueue.Task instead of calling multiple fetch_page_async from a single task (as I described in the first post). Initially, these tasks were processed by GAE sequentially utilizing only a single Fx instance. To achieve parallelization of the tasks, I moved the component to a specific GAE module and used basic scaling, i.e. addressable Bx instances. When I enqueue the tasks for each data segment, I explicitly instruct which basic instance will handle each task by specifying the 'target' option.
With this design, I was able to process 20.000 entities in total within 4-5 seconds (instead of 40'-60'!), using 5 B4 instances.
Now, this has additional costs because of the Bx instances. We 'll have to fine tune the type and number of basic instances we need.
Your comment about "no need to scan all entities" triggers the thought that custom indexes could help your queries. That may entail schema changes to store the data in a less normal form.
Design a solution from the output perspective - what the simplest query is that produces the required results, then what the entity structure is to support such a query, then what work is needed to create and maintain such an entity structure from the current data.