We are trying to index inboxes by sitting on top of the GMail, and are using the App Engine search API, but we are hitting up the 10 GB limit. This is because we are indexing the whole organization's emails so we can search across the whole team's inbox. How can we work around this? One way might be to have an individual index per person and somehow combine the results manually, but worried that merging results might be really complex! Wondering what options are available?
问题:
回答1:
This is a typical problem in any document retrieval system, and the solution is to slice the entire corpus into multiple buckets. You should choose a slicing strategy based on your requirements/usage pattern.
One possibility is to slice messages by their date. You keep adding messages to an index until you come close to the limit, at which point you start a new index for newer messages. Or you can do it by calendar intervals (per year, per quarter or per month, depending on your volume).
Merging results from several indexes is simple. You can also give users a chance to choose how far back in time they want to go in their search. Often people know that they are looking for something recent or something that happened a long time ago.
回答2:
File a feature request:
https://code.google.com/p/googleappengine/wiki/FilingIssues?tm=3
There was this filed too so maybe star it: https://code.google.com/p/googleappengine/issues/detail?id=10667