How can I find the closest document using Google A

2019-04-28 02:35发布

问题:

I have approximately 400,000 documents in a GAE Search index. All documents have a location GeoPoint property and are spread over the entire globe. Some documents might be over 4000km away from any other document, others might be bunched within meters of each other.

I would like to find the closest document to a specific set of coordinates but find the following code gives incorrect results:

from google.appengine.api import search

# coords are in the form of a tuple e.g. (50.123, 1.123)
search.Document(
    doc_id='meaningful-unique-id',
    fields=[search.GeoField(name='location' 
                            value=search.GeoPoint(coords[0], coords[1]))])

# find document function radius is in metres
def find_document(coords, radius=1000000):
    sort_expr = search.SortExpression(
        expression='distance(location, geopoint(%.3f, %.3f))' % coords,
        direction=search.SortExpression.ASCENDING,
        default_value=0)

    search_query = search.Query(
        query_string='distance(location, geopoint(%.3f, %.3f)) < %d' \
                    % (coords[0], coords[1], radius),
        options=search.QueryOptions(
            limit=1,
            ids_only=True,
            sort_options=search.SortOptions(expressions=[sort_expr])))

    index = search.Index(name='document-index')
    return index.search(search_query)

With this code I will get results that are consistent but incorrect. For example, a search for the nearest document to London indicated the closest one was in Scotland. I have verified that there are thousands of closer documents.

I narrowed the problem down to the radius parameter being too large. I get correct results if the radius is down to around 12km (radius=12000). There are generally no more than 1000 documents in a 12 km radius. (Probably associated with search.SortOptions(limit=1000).)

The problem is that if I am in a sparse area of the globe where there aren't any documents for thousands of miles, my search function will not return anything with radius=12000 (12km). I want it to return the closest document to me wherever I am. How can I accomplish this consistently with one call to the Search API?

回答1:

I believe the issue is the following. Your query will select up to 10K documents, then those are sorted according to your distance sort expression and returned. (That is, the sort is in fact not over all 400k documents.) So I suspect that some of the geographically closer points are not included in this 10k selection. That's why things work better when you narrow your search radius, as you have fewer total points in that radius.

Essentially, you want to get your query 'hits' down to 10k, in a manner that makes sense for what you are querying on. You can address this in at least a couple of ways, which you can combine:

  • Add a ranking, so that the most 'important' docs (by some criteria that makes sense in your domain) are returned in rank order, then these will be sorted by distance.
  • Filter on one or more document field(s) (e.g., 'business category', if your docs contain information about businesses) to reduce the number of candidate docs.

(I don't believe this 10k threshold is currently in the Search API documentation; I've filed a ticket to get it added).



回答2:

I have the exact same problem, and I don't think its possible. The problem happens as you yourself has figured out when there are more possible results than returned results. The Google algorithm just quits when it has loaded the limits and then it sorts the results.

I have seen the same clusters as you and its part of the search API.

One Hack would be to subdivide your search into sub-sectors, do multiple simultaneous calls and then merge and order the results.



回答3:

Wild idea, why not keep/record the distance from 3 points then calculate from that.