MongoDB: returning rows sequentially before and af

2019-01-26 15:44发布

问题:

In MongoDB, given a find() operator that returns a cursor for a set of rows, what is an idiomatic and time-efficient manner in which to return "context" rows, i.e. rows sequentially before and/or after each row in the set?

For me the easiest way to explain this concept is using ack, which supports context searching. Given a file:

line 1
line 2
line 3
line 4
line 5
line 6
line 7
line 8

This is the output from ack:

C:\temp>ack.pl -C 2 "line 4" test.txt
line 2
line 3
line 4
line 5
line 6

I am storing log data in a MongoDB collection, one document per row. Each log each tokenized into keywords and these keywords are indexed, which gives me cheap-ish full-text searching.

I execute a bog-standard:

collection.find({keywords: {'$all': ['key1', 'key2']}}, {}).sort({datetime: -1});

and get a cursor. At this stage, without adding any additional fields, what is the approach for getting context? I think the flow is something like:

  • For each row in the cursor:
    • Get the _id field, store into x.
    • execute: collection.find({_id: {'$gt': x}}).limit(N)
      • Get the results from each of these cursors.
    • execute: collection.find({_id: {'$lt': x}}).sort({_id: 1}).limit(N)
      • Get the results from each of these cursors.

For a result set with R rows this requires 2R+1 queries.

However, I think I can trade off space for time. Is a feasible alternative to update each row with its context _id's in the background? For a given row that currently has fields:

_id, contents, keywords

I would add an additional field:

_id, contents, keywords, context_ids

and then in a subsequent search I could, somehow, use these context_ids, I think? I'm not at all familiar with MongoDB MapReduce yet, but can that come into the picture as well?

I think the most direct approach is to store the full-text of the actual context rows in each row, but this seems a bit crude to me. The clear advantage is that a single query could return the context I need.

I appreciate any and all answers that accept the scope of the question. I realise I could use Lucene or a real full-text search engine out-of-band but I'm trying to feel out the edges and capabilities of MongoDB so I'd appreciate MongoDB-specific answers. Thanks!

回答1:

I think your approach of storing context_ids, or something like it, might be the best option. If you are able to store the context_ids of all the rows of context you will need (this assumes that it's a fixed-size amount of context -- say 5 lines before and after), then you can query for all the lines of context using $in:

# pseudocode
for each matching row:
    context_rows = db.logs.find({_id: {$in: row['context_ids']}}).sort({_id: 1})
    row_with_context = [context_rows_before_row] + row + [context_rows_after_row]

I imagine that knowing the set of context rows -- particularly the rows after the row you're considering, can be difficult, since the rows after any given row won't necessarily exist yet.

An alternative, which will avoid this problem (but still requires a fixed, known-ahead-of-time amount of context) is just to store the _id of the first line of context before the line in question (i.e. when inserting, you can buffer the previous N lines where N is the amount of context) -- call this first_context_id -- and then query like:

# pseudocode
for each matching row:
    rows_with_context = db.logs.find({_id: {$gte: row['first_context_id']}}).sort({_id: 1}).limit(N * 2 + 1)

This may also simplify your application logic, as you don't need to reassemble the context with the row in question, this query will return both the matched row and the rows of context.