CouchDB - Filtered Replication - Can the speed be

2019-01-24 11:21发布

问题:

I have a single database (300MB & 42,924 documents) consisting of about 20 different kinds of documents from about 200 users. The documents range in size from a few bytes to many KiloBytes (150KB or so).

When the server is unloaded, the following replication filter function takes about 2.5 minutes to complete. When the server is loaded, it takes >10 minutes.

Can anyone comment on whether these times are expected, and if not, suggest how I might optimize things in order to get better performance?

function(doc, req) {
    acceptedDate = true;
    if(doc.date) {
        var docDate = new Date();
        var dateKey = doc.date;
        docDate.setFullYear(dateKey[0], dateKey[1], dateKey[2]);

        var reqYear = req.query.year;
        var reqMonth = req.query.month;
        var reqDay = req.query.day;
        var reqDate = new Date();
        reqDate.setFullYear(reqYear, reqMonth, reqDay);

        acceptedDate = docDate.getTime() >= reqDate.getTime();
    }

    return doc.user_id && doc.user_id == req.query.userid && doc._id.indexOf("_design") != 0 && acceptedDate; 
}

回答1:

Filtered replications works slow because for each fetched document runs complex logic to decide whether to replicate it or not:

  1. CouchDB fetches next document;
  2. Because filter function has to be applied the document gets converted to JSON;
  3. JSONifyed document passes through stdio to query server;
  4. Query server handles document and decodes it from JSON;
  5. Now, query server lookups and runs your filter function which returns true or false value to CouchDB;
  6. If result is true document goes to be replicated;
  7. Go to p.1 and loop for all documents;

For non-filtered replications take this list, throw away p.2-5 and let p.6 has always true result. This overhead slows down whole replication process.

To significantly improve filtered replication speed, you may use Erlang filters via Erlang native server. They runs inside CouchDB, doesn't pass through any stdio interface and there is no JSON decode/encode overhead applied.

NOTE, that Erlang query server runs not inside sandbox like JavaScript one, so you need to really trust code that you run with it.

Another option is to optimize your filter function e.g. reduce object creation, method calls, but actually you wouldn't win much with this.