Handling embargoed content scenario in MarkLogic

2019-07-21 16:31发布

问题:

I have a MarkLogic 7 database in which several documents are inserted and every document has its own created-on and released-on. Say for example if a document is inserted into the database at 1400 hrs and its released-on value is 1700 hrs then I need to POST this document to an external REST service at 1700 hrs.

I have tried the following options:

  1. Configure a CPF pipeline such that whenever a document is inserted it's released-on value is read and a Scheduled Task is created to trigger based on the timestamp value read from released-on.

    Following are the queries/ observations for this approach:

    1. Since admin configuration manipulation APIs are not transactionally protected operations I need to force a lock on some URI in order to create Scheduled Tasks from within CPF action modules running in parallel. For details read here

    2. When I insert 1000 documents it takes around 20 minutes for the CPF action modules to trigger and create 1000 scheduled tasks based on the released-on value read from the inserted document.

    3. How can I pass the URI of the document that triggered the CPF action module to the Schedule Task which got created from within the CPF action module based on the released-on value read from the document?

  2. Configure a CPF pipeline such that whenever a document is inserted it's released-on value is read and xdmp:sleep() is called with the milliseconds remaining between current date Time and the value of released-on in the document.

    Following are the queries/ observations for this approach:

    1. The Task Server threads on which the CPF action modules are triggered remain occupied and are not released when xdmp:sleep() is called from within them due to which at any time CPF action module is triggered for 16 maximum documents and others remain in queue.

    2. Is there any way we can configure the sleeping thread to become inactive and let other queued action modules to get triggered and when the sleep duration has been elapsed then it again becomes active?

  3. Configure a muti-step CPF pipeline as described here in which the document keeps bouncing between two states till the time released-on timestamp has arrived.

    Following are the queries/ observations for this approach:

    1. Even when 30 documents were inserted the CPU utilization was observed to be 100%

In all the attempts a lot of system resources (CPU and RAM) get utilized even for as small as 1000 documents. I need to find an approach that can cater even 100K documents.

Please let me know in case there are any improvements that can be done in the above mentioned approaches or MarkLogic provides some other way to efficiently handle such scenarios.

回答1:

Rather than CPF, you could set up a scheduled job that will run, say, every 10 minutes and look for documents that are ready to be published. That job would look for documents with released-on values between fn:current-dateTime() and the last time the job ran, which I would save in the database.

For each of those documents, you would spawn a task to POST the document, so that an error in one doesn't cause problems for the others. After looping through, save the current time in the database for the next time.

The 10-minute window can be as large or small as you like, depending on your tolerance for a little delay.