What is the best document storage strategy in NoSQ

2020-03-07 07:16发布

NoSQL databases like Couchbase do hold a lot of documents in memory, hence their enormous speed but it's also putting a greater demand on the memory size of the server(s) it's running on.

I'm looking for the best strategy between several contrary strategies of storing documents in a NoSQL database. These are:

  • Optimise for speed

Putting the whole information into one (big) document has the advantage that with a single GET the information can be retrieved from memory or from disk (if it was purged from memory before). With the schema-less NoSQL databases this almost wished. But eventually the document will become too big and eat up a lot of memory, less documents will be able to be kept in memory in total

  • Optimise for memory

Splitting up all documents into several documents (eg using compound keys as what is described in this question: Designing record keys for document-oriented database - best practice especially when those documents would only hold information that is necessary in a specific Read/Update operation would allow more (transient) documents to be held in memory.

The use case I'm looking at is Call Detail Records (CDR's) from Telecommunication Providers. These CDR's all go into hundreds of millions typically per day. Yet, many of these customer don't provide a single record on each given day (I'm looking at the South-East Asian market with it's Prepaid dominance and still less data saturation). That would mean that typically a large number of documents are having a Read/Update maybe every other day, only a small percentage will have several Read/Update cycles per day.

One solution that was suggested to me is to build 2 buckets, with more RAM being allocated to the more transient ones and less RAM being allocated to the second bucket holding the bigger documents. That would allow a faster access to the more transient data and more slower one to the bigger document which eg holds profile/user information that isn't changing at all. I do see two downsides to this proposal though, one is that you can't build a view (Map/Reduce) across two buckets (this is specifically for Couchbase, other NoSQL solution might allow this) and the second one would be more overhead in managing closely the balance between the memory allocation for both buckets as the user base growths.

Has anyone else being challenged by this and what was your solution to that problem? What would be the best strategy from your POV and why? Clearly it most be something in the middle of both strategies, having only one document or having one big document split up into hundreds of documents can't be the ideal solution IMO.

EDIT 2014-9-14 Ok, though that comes close to answering my own question but in absence of any offered solution so far and following a comment here is a bit more background how I now plan to organise my data, trying to achieve a sweet spot between speed and memory consumption:

Mobile_No:Profile

  • this holds profile information from a table, not directly from a CDR. Less transient data goes in here like age, gender and name. The key is a compound key consisting of the mobile number (MSISDN) and the word profile, separated by a ":"

Mobile_No:Revenue

  • this holds transient information like usage counters and variables accumulating the total revenue the customer spent. The key is again a compound key consisting of the mobile number (MSISDN) and the word revenue, separated by a ":"

Mobile_No:Optin

  • this holds semi transient information about when a customer opted into the program and when he/she opted out of the program again. This can happen several times and is handled via an array. The key is again a compound key consisting of the mobile number (MSISDN) and the word optin, separated by a ":"

Connection_Id

  • this holds information about a specific A/B connection (sender/receiver) which was done via voice or video call or SMS/MMS. The key is consisting of both mobile_no's which are concatenated.

Before these changes in the document structure I was putting all the profile, revenue and optin information in one big document, always keeping the connection_id as a separate document. This new document storing strategy gives me hopefully a better compromise between speed and memory consumption as I split the main document into several documents so that each of them has only the important information that is read/updated in a single step of the app.

This also takes care of the different rate of changes over time with some data being very transient (like the counters and the accumulative revenue field that gets updated with every CDR coming in) and the profile information being mostly unchanged. I do hope this gives a better understanding of what I'm trying to achieve, comments and feedback is more than welcome.

2条回答
Ridiculous、
2楼-- · 2020-03-07 07:48

I do agree with your technique on the efficient use of resources (if they are limited). But on the flip side, the system might end up being very chatty. If I understand correctly, your "connections" document design is too granular and may introduce too many I/Os across the network. In my experience, these network I/Os are very expensive, if you are designing a system that makes real time decisions. You may mathematically estimate the impacts of these different choices to balance these opposing forces :)

I do think that the spirit of the scalable big data systems is that we shall worry "less" about the resource "constraints". These no-sql database licenses do not go by CPU cores. Commodity hardware is cheap. RAM is getting cheaper as we are discussing. Once again, the return of investment of these systems would also impact the architectural decisions.

查看更多
萌系小妹纸
3楼-- · 2020-03-07 07:53

Thank you for updating your original question. You are correct when you talking about finding a right balance between coarse grained documents vs. fine grained.

The final architecture of the documents actually falls under your particular business domain needs. You have to identify in your use cases "chunks" of data that are needed as a whole and then base your stored documents shape on this. Here are some high level steps you need to perform when you design your documents structure:

  1. Identify all document consumption use cases for your app/service. (read, read-write, searchable items)
  2. Design your documents (most likely you will end up with several smaller documents vs one big doc that has everything)
  3. Design your document keys that can coexists in one bucket for different documents types (e.g. use namespace in the key value)
  4. Do "dry run" of the resulting model against your use cases to see of you have optimal (read/write) transactions to noSQL and all required document data with in the transaction.
  5. Run performance testing for your use cases (try simulate the expected load at least 2 times higher)

Note: When you design different docs its OK to have some sort of redundancy (remember its not RDBMS with normalized form) think of it more as Object Oriented Design.

Note2: If you have searchable items that outside of your keys (e.g. search customers by last name "starts with" and some other dynamic search criteria) consider using ElasticSearch integration with CB or you can also try N1QL query language that is coming with CB3.0.

It seems that you going in a right direction by splitting into several smaller documents all linked by a MSISDN e.g.: MSISDN:profile, MSISDN:revenue, MSISDN:optin. I would pay special attention to your last document type "A/B" connection. That sounds like it might generate large volume and in nature transient...so you have to find out how long these documents have to live in Couchbase bucket. You can specify TTL (time to live) so that old docs will be auto-cleared up.

查看更多
登录 后发表回答