WiredTiger and in-place updates

2019-04-02 06:15发布

问题:

I have a collection of users. Each user has a field "geoposition" that is updated quite often (every time the user moves significantly). As I want concurrency on the document level instead of the collection level when updating, I am using the WiredTiger storage engine.

I learned that with WiredTiger every update in the document results in the creation of a new document:

http://learnmongodbthehardway.com/schema/wiredtiger/

WiredTiger does not support in place updates

However, this article also says that "Even though [WiredTiger] does not allow for in-place updates, it could still perform better than MMAP for many workloads". What does it mean? What are the exact implications that I must be aware of when I use WiredTiger? For example, without in-place updates will the database size grow quickly? Are there other things to be aware of?

I also learned that WiredTiger in MongoDB 3.6 added the capability to store deltas rather than re-writing the entire document (https://jira.mongodb.org/browse/DOCS-11416). What does this mean, exactly?

NOTE: Also what I don't understand is that nowadays most (if not all) hard drives have a sector size of 4096 bytes, so you cannot write to the hard drive only 4 bytes (for example) but instead must write the full block of 4096 bytes (so read it first, update the 4 bytes in it and then write it). As most document are often < 4096 bytes does this mean that re-writing the whole document is necessary in any case (even with MMAP). What did I miss?

回答1:

With the MMAPv1 storage engine, in-place updates are frequently highlighted as an optimization strategy because indexes for a document point directly to file locations and offsets. Moving a document to a new storage location (particularly if there are many index entries to update) has more overhead for MMAPv1 than an in-place update which only has to update the changed fields. See: Record Storage Characteristics in MMAPv1.

WiredTiger does not support in-place updates because internally it uses MVCC (Multiversion concurrency control), which is commonly used by database management systems. This is a significant technical improvement over the simplistic view in MMAP, and allows for building more advanced features like isolation levels and transactions. WiredTiger's indexes have a level of indirection (referencing an internal RecordID instead of the file location & offset), so document moves at the storage level are not a significant pain point.

However, this article also says that "Even though [WiredTiger] does not allow for in-place updates, it could still perform better than MMAP for many workloads".

It means that although MMAPv1 may have a more efficient path for in-place updates, WiredTiger has other advantages such as compression and improved concurrency control. You could perhaps construct a workload consisting only of in-place updates to a few documents which might perform better in MMAPv1, but actual workloads are typically more varied. The only way to confirm the impact for a given workload would be to test in a representative environment.

However, the general choice of MMAPv1 vs WiredTiger is moot if you want to plan for the future: WiredTiger has been the default storage engine since MongoDB 3.2 and some newer product features are not supported by MMAPv1. For example, MMAPv1 does not support Majority Read Concern which in turn means it cannot be used for Replica Set Config Servers (required for sharding in MongoDB 3.4+) or Change Streams (MongoDB 3.6+). MMAPv1 will be deprecated in the next major release of MongoDB (4.0) and is currently scheduled to be removed in MongoDB 4.2.

What are the exact implications that I must be aware of when I use WiredTiger? For example, without in-place updates will the database size grow quickly?

Storage outcomes depend on several factors including your schema design, workload, configuration, and version of MongoDB server. MMAPv1 and WiredTiger use different record allocation strategies, but both will try to use preallocated space that is marked as free/reusable. In general WiredTiger is more efficient with use of storage space, and it also has the advantage of compression for data and indexes. MMAPv1 allocates additional storage space to try to optimize for in-place updates and avoid document moves, although you can choose a "no padding" strategy for collections where the workload does not change the document size over time.

There has been significant investment in improving and tuning WiredTiger for different workloads since it was first introduced in MongoDB 3.0, so I would strongly encourage testing with the latest production release series for the best outcome. If you have a specific question about schema design and storage growth, I'd suggest posting details on DBA StackExchange for discussion.

I also learned that WiredTiger in MongoDB 3.6 added the capability to store deltas rather than re-writing the entire document (https://jira.mongodb.org/browse/DOCS-11416). What does this mean, exactly?

This is an implementation detail that improves WiredTiger's internal data structures for some use cases. In particular, WiredTiger in MongoDB 3.6+ can be more efficient about working with small changes to large documents (as compared to previous releases). The WiredTiger cache needs to be able to return multiple versions of documents as long as they are used by open internal sessions (MVCC, as mentioned earlier), so for large documents with small updates it could be more efficient to store a list of deltas. However, if too many deltas accumulate (or the deltas are changing most of the fields in a document) this approach could be less performant than maintaining multiple copies of the full document.

When data is committed to disk via a checkpoint, a full version of the document still needs to be written. If you want to learn more about some of the internals, there's a MongoDB Path To Transactions series of videos following the development of features to support multi-document transactions in MongoDB 4.0.

Also what I don't understand is that nowadays most (if not all) hard drives have a sector size of 4096 bytes, so you cannot write to the hard drive only 4 bytes (for example) but instead must write the full block of 4096 bytes (so read it first, update the 4 bytes in it and then write it). As most document are often < 4096 bytes does this mean that re-writing the whole document is necessary in any case (even with MMAP). What did I miss?

Without getting too far into implementation details and trying to explain all the moving parts involved, consider how the different approaches apply to workloads where many documents are being updated (rather than at the single document level) as well as the impact on memory usage (before documents are written to disk). Depending on factors like document size and compression, a single block of I/O can represent anywhere from a fraction of a document (max size 16MB) to multiple documents.

In MongoDB the general flow is that documents are updated in an in-memory view (for example, the WiredTiger cache) with changes persisted to disk in a fast append-only journal format before being periodically flushed to the data files. If the O/S only has to write blocks of data that have changed, touching fewer blocks of data requires less overall I/O.