Oracle recently released a Berkeley DB back-end to SQLite. I happen to have a hundreds-of-megabytes SQLite database that could very well benefit from "improved performance, concurrency, scalability, and reliability", but Oracle's site appears to lack any measurements of the improvements. Has anyone here done some benchmarking?
问题:
回答1:
I participated in the beta evaluation of the BDB SQLite code and one of the things I tried to get a handle on was the performance difference. At this point, I cannot publish exactly what I found until I have at least one other person evaluate my code, run the tests, and confirm the numbers I got (which is being done). However, I can generalize here and say that there are cases where BDB offers significant performance improvements over SQLite, specifically in the area of handling heavy loads that involve write-concurrency.
There are, generally, two measures of "fast" right -- (1) efficiency: how long does it take for a single process to do XYZ vs. (2) concurrency: how many times can many processes do XYZ per unit time. The main problem BDB addresses is concurrency -- large-scale transaction processing. Thus you think of many concurrent connections writing to and/or modifying the contents of the database.
SQLite by design uses database-level locking so there is a maximum of one writer who can be working in the database at a time. Thus, SQLite's transaction rate stays more or less constant with the number of concurrent connections, so it's scalability in write-intensive applications is really measured by its efficiency (1).
BDB on the other hand uses page level locking, which allows multiple writers to be working in the database at a given time (provided that they are working on separate pages). Thus BDB's rate potentially increases with the number of connections and so its scalability is both a matter of efficiency (1) and concurrency (2), which can add up.
Mainly what it boils down to is (write) concurrency. BDB can push more TPS than SQLite for multiple writers. By transaction, I mean something that modifies the database (how are they of any real help for read-only operations?). That said, for read concurrency (apps that mainly do SELECTs), SQLite could very well go head to head with BDB because locking is no longer a critical issue.
As for the size of the dataset, I am not sure. I've not looked into that. Ultimately, they both use B-trees for storage. There may be factors in their respective implementations to consider, but I've not investigated that. I know that SQLite can gracefully handle data sets into the hundreds of MBs and double digit GBs (and perhaps more now that the dirty page map implementation has been changed).
Therefore, if you have an application which employs many connections that modify a given database and page contention is relatively low, then BDB can offer significant performance improvements. But page contention is a critical variable. In the limit, if you had a BDB database whose data consisted of a single page, then its performance would match that of SQLite in all cases because page-level locking here effectively degenerates into the equivalent of database level locking -- everybody is fighting over one thing. However, as the number of pages increases in BDB (and page contention decreases), then the maximum TPS will start to grow with the number of concurrent connections. Then from that point, memory becomes the next limiting factor. But that's another story.
BTW, I am in the process of writing an article about using BDB for those coming from SQLite.
Article links:
Oracle Berkeley DB SQL API vs. SQLite API – A Technical Evaluation
Oracle Berkeley DB SQL API vs. SQLite API – Integration, Benefits and Differences
回答2:
That's kinda a loaded question. The results would vary dramatically depending on your disk access speeds, size of cache in memory, number of inserts vs. reads, page splits, concurrency, etc, etc, etc.
Overall, BerkeleyDB can be extremely fast - I recently designed a built a data analysis platform for an employer that was capable of doing 40k inserts per second on an 8 core x86 system (while at the same time doing thousands of reads per second) with a dataset in the 30G range. This was with full transactional protection.
That was best-case, though - there were times where inserts could drop to as low as 2k per second, depending on the incoming data and what was currently stored in Berkeley. Performance drops significantly if you have slow disk I/O and a poor cache hit rate or are constantly expanding the DB causing page splits to occur. There is also an enormous amount of tuning you can do to increase performance for your particular dataset.
Overall it's an excellent system, but documentation and knowledge is fairly slim. I recommend The BerkeleyDB Book as probably the best reference currently available.
回答3:
In addition to the Berkeley DB Book that Brian mentions, you may also find the following resources useful:
- The Berkeley DB online forums can provide lots of suggestions from both users and the developers of the product. See Berkeley DB forum,
- The Berkeley DB documentation set, which can be found here. In particular, there are several sections in the Reference Guide covering tuning, performance and throughput.