Can SQLite handle 90 million records?

2019-01-12 23:56发布

问题:

Or should I use a different hammer to fix this problem.

I've got a very simple use-case for storing data, effectively a sparse matrix, which I've attempted to store in a SQLite database. I've created a table:

create TABLE data ( id1 INTEGER KEY, timet INTEGER KEY, value REAL )

into which I insert a lot of data, (800 elements every 10 minutes, 45 times a day), most days of the year. The tuple of (id1,timet) will always be unique.

The timet value is seconds since the epoch, and will always be increasing. The id1 is, for all practical purposes, a random integer. There is probably only 20000 unique ids though.

I'd then like to access all values where id1==someid or access all elements where timet==sometime. On my tests using the latest SQLite via the C interface on Linux, a lookup for one of these (or any variant of this lookup) takes approximately 30 seconds, which is not fast enough for my use case.

I tried defining an index for the database, but this slowed down insertion to completely unworkable speeds (I might have done this incorrectly though...)

The table above leads to very slow access for any data. My question is:

  • Is SQLite completely the wrong tool for this?
  • Can I define indices to speed things up significantly?
  • Should I be using something like HDF5 instead of SQL for this?

Please excuse my very basic understanding of SQL!

Thanks

I include a code sample that shows how the insertion speed slows to a crawl when using indices. With the 'create index' statements in place, the code takes 19 minutes to complete. Without that, it runs in 18 seconds.


#include <iostream>
#include <sqlite3.h>

void checkdbres( int res, int expected, const std::string msg ) 
{
  if (res != expected) { std::cerr << msg << std::endl; exit(1); } 
}

int main(int argc, char **argv)
{
  const size_t nRecords = 800*45*30;

  sqlite3      *dbhandle = NULL;
  sqlite3_stmt *pStmt = NULL;
  char statement[512];

  checkdbres( sqlite3_open("/tmp/junk.db", &dbhandle ), SQLITE_OK, "Failed to open db");

  checkdbres( sqlite3_prepare_v2( dbhandle, "create table if not exists data ( issueid INTEGER KEY, time INTEGER KEY, value REAL);", -1, & pStmt, NULL ), SQLITE_OK, "Failed to build create statement");
  checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute insert statement" );
  checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize insert");
  checkdbres( sqlite3_prepare_v2( dbhandle, "create index issueidindex on data (issueid );", -1, & pStmt, NULL ), SQLITE_OK, "Failed to build create statement");
  checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute insert statement" );
  checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize insert");
  checkdbres( sqlite3_prepare_v2( dbhandle, "create index timeindex on data (time);", -1, & pStmt, NULL ), SQLITE_OK, "Failed to build create statement");
  checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute insert statement" );
  checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize insert");

  for ( size_t idx=0; idx < nRecords; ++idx)
  {
    if (idx%800==0)
    {
      checkdbres( sqlite3_prepare_v2( dbhandle, "BEGIN TRANSACTION", -1, & pStmt, NULL ), SQLITE_OK, "Failed to begin transaction");
      checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute begin transaction" );
      checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize begin transaction");
      std::cout << "idx " << idx << " of " << nRecords << std::endl;
    }

    const size_t time = idx/800;
    const size_t issueid = idx % 800;
    const float value = static_cast<float>(rand()) / RAND_MAX;
    sprintf( statement, "insert into data values (%d,%d,%f);", issueid, (int)time, value );
    checkdbres( sqlite3_prepare_v2( dbhandle, statement, -1, &pStmt, NULL ), SQLITE_OK, "Failed to build statement");
    checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute insert statement" );
    checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize insert");

    if (idx%800==799)
    {
      checkdbres( sqlite3_prepare_v2( dbhandle, "END TRANSACTION", -1, & pStmt, NULL ), SQLITE_OK, "Failed to end transaction");
      checkdbres( sqlite3_step( pStmt ), SQLITE_DONE, "Failed to execute end transaction" );
      checkdbres( sqlite3_finalize( pStmt ), SQLITE_OK, "Failed to finalize end transaction");
    }
  }

  checkdbres( sqlite3_close( dbhandle ), SQLITE_OK, "Failed to close db" ); 
}

回答1:

Are you inserting all of the 800 elements at once? If you are, doing the inserts within a transaction will speed up the process dramatically.

See http://www.sqlite.org/faq.html#q19

SQLite can handle very large databases. See http://www.sqlite.org/limits.html



回答2:

I've looked at your code, and I think you might be overdoing it with the prepare and finalize statements. I am by no means an SQLite expert, but there's got to be significant overhead in preparing a statement each and every time through the loop.

Quoting from the SQLite website:

After a prepared statement has been evaluated by one or more calls to sqlite3_step(), it can be reset in order to be evaluated again by a call to sqlite3_reset(). Using sqlite3_reset() on an existing prepared statement rather creating a new prepared statement avoids unnecessary calls to sqlite3_prepare(). In many SQL statements, the time needed to run sqlite3_prepare() equals or exceeds the time needed by sqlite3_step(). So avoiding calls to sqlite3_prepare() can result in a significant performance improvement.

http://www.sqlite.org/cintro.html

In your case, rather than preparing a new statement each time, you could try binding new values to your existing statement.

All this said, I think the indexes might be the actual culprit, since the time keeps increasing as you add more data. I am curious enough about this where I plan to do some testing over the weekend.



回答3:

Answering my own question just as a place to put some details:

It turns out (as correctly suggested above) that the index creation is the slow step, and every time I do another transaction of inserts, the index is updated which takes some time. My solution is to: (A) create the data table (B) insert all my historical data (several years worth) (C) create the indexes

Now all lookups etc are really fast and sqlite does a great job. Subsequent daily updates now take a few seconds to insert only 800 records, but that is no problem since it only runs every 10 minutes or so.

Thanks to Robert Harvey and maxwellb for the help/suggestions/answers above.



回答4:

Since we know that capturing your data is fast when there is no index on the table, what might actually work is this:

  1. Capture the 800 values in a temporary table with no index.

  2. Copy the records to the master table (containing indexes) using the form of INSERT INTO that takes a SELECT statement.

  3. Delete the records from the temporary table.

This technique is based on the theory that the INSERT INTO that takes a SELECT statement is faster than executing individual INSERTs.

Step 2 can be executed in the background by using the Asynchronous Module, if it still proves to be a bit slow. This takes advantage of the bits of downtime between captures.



回答5:

Consider using a table for new inserts of the given day, without an index. Then, at the end of each day, run a script which will:

  1. Insert new values from new_table into master_table
  2. Clear the new_table for next day of processing

If you can do lookups on historical data in O(log n), and lookups on today's data in O(n), this should provide a nice compromise.



回答6:

I can't tell from your specs, but if the ID field is always increasing, and the time field includes YYYYMMDD for uniqueness and is also always increasing, and you're doing either ID searches or time searches, then the simplest non-database solution would be to simply append all records to a fixed-field text or binary file (since they're being generated in "sorted" order) and use code to do a binary search for the desired records (eg, find the first record with the ID or time of interest, then sequentially step through the desired range).



回答7:

When building large SQLite databases, always insert as much of the data as you can before creating the indices. That will run many times faster than if you create the indices before inserting the data.



回答8:

The theoretical maximum number of rows in a table is 2^64 (18446744073709551616 or about 1.8e+19). This limit is unreachable since the maximum database size of 140 terabytes will be reached first. A 140 terabytes database can hold no more than approximately 1e+13 rows, and then only if there are no indices and if each row contains very little data.