Merging SQLite databases is driving me mad. Help?

2019-04-15 21:19发布

I've got 32 SQLite (3.7.9) databases with 3 tables each that I'm trying to merge together using the idiom that I've found elsewhere (each db has the same schema):

attach db1.sqlite3 as toMerge;
insert into tbl1 select * from toMerge.tbl1;
insert into tbl2 select * from toMerge.tbl2;
insert into tbl3 select * from toMerge.tbl3;
detach toMerge;

and rinse-repeating for the entire set of databases. I do this using python and the sqlite3 module:

  for fn in filelist:

    completedb = sqlite3.connect("complete.sqlite3")
    c = completedb.cursor()

    c.execute("pragma synchronous = off;")
    c.execute("pragma journal_mode=off;")

    print("Attempting to merge " + fn + ".")
    query = "attach '" + fn + "' as toMerge;"
    c.execute(query)

    try:
        c.execute("insert into tbl1 select * from toMerge.tbl1;")
        c.execute("insert into tbl2 select * from toMerge.tbl2;")
        c.execute("insert into tbl3 select * from toMerge.tbl3;")
        c.execute("detach toMerge;")
        completedb.commit()
    except sqlite3.Error as err:
        print "Error! ", type(err), " Error msg: ", err
        raise

2 of the tables are fairly small, only 50K rows per db, while the third (tbl3) is larger, about 850 - 900K rows. Now, what happens is that the inserts progressively slow down until I get to about the fourth database when they grind to a near halt (on the order of a a megabyte or two in file size added every 1-3 minutes to the combined database). In case it was python, I've even tried dumping out the tables as INSERTs (.insert; .out foo; sqlite3 complete.db < foo is the skeleton, found here) and combining them in a bash script using the sqlite3 CLI to do the work directly, but I get exactly the same problem.

The table setup of tbl3 isn't too demanding - a text field containing a UUID, two integers, and four real values. My worry is that it's the number of rows, because I ran into exactly the same trouble at exactly the same spot (about four databases in) when the individual databases were an order of magnitude larger in terms of file size with the same number of rows (I trimmed the contents of tbl3 significantly by storing summary stats instead of raw data). Or maybe it's the way I'm performing the operation? Can anyone shed some light on this problem that I'm having before I throw something out the window?

2条回答
叼着烟拽天下
2楼-- · 2019-04-15 21:55

Try adding or removing indexes/primary key for the larger table.

查看更多
放荡不羁爱自由
3楼-- · 2019-04-15 21:58

You didn't mention the OS you were using or the db file sizes. Windows can have issues with files that are bigger than 2Gb depending on what version.

In any case, since this is a glorified batch script why not get rid of the for loop, get the filename from sys.argv, and then just run it once for each merge db. That way you will never have to deal with memory issues from doing too much in one process.

Mind you, if you end the loop with the following that will likely also fix things.

c.close()
completedb.close()

You say that the same thing occurs when you follow this process using the CLI and quitting after every db. I assume that you mean the Python CLI, and quitting means that you exit and restart Python. If that is true, and it still develops a problem every 4th database, then something is wrong with your SQLITE shared library. It shouldn't be keeping state like that.

If I were in your shoes, I would stop using attach and just open multiple connections in Python, then move the data in batches of about 1000 records per commit. It would be slower than your technique because all the data moves in and out of Python objects, but I think it would also be more reliable. Open the complete db, then loop around opening a second db, copying, then closing the second db. For the copying, I would use OFFSET and LIMIT on the SELECT statements to process batches of 100 records, then commit, then repeat. In fact, I would also count the completedb records, and the second db records before copying, then after copying count the completedb records to ensure that I had copied the expected amount. Also, you would be keeping track of the value of the next OFFSET and I would write that to a text file right after committing, so that I could interrupt and restart the process at any time and it would carry on where it left off.

查看更多
登录 后发表回答