How to know which process is responsible for a “Op

2019-02-19 23:24发布

问题:

I sometimes randomly encounter:

OperationalError: database is locked

in a process that updates a SQLite database, but I find it difficult to reproduce the error:

  • no other process is inserting / deleting rows at the same time
  • just one process might do some read-only queries (SELECT, etc.) here and there, but no committing

I've already read OperationalError: database is locked

Question: Is there a way, when this error happens, to log which other process ID is responsible for the lock?

More generally, how to debug a OperationalError: database is locked?

回答1:

Is there a way, when this error happens, to log which other process ID is responsible for the lock?

No, that information is not recorded when the exception occurs. The OperationalError: database is locked exception is usually raised after a timeout (5 minutes is the default) when trying to obtain a mutex and and a file lock in the SQLite internals, at which point SQLite returns SQLITE_BUSY, but SQLITE_BUSY can also be reported at other points. SQLite error codes do not carry any further context such as the PID of another process that held a lock, and it is conceivable that the lock was passed between two other processes before the current process gave up trying to get hold of it!

At best you can enumerate what processes are currently accessing the file by using lsof <filename of database>, but that won't get you any closer to figuring out which one of these is actually taking too long to commit.

I would, instead, instrument your code with explicit transactions and detailed logging as to when you start and commit transactions. Then, when you do run into OperationalError exceptions, you can check the logs for what was happening where within that time window.

A Python context manager that could be used for this is:

import logging
import sys
import time
import threading
from contextlib import contextmanager
from uuid import uuid4

logger = logging.getLogger(__name__)


@contextmanager
def logged_transaction(con, stack_info=False, level=logging.DEBUG):
    """Manage a transaction and log start and end times.

    Logged messages include a UUID transaction ID for ease of analysis.

    If trace is set to True, also log all statements executed.
    If stack_info is set to True, a stack trace is included to record
    where the transaction was started (the last two lines will point to this
    context manager).

    """
    transaction_id = uuid4()
    thread_id = threading.get_ident()

    def _trace_callback(statement):
        logger.log(level, '(txid %s) executing %s', transaction_id, statement)
    if trace:
        con.set_trace_callback(_trace_callback)

    logger.log(level, '(txid %s) starting transaction', transaction_id, stack_info=stack_info)

    start = time.time()
    try:
        with con:
            yield con
    finally:
        # record exception information, if an exception is active
        exc_info = sys.exc_info()
        if exc_info[0] is None:
            exc_info = None
        if trace:
            con.set_trace_callback(None)
        logger.log(level, '(txid %s) transaction closed after %.6f seconds', transaction_id, time.time() - start, exc_info=exc_info)

The above will create start and end entries, include exception information if there is any, optionally trace all statements that are being executed on the connection and can include a stack trace that will tell you the location where the context manager was used. Do make sure to include the date and time in when formatting log messages so you can track when transactions started.

I'd use it around any code using the connection, so you can time selects as well:

with logged_transaction(connection):
    cursor = connection.cursor()
    # ...

It may be that just using this context manager makes your issue disappear, at which point you'd have to analyse why the code without this context manager leaves open a transaction without committing.

You may also want to use a lower timeout value in sqlite3.connect() calls to speed up the process; you may not have to wait a full 5 minutes to detect the situation.

Note on threading: when enabling tracing it is assumed that you use separate connections for separate threads. If this is not the case, then you'll need to permanently register a trace callback that then sorts out what transaction id to use for the current thread.



回答2:

Solution: Always close the cursor for (even read-only) queries!

First, here is a way to reproduce the problem:

  1. First run this code, once:

    import sqlite3
    conn = sqlite3.connect('anothertest.db')
    conn.execute("CREATE TABLE IF NOT EXISTS mytable (id int, description text)")
    for i in range(100):
        conn.execute("INSERT INTO mytable VALUES(%i, 'hello')" % i)
    conn.commit()
    

    to initialize the test.

  2. Then begin a read-only query:

    import sqlite3, time
    conn = sqlite3.connect('anothertest.db')
    c = conn.cursor()
    c.execute('SELECT * FROM mytable')
    item = c.fetchone()
    print(item)
    print('Sleeping 60 seconds but the cursor is not closed...')
    time.sleep(60)
    

    and keep this script running while executing the next step:

  3. Then try to delete some content and commit:

    import sqlite3
    conn = sqlite3.connect('anothertest.db')
    conn.execute("DELETE FROM mytable WHERE id > 90")
    conn.commit()
    

    It will trigger this error indeed:

    sqlite3.OperationalError: database is locked

Why? Because it's not possible to delete data that is currently accessed by a read-query: if the cursor it's still open, it means the data could still be fetched with fetchone or fetchall.

Here is how to solve the error: in step #2, just add:

item = c.fetchone()
print(item)
c.close()
time.sleep(60)

Then while this is still running, start script #3, you will see there is no more error.