Share a dict with multiple Python scripts

2019-01-22 14:20发布

站内文章 / Python

49 0

祖国的老花朵

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'd like a unique dict (key/value) database to be accessible from multiple Python scripts running at the same time.

If script1.py updates d[2839], then script2.py should see the modified value when querying d[2839] a few seconds after.

I thought about using SQLite but it seems that concurrent write/read from multiple processes is not SQLite's strength (let's say script1.py has just modified d[2839], how would script2.py's SQLite connection know it has to reload this specific part of the database?)
I also thought about locking the file when I want to flush the modifications (but it's rather tricky to do), and use json.dump to serialize, then trying to detect the modifications, use json.load to reload if any modification, etc. ... oh no I'm reinventing the wheel, and reinventing a particularly inefficient key/value database!
redis looked like a solution but it does not officially support Windows, the same applies for leveldb.
multiple scripts might want to write at exactly the same time (even if this is a very rare event), is there a way to let the DB system handle this (thanks to a locking parameter? it seems that by default SQLite can't do this because "SQLite supports an unlimited number of simultaneous readers, but it will only allow one writer at any instant in time.")

What would be a Pythonic solution for this?

Note: I'm on Windows, and the dict should have maximum 1M items (key and value both integers).

回答1:

Mose of embedded datastore other than SQLite doesn't have optimization for concurrent access, I was also curious about SQLite concurrent performance too, so I did a benchmark:

import time
import sqlite3
import os
import random
import sys
import multiprocessing


class Store():

    def __init__(self, filename='kv.db'):
        self.conn = sqlite3.connect(filename, timeout=60)
        self.conn.execute('pragma journal_mode=wal')
        self.conn.execute('create table if not exists "kv" (key integer primary key, value integer) without rowid')
        self.conn.commit()

    def get(self, key):
        item = self.conn.execute('select value from "kv" where key=?', (key,))
        if item:
            return next(item)[0]

    def set(self, key, value):
        self.conn.execute('replace into "kv" (key, value) values (?,?)', (key, value))
        self.conn.commit()


def worker(n):
    d = [random.randint(0, 1<<31) for _ in range(n)]
    s = Store()
    for i in d:
        s.set(i, i)
    random.shuffle(d)
    for i in d:
        s.get(i)


def test(c):
    n = 5000
    start = time.time()
    ps = []
    for _ in range(c):
        p = multiprocessing.Process(target=worker, args=(n,))
        p.start()
        ps.append(p)
    while any(p.is_alive() for p in ps):
        time.sleep(0.01)
    cost = time.time() - start
    print(f'{c:<10d}\t{cost:<7.2f}\t{n/cost:<20.2f}\t{n*c/cost:<14.2f}')


def main():
    print(f'concurrency\ttime(s)\tpre process TPS(r/s)\ttotal TPS(r/s)')
    for c in range(1, 9):
        test(c)


if __name__ == '__main__':
    main()

result on my 4 cores macOS box, SSD volume:

concurrency time(s) pre process TPS(r/s)    total TPS(r/s)
1           0.65    7638.43                 7638.43
2           1.30    3854.69                 7709.38
3           1.83    2729.32                 8187.97
4           2.43    2055.25                 8221.01
5           3.07    1629.35                 8146.74
6           3.87    1290.63                 7743.78
7           4.80    1041.73                 7292.13
8           5.37    931.27                  7450.15

result on an 8 cores windows server 2012 cloud server, SSD volume:

concurrency     time(s) pre process TPS(r/s)    total TPS(r/s)
1               4.12    1212.14                 1212.14
2               7.87    634.93                  1269.87
3               14.06   355.56                  1066.69
4               15.84   315.59                  1262.35
5               20.19   247.68                  1238.41
6               24.52   203.96                  1223.73
7               29.94   167.02                  1169.12
8               34.98   142.92                  1143.39

turns out overall throughput is consistent regardless of concurrency, and SQLite is slower on windows than macOS, hope this is helpful.

As SQLite write lock is database wise, in order to get more TPS, you could partition data to multi-database files:

class MultiDBStore():

    def __init__(self, buckets=5):
        self.buckets = buckets
        self.conns = []
        for n in range(buckets):
            conn = sqlite3.connect(f'kv_{n}.db', timeout=60)
            conn.execute('pragma journal_mode=wal')
            conn.execute('create table if not exists "kv" (key integer primary key, value integer) without rowid')
            conn.commit()
            self.conns.append(conn)

    def _get_conn(self, key):
        assert isinstance(key, int)
        return self.conns[key % self.buckets]

    def get(self, key):
        item = self._get_conn(key).execute('select value from "kv" where key=?', (key,))
        if item:
            return next(item)[0]

    def set(self, key, value):
        conn = self._get_conn(key)
        conn.execute('replace into "kv" (key, value) values (?,?)', (key, value))
        conn.commit()

result on my mac with 20 partitions:

concurrency time(s) pre process TPS(r/s)    total TPS(r/s)
1           2.07    4837.17                 4837.17
2           2.51    3980.58                 7961.17
3           3.28    3047.68                 9143.03
4           4.02    2486.76                 9947.04
5           4.44    2249.94                 11249.71
6           4.76    2101.26                 12607.58
7           5.25    1903.69                 13325.82
8           5.71    1752.46                 14019.70

total TPS is higher than single database file.

回答2:

Before there was redis there was Memcached (which works on windows). Here is a tutorial. https://realpython.com/blog/python/python-memcache-efficient-caching/

回答3:

I'd consider 2 options, both are embedded databases

SQlite

As answered here and here it should be fine

BerkeleyDB

link

Berkeley DB (BDB) is a software library intended to provide a high-performance embedded database for key/value data

It has been designed exactly for your purpose

BDB can support thousands of simultaneous threads of control or concurrent processes manipulating databases as large as 256 terabytes,3 on a wide variety of operating systems including most Unix-like and Windows systems, and real-time operating systems.

It is robust and has been around for years if not decades

Bringing up redis/memcached/ whatever else full-fledged socket-based server that requires sysops involvement IMO is an overhead for the task to exchange data between 2 scripts located on the same box

回答4:

You can use python dictionary for this purpose.

Create a generic class or script named as G, that initializes a dictionary in it. The G will run the script1.py & script2.py and passes the dictionary to both scripts file, in python dictionary is passed by reference by default. In this way, a single dictionary will be used to store data and both scripts can modify dictionary values, changes can be seen in both of the scripts. I hope script1.py and script2.py are class based. It doesn't guarantee the persistence of data. For persistence, you can store the data in the database after x intervals.

Example

script1.py

class SCRIPT1:

    def __init__(self, dictionary):
        self.dictionary = dictionary
        self.dictionary.update({"a":"a"})
        print("SCRIPT1 : ", self.dictionary)

    def update(self):
        self.dictionary.update({"c":"c"})

script2.py

class SCRIPT2:
    def __init__(self, dictionary):
        self.dictionary = dictionary
        self.dictionary.update({"b":"b"})
        print("SCRIPT 2 : " , self.dictionary)

main_script.py

import script1
import script2

x = {}

obj1 = script1.SCRIPT1(x) # output: SCRIPT1 :  {'a': 'a'}
obj2 = script2.SCRIPT2(x) # output: SCRIPT 2 :  {'a': 'a', 'b': 'b'}
obj1.update()
print("SCRIPT 1 dict: ", obj1.dictionary) # output: SCRIPT 1 dict:  {'c': 'c', 'a': 'a', 'b': 'b'}

print("SCRIPT 2 dict: ", obj2.dictionary) # output: SCRIPT 2 dict:  {'c': 'c', 'a': 'a', 'b': 'b'}

Also create an empty _ init _.py file in the directory where you will run the scripts.

Another option is:

Redis

回答5:

You could use a document-based database manager. Maybe is too heavy for your system to do so, but concurrent access is typically one of the reasons DB management systems and API to connect to them are in place.

I have used MongoDB with Python and it works fine. The Python API documentation is quite good and each document (element of the database) is a dictionary that can be loaded to python as such.

回答6:

I would use a pub/sub websocket-framework, like Autobahn/Python, with one script as a "server" and it handles all the file communication but it depends on scale maybe this could be Overkill.

回答7:

CodernintyDB could be worth exploring, using the server version.

http://labs.codernity.com/codernitydb/

Server version: http://labs.codernity.com/codernitydb/server.html

回答8:

It sounds like you really need is a database of some kind.

If redis won't work for windows, then I would look at MongoDB.

https://docs.mongodb.com/manual/tutorial/install-mongodb-on-windows/

MongoDB works great with python and can function similar to redis. Here are the install docs for PyMongo: http://api.mongodb.com/python/current/installation.html?_ga=2.78008212.1422709185.1517530606-587126476.1517530605

Also, many people have brought up SQlite. I think you were concerned that it only allows one writer at a time, but this is not really a problem for you to worry about. I think what it is saying is that, if there are two writers, the second will be blocked until the first is finished. This is probably fine for your situation.