How to scale psycopg2 insert and select with singl

2020-05-08 01:02发布

It takes average of about 0.300095081329 for my insert to go through to finish commit to postgres.

Here is my table pattern

id_table
    latest_update_id (primary index)
    product_id       (index)
    publish_date

product_meta_table
    latest_update_id    (index)
    product_id          (index)
    meta_related_info1
    meta_related_info2
    ...etc

product_table
    latest_update_id    (index)
    product_id          (index)
    note_related_info1
    note_related_info2
    ....etc

Here are some of my inserts

db_cursor.execute("INSERT INTO id_table (product_id, publish_date)  \
             VALUES (%s, %s) RETURNING latest_update_id",
    (my_dict["product_id"], my_dict["publish_date"])
)

 db_cursor.execute("INSERT INTO product_table ( \
                   latest_update_id, \
                   product_id, \
                   note_related_info1, \
                   note_related_info2, \
                   ...etc)  \
             VALUES (%s, %s, %s, %s) RETURNING *",
    (my_dict["latest_update_id"], 
     my_dict["product_id"],
     my_dict["note_related_info1"],
     my_dict["note_related_info2"])
)       

Using the insert time my throughput is about 1/0.3 = 3qps

I know I can scale this horizontally by adding more instances but I want to try to see if I can hit at least 3000qps.

I am thinking of either using aync or threading, but was not sure of GIL is going to interfere or not.

Is there a general good practice and technique on how to scale insert statements using psycopg2?

Thanks

Note: I am using python 2.7

Note: python process is communicating with sql server through https

Note: the inserts to each table are staggered, table2 inserts after table1, table3 inserts after table2. Technically table2 and table3 only have to wait for table1 to finish insert because they need latest_update_id

2条回答
我欲成王,谁敢阻挡
2楼-- · 2020-05-08 01:28

Do a single insert query in instead of 3. Notice the triple quotes and dictionary parameter passing:

insert_query = """
    with i as (
        insert into id_table (product_id, publish_date) 
        values (%(product_id)s, %(publish_date)s)
        returning latest_update_id
    )
    insert into product_table (
        latest_update_id,
        product_id,
        note_related_info1,
        note_related_info2
    ) values (
        (select latest_update_id from i),
        %(product_id)s, %(note_related_info1)s, %(note_related_info2)s
    )
    returning *
"""

db_cursor.execute(insert_query, my_dict)
查看更多
狗以群分
3楼-- · 2020-05-08 01:38

Followup on my network comment.

Say you have 100ms roundtrip (like the time for SELECT 1).

If you want to chain queries, then you will have no other choice than to do INSERT... with tons of values to amortize the roundtrip time.

This is cumbersome, as you then will have to sort through the returned ids, to insert the dependent rows. Also, if your bandwidth is low, you will saturate it, and it won't be that fast anyway.

If your bandwidth is high enough but your ping is slow, you may be tempted to multithread... but this creates another problem...

Instead of having, say 1-2 server process churning through queries very fast, you'll have 50 processes sitting there doing nothing except waste valuable server RAM while they wait for the queries to come over the slow network.

Also, concurrency and lock issues may arise. You won't do just INSERTs... You're going to do some SELECT FOR UPDATE which grabs a lock...

...and then other processes pile up to acquire that lock while your next query crawls over the network...

This feels like using MyISAM in a concurrent write-intensive scenario. Locks should be held for the shortest time possible... fast pings help, putting the whole chain of queries from lock acquisition to release lock inside a stored proc is even better, so it is held for only a very short time.

So, consider executing your python script on the DB server, or on a server on the same LAN.

查看更多
登录 后发表回答