Multiprocessing with JDBC connection and pooling

2020-08-04 02:39发布

问题:

I would like to create a parallel process which gets Data from a Database.

I am using a JDBC connector which works quite fine if i run my program not in parallel:

conn = jaydebeapi.connect("com.teradata.jdbc.TeraDriver", "jdbc:teradata://DBNAME"+str(i)+"/LOGMECH=LDAP",
                          ["LIB_NAME", "PWD"],
                          "/home/user/TeraJDBC/terajdbc4.jar:/home/user/TeraJDBC/tdgssconfig.jar", )
curs = conn.cursor()

However I want to fasten that process and so I am using:

from joblib import Parallel, delayed, parallel_backend
with parallel_backend('multiprocessing'):
    test = Parallel(n_jobs=2, verbose=50)(delayed(function_Name)(df, i) for i in range(0, df.shape[0]))

I get no error message, but it gets just to the point where it creates the first Connection and then gets stuck. Is it not possible to have several JDBC Connections?

Update

I found this website where someone wants to use pooling within python. The updated code for that is now:

c = jaydebeapi.connect("com.teradata.jdbc.TeraDriver", "jdbc:teradata://DBNAME"+str(i)+"/LOGMECH=LDAP",
                          ["LIB_NAME", "PWD"],
                          "/home/user/TeraJDBC/terajdbc4.jar:/home/user/TeraJDBC/tdgssconfig.jar", )
mypool = pool.QueuePool(c, max_overflow=10, pool_size=5)
# get a connection
conn = mypool.connect() ###here is runs into the error

# use it
curs = conn.cursor()
#curs = conn.cursor()

However now it runs into in error at conn = mypool.connect():

TypeError: 'Connection' object is not callable

Do I have to change somthing in the java files terajdbc4.jar or tdgssconfig.jar to configure pooling? Or did i miss a configuration?

Update2 - Nothing is impossible

So, i looked for a package where I can multiprocess small Database accesses and came up with the following code with the help of this page-->

import concurrent.futures
import JDBC_Connector
import pandas as pd
import time

def make_image_thumbnail(filename):
    print('start')
    conn, curs = JDBC_Connector.JDBC_Connector()
    curs.execute("select DISTINCT id, date "
                 "from Lib.Data "
                 "where id = 1 "                     
                 "and date = '2018-03-0"+str(filename)+"' "                 
                 "order by date;")

    df = pd.DataFrame(curs.fetchall())
    if df.shape[0] > 0:
        df.columns = [curs.description[i][0] for i in range(len(curs.description))]
    if df.shape[0]> 0:
        print('end')
        curs.close()
        return df.DATE[0]

    else:
        curs.close()
        print('end')
        return 0


start_time = time.time()
# your code

#test single thread
for item in list(range(1,10)):
    make_image_thumbnail(item)
elapsed_time = time.time() - start_time
print('not parallel')
print(elapsed_time)



start_time = time.time()
# Create a pool of processes. By default, one is created for each CPU in your machine.
with concurrent.futures.ProcessPoolExecutor(max_workers = 5) as executor:
    # Get a list of files to process
    image_files = list(range(1,10))

    # Process the list of files, but split the work across the process pool to use all CPUs!
    for image_file, thumbnail_file in zip(image_files, executor.map(make_image_thumbnail, image_files)):
        print("A thumbnail for {} was saved as {}".format(image_file, thumbnail_file))

elapsed_time = time.time() - start_time

Not Parallel Process = 145.4

Parallel Process = 45.3

Not quite 5 times faster, however that depends on the size of each query. However the strange thing is: When I execute the full script, so it executes first the iterative one and then the parallel one then the parallel one gets stuck and wont go any further. Why is that so?