loading huge XLS data into Oracle using python

2019-08-13 08:18发布

问题:

I have a 3+ million record XLS file which i need to dump in Oracle 12C DB (direct dump) using a python 2.7.

I am using Cx_Oracle python package to establish connectivity to Oracle , but reading and dumping the XLS (using openpyxl pckg) is extremely slow and performance degrades for thousands/million records.

From a scripting stand point used two ways-

  1. I've tried bulk load , by reading all the values in array and then dumping it using cursor prepare (with bind variables) and cursor fetchmany.This doesn't work well with huge data.

  2. Iterative loading of the data as it is being fetched.Even this way has performance issues.

What options and techniques/packages can i deploy as a best practise to load this volume of data from XLS to Oracle DB ?Is it advisable to load this volume of data via scripting or should i necessarily use an ETL tool ? As of now i only have option via python scripting so please do answer the former

回答1:

If is possible to export your excel fila as a CSV, then all you need is to use sqlldr to load the file in db



回答2:

Excel also comes with ODBC support so you could pump straight from Excel to Oracle assuming you have the drivers. That said, anything that involves transforming a large amount of data in memory (from whatever Excel is using internally) and then passing it to the DB is likely to be less performant than a specialised bulk operation which can be optimised to use less memory. Going through Python just adds another layer to the task (Excel to Python to Oracle), though it might be possible to set this up to use streams.



回答3:

Basically for high volume data any language going to be stressed on I/O, except C. Best way is to use native tools/utilities provided by the DB vendor. For oracle the correct fit is SQL Loader.

Refer this link for quick tutorial http://www.thegeekstuff.com/2012/06/oracle-sqlldr/

Here you go… Sample code that runs SQL Loader and gets you back with return code, output & error

sql_ld_command = ['sqlldr ',  'uid/passwd', 'CONTROL=', 
'your_ctrl_file_path', 'DATA=', 'your_data_file_path']   

sql_ldr_proc  = subprocess.Popen(sql_ld_command, stdin=subprocess.PIPE,stdout=subprocess.PIPE, stderr=subprocess.PIPE)   

out, err  = sql_ldr_proc.communicate()  
retn_code = sql_ldr_proc.wait()