I have more than 500 PyTables
stores that contain about 300Mb of data each. I would like to merge these files into a big store, using pandas append
as in the code below.
def merge_hdfs(file_list, merged_store):
for file in file_list:
store = HDFStore(file, mode='r')
merged_store.append('data', store.data)
store.close()
The append operation is very slow (it is taking up to 10 minutes to append a single store to merged_store
), and strangely the file size of merged_store
seems to be increasing by 1Gb for each appended store.
I have indicated the total number of expected rows which according to the documentation should improve performance, and having read Improve pandas (PyTables?) HDF5 table write performance I was expecting large write times, but almost 10 minutes for every 300Mb seems to be too slow, and I cannot understand why the increase in size.
I wonder if I am missing something?
For additional information, here is a description of one of the 500 PyTables.
/data/table (Table(272734,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(6,), dflt=0.0, pos=1),
"id": StringCol(itemsize=11, shape=(), dflt='', pos=2),
"datetaken": Int64Col(shape=(), dflt=0, pos=3),
"owner": StringCol(itemsize=15, shape=(), dflt='', pos=4),
"machine_tags": StringCol(itemsize=100, shape=(), dflt='', pos=5),
"title": StringCol(itemsize=200, shape=(), dflt='', pos=6),
"country": StringCol(itemsize=3, shape=(), dflt='', pos=7),
"place_id": StringCol(itemsize=18, shape=(), dflt='', pos=8),
"url_s": StringCol(itemsize=80, shape=(), dflt='', pos=9),
"url_o": StringCol(itemsize=80, shape=(), dflt='', pos=10),
"ownername": StringCol(itemsize=50, shape=(), dflt='', pos=11),
"tags": StringCol(itemsize=505, shape=(), dflt='', pos=12)}
byteorder := 'little'
chunkshape := (232,)