Short question
When Pandas work on a HDFStore (eg: .mean() or .apply() ), does it load the full data in memory as a DataFrame, or does it process record-by-record as a Serie?
Long description
I have to work on large data files, and I can specify the output format of the data file.
I intend to use Pandas to process the data, and I would like to setup the best format so that it maximizes the performances.
I have seen that panda.read_table() has gone a long way, but it still at least takes at least as much memory (in fact at least twice the memory) as the original file size that we want to read to transform into a DataFrame. This may work for files up to 1 GB, but above? That may be hard, especially on online shared machines.
However, I have seen that now Pandas seems to support HDF tables using pytables.
My question is: how does Pandas manage the memory when we do an operation on a whole HDF table? For example a .mean() or .apply(). Does it first load the entire table in a DataFrame, or does it process the function over data directly from the HDF file without storing in memory?
Side-question: is the hdf5 format compact on disk usage? I mean, is it verbose like xml or more like JSON? (I know there are indexes and stuff, but I am here interested in the bare description of the data)