Python pandas persistent cache

2019-07-05 03:04发布

问题:

Is there an implementation for python pandas that cache the data on disk so I can avoid to reproduce it every time?

In particular is there a caching method for get_yahoo_data for financial?

A very plus would be:

  • very few lines of code to write
  • possibility to integrate the persisted series when new data is downloaded for the same source

回答1:

There are many ways to achieve this, however probably the easiest way is to use the build in methods for writing and reading Python pickles. You can use pandas.DataFrame.to_pickle to store the DataFrame to disk and pandas.read_pickle to read the stored DataFrame from disk.

An example for a pandas.DataFrame:

# Store your DataFrame
df.to_pickle('cached_dataframe.pkl') # will be stored in current directory

# Read your DataFrame
df = pandas.read_pickle('cached_dataframe.pkl') # read from current directory

The same methods also work for pandas.Series:

# Store your Series
series.to_pickle('cached_series.pkl') # will be stored in current directory

# Read your DataFrame
series = pandas.read_pickle('cached_series.pkl') # read from current directory


回答2:

Depend on different requirements, there are a dozen of methods to do that, to and fro, in CSV, Excel, JSON, Python Pickle Format, HDF5 and even SQL with DB, etc.

In terms of code lines, to/read many of these formats are just one line of code for each direction. Python and Pandas already make the code as clean as possible, so you could worry less about that.

I think there is no single solution to fit all requirements, really case by case:

  • for human readability of saved data: CSV, Excel
  • for binary python object serialization (use-cases): Pickle
  • for data-interchange: JSON
  • for long-time and incrementally updating: SQL
  • etc.

And if you want to daily update the stock prices and for later usage, I prefer Pandas with SQL Queries, of course this will add few lines of code to set up DB connection:

from sqlalchemy import create_engine

new_data = getting_daily_price()
# You can also choose other db drivers instead of `sqlalchemy`
engine = create_engine('sqlite:///:memory:')
with engine.connect() as conn:
    new_data.to_sql('table_name', conn) # To Write
    df = pd.read_sql_table('sql_query', conn) # To Read