How to create Dask DataFrame from a list of urls?

2020-07-18 04:28发布

问题:

I have a list of the URLs, and I'd love to read them to the dask data frame at once, but it looks like read_csv can't use an asterisk for http. Is there any way to achieve that?

Here is an example:

link = 'http://web.mta.info/developers/'

data = [     'data/nyct/turnstile/turnstile_170128.txt',
                        'data/nyct/turnstile/turnstile_170121.txt',
                        'data/nyct/turnstile/turnstile_170114.txt',
                        'data/nyct/turnstile/turnstile_170107.txt' 
        ]

and what I want is

df = dd.read_csv('XXXX*X')

回答1:

Try using dask.delayed to turn each of your urls into a lazy pandas dataframe and then use dask.dataframe.from_delayed to turn those lazy values into a full dask dataframe

import pandas as pd
import dask
import dask.dataframe as dd

dfs = [dask.delayed(pd.read_csv)(url) for url in urls]

df = dd.from_delayed(dfs)

This will read one of your links immediately in order to figure out metadata (column, dtypes). If you know these dtypes and links ahead of time then you can avoid this by passing a sample empty dataframe to dd.from_delayed(..., meta=sample_df)

See also: http://dask.pydata.org/en/latest/delayed-collections.html