How to scrape .csv files from a url, when they are

2019-08-13 22:26发布

I am trying to scrape some .csv files from a website. I currently have a list of links:

master_links = [
    'http://mis.nyiso.com/public/csv/damlbmp/20161201damlbmp_zone_csv.zip', 
    'http://mis.nyiso.com/public/csv/damlbmp/20160301damlbmp_zone_csv.zip', 
    'http://mis.nyiso.com/public/csv/damlbmp/20160201damlbmp_zone_csv.zip']

when I try to use:

pd.read_csv(master_links[0])]

it returns an error because each .zip file contains multiple .csv within them. I understand why this isn't working, but I haven't figured out how to unzip these files, and then put the .csv files into pd.read_csv without saving everything to my computer.

Is this possible?

1条回答
趁早两清
2楼-- · 2019-08-13 23:03

You can do that with a custom file reader for pandas.read_csv() like:

Code:

def fetch_multi_csv_zip_from_url(url, filenames=(), *args, **kwargs):
    assert kwargs.get('compression') is None
    req = urlopen(url)
    zip_file = zipfile.ZipFile(BytesIO(req.read()))

    if filenames:
        names = zip_file.namelist()
        for filename in filenames:
            if filename not in names:
                raise ValueError(
                    'filename {} not in {}'.format(filename, names))
    else:
        filenames = zip_file.namelist()

    return {name: pd.read_csv(zip_file.open(name), *args, **kwargs)
            for name in filenames}

Some Docs: (ZipFile) (BytesIO) (urlopen)

Test Code:

try:
    from urllib.request import urlopen
except ImportError:
    from urllib2 import urlopen
from io import BytesIO
import zipfile
import pandas as pd

master_links = [
    'http://mis.nyiso.com/public/csv/damlbmp/20161201damlbmp_zone_csv.zip',
    'http://mis.nyiso.com/public/csv/damlbmp/20160301damlbmp_zone_csv.zip',
    'http://mis.nyiso.com/public/csv/damlbmp/20160201damlbmp_zone_csv.zip']

dfs = fetch_multi_csv_zip_from_url(master_links[0])
print(dfs['20161201damlbmp_zone.csv'].head())

Results:

         Time Stamp    Name   PTID  LBMP ($/MWHr)  \
0  12/01/2016 00:00  CAPITL  61757          21.94   
1  12/01/2016 00:00  CENTRL  61754          16.85   
2  12/01/2016 00:00  DUNWOD  61760          20.85   
3  12/01/2016 00:00  GENESE  61753          16.16   
4  12/01/2016 00:00     H Q  61844          15.73   

   Marginal Cost Losses ($/MWHr)  Marginal Cost Congestion ($/MWHr)  
0                           1.21                              -4.45  
1                           0.11                              -0.45  
2                           1.58                              -2.99  
3                          -0.49                              -0.36  
4                          -0.55                               0.00  
查看更多
登录 后发表回答