Automatically retrieving large files via public HT

2019-06-26 05:17发布

问题:

For weather processing purpose, I am looking to retrieve automatically daily weather forecast data in Google Cloud Storage.

The files are available on public HTTP URL (http://dcpc-nwp.meteo.fr/openwis-user-portal/srv/en/main.home), but they are very large (between 30 and 300 Megabytes). Size of files is the main issue.

After looking at previous stackoverflow topics, I have tried two unsuccessful methods:

1/ First attempt via urlfetch in Google App Engine

    from google.appengine.api import urlfetch

    url = "http://dcpc-nwp.meteo.fr/servic..."
    result = urlfetch.fetch(url)

    [...] # Code to save in a Google Cloud Storage bucket

But I get the following error message on the urlfetch line :

DeadlineExceededError: Deadline exceeded while waiting for HTTP response from URL

2/ Second attempt via the Cloud Storage Transfert Service

According to the documentation, it is possible to retrieve HTTP Data into Cloud Storage directly via the Cloud Storage Transfert Service : https://cloud.google.com/storage/transfer/reference/rest/v1/TransferSpec#httpdata

But it requires the size and md5 of the files before the download. This option cannot work in my case because the website does not provide those information.

3/ Any ideas ?

Do you see any solution to retrieve automatically large file on HTTP into my Cloud Storage bucket?

回答1:

3/ Workaround with a Compute Engine instance

Since it was not possible to retrieve large files from external HTTP with App Engine or directly with Cloud Storage, I have used a workaround with an always-running Compute Engine instance.

This instance regularly checks if new weather files are available, downloads them and uploads them to a Cloud Storage bucket.

For scalability, maintenance and cost reasons, I would have prefered to use only serverless services, but hopefully :

  • It works well on a fresh f1-micro Compute Engine instance (no extra package required and only 4$/month if running 24/7)
  • The network traffic from Compute Engine to Google Cloud Storage is free if the instance and the bucket are in the same region (0$/month)


回答2:

Currently, the MD5 and size are required for Google's Transfer Service; we understand that in cases like yours, this can be difficult to work with, but unfortunately we don't have a great solution today.

Unless you're able to get the size and MD5 by downloading the files yourself (temporarily), I think that's the best you can do.



回答3:

The md5 and size of the file can be retrieved easily and quickly using curl -I command as mentioned in this link https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests.
The Storage Transfer Service can then be configured to use that information.

Another option would be to use a serverless Cloud Function. It could look like something below in Python.

import requests

def download_url_file(url):
    try:
        print('[ INFO ] Downloading {}'.format(url))
        req = requests.get(url)
        if req.status_code==200:
            # Download and save to /tmp
            output_filepath = '/tmp/{}'.format(url.split('/')[-1])
            output_filename = '{}'.format(url.split('/')[-1])
            open(output_filepath, 'wb').write(req.content)
            print('[ INFO ] Successfully downloaded to output_filepath: {} & output_filename: {}'.format(output_filepath, output_filename))
            return output_filename
        else:
            print('[ ERROR ] Status Code: {}'.format(req.status_code))
    except Exception as e:
        print('[ ERROR ] {}'.format(e))
    return output_filename