For weather processing purpose, I am looking to retrieve automatically daily weather forecast data in Google Cloud Storage.
The files are available on public HTTP URL (http://dcpc-nwp.meteo.fr/openwis-user-portal/srv/en/main.home), but they are very large (between 30 and 300 Megabytes). Size of files is the main issue.
After looking at previous stackoverflow topics, I have tried two unsuccessful methods:
1/ First attempt via urlfetch in Google App Engine
from google.appengine.api import urlfetch
url = "http://dcpc-nwp.meteo.fr/servic..."
result = urlfetch.fetch(url)
[...] # Code to save in a Google Cloud Storage bucket
But I get the following error message on the urlfetch line :
DeadlineExceededError: Deadline exceeded while waiting for HTTP response from URL
2/ Second attempt via the Cloud Storage Transfert Service
According to the documentation, it is possible to retrieve HTTP Data into Cloud Storage directly via the Cloud Storage Transfert Service :
https://cloud.google.com/storage/transfer/reference/rest/v1/TransferSpec#httpdata
But it requires the size and md5 of the files before the download. This option cannot work in my case because the website does not provide those information.
3/ Any ideas ?
Do you see any solution to retrieve automatically large file on HTTP into my Cloud Storage bucket?
3/ Workaround with a Compute Engine instance
Since it was not possible to retrieve large files from external HTTP with App Engine or directly with Cloud Storage, I have used a workaround with an always-running Compute Engine instance.
This instance regularly checks if new weather files are available, downloads them and uploads them to a Cloud Storage bucket.
For scalability, maintenance and cost reasons, I would have prefered to use only serverless services, but hopefully :
- It works well on a fresh f1-micro Compute Engine instance (no extra package required and only 4$/month if running 24/7)
- The network traffic from Compute Engine to Google Cloud Storage is free if the instance and the bucket are in the same region (0$/month)
Currently, the MD5 and size are required for Google's Transfer Service; we understand that in cases like yours, this can be difficult to work with, but unfortunately we don't have a great solution today.
Unless you're able to get the size and MD5 by downloading the files yourself (temporarily), I think that's the best you can do.
The md5 and size of the file can be retrieved easily and quickly using curl -I command as mentioned in this link https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests.
The Storage Transfer Service can then be configured to use that information.
Another option would be to use a serverless Cloud Function. It could look like something below in Python.
import requests
def download_url_file(url):
try:
print('[ INFO ] Downloading {}'.format(url))
req = requests.get(url)
if req.status_code==200:
# Download and save to /tmp
output_filepath = '/tmp/{}'.format(url.split('/')[-1])
output_filename = '{}'.format(url.split('/')[-1])
open(output_filepath, 'wb').write(req.content)
print('[ INFO ] Successfully downloaded to output_filepath: {} & output_filename: {}'.format(output_filepath, output_filename))
return output_filename
else:
print('[ ERROR ] Status Code: {}'.format(req.status_code))
except Exception as e:
print('[ ERROR ] {}'.format(e))
return output_filename