Download resources from private CKAN datasets

2019-02-24 22:25发布

问题:

My aim is to download files which are held as resources within private datasets using (a) the CKAN API, or (b) the CKANAPI CLI, or (c) paster (if (c) is possible).

I have attempted downloading the files using (a) unsuccessfully. For example using the resource URL and urllib2 or requests the file is downloaded but it is either corrupted (.zip) or the CKAN login page is stored within the file (.xls).

I have tried using (b) too unsuccessfully. For example using the following code:

ckanapi dump datasets dataset_name --datapackages=~/ckan_out -r http://localhost:5000 -a XXXXX-XXXX-XXXX-XXXX-XXXXXXXXX

URL xxxxxxxxxxxx refused connection. The resource will not be downloaded

I haven't found anything that has the download resources functionality for paster yet.

Is it possible to automate the process of downloading private resources using CKAN tools?

Should I change datasets from private to public, download the resource, and then make them private again?

Any insights are more than welcome.

CKAN 2.5.2, UBUNTU 14.04

回答1:

Unfortunately, the CKAN API doesn't offer a function for downloading resource data (only for metadata: resource_show). Resource download is handled by CKAN's web UI code instead. This means that you cannot use the authentication methods provided by the API (i.e. your API-key) but have to use your normal credentials (username + password) instead:

import requests

CKAN_URL = 'http://localhost:5000'


def login(username, password):
    '''
    Login to CKAN.

    Returns a ``requests.Session`` instance with the CKAN
    session cookie.
    '''
    s = requests.Session()
    data = {'login': username, 'password': password}
    url = CKAN_URL + '/login_generic'
    r = s.post(url, data=data)
    if 'field-login' in r.text:
        # Response still contains login form
        raise RuntimeError('Login failed.')
    return s


def download_resource_data(session, pkg_id, res_id):
    url = '{ckan}/dataset/{pkg}/resource/{res}/download/'.format(
            ckan=CKAN_URL, pkg=pkg_id, res=res_id)
    return session.get(url).content


if __name__ == '__main__':
    session = login('my-user', 'my-password')
    data = download_resource_data(session, 'some-package', 'some-resource')
    print(data)


回答2:

Since I have administration access to the machine that CKAN is installed (Ubuntu 14.04) I used the following process to copy the resources found in the CKAN storage folder and copy them securely into another host.

Write a python script where:

(A) The CKAN API is used to get the resources' metadata. Alternatively you can use paster to generate a dump of the metadata either in *.csv or *.json but for me it didn't work because the file generated from paster wasn't accessible from my csv module or from the json module respectively. Is paster closing the file properly after it is generated?

(B) Using the resources' metadata create a dictionary (1) where the key presents the resource_id and the value represents a list of items [package_name,resource_format].

(C) Using the python module os reconstruct the full resource_id for each resource by accessing the CKAN storage folder that you have defined in your configuration file ckan.storage_path and hold them in a list (2)

(D) Perform comparisons between resource_ids in (1) and (2), rename the resources in (2) by using mv and using the information found in the list of items in dictionary (1) when a match is found, and copy them securely using scp to another host.

The above works fine if one is able to access the CKAN storage path and able to open ports in the firewall between machines; however I would appreciate it if someone also had insights on performing the same functionality using just the CKAN API and authentication.