Automation

Automation#

When dealing with any sort of repetitive tasks like:

uploading a large number of files
creating many packages
preserving your data periodically
analyzing information across packages

you can use the API of CKAN (the core software stack of ERIC) to automate these tasks with a programming language of your choosing.

Authentication#

For many things you might want to automate (like uploading data) you will need to authenticate yourself to the system, so that CKAN can check if you’re authorized to for instance upload data to a certain package. For that you will need a token. If you do not yet have one please contact rdm@eawag.ch and we will generate one for you.

Important

CKAN has some limitations when uploading large files (>8GB). If that is what you’re after please contact rdm@eawag.ch.

Examples#

Below you will find some examples on how to use the API with Python.

Retrieving information about a package#

For this we can use the package_show endpoint the CKAN API offers.

First we’ll define a function that can request information from the CKAN API and returns a dictionary:

import json
from urllib.request import urlopen, Request

def request_json_data(url: str, token: str | None =None) -> dict:
    headers = {} if token is None else {'Authorization': token}
    with urlopen(Request(url, headers=headers)) as response:
        return json.loads(response.read().decode())

Then we can request the data. As we’re reading from a public dataset we do not need an API key. Mind the composition of the url:

host = "https://opendata.eawag.ch/"  # The url of the public data repository
api_endpoint = "api/3/action/package_show"
endpoint_parameter = "id"
parameter_value = "data-for-geringste-konzentrationen-grosste-wirkung"

url = f"{host}{api_endpoint}?{endpoint_parameter}={parameter_value}"

package_data = request_json_data(url)
package_data

{'help': 'https://opendata.eawag.ch/api/3/action/help_show?name=package_show',
 'success': True,
 'result': {'author': ['Rösch, Andrea',
   'Beck, Birgit',
   'Doppler, Tobias',
   'Junghans, Marion',
   'Hollender, Juliane',
   'Stamm, Christian',
   'Singer, Heinz'],
  'author_email': None,
  'citation': 'Rösch, A., Beck, B., Doppler, T., Junghans, M., Hollender, J., Stamm, C., &amp; Singer, H. (2019). <i>Data for: Geringste Konzentrationen – Grösste Wirkung</i> [Data set]. Eawag: Swiss Federal Institute of Aquatic Science and Technology. https://doi.org/10.25678/0001C7',
  'citation_publication': 'Rösch, A., Beck, B., Hollender, J., Stamm, C., Singer, H., Doppler, T., & Junghans, M. (2019). Geringe Konzentrationen mit grosser Wirkung. Nachweis von Pyrethroid- und Organophosphatinsektiziden in Schweizer Bächen im pg l<sup>-1</sup>-Bereich. <i>Aqua &amp; Gas</i>, 99(11), 54-66.',
  'creator_user_id': '064a4293-f097-4005-98d5-65b49b35ccf3',
  'doi': '10.25678/0001c7',
  'geographic_name': ['Ballmoosbach',
   'Chrümmlisbach',
   'Ron',
   'Le Bainoz',
   'Boiron de Morges',
   'Beggingerbach'],
  'has_part': [],
  'id': '50eafcd5-27c2-40a1-95d6-fc671262ee92',
  'id_internal': '',
  'is_part_of': [],
  'isopen': False,
  'ispublication': 'true',
  'license_id': None,
  'license_title': None,
  'maintainer': 'Doppler, Tobias',
  'maintainer_email': 'Tobias.Doppler@eawag.ch',
  'metadata_created': '2019-11-08T10:40:10.034952',
  'metadata_modified': '2020-03-31T21:04:00.642138',
  'name': 'data-for-geringste-konzentrationen-grosste-wirkung',
  'notes': 'In sechs kleinen bis mittelgrossen Fliessgewässern wurden die für aquatische Organismen extrem toxischen Pyrethroid- und Organophosphatinsektizide mittels einer Spezialanalytik bis in den Picogramm pro-Liter Bereich quantifiziert. An fünf der sechs untersuchten Standorte überschritten die gemessenen Insektizidkonzentrationen regelmässig chronische und zum Teil akute Qualitätskriterien und die chronische Mischungsrisiko¬bewertung zeigte während 43-100% des Untersuchungszeitraums hohe Risiken für die Invertebratengemeinschaft an. Werden Pyrethroid- und Organophosphatinsektizide nicht in die Beurteilung der Gewässerqualität miteinbezogen, kann das Gesamtrisiko für aquatische Organismen erheblich unterschätzt werden. ',
  'notes-2': '',
  'num_resources': 3,
  'num_tags': 6,
  'open_data': 'true',
  'organization': {'id': 'ad8c7050-d39a-41fb-83a8-563aae035ee7',
   'name': 'environmental-analytical-chemistry',
   'title': 'Environmental Analytical Chemistry',
   'type': 'organization',
   'description': 'Research in the group of environmental analytical chemistry focuses on development of novel methods for the analysis of organic contaminants in the aquatic environment.',
   'image_url': 'https://www.eawag.ch/fileadmin/_processed_/csm_bitmap_c95554bc94.png',
   'created': '2019-09-18T14:11:45.718963',
   'is_organization': True,
   'approval_status': 'approved',
   'state': 'active'},
  'owner_org': 'ad8c7050-d39a-41fb-83a8-563aae035ee7',
  'private': False,
  'publicationlink': '',
  'publicationlink_dora': 'https://www.dora.lib4ri.ch/eawag/islandora/object/eawag:19547',
  'publicationlink_url': 'https://www.eawag.ch/fileadmin/Domain1/News/2019/11/04/eawag_pyrethroide_ag_roesch.pdf',
  'review_level': 'none',
  'reviewed_by': '',
  'spatial': '{"type": "MultiPoint","coordinates": [[7.48096592772721,47.045354199371], [7.50967817301686,47.1233907382883],[6.81776618702072,46.8062051006771],[8.28338717661554,47.1648541498338],[8.52261321519673,47.7651854094541],[6.47793693774811,46.4938695571918] ]}',
  'state': 'active',
  'status': 'complete',
  'substances': ['Chlorpyrifos (SBPBAQFWLVIOKP-UHFFFAOYSA-N)',
   'Chlorpyrifos-methyl (HRBKVYFZANMGRE-UHFFFAOYSA-N)',
   'Tefluthrin (ZFHGXWPMULPQSE-SZGBIDFHSA-N)',
   'Bifenthrin (OXCDWLBJSLVWHB-LKRLXIKPSA-N)',
   'Cypermethrin (KAATUXNTWXVJKI-UHFFFAOYSA-N)',
   'Etofenprox (YREQHYQNNWYQCJ-UHFFFAOYSA-N)',
   'λ-Cyhalothrin (ZXQYGBMAQZUVMI-QQDHXZELSA-N)',
   'Cyfluthrin (OFHFONYRMVKULH-WNYJFNBPSA-N)',
   'Deltamethrin (OWZREIFADZCYQD-NSHGMRRFSA-N)',
   'Permethrin (RLLPVAHGXHCWKJ-UHFFFAOYSA-N)',
   'Cyphenothrin (FJDPATXIBIBRIM-UHFFFAOYSA-N)',
   'Empenthrin (YUGWDVYLFSETPE-JLHYYAGUSA-N)',
   'Fenvalerat (NYPJDWWKZLNGGM-UHFFFAOYSA-N)',
   'Metofluthrin (KVIZNNVXXNFLMU-UHFFFAOYSA-N)',
   'Phenothrin (SBNFWQZLDJGRLK-UHFFFAOYSA-N)',
   'Transfluthrin (DDVNRFNDOPPVQJ-HQJQHLMTSA-N)',
   'Allethrin (ZCVAOQKBXKSDMS-UHFFFAOYSA-N)',
   'Imiprothrin (VPRAQYXPZIFIOH-PYMCNQPYSA-N)',
   'Prallethrin (SMKRKQBMYOFFMU-UHFFFAOYSA-N)',
   'Tetramethrin (CXBMCYHAMVGWJQ-UHFFFAOYSA-N)',
   'Acrinathrin (YLFSVIMMRPNPFK-WEQBUNFVSA-N)',
   'τ-Fluvalinat (INISTDXBRIBGOC-UHFFFAOYSA-N)'],
  'substances_generic': ['pesticides', 'insecticides', 'pyrethroids'],
  'systems': ['stream'],
  'tags_string': 'pesticides,Insecticides,pyrethroids,stream,toxicity,trace analytics',
  'taxa': [],
  'taxa_generic': [],
  'timerange': ['2018-03 TO 2018-10', '2017-03 TO 2017-10'],
  'title': 'Data for: Geringste Konzentrationen – Grösste Wirkung',
  'type': 'dataset',
  'url': 'https://doi.org/10.25678/0001c7/',
  'variables': ['concentration'],
  'version': None,
  'resources': [{'allowed_users': '',
    'cache_last_updated': None,
    'cache_url': None,
    'citation': '',
    'created': '2019-11-08T10:40:10.532527',
    'datastore_active': False,
    'description': '',
    'format': 'TXT',
    'hash': '763481b753a85ee813652c7f6cd7ea389b7c67220af59809c9a117ce4230fd9d',
    'hashtype': 'sha256',
    'id': '016e7298-77dc-4a2d-b73e-1b68df23d038',
    'last_modified': '2019-11-08T10:40:10.495978',
    'metadata_modified': '2019-11-08T10:40:10.532527',
    'mimetype': 'text/plain',
    'mimetype_inner': None,
    'name': 'README.txt',
    'package_id': '50eafcd5-27c2-40a1-95d6-fc671262ee92',
    'position': 0,
    'resource_type': 'Text',
    'restricted_level': 'public',
    'size': 2766,
    'state': 'active',
    'url': 'https://opendata.eawag.ch/dataset/50eafcd5-27c2-40a1-95d6-fc671262ee92/resource/016e7298-77dc-4a2d-b73e-1b68df23d038/download/readme.txt',
    'url_type': 'upload'},
   {'allowed_users': '',
    'cache_last_updated': None,
    'cache_url': None,
    'citation': '',
    'created': '2019-11-08T10:40:11.052886',
    'datastore_active': True,
    'description': '',
    'format': 'XLSX',
    'hash': 'd81cd74c943dc3623a09d8b913a5b2b848106a44caeaafda61161d31fc82243a',
    'hashtype': 'sha256',
    'id': 'f4e375c5-8cd5-4e7c-8ed2-1fe32409a002',
    'last_modified': '2019-11-08T10:40:11.014235',
    'metadata_modified': '2019-11-08T10:40:11.052886',
    'mimetype': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
    'mimetype_inner': None,
    'name': 'Pyrethroids2018.xlsx',
    'package_id': '50eafcd5-27c2-40a1-95d6-fc671262ee92',
    'position': 1,
    'resource_type': 'Dataset',
    'restricted_level': 'public',
    'size': 91055,
    'state': 'active',
    'url': 'https://opendata.eawag.ch/dataset/50eafcd5-27c2-40a1-95d6-fc671262ee92/resource/f4e375c5-8cd5-4e7c-8ed2-1fe32409a002/download/pyrethroids2018.xlsx',
    'url_type': 'upload'},
   {'allowed_users': '',
    'cache_last_updated': None,
    'cache_url': None,
    'citation': '',
    'created': '2019-11-08T10:40:11.613710',
    'datastore_active': True,
    'description': '',
    'format': 'XLSX',
    'hash': 'a22f9cbd1e563cb22daba6ac15503b9c4dcba01224156581274e70973e2ab4b0',
    'hashtype': 'sha256',
    'id': '35c4dcfb-a4bf-4dc4-82cf-e3360d0f08e8',
    'last_modified': '2019-11-08T10:40:11.574932',
    'metadata_modified': '2019-11-08T10:40:11.613710',
    'mimetype': 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet',
    'mimetype_inner': None,
    'name': 'Pyrethroids2017.xlsx',
    'package_id': '50eafcd5-27c2-40a1-95d6-fc671262ee92',
    'position': 2,
    'resource_type': 'Dataset',
    'restricted_level': 'public',
    'size': 57648,
    'state': 'active',
    'url': 'https://opendata.eawag.ch/dataset/50eafcd5-27c2-40a1-95d6-fc671262ee92/resource/35c4dcfb-a4bf-4dc4-82cf-e3360d0f08e8/download/pyrethroids2017.xlsx',
    'url_type': 'upload'}],
  'tags': [{'display_name': 'Insecticides',
    'id': '6ef54430-e446-40d6-988b-1b0fd7fbd4c4',
    'name': 'Insecticides',
    'state': 'active',
    'vocabulary_id': None},
   {'display_name': 'pesticides',
    'id': '0e5fa288-9c9b-4e3d-9ae4-b747113b0b3f',
    'name': 'pesticides',
    'state': 'active',
    'vocabulary_id': None},
   {'display_name': 'pyrethroids',
    'id': '2a07cdc5-d030-4119-a165-2ee32c62ef05',
    'name': 'pyrethroids',
    'state': 'active',
    'vocabulary_id': None},
   {'display_name': 'stream',
    'id': '9df710c2-6240-423d-8ce0-310acf883170',
    'name': 'stream',
    'state': 'active',
    'vocabulary_id': None},
   {'display_name': 'toxicity',
    'id': 'be58b419-5bdc-4800-8b59-7bd59e5fe8f0',
    'name': 'toxicity',
    'state': 'active',
    'vocabulary_id': None},
   {'display_name': 'trace analytics',
    'id': '8244b5de-5afc-49aa-8ca6-7e5c51e7ac74',
    'name': 'trace analytics',
    'state': 'active',
    'vocabulary_id': None}],
  'groups': [],
  'relationships_as_subject': [],
  'relationships_as_object': []}}

A lot of data is returned. Let’s only check your all resource links for this data package.

resource_urls = [resource["url"] for resource in package_data["result"]["resources"]]
resource_urls

['https://opendata.eawag.ch/dataset/50eafcd5-27c2-40a1-95d6-fc671262ee92/resource/016e7298-77dc-4a2d-b73e-1b68df23d038/download/readme.txt',
 'https://opendata.eawag.ch/dataset/50eafcd5-27c2-40a1-95d6-fc671262ee92/resource/f4e375c5-8cd5-4e7c-8ed2-1fe32409a002/download/pyrethroids2018.xlsx',
 'https://opendata.eawag.ch/dataset/50eafcd5-27c2-40a1-95d6-fc671262ee92/resource/35c4dcfb-a4bf-4dc4-82cf-e3360d0f08e8/download/pyrethroids2017.xlsx']

Downloading resources#

In our previous example we use the package_show endpoint the CKAN API to extract links of resources. In this example we will download those resources.

def download_resource(url: str, file_path:  str, token: str | None = None, chunk_size: int = 1024) -> None:
    headers = {} if token is None else {'Authorization': token}
    with urlopen(Request(url, headers=headers)) as response:
        with open(file_path, 'wb') as file:
            while True:
                chunk = response.read(chunk_size)
                if not chunk:
                    break
                file.write(chunk)

With the download_resource function with can iterate the previously extracted resources and download them.

for url in resource_urls:
    file_path =f"/tmp/{url.split('/')[-1]}"
    download_resource(url, file_path)
    print(f"Successfully saved resource at: {file_path}")

Successfully saved resource at: /tmp/readme.txt
Successfully saved resource at: /tmp/pyrethroids2018.xlsx
Successfully saved resource at: /tmp/pyrethroids2017.xlsx

Uploading resources#

In this scenario we assume you created a package on ERIC/internal called data-for-project-x and now you want to upload your many resources.

Important

This procedure will require an API Token.

Note

Uploads will take longer that the implemented progress bar shows. The progress bar will reach 100% after about 1/4 of the time need for the process to finish. The reasons are very technical, if your want to know why please click below. You will only notice this for large files.

Technical reasons!

A file’s journey from your computer across the network to its final “resting place” passes through several proxies. Each of these proxies passes the data on to the next. The time shown in the progress bar is the time taken to upload the data to the first proxies. The additional time you have to wait is the time it takes for the various other proxies to copy the data from one to the other.

For ease of use we will install 3 libraries via pip install ...:

requests
requests_toolbelt
tqdm

The function below can be used to upload your data.

import pathlib

import tqdm
import requests
from requests_toolbelt.multipart.encoder import (
    MultipartEncoder,
    MultipartEncoderMonitor,
)


class TqdmProgressCallback:
    def __init__(self, total_size, filename):
        self.bar = tqdm.tqdm(
            total=total_size,
            unit="B",
            unit_scale=True,
            desc=f"Uploading {filename}",
        )
    def __call__(self, monitor):
        self.bar.update(monitor.bytes_read - self.bar.n)
        self.bar.refresh()

    def close(self):
        self.bar.close()

def upload_resource(
    file_path: pathlib.Path,
    package_id: str,
    token: str,
    description: str = "",
    resource_type: str = "Dataset",
    restricted_level: str = "public",
    state: str = "active",
    host: str = "https://data.eawag.ch",
    endpoint: str = "/api/3/action/resource_create",
):
    
    file_name = file_path.name
    file_size = file_path.stat().st_size
    with open(file_path, "rb") as file_stream:
        encoder = MultipartEncoder(
            fields={
                "upload": (
                    file_name,
                    file_stream,
                    "application/octet-stream",
                ),
                "package_id": package_id,
                "name": file_name,
                "description": description,
                "state": state,
                "size": str(file_size),
                "resource_type": resource_type,
                "restricted_level": restricted_level,
            }
        )

        progress_callback = TqdmProgressCallback(file_size, file_name)
        monitor = MultipartEncoderMonitor(encoder, progress_callback)
        
        headers = {"Authorization": token, "Content-Type": monitor.content_type}

        response = requests.post(
            f"{host}{endpoint}",
            data=monitor,
            headers=headers,
            auth=None,
            stream=True,
        )
        progress_callback.close()
        response.raise_for_status()

Note

File paths should be passed as pathlib.Path objects to the upload_resource function.

Let’s try it out. I prepared a folder full of test files.

/usr/bin/sh: 1: tree: not found

In this example, we’ll iterate over the entire contents of the “tmp/upload-test” folder and upload the contents if it’s a file. To do this, we also need the package_id “data-for-project-x” that we want to upload to, and a valid token.

your_token = "..."  # you must provide your token here
your_package_id = "data-for-project-x"  # you must provide your token here
data_package_folder = pathlib.Path("/tmp/upload-test/")

for candidate in data_package_folder.iterdir():
    if not candidate.is_file():
        continue
    upload_resource(
        file_path=candidate,
        package_id=your_package_id,
        token = your_token,
        description = f"This is the description for file {candidate}",
    )

Uploading random_file_15: 15.7MB [00:01, 9.06MB/s]                        
Uploading random_file_14: 14.7MB [00:01, 7.55MB/s]                        
Uploading random_file_13: 13.6MB [00:01, 7.41MB/s]                        
Uploading random_file_12: 12.6MB [00:01, 7.69MB/s]                        
Uploading random_file_11: 11.5MB [00:01, 7.53MB/s]                        
Uploading random_file_10: 10.5MB [00:01, 7.33MB/s]                        
Uploading random_file_9: 9.44MB [00:01, 6.16MB/s]                         
Uploading random_file_8: 8.39MB [00:01, 5.47MB/s]                         
Uploading random_file_7: 7.34MB [00:01, 5.52MB/s]                         
Uploading random_file_6: 6.29MB [00:01, 5.13MB/s]                         
Uploading random_file_5: 5.24MB [00:01, 4.67MB/s]                         
Uploading random_file_4: 4.20MB [00:01, 3.74MB/s]                         
Uploading random_file_3: 3.15MB [00:01, 3.08MB/s]                         
Uploading random_file_2: 2.10MB [00:01, 1.71MB/s]                         
Uploading random_file_1: 1.05MB [00:01, 916kB/s]