(automation-guide)=
# Automation

When dealing with any sort of repetitive tasks like:
 + uploading a large number of files
 + creating many packages
 + preserving your data periodically
 + analyzing information across packages 

you can use the **[API](https://docs.ckan.org/en/2.9/api/)** of CKAN (the core software stack of ERIC) to automate these tasks with a programming language of your choosing.

## Authentication
For many things you might want to automate (like uploading data) you will need to authenticate yourself to the system, so that CKAN can check if you're authorized to for instance upload data to a certain package. For that you will need a **[token](https://docs.ckan.org/en/2.9/api/#authentication-and-api-tokens)**. If you do not yet have one please contact [rdm@eawag.ch](mailto:rdm@eawag.ch) and we will generate one for you.

```{important}
CKAN has some limitations when uploading large files (**>8GB**). If that is what you're after please contact [rdm@eawag.ch](mailto:rdm@eawag.ch).
```

## Examples

Below you will find some examples on how to use the API with **Python**. 

### Retrieving information about a package
For this we can use the [package_show](https://docs.ckan.org/en/2.9/api/#ckan.logic.action.get.package_show) endpoint the CKAN API offers.

First we'll define a function that can request information from the CKAN API and returns a dictionary:

In [2]:
import json
from urllib.request import urlopen, Request

def request_json_data(url: str, token: str | None =None) -> dict:
    headers = {} if token is None else {'Authorization': token}
    with urlopen(Request(url, headers=headers)) as response:
        return json.loads(response.read().decode())

Then we can request the data. As we're reading from a public dataset we do not need an API key. Mind the composition of the url:

In [3]:
host = "https://opendata.eawag.ch/"  # The url of the public data repository
api_endpoint = "api/3/action/package_show"
endpoint_parameter = "id"
parameter_value = "data-for-geringste-konzentrationen-grosste-wirkung"

url = f"{host}{api_endpoint}?{endpoint_parameter}={parameter_value}"

package_data = request_json_data(url)
package_data

{'help': 'https://opendata.eawag.ch/api/3/action/help_show?name=package_show',
 'success': True,
 'result': {'author': ['Rösch, Andrea',
   'Beck, Birgit',
   'Doppler, Tobias',
   'Junghans, Marion',
   'Hollender, Juliane',
   'Stamm, Christian',
   'Singer, Heinz'],
  'author_email': None,
  'citation': 'Rösch, A., Beck, B., Doppler, T., Junghans, M., Hollender, J., Stamm, C., &amp; Singer, H. (2019). <i>Data for: Geringste Konzentrationen – Grösste Wirkung</i> [Data set]. Eawag: Swiss Federal Institute of Aquatic Science and Technology. https://doi.org/10.25678/0001C7',
  'citation_publication': 'Rösch, A., Beck, B., Hollender, J., Stamm, C., Singer, H., Doppler, T., & Junghans, M. (2019). Geringe Konzentrationen mit grosser Wirkung. Nachweis von Pyrethroid- und Organophosphatinsektiziden in Schweizer Bächen im pg l<sup>-1</sup>-Bereich. <i>Aqua &amp; Gas</i>, 99(11), 54-66.',
  'creator_user_id': '064a4293-f097-4005-98d5-65b49b35ccf3',
  'doi': '10.25678/0001c7',
  'geographic_nam

A lot of data is returned. Let's only check your all resource links for this data package.

In [3]:
resource_urls = [resource["url"] for resource in package_data["result"]["resources"]]
resource_urls

['https://opendata.eawag.ch/dataset/50eafcd5-27c2-40a1-95d6-fc671262ee92/resource/016e7298-77dc-4a2d-b73e-1b68df23d038/download/readme.txt',
 'https://opendata.eawag.ch/dataset/50eafcd5-27c2-40a1-95d6-fc671262ee92/resource/f4e375c5-8cd5-4e7c-8ed2-1fe32409a002/download/pyrethroids2018.xlsx',
 'https://opendata.eawag.ch/dataset/50eafcd5-27c2-40a1-95d6-fc671262ee92/resource/35c4dcfb-a4bf-4dc4-82cf-e3360d0f08e8/download/pyrethroids2017.xlsx']

### Downloading resources
In our previous example we use the [package_show](https://docs.ckan.org/en/2.9/api/#ckan.logic.action.get.package_show) endpoint the CKAN API to extract links of resources. In this example we will download those resources.

In [1]:
def download_resource(url: str, file_path:  str, token: str | None = None, chunk_size: int = 1024) -> None:
    headers = {} if token is None else {'Authorization': token}
    with urlopen(Request(url, headers=headers)) as response:
        with open(file_path, 'wb') as file:
            while True:
                chunk = response.read(chunk_size)
                if not chunk:
                    break
                file.write(chunk)


With the `download_resource` function with can iterate the previously extracted resources and download them.

In [5]:
for url in resource_urls:
    file_path =f"/tmp/{url.split('/')[-1]}"
    download_resource(url, file_path)
    print(f"Successfully saved resource at: {file_path}")

Successfully saved resource at: /tmp/readme.txt
Successfully saved resource at: /tmp/pyrethroids2018.xlsx
Successfully saved resource at: /tmp/pyrethroids2017.xlsx


### Uploading resources
In this scenario we assume you created a package on **ERIC/internal** called `data-for-project-x` and now you want to upload your many resources.

```{important}
This procedure will require an **API Token**.
```

::::{note}
Uploads will take longer that the implemented progress bar shows. The progress bar will reach **100% after about 1/4 of the time need for the process to finish**. The reasons are very technical, if your want to know why please click below. **You will only notice this for large files.**
:::{admonition} Technical reasons!
:class: dropdown
A file's journey from your computer across the network to its final "resting place" passes through several [proxies](https://docs.nginx.com/nginx/admin-guide/web-server/reverse-proxy/). Each of these proxies passes the data on to the next. The time shown in the progress bar is the time taken to upload the data to the first proxies. The additional time you have to wait is the time it takes for the various other proxies to copy the data from one to the other. 
:::
::::


For ease of use we will install 3 libraries via `pip install ...`:
 + `requests`
 + `requests_toolbelt`
 + `tqdm`

The function below can be used to upload your data.

In [6]:
!pip install tqdm requests requests_toolbelt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [7]:
import pathlib

import tqdm
import requests
from requests_toolbelt.multipart.encoder import (
    MultipartEncoder,
    MultipartEncoderMonitor,
)


class TqdmProgressCallback:
    def __init__(self, total_size, filename):
        self.bar = tqdm.tqdm(
            total=total_size,
            unit="B",
            unit_scale=True,
            desc=f"Uploading {filename}",
        )
    def __call__(self, monitor):
        self.bar.update(monitor.bytes_read - self.bar.n)
        self.bar.refresh()

    def close(self):
        self.bar.close()

def upload_resource(
    file_path: pathlib.Path,
    package_id: str,
    token: str,
    description: str = "",
    resource_type: str = "Dataset",
    restricted_level: str = "public",
    state: str = "active",
    host: str = "https://data.eawag.ch",
    endpoint: str = "/api/3/action/resource_create",
):
    
    file_name = file_path.name
    file_size = file_path.stat().st_size
    with open(file_path, "rb") as file_stream:
        encoder = MultipartEncoder(
            fields={
                "upload": (
                    file_name,
                    file_stream,
                    "application/octet-stream",
                ),
                "package_id": package_id,
                "name": file_name,
                "description": description,
                "state": state,
                "size": str(file_size),
                "resource_type": resource_type,
                "restricted_level": restricted_level,
            }
        )

        progress_callback = TqdmProgressCallback(file_size, file_name)
        monitor = MultipartEncoderMonitor(encoder, progress_callback)
        
        headers = {"Authorization": token, "Content-Type": monitor.content_type}

        response = requests.post(
            f"{host}{endpoint}",
            data=monitor,
            headers=headers,
            auth=None,
            stream=True,
        )
        progress_callback.close()
        response.raise_for_status()


```{note}
File paths should be passed as `pathlib.Path` objects to the `upload_resource` function.
```

Let's try it out. I prepared a folder full of test files.

In [8]:
!rm -rd "/tmp/upload-test"

rm: cannot remove '/tmp/upload-test': No such file or directory


In [9]:
!mkdir -p "/tmp/upload-test"
!for i in $(seq 1 15); do fallocate -l ${i}M "/tmp/upload-test/random_file_$i"; done
!tree "/tmp/upload-test"

[01;34m/tmp/upload-test[0m
├── random_file_1
├── random_file_10
├── random_file_11
├── random_file_12
├── random_file_13
├── random_file_14
├── random_file_15
├── random_file_2
├── random_file_3
├── random_file_4
├── random_file_5
├── random_file_6
├── random_file_7
├── random_file_8
└── random_file_9

1 directory, 15 files


In this example, we'll iterate over the entire contents of the "`tmp/upload-test`" folder and upload the contents if it's a file. To do this, we also need the package_id "`data-for-project-x`" that we want to upload to, and a valid token.

In [10]:
your_token = "..."  # you must provide your token here
your_package_id = "data-for-project-x"  # you must provide your token here
data_package_folder = pathlib.Path("/tmp/upload-test/")

```python
for candidate in data_package_folder.iterdir():
    if not candidate.is_file():
        continue
    upload_resource(
        file_path=candidate,
        package_id=your_package_id,
        token = your_token,
        description = f"This is the description for file {candidate}",
    )
```

```
Uploading random_file_15: 15.7MB [00:01, 9.06MB/s]                        
Uploading random_file_14: 14.7MB [00:01, 7.55MB/s]                        
Uploading random_file_13: 13.6MB [00:01, 7.41MB/s]                        
Uploading random_file_12: 12.6MB [00:01, 7.69MB/s]                        
Uploading random_file_11: 11.5MB [00:01, 7.53MB/s]                        
Uploading random_file_10: 10.5MB [00:01, 7.33MB/s]                        
Uploading random_file_9: 9.44MB [00:01, 6.16MB/s]                         
Uploading random_file_8: 8.39MB [00:01, 5.47MB/s]                         
Uploading random_file_7: 7.34MB [00:01, 5.52MB/s]                         
Uploading random_file_6: 6.29MB [00:01, 5.13MB/s]                         
Uploading random_file_5: 5.24MB [00:01, 4.67MB/s]                         
Uploading random_file_4: 4.20MB [00:01, 3.74MB/s]                         
Uploading random_file_3: 3.15MB [00:01, 3.08MB/s]                         
Uploading random_file_2: 2.10MB [00:01, 1.71MB/s]                         
Uploading random_file_1: 1.05MB [00:01, 916kB/s]
```