Tactical Tech Show
— tools, technologies, concepts —
Research Data & Software Training Series — 2020-01-30

Harald von Waldow <harald.vonwaldow@eawag.ch>
Michael Köbberich <michael.koebberich@eawag.ch>



Point your device to https://pollev.com/actionlake240
  • Agree to cookies
  • If asked for login: "skip"
  • Answer first question:
    Empa or Eawag?

Scripting & Programming

R, Python and Julia
Which one?


statistics, data wrangling, non-NN related ML, visualization, high quality packages, "friendly" ecosystem
data wrangling, visualization, NN related ML, beginner-friendly, extremely versatile
very fast, own simulation codes, matrix algera, numerical solvers, differential equations (h/t Andreas)
What is everybody else in your lab and in your field using?



Python2 versus Python3

  • Incompatible!
  • ... but lots of help to port from 2 to 3
  • Use Python 3 unless you need legacy packages

Python encapsulation



Python packages you need to learn

Fundamental package for scientific computing: matrices (arrays), linear algebra
Fundametal package for data analysis: Data Frame (similar to R) and Time Series. Data sorting & preparing.
Recommended for visualization. A clone of R's ggplot2. Traditional alternative: Matplotlib/Seaborn.



Python packages for ML / AI

Python is "industry-standard" when it comes to ANN, CNN, RNN, GAN, ..., "deep-anything".

Classical statistics & ML: regression, classification, clustering, decision-trees, validation, SVM, EOF, ...
Fast CNN framework by Berkeley AI Research Lab.
Deep Learning Framework by Facebook
Keras + TensorFlow
Deep learning high-level interface by Google.



Python IDEs


R packages you need to learn

The Tidyverse

  • dplyr,
  • ridyr,
  • tibble,
  • ggplot2, ...

Modern packages for "Data Science" (Hadley Wickham). Check out the free book.


R goodies

Find R packages
CRAN Task Views : https://cran.r-project.org/web/views
RStudio Desktop
Web-based interactive visualizations
Shiny. Example: ExPanD. Also a cloud service.
Dynamic report generation, literate programming
A snapshot of all package versions (h/t Andreas)

Coding Tips


Steps towards better code

  1. A script without structure.
  2. Code is for humans: Comment and structure into sections!
  3. DRY, generalize: Write functions!
  4. Make code re-usable: Write a library!
  5. Get rid of hard-coded parameters: Make scripts take arguments -> docopt
  6. Read a book: The Pragmatic Programmer

Version control


Learn Git!

Version control


Why Git?

  • It is the de-facto standard today and the most powerful option.
  • A lot of infrastructure and workflows are connected with Git repositories.
    • Workshop: "clone this repo, please"
    • Open Source collaboration
    • Continuous Integration tools
    • JupyterLab - binder - nbviewer

Version control


Git, what for?

  • Version control for software development, duh
  • Convenient and efficient backup
  • Organization of and collaboration on text-documents
  • Trace the provenance of any data

Text & Presentations





All tools for presentions are not fun.

  • One alternative to PowerPoint are web-based presentations.
  • Like this one
  • One tool to make them: reveal.js
  • For GUI oriented people: slides.com
  • Triple-safety: On the web, on your laptop, on a stick, as PDF
  • Also good for live-notetaking

Git and Writing


Collaborating on text-documents
using Markdown & Git

  • Only likely in a few fields:
    Co-authors must know Markdown & Git, a bit.
  • There are commercial cloud-based offers with a GUI:
    e.g. authorea.com
  • Good for coordinating the input of many contributors:
    • See changes by everybody immediately
    • Keep record of changes automatically
    • Get help with merging multiple contributions

Data collection & cleaning

Data collection in the field



  • => Form design
  • => Data entry
  • Complex, e.g. conditional forms possible
  • Completely Open Source & community run
  • Self-hosting possible
  • Works for desaster relief in the developing world
    => might even work in DE



  • Always have at least
    2 copies in existence
  • Rule of three:
    • 3 copies
    • 2 different storage media
    • 1 off-site backup

Pay particular attention to transfer situations, e.g.
datalogger -> laptop -> field station



Automatize as much as possible

  • Laptop: Use an automatically scheduled script
  •    -> Backup to external drive / memory-stick.
  •    -> Carry backup-medium separately during transport.
  •    -> Use reverse differential backup if possible,
             e.g. rdiff-backup.

Data cleaning


Interactively clean, sort, transform messy data:


Data organization & Databases

File system


File naming

  • Have a clearly defined naming- and organization scheme!

Tabular data


Avoid Excel files (like the plague)

Tabular data


Excel is good

If you use Excel, read
Broman & Woo (2018). Data Organization in Spreadsheets

Tabular data


  • Use simple tab- or comma-separated files
  • Make sure they are UTF-8 encoded, if they contain special characters
  • To export a spreadsheet to csv, use LibreOffice (because Excel ... you know)
  • Use open formats provided by your analysis language
    ( R: save(), Python: pickle.dump() )



When you spend a lot of time in your analysis code to

load the right files & extract the right data

maybe you need a database.

Good for one local user (with write permissions). Nice GUI for table-design & query-forms. Good for interactive use.
Good for one local user (with write permissions). Integrated into RSQLite (R) and sqlite3 (Python). Good for programmatic use.
Multi-user, server-based. E.g. PostgreSQL, MariaDB, MS SQL Server


Virtual Machines


Run another operating system on your Desktop

  • Main system: Windows, run Linux to for devlopment
  • Main system: Linux, run Windows for Outlook, MS-Office, ...
  • Share a whole development machine with complete setup


  • VirtualBox (Oracle, free)
  • VMware Workstation Player/Pro


  • Needs ressources just like a separate Computer
  • Need complete setup of an OS from scratch





Difference to VM

  • Lots of containers on one machine possible
  • Fast start and stop


  1. Try out complex software (e.g. Taiga)
  2. Develop software with "big" dependencies, e.g. databases
  3. Have the setup in code (as Dockerfile)
  4. Deploy exactly the same environment to any host