Eawag Research Data Management Project

General Data Management

Data organization in spreadsheets
Article: Karl W. Broman and Kara H. Woo (2017).
Data organization in spreadsheets.

Absolute must-read in case you have to use Excel in your research.

Abstract (Broman and Woo, 2017)
Spreadsheets are widely used software tools for data entry, storage, analysis, and visualization. Focusing on the data entry and storage aspects, this paper offers practical recommendations for organizing spreadsheet data to reduce errors and ease later analyses. The basic principles are: be consistent, write dates like YYYY-MM-DD, don't leave any cells empty, put just one thing in a cell, organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row), create a data dictionary, don't include calculations in the raw data files, don't use font color or highlighting as data, choose good names for things, make backups, use data validation to avoid data entry errors, and save the data in plain text file.

File Formats
File Formats: It is difficult to give comprehensive advice on file formats suited for archiving and digital preservation. The Library of Congress Recommended Formats Statement is a good starting point.

Rules of Thumb

  • Avoid proprietary formats (prefer standardized ones)
  • Avoid pseudo-standards, such as xlsx, docx, which you can recognize by the fact that there do not exist multiple implementations across different applications and platforms.
  • Anything "plain text" that is not pure ASCII should be UTF-8 encoded.
  • Simpler is better (as long as no info is lost).
  • If your original data comes in a "low-quality" format (e.g. MP3, GIF, ...) just keep it like that.

File Organization
File Organization: Some rules and helpful hints about how to store files on your computer that will make your life easier during the course of research. Slides from the MIT Data Management Group.

File Formats
RDM Guide: CESSDA ERIC provides a hands-on guide to elementary data-management. It is targeted at social sciences but contains enough general know-how to be worth the read for other fields.

Book: Rutger Vos and Pedro Fernandes (2017).
Open Science, Open Data, Open Source:
21st century research skills for the life sciences.

An excellent resource, not only for life sciences, that covers a lot of ground.

R for datascience book
Book: Garrett Grolemund and Hadley Wickham (2017).
R for Data Science. Online.

Really not "Science" but an excellent tutorial and reference for data handling in R.

Managing and sharing data - book
Book (40 pages): Van den Eynden, V, et al. (2011).
Managing and Sharing Data; a best practice guide for researchers. UK Data Archive.

Contains many practical hints.

openrefine software
Software: OpenRefine (formerly Google Refine)

State-of-the-art power tool to clean, explore and transform large, messy datasets. A free desktop application for Windows, Mac and Linux. Check this out the moment you begin to swear at Excel (at the latest).

DLCM project
Project: Research Data Life-Cycle Management (DLCM): From Pilot Implementations to National Services.

A large, national-level project that aims at providing guidelines, training activities, policy support and various tools to support researchers and their institutions in all aspects of data management.

Data Management Plans

Eawag DMP Guide
Guide: The Eawag Data Management Guide: Instructions for creating data management plans for SNSF proposals.

Extensive step-by-step instructions to prepare an SNSF-compliant DMP. Is in parts Eawag specific (some cut & paste snippets) but also useful for other Swiss researchers.

Resources & Guide: Help with data mananagement plans from the DLCM Project.

Various useful links for DMP preparation. In particular, there is the DLCM Template for the SNSF Data Management Plan, a 30 page document with step-by-step instructions.

DMP online tools
Online Tools: There are various web applications that help to create and manage data management plans. These tools include templates that reflect the requirements of various funding agencies. Unfortunately, the the Swiss National Science Fundation has not yet integrated their DMP template into one of these tools.
  • Software: DMPRoadmap. The joint codebase from DMPTool (University of California Curation Center, UC3) and DMPonline (Digital Curation Centre, DCC). Probably the most established tool today. Open source / Ruby.
  • Software: RDMO. A DMP web application by the Research Data Management Organiser (RDMO) project, which is funded by the DFG and supported and used by a large number of German institutions. This software is pretty new and still a bit rough around the edges but has a great number of advanced features. Open source / Python.
  • Service: DMPTool Public service by UC3.
  • Service: DMPonline Public service by DCC.
  • Examples: Public plans: 100s of public DMPs from DMPTool.

Metadata Standards

metadata directory
Maintained by the RDA Metadata Standards Directory Working Group, this directory lists a large number of (mostly) domain-specific metadata standards. Look here first if you are looking for a system to annotate your data that is established for your field of research.

Resources for RDM professionals


The 4TU. Centre for Research Data and in particular the Research Data Services of the TU Delft Library have long-term experience and impressing know-how.

Presentation with interesting numbers by Alastair Dunning
(PDF version from the original Powerpoint file)

Curating Research Data - V2

Book: Johnston, Lisa R. (2017). Curating Research Data Volume Two: A Handbook of Current Practice. Association of College & Research Libraries.

A very extensive checklist for people who establish an institutional research data repository.


Survey: TU Delft and EPFL have published an RDM-Survey among their researchers. n = 659 (both Institutions). Although it covers exclusively engineering + CS faculties, it is a good start for providing some sort of empirical status-quo.

It would be nice if other institutions used a compatible survey.

Data Curation Network

The Data Curation Network is a Sloan-funded project that aims to conceptualize and develop a “network of expertise” model for U.S. academic libraries to collectively provide data curation services to support digital research data deposit into repositories for open access and reuse.

Next to the down-to-earth conceptual model, they provide a wealth of recent empirical information, advice, checklists & workflows.