Commit de2be794 authored by Vilem Ded's avatar Vilem Ded

slides 2020-03-12_IT101-DM

parent 6d7e6b71
Pipeline #22492 passed with stage
in 3 minutes and 1 second
# Code versioning
<div style="position:absolute; width:40%">
* Current standard for code versioning
* Maintain versions of your code as it develops
* Local system, which does not require an online repository
* Repositories allow distributed development
<img align="middle" height="300px" src="slides/img/Git-logo.png">
<div class="fragment" style="position:absolute; left:50%; width:40%"">
* Recommended, supported repository
* Allows tracking of issues
* Ready for continous integration - code checked on commits to the repository.
* [](
**Use at LCSB**
* All analyses code should be in a repository
* Minimally at submission of a manuscript
* Better daily
* Even better "analyses chunkwise"
<aside class="notes">
Policy! - code in central repository
# Data housekeeping
## File names
<div style="display:flex; position:static; width:100%">
<div class="fragment" data-fragment-index="0" style="position:static; width:30%">
### General pricinples
* Machine readable
* Human readable
* Plays well with default ordering
<div class="fragment" data-fragment-index="1" style="position:absolute; left:33%; width:30%">
### Separators
* No spaces
* Underscore to separate
* Hyphen to combine
<div class="fragment" data-fragment-index="2" style="position:absolute; left:66%; width:30%">
### Date format follows **ISO 8601**<br>
<div class="fragment" data-fragment-index="3" style="width:100%; position:static">
<div style="position:absolute;width:55%">
<b>Bad</b> names
Finacial detailes BIocore 19/11/12.xls
Hlad.jez.M-L-průtoky JíObj.z Ohře-od 10-2011.xlsx
<div style="position:relative;width:55%; bottom:20%; left:50%">
<b>Good</b> names
<div class="fragment" data-fragment-index="3" style="width:100%;">
From Jenny Bryan by CC-BY
# Data housekeeping
## File organization
* Have folder organization conventions for your **group**
* Per Paper
* Per Study/Project
* Per Collaborator
* Keep <b>readme files</b> for data
* Title
* Date of Creation/Receipt
* Instrument or software specific information
* People involved
* Relations between multiple files/folders
* Separate files you are actively working from the old ones
* Orient newcomers to the group's conventions
# Data housekeeping
<div style="position:absolute">
## When working
* Clarify and separate source and intermediate data
* Keep data copies to a **minimum**
* Cleanup post-analysis
* Cleanup copies created for presentations or for sharing
* Handover data to a new responsible when leaving
<div style="position:relative;left:50%; width:40%">
<img src="slides/img/cleaning-table.jpg" height="450px">
# Data housekeeping
## End of project
* data should be kept as a single copy on server-side storage
* no copies on desktops and external devices
* non-proprietary formats
* minimal metadata:
* source
* context of generation
* data structure
* content
* sensitive data (e.g. whole genome) **must** be encrypted
* If not specified otherwise, data must be kept for **10 years** following project end for reproducibility purposes
<aside class="notes">
Note: sometimes it is hard to find/understand dataset 10 days old
## In doubt on data archival?
Contact R<sup>3</sup> for support on archival of datasets using tickets:
* Home > Catalog > LCSB > Biocore: Application services > Request for: Support
# Data housekeeping - Summary
## Server is your friend!
* Allows a consistent backup policy for your datasets
* Keeps number of copies to minimum
* Specification of clear access rights
* High accessibility
* Data are discoverable
* Server can't be stolen
## General guidelines
* Use institutional media for storage of **all** data
* Research data (particularly sensitive data) should be in a single source location
* Enable encryption for data stored on movable media
* Clarify and separate source and intermediate data
* Disable write access to relevant source data (read-only)
* Backup research data!
* Download Anti-virus software
* Generate checksums
# Data and metadata
<div style="display:grid;grid-gap:100px;grid-template-columns: 40% 40%">
<div >
## Data
* "*information in digital form that can be transmitted or processed*"
<p align="right">-- Merriam-Webster dictionary</p>
* "*information in an electronic form that can be stored and processed by a computer*"
<p align="right">--Cambridge dictionary</p>
## Metadata
* data describing other data
* information that is given to describe or help you use other information
* metadata are data
* can be processed and analyzed
<div class="fragment">
## Metadata examples:
<div style="position:absolute">
<li> LabBook </li>
<li> author/owner of the data</li>
<li> origin of the data
<li> data type
<div style="position:absolute;left:25%">
<li> description of content </li>
<li> modification date </li>
<li> description of modification </li>
<li> location </li>
<div style="position:relative;left:50%;top:0.7em">
<li> calibration readings</li>
<li> software/firmware version</li>
<li> data purpose</li>
<li> means of creation</li>
<div class="fragment">
<center style="color:red">!Insufficient metadata make the data useless!</center>
<aside class="notes">
Sometimes metadata collection takes more time than data collection
# LCSB research data
three categories:
* **Primary data**
* scientific data
* measurements, images, observations, notes, surveys, ...
* models, software codes, libraries, ...
* metadata directly describing the data
* data dictionaries
* format, version, coverage descriptions, ...
* **Research record**
* description of the research process, including experiment
* experiment set-up
* followed protocols
* ...
* **Project accompanying documentation**
* ethical approvals, information on the consent)
* collaboration agreements
* intellectual property ownership
* other relevant documentation
# Typical flow of data
<div style="display:grid;grid-gap:10px;grid-template-columns: 30% 20% 30%;
grid-auto-flow:column;grid-template-rows: repeat(4,auto);position:relative;left:8%">
<div class="content-box fragment" data-fragment-index="1">
<div class="box-title red">Source data</div>
<div class="content">
* Experimental results
* Large data sets
* Manually collected data
* External
<div class="content-box fragment" data-fragment-index="2">
<div class="box-title yellow">Intermediate</div>
<div class="content">
* Derived data
* Tidy data
* Curated sets
<div class="content-box fragment" data-fragment-index="3">
<div class="box-title blue">Analyses</div>
<div class="content">
* Exploratory
* Model building
* Hypothesis testing
<div class="content-box fragment" data-fragment-index="4">
<div class="box-title green">Dissemination</div>
<div class="content">
* Manuscript, report, presentation, ...
<img src="slides/img/data-flow_sources.png" height=60%>
<img src="slides/img/data-flow_transformation.png" height=60%>
<img src="slides/img/data-flow_chart.png" height=60%>
<img src="slides/img/data-flow_paper.png" height=60%>
<div class="content-box fragment" data-fragment-index="5">
<div class="box-title red">Preserve</div>
<div class="content">
* Version data sets
* Backup
* Protect
<div class="content-box fragment" data-fragment-index="6">
<div class="box-title yellow">Reproduce</div>
<div class="content">
* Automate your builds
* Use workflow tools (e.g. Snakemake)
<div class="content-box fragment" data-fragment-index="7">
<div class="box-title blue">Trace</div>
<div class="content">
* Multiple iterations.
* Code versioning (Git)
<div class="content-box fragment" data-fragment-index="8">
<div class="box-title green">Track</div>
<div class="content">
* Through multiple versions
<aside class="notes">
flow of the data is downstream (mostly), but you are going back and forth
applies to all data (financial report, lab safety assessment)
# FAIR (meta)data principles
* dates back to 2014
* well accepted by scientific community
* necessity in data driven science
* officially embraced by EU and G20
* required by funding agencies and journal publishers
<img src="slides/img/fair-principles.png" height="400px">
# LCSB How-Tos
<iframe data-src="" height="600px" width="1200px"></iframe>
iris <- data.table(iris)
iris <- iris[c(1:103)]
g1 <- ggplot(iris, aes(x = Species, y = Sepal.Length))+
geom_bar(aes(fill = Species),stat="summary", fun.y="mean" ) +guides(fill = F)+
ylim(c(0,8)) + ylab("Mean Sepal Lenght ")
g2 <- ggplot(iris, aes(x = Species, y = Sepal.Length))+
geom_boxplot(aes(fill = Species))+
ylim(c(0,8))+guides(fill = F)+ylab("Sepal Length")
g3 <- ggplot(iris, aes(x = Species, y = Sepal.Length))+
geom_boxplot(aes(fill = Species))+
ylim(c(0,8))+ geom_point( position="jitter")+
guides(fill = F)+ylab("Sepal Length")
ggarrange(g1, g2, g3, nrow = 1)+ggsave(filename = "../plot-data.png", device = "png", width =12, height = 6)
# IT101 - Working with computers
<br>IT101 - Working with computers<br>
## March 12th, 2020
<div style="top: 6em; left: 0%; position: absolute;">
<img src="theme/img/lcsb_bg.png">
<div style="top: 5em; left: 60%; position: absolute;">
<img src="slides/img/r3-training-logo.png" height="200px">
Vilem Ded<br>
Data Steward<br><br>
<i>Luxembourg Centre for Systems Biomedicine</i>
# Data housekeeping
## Available data storage
<div class='fragment' style="position:absolute">
<img src="slides/img/LCSB_storages_full.png" height="750px">
<div class='fragment' style="position:absolute">
<img src="slides/img/LCSB_storages_personal-crossed.png" height="750px">
# Data ingestion/transfer
## Receiving and sending data
<img height="450px" style="position:relative;left:10%" src="slides/img/banned_exchange_channels.png"><br>
<div style="position:absolute; left:10%;width:30%">
## E-mail is not for data transfer
* Avoid transfer of any data by e-mail
* E-mail is a poor repository
* Data can be read on passage
<div class="fragment" style="left:50%; width:30%; position:absolute">
## Exchanging data
* Share on Atlas server
* OwnCloud share (LCSB - BioCore)
* DropIt service (SIU)
* AsperaWeb share for sensitive data
# Data ingestion/transfer
Data can be corrupted:
* (non-)malicious modification
* faulty file transfer
* disk corruption
<div class="fragment">
### Solution
* disable write access to the source data
* Generate checksums!
<div style="position:absolute;left:40%">
<img src="slides/img/checksum.png" width="500px">
<div class="fragment" style="position:relative; left:0%">
## When to generate checksums?
* before data transfer
- new dataset from collaborator
- upload to remote repository
* long term storage
- master version of dataset
- snapshot of data for publication
# Introduction
<div class="fragment" style="position:absolute">
<img height="450px" src="slides/img/wordcloud.png"><br>
## Learning objectives
* How to manage your data
* How to look and analyze your data
* Solving issues with computers
* Reproduciblity in the research data life cycle
<div class="fragment" style="position:relative;left:50%; width:40%">
<div >
<img height="405px" src="slides/img/rudi_balling.jpg"><br>
Prof. Dr. Rudi Balling, director
## Pertains to practically all people at LCSB
* Scientists
* PhD candidates
* Technicians
* Administrators
{"filename": ""},
{"filename": ""},
{"filename": ""},
{"filename": ""},
{"filename": ""},
{"filename": ""},
{"filename": ""},
{"filename": ""},
{"filename": ""},
{"filename": ""},
{"filename": ""},
{"filename": ""},
{"filename": ""},
{"filename": ""},
{"filename": ""}
## Overview
0. Introduction - learning objectives + targeted audience
1. Data workflow
1. Ingestion:
* receiving/sending/sharing data
* file naming
* checksums
* backup
1. making data tidy
* what is table
1. Learning to code workflows and analyses - excel files, coding
1. Code versioning and reproducibility
1. Visualization
* see the data
1. problem solving
* guide
* rubberducking
* google for help
* oracle
1. R3 team
1. Acknowledgment
1. data minimization
# Problem solving
A guide for solving computing issues
1. Express the problem
* Write down what you want to achieve
2. Search for help
* Read **FAQs**, **help pages** and the **official documentation** well before turning to Google
* Use stack exchange, forums and related resources carefully
3. Ask an expert
* You have to submit the problem in writing
* The Oracle answers a questions only once or if it finds the problem interesting
* If you supply a trivial problem, it will stop answering
* Available Oracles
* Service Now @ [] (Uni and LCSB helpdesk)
* [Stack Overflow]( and other online sites
* Local experts
# Responsible and Reproducible Research (R<sup>3</sup>)
## What is R<sup>3</sup>?
A multi-facetted change management
process built on 3 pillars:
- R3 pathfinder
- R3 school
- R3 accelerator
Common link module: R3 clinic
<div style="top: -1em; left: 50%; position: absolute;">
<img src="slides/img/3pillars-full.png">
<aside class="notes">
Pathfinder - policies, data management changes<br>
School - courses, howtos, trainnings<br>
Accelerator - advanced teams and their boost/support, CI/CD setup<br>
Clinic - hands-on, meetings in groups, code review + suggestions<br>
## R<sup>3</sup> Training
* LCSB's Monthly Data Management and Data Protection training
* ELIXIR Luxembourg's Best practices in research data management and stewardship - 27th May 2020
* R<sup>3</sup> school Git basics - every 4 months
<aside class="notes">
Direct newcommers to this monthly training
# Responsible and Reproducible Research (R<sup>3</sup>)