data-housekeeping.md 3.8 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
# Data housekeeping
## File names
<div  style="display:flex; position:static; width:100%">
<div class="fragment" data-fragment-index="0" style="position:static; width:30%">

### General pricinples  
  * Machine readable
  * Human readable
  * Plays well with default ordering
</div>
<div class="fragment" data-fragment-index="1" style="position:absolute; left:33%; width:30%">

### Separators  
  * No spaces
  * Underscore to separate
  * Hyphen to combine
  
</div>
<div class="fragment" data-fragment-index="2" style="position:absolute; left:66%; width:30%">

### Date format follows **ISO 8601**<br>

  2018-12-03<br> 
  2018-12-06_1700  

</div>
</div>
 

<div class="fragment" data-fragment-index="3" style="width:100%; position:static">
<div style="position:absolute;width:55%">
<b>Bad</b> names

```bash
 PhD-project-Jan19 alldata_final.foo
 Finacial detailes BIocore 19/11/12.xls
 ATACseq1Londonmapped.bam
 Hlad.jez.M-L-průtoky JíObj.z Ohře-od 10-2011.xlsx
```
</div>
<div style="position:relative;width:55%; bottom:20%; left:50%">
<b>Good</b> names

```bash
Iris-setosa_samples_1927-05-12.csv
PI102_Mouse12_EEG_2018-11-03_1245.tsv
Bioinfiniti_FullProposal_2018-11-15_1655.do
```
</div>
</div>
<br>
<br>
<div class="fragment" data-fragment-index="3" style="width:100%;">
From Jenny Bryan by CC-BY  
(https://speakerdeck.com/jennybc/how-to-name-files)
</div>



# Data housekeeping
## File organization
* Have folder organization conventions for your **group**
  * Per Paper
  * Per Study/Project 
  * Per Collaborator
* Keep <b>readme files</b> for data  
  * Title
  * Date of Creation/Receipt
  * Instrument or software specific information
  * People involved
  * Relations between multiple files/folders 

* Separate files you are actively working from the old ones  
* Orient newcomers to the group's conventions



# Data housekeeping
<div style="position:absolute">

## When working 
  * Clarify and separate source and intermediate data
  * Keep data copies to a **minimum**
  * Cleanup post-analysis
  * Cleanup copies created for presentations or for sharing
</div> 
<div style="position:relative;left:50%; width:40%">
<img src="slides/img/cleaning-table.jpg" height="450px">
</div>



# Data housekeeping
## End of project
  * handover data to a new responsible when leaving
  * data should be kept as a single copy on server-side storage 
    * no copies on desktops and external devices
  * non-proprietary formats
  * minimal metadata
  * sensitive data (e.g. whole genome) **must** be encrypted
  <br/>
  <br/>
  * If not specified otherwise, data must be kept for **10 years** following project end for reproducibility purposes
<aside class="notes">
Note: sometimes it is hard to find/understand dataset 10 days old
</aside>
 
## In doubt on data archival?
Contact R<sup>3</sup> for support on archival of datasets using tickets:
  * https://service.uni.lu/sp
  * Home > Catalog > LCSB > Biocore: Application services > Request for: Support

<div style="position:absolute; width:45%; left:50%; top:28em; text-align:right">
<a href=" https://howto.lcsb.uni.lu/?policies:LCSB-POL-BIC-03" style="color:grey; font-size:0.8em;">Research Data Retention and Archival Policy</a>
</div>



# Data housekeeping - Summary
## Server is your friend!
  * Allows a consistent backup policy for your datasets
  * Keeps number of copies to minimum
  * Specification of clear access rights
  * High accessibility
  * Data are discoverable
  * Server can't be stolen
  
## General guidelines
  * Use institutional media for storage of **all** data
  * Research data (particularly sensitive data) should be in a single source location
  * Enable encryption for data stored on movable media
  * Clarify and separate source and intermediate data
  * Disable write access to relevant source data (read-only)
  * Backup research data!
  * Download Anti-virus software
  * Generate checksums