file_naming.md 12.2 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
---
layout: page
permalink: /internal/good-practice/file_naming/
shortcut: good-practice:file_naming
redirect_from:
  - /cards/good-practice:file_naming
  - /internal/cards/good-practice:file_naming
---
# Naming files
(Re)Naming a file is very easy operation usually one or two clicks away (*right click+rename, F2, ...*). Maybe thats why people do not pay enough attention when choosing a proper file name  even though it can have a big impact on their ability to find those files later and to understand what they contain.

Good file name follows three basic principles:

 * machine readable
 * human readable
 * plays well with default ordering

## Machine readable
  Special characters can have different meaning for different operation system or software. The most commonly found are

**#$%&'(")*+,-./:&#59;<=>?@[\]^_`{|}~**
  and
white characters like **space** or **tabulator**.

  The only two which are recommended in file names are hyphen "**-**" and underscore "**_**". You can use underscore to separate and hyphen to combine.
  File name

```
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_A01.csv
```

gives us already some information about date of creation (2013-06-26), assay (BRAFWTNEGASSAY), sample set (Plasmid-Cellline-100-1MutantFraction) and well (A01). While following names

```
2013-06-26-BRAFWTNEGASSAY-Plasmid-Cellline-100-1MutantFraction-A01.csv
.csv
2013_06_26_BRAFWTNEGASSAY_Plasmid_Cellline_100_1MutantFraction_A01.csv
```
are much more prone to misinterpretation.
#### Accented characters
  Your language might be very rich on various accented or special characters
  but both colleagues and your machines will have hard time to work with them.
  Special letters like  **ç**, **ä**, **ô**,
  **ě**, **ŕ**, etc. require special encoding and might cause troublesome issues when used in file names.


Beware of typos and avoid using multiple names varying in small ways unless it has some true meaning. Following file names are distinct, but can you tell where exactly?

```
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFractions_B03.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Celline-100-1MutantFraction_B03.csv
2013-06-26_BRAFWTNEGASSAY_Plazmid-Cellline-100-1MutantFraction_B03.csv
```

#### Exploiting machine readable names
You may already have a lot of files collected for your project or you have received big dataset from one of your collaborators. Then you might think about organizing and renaming them to be compliant with your new or existing naming policy.
If the names are consistent and you don't want to loose time renaming them by hand, you may try to use dedicated tools (e.g. [PSRenamer](https://www.powersurgepub.com/products/psrenamer/index.html)) or simple commands in your command line (**rename** for Mac and Linux, **ren** for Windows).

Once your skills develop, you will be able to use machines and machine readable file names to perform advanced operations on them, e.g. search using regular expression.
Imagine folder with thousands of files. Running simple R command
```
flist <- list.files(pattern = "Plasmid")
```
will give you all file names containing word "Plasmid".

```
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_A01.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_A02.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_A03.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_B01.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_B02.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_B03.csv
```

This result can be easily further processed into an awesome meta-data table by applying split in places of underscore and dot:

```
flist_df <- stringr::str_split_fixed(flist, "[_\\.]", 5)
names(flist_df) <- c("Date", "Assay", "Sample_set", "Well", "Format")
```

| Date         | Assay          | Sample_set                                 | Well | Format |
|--------------|------------------|----------------------------------------|----------|--------|
| "2013-06-26" | "BRAFWTNEGASSAY" | "Plasmid-Cellline-100-1MutantFraction" | "A01"    | csv    |
| "2013-06-26" | "BRAFWTNEGASSAY" | "Plasmid-Cellline-100-1MutantFraction" | "A02"    | csv    |
| "2013-06-26" | "BRAFWTNEGASSAY" | "Plasmid-Cellline-100-1MutantFraction" | "A03"    | csv    |
| "2013-06-26" | "BRAFWTNEGASSAY" | "Plasmid-Cellline-100-1MutantFraction" | "B01"    | csv    |
| "2013-06-26" | "BRAFWTNEGASSAY" | "Plasmid-Cellline-100-1MutantFraction" | "B02"    | csv    |
| "2013-06-26" | "BRAFWTNEGASSAY" | "Plasmid-Cellline-100-1MutantFraction" | "B03"    | csv    |

Of course, similarly simple and powerful commands can be found in every programming language/interpreter (Python, Bash, ...)

#### Case sensitivity
It is generally recommended **not** to use upper case letters.
Firstly, matching patterns and splitting names with upper case letters is much harder and error prone. Another drawback might be the fact, that Windows file system is case insensitive (unlike Mac or Linux OS).

If you really want to extend hyphen-underscore semantic separation, you can use so called [**camelCase**](https://en.wikipedia.org/wiki/Camel_case) - substituting spaces between words by upper-casing their first letters.




#### Machine readable names allow us:
 * easily search for files later
 * easily narrow file lists based on names
 * easily extract info from file names, e.g. by splitting

Remember that the rules on machine readability apply also for naming your **folders** (now containing your nicely named files). In fact, it is a good practice to stick to these rules even when naming **variables** in your data files.
## Human readable

  * Be specific.  It is generally better to create longer file name which is fulfilling its purpose than using short abbreviations which might be hard to grasp by your colleagues, eventually by yourself after some time. Stay away from cryptic names and non-standard or unclear abbreviations.

| Bad named                 | Better name                                           |
| ------------------------- | ----------------------------------------------------- |
| myabstract.txt            | John-White_Sensitivity-of-PLFA-analyses_abstract.txt  |
| samples_project_start.csv | PA324_samples_2019-12-11.csv                          |
| ms_cresp_final.doc        | John-White_Cell-respiration-manuscript_2019-12-11.doc |
| fig_1.png                 | John-White_Cell-respiration_fig-1_2019-12-11.png      |

  * Usually, file extension is already telling you some information about the file itself.
  Here are some examples of file names which are unnecessarily long and could be easily shortened:
  `````
  Iris-setosa_table.csv
  video_2019_annual-meeting.avi
  2019-12-11_notes.log
  ATAC_seq1_London_mapped.bam
  A2452_description-tutorial.info
  `````
  * Never use suffixes (or prefixes) like **"final"**, **"old"**, **"new"**, **"current"**, **"obsolete"**, **"recent"**, **"latest"**, **"best"**...
  File is hardly in such states and it will change sooner or later anyway.

  * Name should naturally explain why the file exists. If you have to search for additional information (either asking your colleagues or reading some README files), the file name is probably not chosen properly. Name file in a way that even a total stranger could get it easily.

  * Leave out meaningless or redundant words, e.g. "the", "and", "a", "file", "data" ...

  * Do not be too creative, do not pun and stay professional. Bad examples:

  ```
  bio-rect_UM.csv - data related to bio-reactors at University of Michigan
  PEPA_d-pic.jpeg - a fourth picture from your paper on Performace Evaluation Process Algebra
  ```
#### Semantic versioning
If your files or documents change very often and you want to track the versions manually instead of using some sophisticated versioning software<!-- TODO: link to GIT howto-card -->, you might follow semantic versioning scheme widely used in software development.
It is based on adding several numbers, standard is 3, into a suffix of your file name where:

    * first number called **MAJOR** version is increased once the document has undergone **significant changes**
    * second number called **MINOR** version is incremented once some new information is added to the document or something is deleted
    * last number called **PATCH** should refer to very minor changes like fixing of typos or rephrasing a sentence.

  These can be be headed by the letter „V“ in order to indicate the following version information.


Human readable names allow us:
  * easily understand what the file is and what it contains
  * easily share files with others

## Default ordering
Inbuilt tools (e.g. file explorer) allows you to order files by name in alphanumerical order. Make the best out of this great feature.

  * Put the terms in general-to-specific order. That way, you will have files grouped in logical order and related files will be naturally close to each other.
    ```
    Ares-triticum_samples_redundant_2010-04-12.csv
    Ares-hordeum_samples_redundant_2010-05-12.csv
    Iris-setosa_samples_1927_05_12.csv
    Iris-setosa_samples_1954-06-24.csv
    Iris-versicolor_samples_1945-04-12.csv
    ```
  * Put the date first to get chronological ordering:
    ```
    2013-06-26_Plasmid_A01.csv
    2014-06-26_Plasmid_C02.csv
    2015-06-30_Plasmid_A03.csv
    2015-07-12_Plasmid_B01.csv
    2015-07-13_Plasmid_B02.csv
    2015-11-10_Plasmid_B03.csv
    ```
  * Put number defining explicit order as first. Remember that the ordering is done by character, not by the whole number, so you might want to add leading zeros just to be sure that the ordering will be correct with growing number of your files.
    ```
    01_Plasmid_A01_2013-06-26.csv
    02_Plasmid_C02_2014-06-26.csv
    03_Plasmid_A03_2015-06-30.csv
    10_Plasmid_B01_2015-07-12.csv
    11_Plasmid_B02_2015-07-13.csv
    25_Plasmid_B03_2015-11-10.csv
    ```

## Dates
  Including date in your file names allows you to sort them easily and find exactly the one you want in very short time.
Remember that recording dates using anything else than numbers (e.g. month abbreviations) can due to different language background result in formats like "*11dic2019*" or "*11Dez2019*", etc., which doesn't have to be recognized as date at all.
It is much better to use only numeric format but even then it can be written in endless variations which are hard to read or more importantly make them ambiguous, like date **11th of December 2019** in following examples:
```
  19/11/12
  19/12/11
  20191112
  11.12.2019
  11-12-19
  ...
```
  Luckily, there is a standard for date format, YYYY-MM-DD ([*ISO 8601*](https://en.wikipedia.org/wiki/ISO_8601)), which really nicely comply with all three principles above. Therefore, the **only** correct format of 11th of December 2019 is:
 ```
  2019-12-11
 ```
<!-- TODO: stability of names in shared repository which is not read-only - e.g. someone gets nuts and starts to rename everything. Dangerous if there is any analyses link directly to a file. -->
<!-- TODO: do some guidelines/rules/recommendations apply to different classes of files - source code, data, documents -->
## Final notes
When starting your project or creating a new repository, give yourself a time to set a proper naming design.
Remember that it should be also accepted by your teammates and other collaborators accessing the files.
To make dissemination of the naming design as easy as possible, don't forget to document it and include it into policies of your group/project.

Adopting proposed recommendations might seem like a lot of work now.
But the truth is that it will pay off once the projects get more complex and your skills will evolve. Choosing good names takes time but saves more than it takes.

If you don't agree with naming rules which are adopted in your group, follow them or make an effort to change it globally.
The **consistency** is much more important than your preferred naming.

# Resources
Jenny Brian's [slides](https://speakerdeck.com/jennybc/how-to-name-files) on "Naming things" from Reproducible Science Workshop, Duke, 2015
Semantic versioning - [semverdoc.org](https://semverdoc.org/)
LCSB *IT101* training [presentation](https://git-r3lab.uni.lu/R3/howto-cardsrds/uploads/738930b9a533a2f308cc62c431d9246f/it101.html)