Commit ae8a8e75 authored by Laurent Heirendt's avatar Laurent Heirendt
Browse files

Merge branch 'formatting-fn' into 'develop'

formatting of file naming card

See merge request !121
parents a09bf719 aeb4d844
Pipeline #23305 passed with stages
in 1 minute and 49 seconds
......@@ -7,63 +7,65 @@ redirect_from:
- /external/cards/integrity:naming
---
# Naming files
(Re)Naming a file is very easy operation usually one or two clicks away (*right click+rename, F2, ...*). Maybe thats why people do not pay enough attention when choosing a proper file name even though it can have a big impact on their ability to find those files later and to understand what they contain.
Good file name follows three basic principles:
* machine readable
* human readable
* plays well with default ordering
* machine readable
* human readable
* plays well with default ordering
## Machine readable
Special characters can have different meaning for different operation system or software. The most commonly found are
Special characters can have different meaning for different operation system or software. The most commonly found are
**#$%&'(")*+,-./:&#59;<=>?@[\]^_`{|}~**
and
white characters like **space** or **tabulator**.
The only two which are recommended in file names are hyphen "**-**" and underscore "**_**". You can use underscore to separate and hyphen to combine.
File name
```
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_A01.csv
```
The only two which are recommended in file names are hyphen "**-**" and underscore "**_**". You can use underscore to separate and hyphen to combine.
The file name `2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_A01.csv`
gives us already some information about date of creation (2013-06-26), assay (BRAFWTNEGASSAY), sample set (Plasmid-Cellline-100-1MutantFraction) and well (A01). While following names
```
```text
2013-06-26-BRAFWTNEGASSAY-Plasmid-Cellline-100-1MutantFraction-A01.csv
.csv
2013_06_26_BRAFWTNEGASSAY_Plasmid_Cellline_100_1MutantFraction_A01.csv
```
are much more prone to misinterpretation.
#### Accented characters
Your language might be very rich on various accented or special characters
but both colleagues and your machines will have hard time to work with them.
Special letters like **ç**, **ä**, **ô**,
**ě**, **ŕ**, etc. require special encoding and might cause troublesome issues when used in file names.
## Accented characters
Your language might be very rich on various accented or special characters
but both colleagues and your machines will have hard time to work with them.
Special letters like **ç**, **ä**, **ô**,
**ě**, **ŕ**, etc. require special encoding and might cause troublesome issues when used in file names.
Beware of typos and avoid using multiple names varying in small ways unless it has some true meaning. Following file names are distinct, but can you tell where exactly?
```
```text
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFractions_B03.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Celline-100-1MutantFraction_B03.csv
2013-06-26_BRAFWTNEGASSAY_Plazmid-Cellline-100-1MutantFraction_B03.csv
```
#### Exploiting machine readable names
## Exploiting machine readable names
You may already have a lot of files collected for your project or you have received big dataset from one of your collaborators. Then you might think about organizing and renaming them to be compliant with your new or existing naming policy.
If the names are consistent and you don't want to loose time renaming them by hand, you may try to use dedicated tools (e.g. [PSRenamer](https://www.powersurgepub.com/products/psrenamer/index.html)) or simple commands in your command line (**rename** for Mac and Linux, **ren** for Windows).
Once your skills develop, you will be able to use machines and machine readable file names to perform advanced operations on them, e.g. search using regular expression.
Imagine folder with thousands of files. Running simple R command
```
```R
flist <- list.files(pattern = "Plasmid")
```
will give you all file names containing word "Plasmid".
```
```text
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_A01.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_A02.csv
2013-06-26_BRAFWTNEGASSAY_Plasmid-Cellline-100-1MutantFraction_A03.csv
......@@ -74,7 +76,7 @@ will give you all file names containing word "Plasmid".
This result can be easily further processed into an awesome meta-data table by applying split in places of underscore and dot:
```
```R
flist_df <- stringr::str_split_fixed(flist, "[_\\.]", 5)
names(flist_df) <- c("Date", "Assay", "Sample_set", "Well", "Format")
```
......@@ -90,24 +92,24 @@ names(flist_df) <- c("Date", "Assay", "Sample_set", "Well", "Format")
Of course, similarly simple and powerful commands can be found in every programming language/interpreter (Python, Bash, ...)
#### Case sensitivity
## Case sensitivity
It is generally recommended **not** to use upper case letters.
Firstly, matching patterns and splitting names with upper case letters is much harder and error prone. Another drawback might be the fact, that Windows file system is case insensitive (unlike Mac or Linux OS).
If you really want to extend hyphen-underscore semantic separation, you can use so called [**camelCase**](https://en.wikipedia.org/wiki/Camel_case) - substituting spaces between words by upper-casing their first letters.
## Machine readable names allow us
#### Machine readable names allow us:
* easily search for files later
* easily narrow file lists based on names
* easily extract info from file names, e.g. by splitting
* easily search for files later
* easily narrow file lists based on names
* easily extract info from file names, e.g. by splitting
Remember that the rules on machine readability apply also for naming your **folders** (now containing your nicely named files). In fact, it is a good practice to stick to these rules even when naming **variables** in your data files.
## Human readable
* Be specific. It is generally better to create longer file name which is fulfilling its purpose than using short abbreviations which might be hard to grasp by your colleagues, eventually by yourself after some time. Stay away from cryptic names and non-standard or unclear abbreviations.
* Be specific. It is generally better to create longer file name which is fulfilling its purpose than using short abbreviations which might be hard to grasp by your colleagues, eventually by yourself after some time. Stay away from cryptic names and non-standard or unclear abbreviations.
| Bad named | Better name |
| ------------------------- | ----------------------------------------------------- |
......@@ -116,92 +118,110 @@ Remember that the rules on machine readability apply also for naming your **fold
| ms_cresp_final.doc | John-White_Cell-respiration-manuscript_2019-12-11.doc |
| fig_1.png | John-White_Cell-respiration_fig-1_2019-12-11.png |
* Usually, file extension is already telling you some information about the file itself.
Here are some examples of file names which are unnecessarily long and could be easily shortened:
`````
Iris-setosa_table.csv
video_2019_annual-meeting.avi
2019-12-11_notes.log
ATAC_seq1_London_mapped.bam
A2452_description-tutorial.info
`````
* Never use suffixes (or prefixes) like **"final"**, **"old"**, **"new"**, **"current"**, **"obsolete"**, **"recent"**, **"latest"**, **"best"**...
File is hardly in such states and it will change sooner or later anyway.
* Name should naturally explain why the file exists. If you have to search for additional information (either asking your colleagues or reading some README files), the file name is probably not chosen properly. Name file in a way that even a total stranger could get it easily.
* Leave out meaningless or redundant words, e.g. "the", "and", "a", "file", "data" ...
* Do not be too creative, do not pun and stay professional. Bad examples:
```
bio-rect_UM.csv - data related to bio-reactors at University of Michigan
PEPA_d-pic.jpeg - a fourth picture from your paper on Performace Evaluation Process Algebra
```
#### Semantic versioning
* Usually, file extension is already telling you some information about the file itself.
Here are some examples of file names which are unnecessarily long and could be easily shortened:
```text
Iris-setosa_table.csv
video_2019_annual-meeting.avi
2019-12-11_notes.log
ATAC_seq1_London_mapped.bam
A2452_description-tutorial.info
```
* Never use suffixes (or prefixes) like **"final"**, **"old"**, **"new"**, **"current"**, **"obsolete"**, **"recent"**, **"latest"**, **"best"**...
File is hardly in such states and it will change sooner or later anyway.
* Name should naturally explain why the file exists. If you have to search for additional information (either asking your colleagues or reading some README files), the file name is probably not chosen properly. Name file in a way that even a total stranger could get it easily.
* Leave out meaningless or redundant words, e.g. "the", "and", "a", "file", "data" ...
* Do not be too creative, do not pun and stay professional. Bad examples:
```text
bio-rect_UM.csv - data related to bio-reactors at University of Michigan
PEPA_d-pic.jpeg - a fourth picture from your paper on Performace Evaluation Process Algebra
```
## Semantic versioning
If your files or documents change very often and you want to track the versions manually instead of using some sophisticated versioning software<!-- TODO: link to GIT howto-card -->, you might follow semantic versioning scheme widely used in software development.
It is based on adding several numbers, standard is 3, into a suffix of your file name where:
* first number called **MAJOR** version is increased once the document has undergone **significant changes**
* second number called **MINOR** version is incremented once some new information is added to the document or something is deleted
* last number called **PATCH** should refer to very minor changes like fixing of typos or rephrasing a sentence.
These can be be headed by the letter „V“ in order to indicate the following version information.
* first number called **MAJOR** version is increased once the document has undergone **significant changes**
* second number called **MINOR** version is incremented once some new information is added to the document or something is deleted
* last number called **PATCH** should refer to very minor changes like fixing of typos or rephrasing a sentence.
These can be be headed by the letter „V“ in order to indicate the following version information.
Human readable names allow us:
* easily understand what the file is and what it contains
* easily share files with others
* easily understand what the file is and what it contains
* easily share files with others
## Default ordering
Inbuilt tools (e.g. file explorer) allows you to order files by name in alphanumerical order. Make the best out of this great feature.
* Put the terms in general-to-specific order. That way, you will have files grouped in logical order and related files will be naturally close to each other.
```
Ares-triticum_samples_redundant_2010-04-12.csv
Ares-hordeum_samples_redundant_2010-05-12.csv
Iris-setosa_samples_1927_05_12.csv
Iris-setosa_samples_1954-06-24.csv
Iris-versicolor_samples_1945-04-12.csv
```
* Put the date first to get chronological ordering:
```
2013-06-26_Plasmid_A01.csv
2014-06-26_Plasmid_C02.csv
2015-06-30_Plasmid_A03.csv
2015-07-12_Plasmid_B01.csv
2015-07-13_Plasmid_B02.csv
2015-11-10_Plasmid_B03.csv
```
* Put number defining explicit order as first. Remember that the ordering is done by character, not by the whole number, so you might want to add leading zeros just to be sure that the ordering will be correct with growing number of your files.
```
01_Plasmid_A01_2013-06-26.csv
02_Plasmid_C02_2014-06-26.csv
03_Plasmid_A03_2015-06-30.csv
10_Plasmid_B01_2015-07-12.csv
11_Plasmid_B02_2015-07-13.csv
25_Plasmid_B03_2015-11-10.csv
```
* Put the terms in general-to-specific order. That way, you will have files grouped in logical order and related files will be naturally close to each other.
```text
Ares-triticum_samples_redundant_2010-04-12.csv
Ares-hordeum_samples_redundant_2010-05-12.csv
Iris-setosa_samples_1927_05_12.csv
Iris-setosa_samples_1954-06-24.csv
Iris-versicolor_samples_1945-04-12.csv
```
* Put the date first to get chronological ordering:
```text
2013-06-26_Plasmid_A01.csv
2014-06-26_Plasmid_C02.csv
2015-06-30_Plasmid_A03.csv
2015-07-12_Plasmid_B01.csv
2015-07-13_Plasmid_B02.csv
2015-11-10_Plasmid_B03.csv
```
* Put number defining explicit order as first. Remember that the ordering is done by character, not by the whole number, so you might want to add leading zeros just to be sure that the ordering will be correct with growing number of your files.
```text
01_Plasmid_A01_2013-06-26.csv
02_Plasmid_C02_2014-06-26.csv
03_Plasmid_A03_2015-06-30.csv
10_Plasmid_B01_2015-07-12.csv
11_Plasmid_B02_2015-07-13.csv
25_Plasmid_B03_2015-11-10.csv
```
## Dates
Including date in your file names allows you to sort them easily and find exactly the one you want in very short time.
Including date in your file names allows you to sort them easily and find exactly the one you want in very short time.
Remember that recording dates using anything else than numbers (e.g. month abbreviations) can due to different language background result in formats like "*11dic2019*" or "*11Dez2019*", etc., which doesn't have to be recognized as date at all.
It is much better to use only numeric format but even then it can be written in endless variations which are hard to read or more importantly make them ambiguous, like date **11th of December 2019** in following examples:
```text
19/11/12
19/12/11
20191112
11.12.2019
11-12-19
...
```
19/11/12
19/12/11
20191112
11.12.2019
11-12-19
...
Luckily, there is a standard for date format, YYYY-MM-DD ([*ISO 8601*](https://en.wikipedia.org/wiki/ISO_8601)), which really nicely comply with all three principles above. Therefore, the **only** correct format of 11th of December 2019 is:
```text
2019-12-11
```
Luckily, there is a standard for date format, YYYY-MM-DD ([*ISO 8601*](https://en.wikipedia.org/wiki/ISO_8601)), which really nicely comply with all three principles above. Therefore, the **only** correct format of 11th of December 2019 is:
```
2019-12-11
```
<!-- TODO: stability of names in shared repository which is not read-only - e.g. someone gets nuts and starts to rename everything. Dangerous if there is any analyses link directly to a file. -->
<!-- TODO: do some guidelines/rules/recommendations apply to different classes of files - source code, data, documents -->
## Final notes
When starting your project or creating a new repository, give yourself a time to set a proper naming design.
Remember that it should be also accepted by your teammates and other collaborators accessing the files.
To make dissemination of the naming design as easy as possible, don't forget to document it and include it into policies of your group/project.
......@@ -212,7 +232,8 @@ But the truth is that it will pay off once the projects get more complex and you
If you don't agree with naming rules which are adopted in your group, follow them or make an effort to change it globally.
The **consistency** is much more important than your preferred naming.
# Resources
Jenny Brian's [slides](https://speakerdeck.com/jennybc/how-to-name-files) on "Naming things" from Reproducible Science Workshop, Duke, 2015
Semantic versioning - [semverdoc.org](https://semverdoc.org/)
LCSB *IT101* training [presentation](https://git-r3lab.uni.lu/R3/howto-cardsrds/uploads/738930b9a533a2f308cc62c431d9246f/it101.html)
## Resources
* Jenny Brian's [slides](https://speakerdeck.com/jennybc/how-to-name-files) on "Naming things" from Reproducible Science Workshop, Duke, 2015
* Semantic versioning - [semverdoc.org](https://semverdoc.org/)
* LCSB *IT101* training [presentation](https://git-r3lab.uni.lu/R3/howto-cardsrds/uploads/738930b9a533a2f308cc62c431d9246f/it101.html)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment