Commit bbf6a822 authored by Leon-Charles Tranchevent's avatar Leon-Charles Tranchevent
Browse files

Updated README files.

parent 8bcecb01
# Objectives
The objectives of this step are to analyse the quality of the raw data of the project and to create reports that can be visually inspected to decide which samples should be kept and which could (or should) be discarded.
The objectives of this step are to analyze the quality of the raw data and to create reports that can be visually inspected to decide which samples should be kept and which should be discarded.
# Details and instructions
Affymetrix datasets are analysed using R and the 'arrayQualityMetrics' package, which produces HTML reports with figures and descriptions. Agilent datasets are analysed using home-made scripts, which creates figures but does not summarize the QC in a single document. Illumina datasets are not controlled so far since we do not have the complete raw data. Similarly, the RNA-seq datasets are not controlled. The Makefile contains the commands to launch all jobs on the cluster.
Affymetrix datasets are analysed using R and the 'arrayQualityMetrics' package, which produces HTML reports with figures and descriptions. Agilent datasets are analysed using home-made scripts, which creates figures but does not summarize the QC in a single document. The Makefile contains the commands to launch all jobs on the cluster.
```
make clean_outputs
make run_qc
......
......@@ -2,9 +2,9 @@
The objectives of this step are to preprocess the raw data and to save them for further use.
# Details and instructions
All Affymetrix datasets are preprocessed using SCAN and GC-RMA. Illumina and Agilent datasets are preprocessed using dedicated R libraries (limma and beadarrays). The RNA-seq data are just post-processed since the raw data are not available. The Makefile contains the commands to launch all jobs on the cluster.
All Affymetrix datasets are preprocessed using SCAN and GC-RMA. Illumina and Agilent datasets are preprocessed using dedicated R libraries (limma and beadarrays). The RNA-seq data are just post-processed since the preprocessing took place before. The Makefile contains the commands to launch all jobs on the cluster.
The data are stored back as TSV files.
The data are stored as TSV files.
```
make clean_outputs
make preprocess
......@@ -26,4 +26,4 @@ make doc
```
# Prerequisites
In general, the only prerequisite is to have the raw data (mostly from GEO) in the Data folder. For the array-based datasets, there should be one folder per dataset, with a '/RAW/' folder with the raw data as CEL files. For the RNA-seq dataset, the pre-processed data should be available as TSV files containing read counts. For SCAN in particular, it is not necessary to run the quality control before running the preprocessing since arrays are preprocessed independently (unless one has many problematic arrays, which would take CPU time for nothing). For all the other methods however, this is not the case, and the bad quality arrays needs to be filtered before pre-processing (arrays are not treated individually).
In general, the only prerequisite is to have the raw data (mostly from GEO) in the Data folder. For the array-based datasets, there should be one folder per dataset, with a '/RAW/' folder with the raw data as CEL files. For the RNA-seq dataset, the pre-processed data should be available as TSV files containing read counts. For SCAN in particular, it is not necessary to run the quality control before running the preprocessing since arrays are preprocessed independently (unless one has many problematic arrays, which would take CPU time for nothing). For all the other methods however, this is not the case, and the bad quality arrays needs to be filtered before pre-processing.
# Objectives
The objectives of this step is to predict the gender /age of the patients whose gender / age is not indicated in the clinical annotations.
The objectives of this step are to predict the gender /age of the patients whose gender / age is not indicated in the clinical annotations.
# Details and instructions
All datasets are used regardless of whether there exists samples with missing clinical annotations. This is motivated by the fact that we also want to estimate the overall accuracy of the predictions. The Makefile contains the commands to get the data from Biomart and then make the predictions. Plots are made and whether the predicted data should be used is left to the user to decide (manually).
......@@ -14,4 +14,4 @@ make doc
```
# Prerequisites
The prerequisites are to have the raw data (mostly from GEO) in the Data folder and the pre-processed data from step 02. There should be one folder per dataset, with a '/RAW/' folder with the raw data as CEL files (array data) or a TSV file with the pre-processed data (RNA-seq). In addition , a "ClinicalData.tsv" file should contain the clinical annotations (including of course gender and age).
The prerequisites are to have the raw data (mostly from GEO) in the Data folder and the pre-processed data from step 02. There should be one folder per dataset, with a '/RAW/' folder with the raw data as CEL files (array data) or a TSV file with the pre-processed data (RNA-seq). In addition , a "ClinicalData.tsv" file should contain the clinical annotations (including of course gender and age).
\ No newline at end of file
# Objectives
The first objectives of this step is clean the datasets, i.e., remove the arrays that have been flagged during the previous steps due to various errors (referred to as QC-I, PROC-no-converg, QC-II-SCAN, QC-II-GCRMA, CLIN-no-age or CLIN-no-gender). A second objective is to check the expression levels of pre-defined biomarkers.
The first objectives of this step is clean the datasets, i.e., remove the arrays that have been flagged during the previous steps due to various errors (referred to as QC-I, PROC-no-converg, QC-II-SCAN, QC-II-GCRMA, CLIN-no-age or CLIN-no-gender). The second objective is to define which probes are going to be used to represent which genes for each platform (removing complex or ambiguous matches).
# Details and instructions
The datasets are processed one by one to remove the bad arrays and update the clinical files accordingly. The clinical data is also updated to include gender predictions. The necessary information is described in the local configuration file, that needs to be manually updated based on the outputs of the previous steps. Please note that the 'Batch.tsv' file is also updated since it might be used after preprocessing (for limma).
......@@ -13,11 +13,9 @@ The age and category balances are then checked for all datasets. No statistics i
make check
```
Then, the expression levels of the various biomarkers are checked. We first need to map the probes to the genes for all platforms, and then analyze the pre-defined biomarkers.
The most appropriate gene-probe match is defined (per dataset/platform individually).
```
make ps
make match
make biomarker
```
For manual inspection, a document that contains all figures is then generated.
......
# Objectives
The objectives of this step is to identify the differentially expressed genes of each dataset, and for various comparisons of interest (e.g., female vs male, disease versus control).
This produces gene lists that can be then used for the pathway and network analyses.
The objectives of this step is to identify the differentially expressed genes of each dataset, and for various comparisons of interest (e.g., female vs male, disease versus control). This produces gene lists that can be then used for the pathway and network analyses.
# Details and instructions
The datasets are processed one by one to identify differentially expressed genes (using limma). The following analyses are performed:
......@@ -12,18 +11,24 @@ The datasets are processed one by one to identify differentially expressed genes
- Female patients vs male patients
- Female controls vs male controls
- (Female patients vs male patients) vs (Female controls vs male controls) [equivalent to #5]
Various plots and lists are created in the process. The results are summarized at the probe level but with gene annotations.
```
make clean_outputs
make run_limma
```
Then, the expression levels of the various biomarkers are checked.
```
make biomarker
```
A document that contains all figures can then be generated.
```
make doc
```
Notice that not all datasets are analyzed using the exact same method. The co-factors might be different depending on whether there are replicates, potential batch effects, and complete clinical annotations (such as age). In addition, some datasets do not have enough samples in a given category (for instance, female PD patients) to perform the complex analyses that take this category into account.
Notice that not all datasets are analyzed using the exact same method. The co-factors might be different depending on whether there are replicates, potential batch effects, and complete clinical annotations (such as age). In addition, some datasets do not have enough samples in a given category (for instance, female PD patients) to perform the complex analyses that take this category into account. This means that for some datasets not all analyses are performed.
# Prerequisites
The only prerequisite is to have the preprocessed and cleaned data for all datasets (Step 04).
The only prerequisite is to have the preprocessed and cleaned data for all datasets (Step 04).
\ No newline at end of file
# Objectives
The objectives of this step is to integrate the results of the differential expression analysis across several datasets in order to identify similarities and overlaps and to identify robust DEGs.
The objectives of this step is to perform the meta-analysis, *i.e.* to integrate the results of the differential expression analysis across several datasets in order to identify robust DEGs.
# Details and instructions
The datasets are first summarized at the gene level (limma analyses are performed at the probe level). Conflicts and non unique mappings are handled to create a unique list of DEGS (per dataset still).
......@@ -16,7 +16,7 @@ make analyse
make check
```
We create the final gene expression matrices (to share data outside of the project).
We create the final gene expression matrices.
```
make gexpr
```
......@@ -26,7 +26,7 @@ A document that contains many but not all figures can then be generated.
make doc
```
Finally, the gene lists to be further analyzed are created. The idea there is to split the gender-specific genes (for bothe males and females) and the gender-dimorphic genes.
Finally, the gene lists to be further analyzed are created. The idea there is to split the sex-specific genes (for both males and females) and the sex-dimorphic genes.
```
make rankings
```
......
......@@ -2,20 +2,20 @@
The objectives of this step is to perform the enrichment analyses of the DEGs.
# Details and instructions
The significant genes are analyzed to identify functional terms that are enriched (be it functions, pathways or diseases).
The genes are first ranked as to identify the gender dymorphic and gender specific genes.
The genes are analyzed to identify functional terms that are enriched (functions, pathways or diseases).
The sex-dimorphic and sex-specific genes are derived from the previous analyses.
```
make clean_outputs
make prep
```
We then perform the enrichment with several tools, including ClusterProfiler (self-contained, GSEA), ROntoTools (self-contained, GSEA + network topology), Gene2Pathways (competitive methods) and PathFindR (ORA + network). Notice that the pf_enrich command is run on frodo, all other commands are run on iris (issue with PathFindR on iris).
We then perform the enrichment with several tools, including ClusterProfiler (self-contained, GSEA), ROntoTools (self-contained, GSEA + network topology), Gene2Pathways (competitive methods) and PathFindR (ORA + network).
```
make enrich
make pf_enrich
```
Once the results of PathFindR are retrieved from frodo, we can create a merge since several tools rely on the same ontologies.
We can combine the results of several tools that rely on the same ontologies.
```
make merge
```
......@@ -26,7 +26,4 @@ make gsp
```
# Prerequisites
A prerequisite is to have the results of the integration (step 06).
#TODO
This steps needs to be completely changed as to incorporate the latest change in the workflow (new rankings and so on).
A prerequisite is to have the results of the integration (step 06).
\ No newline at end of file
......@@ -2,10 +2,7 @@
The objectives of this step is to investigate the potential regulators behind the observed DEGs.
# Details and instructions
For DoRothEA and CARNIVAL, it is important to make sure that the local repositories are up-to-date and contain the relevant data.
The folder is ~/Data/GeneDER/Original/Else/DoRothEA.
We first start by selecting the genes we want to investigate. This is based on the PI values again, selecting a not so conservative threshold as to have enough genes to start with. This also means that we will need to check the differential expression of the genes at the end as well since some genes might be only weakly differentially expressed.
We first start by selecting the genes we want to investigate.
```
make clean_outputs
make prep
......@@ -40,26 +37,5 @@ Last, we merge the enrichment results altogether. These files can then be manual
make refineEnrich
```
# Summary of manual changes as of 2020/02/20
In the Female_mapping_refined.tsv file (3 updates):
HIST1H2AC --> HIST1H2AC Histone H2AC
HIST1H2BD --> HIST1H2BD HIST1H2BD
HIST2H2BE --> HIST2H2BE HIST2H2BE
In the Male_mapping_refined.tsv file (1 update):
IARS --> IARS IleRS
In the Gender_disease_status_mapping_refined.tsv file (0 update).
In the PDvsControl_mapping_refined.tsv (6 updates):
HARS --> HARS SYH
HIST1H1C --> HIST1H1C Histone H1.2
HIST1H2AC --> HIST1H2AC Histone H2AC
HIST1H2BD --> HIST1H2BD HIST1H2BD
LINC00889 --> {} (remove line)
LOC100996756 --> {} (remove line)
In the extendedTF_mapping_refined.tsv file (0 update).
# Prerequisites
A prerequisite is to have the results of the functional enrichment (Step 17), as to rely on the same input than GSEA.
A prerequisite is to have the results of the functional enrichment (Step 17), as to rely on the same input than GSEA.
\ No newline at end of file
CODE_FOLDER=/home/users/ltranchevent/Projects/GeneDER/Analysis/Code_style_check/
clean:
@rm -rf *~
check:
@Rscript --vanilla ${CODE_FOLDER}/lint_all.R
#!/usr/bin/env Rscript
# ================================================================================================
# Libraries
# ================================================================================================
library("lintr")
lint_it <- function(rscript) {
errors <- lint(rscript, cache = FALSE,
linters = with_defaults(line_length_linter(100),
cyclocomp_linter(complexity_limit = 500)))
for (error in errors) {
print(error)
}
message(paste0("[", Sys.time(), "] ", rscript, " analyzed."))
}
# ================================================================================================
# Main
# ================================================================================================
# Main code.
r_scripts <- system("ls ../*/*R", intern = TRUE)
for (r_script in r_scripts) {
lint_it(r_script)
}
# Rlibs, scripts and utils.
r_scripts <- system("ls ../libs/*/*R", intern = TRUE)
for (r_script in r_scripts) {
lint_it(r_script)
}
# Package ArrayUtils.
r_scripts <- system("ls ../../../Rlibs/ArrayUtils/R/*R", intern = TRUE)
for (r_script in r_scripts) {
lint_it(r_script)
}
affy
AnnotationDbi
arrayQualityMetrics
ArrayUtils
beadarray
Biobase
CARNIVAL
clusterProfiler
dendextend
devtools
doParallel
DOSE
edgeR
gcrma
GEOquery
gep2pep
ggfortify
ggplot2
ggpubr
graph
GSEABase
heatmaply
hgfocus.db
hgu133a.db
hgu133plus2.db
hgug4112a.db
huex10sttranscriptcluster.db
hugene10sttranscriptcluster.db
illuminaHumanv3.db
limma
lintr
massiR
methods
missRanger
msigdbr
oligo
org.Hs.eg.db
pathfindR
pathview
preprocessCore
progeny
RColorBrewer
ReactomePA
readr
readxl
reshape2
ROntoTools
SCAN.UPC
statmod
stats
stringr
stringr
survcomp
sva
tidyverse
tidyverse
topconfects
u133x3p.db
utils
viper
vsn
yaml
\ No newline at end of file
FOLDER=/home/users/ltranchevent/Projects/GeneDER/Analysis/
OUTPUT_FOLDER=/home/users/ltranchevent/Data/GeneDER/Analysis/
ANNEX=/home/users/ltranchevent/Projects/GeneDER/Documents/WorkReport/Annexes_p3/
clean:
@rm -rf *~
code_loc_hpc:
@rsync -vazur --delete /home/leon/Projects/GeneDER/Analysis/ iris:/home/users/ltranchevent/Projects/GeneDER/Analysis/
code_hpc_loc:
@rsync -vazur --delete iris:/home/users/ltranchevent/Projects/GeneDER/Analysis/ /home/leon/Projects/GeneDER/Analysis/
code_loc_frd:
@rsync -vazur --delete /home/leon/Projects/GeneDER/Analysis/ frodo:/home/leon-charles.tranchevent/Projects/GeneDER/Analysis/
code_frd_loc:
@rsync -vazur --delete frodo:/home/leon-charles.tranchevent/Projects/GeneDER/Analysis/ /home/leon/Projects/GeneDER/Analysis/
lib_loc_hpc:
@rsync -vazur --delete /home/leon/Projects/Rlibs/ iris:/home/users/ltranchevent/Projects/Rlibs/
lib_hpc_loc:
@rsync -vazur --delete iris:/home/users/ltranchevent/Projects/Rlibs/ /home/leon/Projects/Rlibs/
lib_loc_frd:
@rsync -vazur --delete /home/leon/Projects/Rlibs/ frodo:/home/leon-charles.tranchevent/Projects/Rlibs/
lib_frd_loc:
@rsync -vazur --delete frodo:/home/leon-charles.tranchevent/Projects/Rlibs/ /home/leon/Projects/Rlibs/
data_loc_hpc:
@rsync -vazur /home/leon/Data/GeneDER/Analysis/ iris:/home/users/ltranchevent/Data/GeneDER/Analysis/
@rsync -vazur /home/leon/Data/GeneDER/Original/ iris:/home/users/ltranchevent/Data/GeneDER/Original/
data_hpc_loc:
@rsync -vazur iris:/home/users/ltranchevent/Data/GeneDER/Analysis/ /home/leon/Data/GeneDER/Analysis/
@rsync -vazur iris:/home/users/ltranchevent/Data/GeneDER/Original/ /home/leon/Data/GeneDER/Original/
data_loc_frd:
@rsync -vazur /home/leon/Data/GeneDER/Analysis/ frodo:/home/leon-charles.tranchevent/Data/GeneDER/Analysis/
@rsync -vazur /home/leon/Data/GeneDER/Original/ frodo:/home/leon-charles.tranchevent/Data/GeneDER/Original/
data_frd_loc:
@rsync -vazur frodo:/home/leon-charles.tranchevent/Data/GeneDER/Analysis/ /home/leon/Data/GeneDER/Analysis/
@rsync -vazur frodo:/home/leon-charles.tranchevent/Data/GeneDER/Original/ /home/leon/Data/GeneDER/Original/
install_lib:
#ii
@module load lang/R/3.6.0-foss-2019a-bare
@Rscript --vanilla ${FOLDER}/libs/utils/install.R
up_annex:
@cp ${OUTPUT_FOLDER}/02/results_summary.pdf ${ANNEX}/02_summary_results.pdf
@cp ${OUTPUT_FOLDER}/03/results_summary.pdf ${ANNEX}/03_summary_results.pdf
@cp ${OUTPUT_FOLDER}/04/results_summary.pdf ${ANNEX}/04_summary_results.pdf
@cp ${OUTPUT_FOLDER}/05/results_summary.pdf ${ANNEX}/05_summary_results.pdf
@cp ${OUTPUT_FOLDER}/06/results_summary_a.pdf ${ANNEX}/06a_summary_results.pdf
@cp ${OUTPUT_FOLDER}/06/results_summary_b.pdf ${ANNEX}/06b_summary_results.pdf
@cp ${OUTPUT_FOLDER}/14/results_summary.pdf ${ANNEX}/14_summary_results.pdf
@cp ${OUTPUT_FOLDER}/15/results_summary.pdf ${ANNEX}/15_summary_results.pdf
@cp ${OUTPUT_FOLDER}/16/results_summary_a.pdf ${ANNEX}/16a_summary_results.pdf
@cp ${OUTPUT_FOLDER}/16/results_summary_b.pdf ${ANNEX}/16b_summary_results.pdf
# GeneDER project
This project focuses on the analysis of gender based differences in Parkinson's disease. More detail about the project can be found in the project reports. This repository contains the code associated with the project.
# GeneDER
## Summary
The idea is to study several expression datasets (both microarray- and sequencing-based) as to detect genes that behave differently between males and females during the course of Parkinson's disease. These genes are then be investigated globally though network- and pathway-based methods as to identify potential key functions/modules/pathways.
## Table of contents
* [Introduction](#introduction)
* [Content](#content)
* [Data](#data)
* [Requirements](#requirements)
* [License](#license)
* [Citation](#citation)
## Data
The original data have been extracted from GEO or were produced in-house. Details can be found in the configuration files (for instance ./Confs/datasets_config.yml).
## Introduction
This repository contains the code necessary to run the analyses described in the article titled "Systems level meta-analysis of disease-associated molecular gender differences in Parkinson’s disease" authored by Léon-Charles Tranchevent, Rashi Halder and Enrico Glaab.
This project focuses on a meta-analysis of transcriptomics datasets of Parkinson's disease patients and controls in order to identify variations associated with both disease status and biological sex. These variations are then further investigated through functional enrichment and regulatory network analyses.
## Content
The workflow is split in eight sequential steps, each one is associated with a corresponding folder. There is an additional folder for the configuration files (*e.g.*, to indicate where to find the data and to define the parameters of the analyses). Each step is briefly described below but each associated folder also contain its own README file.
## Prerequisites
Most of the code is currently composed of R and bash scripts. Makefiles are used to store the main commands. Steps are numbered and can be run sequentially using the dedicated Makefiles (more details are indicated in the dedicated README files). The code has been tested on my local machine, and then run on the iris cluster (excepted one job which has to run on frodo since it relies on a tool that does not run on iris). Note that this project relies on various R packages including BioConductor, Affy, SCAN.UPC, arrayQualityMetrics, limma, tidyverse as well as the ArrayUtils set of functions.
1. The quality control of the raw expression data is performed.
2. The raw expression data is preprocessed and another quality control is performed afterwards.
3. The clinical annotations are investigated in order to identify whether missing values can be predicted.
4. The processed data and associated clinical annotations are prepared taking into account the observations from the previous steps (*i.e.*, samples to remove because of the quality control, predicted clinical values to add).
5. For each dataset, two differential expression analyses are performed using respectively only the male samples and only the female samples. For both analyses, patients and controls are compared so that the models identify the genes that are differentially expressed between female patients and female controls (or between male patients and male controls).
6. The meta-analyses are performed by integrating the results of the differential expression analyses across datasets (but again separately for each sex). By comparing the male and female results, the female-specific, male-specific and sex-dimorphic genes are then defined.
7. Functional enrichment of the meta-analysis results is performed.
8. Regulatory networks around the key differentially expressed genes are reconstructed.
## Authors
* **Léon-Charles Tranchevent**
* **Enrico Glaab**
## Data
The datasets used in our study have been extracted from the [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/). The code can be used to analyze other datasets as long as the raw data and the associated clinical data is available. The configuration of the meta-analysis (*i.e.*, which datasets to include) can be found in the configuration folder `Confs/`.
## Requirements
The code consists of R and bash scripts. In addition, makefiles are used to illustrate how the scripts were exactly used in our meta-analysis. This project relies on various R and BioConductor packages (see the full list in the file `Confs/packages`). It also relies on the ArrayUtils set of functions which repository can be found [here](https://git-r3lab.uni.lu/bds/geneder/arrayutils).
## License
This project is currently not publicly available and therefore is not yet licensed.
## Acknowledgments
* GEO data providers.
* Drs Middleton and Miller for providing additional clinical data.
* Dr Hendrickx for discussion about the GEO datasets.
* Dr Rauschenberger for discussion about the statistics.
* Dr Ali for discussion about the regulatory networks.
* Drs Cantuti-Castelvetri and Standaert for their help regarding some expression datasets.
The code is available under the GNU General Public License (GPLv3).
## Citation
If you found this code useful, please cite our article: **Systems level meta-analysis of disease-associated molecular gender differences in Parkinson’s disease**, Tranchevent LC., Halder R. and Glaab E., *manuscript submitted*
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment