Commit 4aa9bf44 authored by Leon-Charles Tranchevent's avatar Leon-Charles Tranchevent
Browse files

Merge branch 'develop' into 'master'

Updated license and documentation for PPC - cleaning of some unnecessary files

See merge request !3
parents 3bf5aec2 d4e5a666
......@@ -38,4 +38,4 @@ make refineEnrich
```
# Prerequisites
A prerequisite is to have the results of the functional enrichment (Step 17), as to rely on the same input than GSEA.
\ No newline at end of file
A prerequisite is to have the results of the functional enrichment (Step 07), as to rely on the same input than GSEA.
\ No newline at end of file
......@@ -10,24 +10,12 @@ clean_outputs:
s_ana:
@sbatch --time=0-02:30:00 --mem=64GB ${CODE_FOLDER}analyse_seurat.sh GSE157783
@sbatch --time=0-20:00:00 --mem=196GB -p bigmem ${CODE_FOLDER}analyse_seurat.sh GSE178265
#s_ana_stest:
# @sbatch --time=0-00:20:00 --mem=8GB ${CODE_FOLDER}sanalyse_seurat.sh GSE157783s
# @sbatch --time=0-00:20:00 --mem=8GB ${CODE_FOLDER}sanalyse_seurat.sh GSE178265s
#s_ana_mtest:
# @sbatch --time=0-02:00:00 --mem=64GB ${CODE_FOLDER}analyse_seurat.sh GSE157783m
# @sbatch --time=0-02:00:00 --mem=64GB ${CODE_FOLDER}analyse_seurat.sh GSE178265m
l_ana:
@sbatch --time=0-17:00:00 --mem=64GB ${CODE_FOLDER}analyse_liger.sh GSE157783
@sbatch --time=2-00:00:00 --mem=400GB -p bigmem ${CODE_FOLDER}analyse_liger.sh GSE178265
#l_ana_stest:
# @sbatch --time=0-00:20:00 --mem=6GB ${CODE_FOLDER}sanalyse_liger.sh GSE157783s
# @sbatch --time=0-00:20:00 --mem=6GB ${CODE_FOLDER}sanalyse_liger.sh GSE178265s
#l_ana_mtest:
# @sbatch --time=0-02:00:00 --mem=64GB ${CODE_FOLDER}analyse_liger.sh GSE157783m
# @sbatch --time=0-02:00:00 --mem=64GB ${CODE_FOLDER}analyse_liger.sh GSE178265m
prep:
@sbatch ${CODE_FOLDER}prepare_rankings_seurat.sh GSE157783
#@sbatch ${CODE_FOLDER}prepare_rankings_seurat.sh GSE178265
@sbatch ${CODE_FOLDER}prepare_rankings_seurat.sh GSE178265
enrich:
@sbatch ${CODE_FOLDER}CP_enrich_seurat.sh GSE157783
#@sbatch ${CODE_FOLDER}CP_enrich_seurat.sh GSE178265
@sbatch ${CODE_FOLDER}CP_enrich_seurat.sh GSE178265
# Objectives
The objectives of this step is to analyse the single-cell data (PD versus control) for male and female samples.
The objectives of this step is to analyse the single-cell data (PD versus control) for male and female samples. This step includes the quality control and data pre-processing as well as the differential expression and the functional enrichment. Due to the low number of datasets (*i.e.*, only two), there is no meta-analysis.
# Details and instructions
Beware, we use a version of SEurat that needs spatstat v1.64 not v2.x.
Note that Seurat version we use relies on the R package spatstat v1.64 and not the latest v2.x.
The single-cell datasets are analysed using a Seurat pipeline.
```
make clean_outputs
make s_ana
```
Alternatively, the same datasets can be analysed using a mixed Seurat / Liger pipeline.
```
make l_ana
```
At this stage, the datasets have been pre-processed and the diffeential genes have been identified.
for each dataset, the sex-specific and sex-dimorphic genes are first identified and then used to perform functional enrichment using Fisher method (GSEA can not be used as the differential expression analyses do not provide whole genome rankings).
```
make prep
make enrich
```
# Prerequisites
A prerequisite is to have the scRNA data ready in the project data folder.
A prerequisite is to have the scRNA data ready in the project data folder and the configutration file setup correctly (as for the bulk transcriptomics data in step 01).
\ No newline at end of file
# To compute the overlap in DEGs between various differential analyses.
# I/Os
OUTPUT_FOLDER=/home/users/ltranchevent/Data/GeneDER/Analysis/09/GSE157783_seurat/
META_FOLDER=/home/users/ltranchevent/Data/GeneDER/Analysis/06/
# Declare the arrays.
##declare -a cells=('Oligodendrocytes' 'Excitatory' 'Microglia' 'OPCs' 'Astrocytes' 'Endothelial cells' 'Pericytes' 'Inhibitory' 'Ependymal' 'CADPS2+ neurons' 'GABA' 'DaNs');
declare -a cells=('Oligodendrocytes' 'Excitatory' 'Microglia' 'Astrocytes');
declare -a sexes=('F' 'M');
##declare -a methods=('wilcox' 'negbinom' 'poisson');
declare -a methods=('poisson');
declare -a refs=('SNage_PDVsControl_males_max-avg_all_pivalue_rankings.tsv' 'SNage_PDVsControl_males_max-avg_gs_pivalue_rankings.tsv' 'SNage_PDVsControl_females_max-avg_all_pivalue_rankings.tsv' 'SNage_PDVsControl_females_max-avg_gs_pivalue_rankings.tsv' 'SNage_PDVsControl_females_max-avg_gd_pivalueref_rankings.tsv');
# I - scRNA-seq only: per method, per gender, across all celltype pairs.
for method in "${methods[@]}"
do
for sex in "${sexes[@]}"
do
for cell_1 in "${cells[@]}"
do
for cell_2 in "${cells[@]}"
do
if [ "${cell_1}" = "${cell_2}" ]
then
# Skip that case since we do not want to compare cell_1 with itself.
continue
fi
# Do some row computation.
tag="[${method}][${sex}][${cell_1} vs ${cell_2}]"
fn1="${OUTPUT_FOLDER}DEG_${cell_1}_${sex}_PDvsCTRL_${method}.tsv"
fn2="${OUTPUT_FOLDER}DEG_${cell_2}_${sex}_PDvsCTRL_${method}.tsv"
wc1=$(awk '{if ($5 <= 0.05) print $0}' ${fn1} | cut -f 6 | grep -v Gene | sort -u | wc -l)
wc2=$(awk '{if ($5 <= 0.05) print $0}' ${fn2} | cut -f 6 | grep -v Gene | sort -u | wc -l)
echo "${tag} First size: ${wc1}"
echo "${tag} Second size: ${wc2}"
wc3=$(comm -1 -2 <(awk '{if ($5 <= 0.05) print $0}' ${fn1} | cut -f 6 | grep -v Gene | sort -u) <(awk '{if ($5 <= 0.05) print $0}' ${fn2} | cut -f 6 | grep -v Gene | sort -u) | wc -l)
p1=$(bc <<< 100*${wc3}/${wc1})
p2=$(bc <<< 100*${wc3}/${wc2})
echo "${tag} Overlap size: ${wc3} (${p1}% // ${p2}%)"
##echo "${tag} Overlap genes:"
##comm -1 -2 <(cut -f 6 ${fn1} | grep -v Gene | sort -u) <(cut -f 6 ${fn2} | grep -v Gene | sort -u)
echo ""
done # for each cell_1
done # for each cell_2
done # for each sex
done # for each method
echo ""
echo ""
echo ""
#
# II - scRNA-seq only: per method, per celltype, across sexes.
for method in "${methods[@]}"
do
for cell in "${cells[@]}"
do
# Do some row computation.
tag="[${method}][${cell}][F vs M]"
fn1="${OUTPUT_FOLDER}DEG_${cell}_F_PDvsCTRL_${method}.tsv"
fn2="${OUTPUT_FOLDER}DEG_${cell}_M_PDvsCTRL_${method}.tsv"
wc1=$(awk '{if ($5 <= 0.05) print $0}' ${fn1} | cut -f 6 | grep -v Gene | sort -u | wc -l)
wc2=$(awk '{if ($5 <= 0.05) print $0}' ${fn2} | cut -f 6 | grep -v Gene | sort -u | wc -l)
echo "${tag} First size: ${wc1}"
echo "${tag} Second size: ${wc2}"
wc3=$(comm -1 -2 <(awk '{if ($5 <= 0.05) print $0}' ${fn1} | cut -f 6 | grep -v Gene | sort -u) <(awk '{if ($5 <= 0.05) print $0}' ${fn2} | cut -f 6 | grep -v Gene | sort -u) | wc -l)
p1=$(bc <<< 100*${wc3}/${wc1})
p2=$(bc <<< 100*${wc3}/${wc2})
echo "${tag} Overlap size: ${wc3} (${p1}% // ${p2}%)"
##echo "${tag} Overlap genes:"
##comm -1 -2 <(cut -f 6 ${fn1} | grep -v Gene | sort -u) <(cut -f 6 ${fn2} | grep -v Gene | sort -u)
echo ""
done # for each cell type.
done # for each method
echo ""
echo ""
echo ""
#
# III - scRNA-seq only: per method, per celltype, per sex, overlap with the results of the meta-analysis.
for ref in "${refs[@]}"
do
for method in "${methods[@]}"
do
for cell in "${cells[@]}"
do
for sex in "${sexes[@]}"
do
# Do some row computation.
tag="[${method}][${sex}][${cell} vs ${ref}]"
fn1="${OUTPUT_FOLDER}DEG_${cell}_${sex}_PDvsCTRL_${method}.tsv"
fn2="${META_FOLDER}${ref}"
wc1=$(awk '{if ($5 <= 0.05) print $0}' ${fn1} | cut -f 6 | grep -v Gene | sort -u | wc -l)
wc2=$(awk '{if ($4 <= 0.05) print $0}' ${fn2} | cut -f 1 | grep -v Gene | sort -u | wc -l)
echo "${tag} First size: ${wc1}"
echo "${tag} Second size: ${wc2}"
wc3=$(comm -1 -2 <(awk '{if ($5 <= 0.05) print $0}' ${fn1} | cut -f 6 | grep -v Gene | sort -u) <(awk '{if ($4 <= 0.05) print $0}' ${fn2} | cut -f 1 | grep -v Gene | sort -u) | wc -l)
p1=$(bc <<< 100*${wc3}/${wc1})
p2=$(bc <<< 100*${wc3}/${wc2})
echo "${tag} Overlap size: ${wc3} (${p1}% // ${p2}%)"
#echo "${tag} Overlap genes:"
#comm -1 -2 <(cut -f 6 ${fn1} | grep -v Gene | sort -u) <(awk '{if ($4 <= 0.05) print $0}' ${fn2} | cut -f 1 | grep -v Gene | sort -u)
echo ""
done # for each ref.
done # for each cell
done # for each sex
done # for each method
# awk '{if ($4 <= 0.05) print $0}' /home/leon/Data/GeneDER/Analysis/06/SNage_PDVsControl_males_max-avg_all_pivalue_rankings.tsv
# ==> 1330
# awk '{if ($4 <= 0.05) print $0}' /home/leon/Data/GeneDER/Analysis/06/SNage_PDVsControl_males_max-avg_gs_pivalue_rankings.tsv
# ==> 947
# awk '{if ($4 <= 0.05) print $0}' /home/leon/Data/GeneDER/Analysis/06/SNage_PDVsControl_females_max-avg_all_pivalue_rankings.tsv
# ==> 27
# awk '{if ($4 <= 0.05) print $0}' /home/leon/Data/GeneDER/Analysis/06/SNage_PDVsControl_females_max-avg_gs_pivalue_rankings.tsv
# ==> 13
# awk '{if ($4 <= 0.05) print $0}' /home/leon/Data/GeneDER/Analysis/06/SNage_PDVsControl_females_max-avg_gd_pivalueref_rankings.tsv
# ==> 53
This diff is collapsed.
#!/bin/bash -l
#SBATCH -J geneder:09:liger
#SBATCH --mail-type=all
#SBATCH --mail-user=leon-charles.tranchevent@uni.lu
#SBATCH -N 1
#SBATCH --ntasks-per-socket=14
#SBATCH --ntasks-per-node=28
#SBATCH -c 1
#SBATCH --mem=64GB
#SBATCH --time=0-04:00:00
#SBATCH -p batch
#SBATCH --qos=normal
echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
echo "== Node list: ${SLURM_NODELIST}"
echo "== Submit dir. : ${SLURM_SUBMIT_DIR}"
echo ""
# Defining global parameters.
OUTPUT_FOLDER=/home/users/ltranchevent/Data/GeneDER/Analysis/09/
CODE_FOLDER=/home/users/ltranchevent/Projects/GeneDER/Analysis/09-Single-cell_analysis/
REF=${1}
# Loading modules.
module load lang/R/3.6.2-foss-2019b-bare
# Load configuration
source ../libs/conf/confSH.sh
create_variables ../Confs/datasets_config.yml
# For all scRNA datasets
nbScDatasets=${#scdatasets__dataset_name[@]}
for (( i=0; i<$nbScDatasets; i++ ))
do
datasetName=${scdatasets__dataset_name[$i]}
if [ "${datasetName}" == "${REF}" ]
then
echo "== Job $i started (${datasetName}) =="
rm -rf ${OUTPUT_FOLDER}${datasetName}_liger/
mkdir ${OUTPUT_FOLDER}${datasetName}_liger/
Rscript --vanilla ${CODE_FOLDER}sanalyse_liger.R ${datasetName} > ${OUTPUT_FOLDER}${datasetName}_liger/analysis_log.out 2> ${OUTPUT_FOLDER}${datasetName}_liger/analysis_log.err
echo "== Job $i ended (${datasetName}) =="
fi
done
# Moving the slurm log file to data
mv ${CODE_FOLDER}/slurm-*out ${OUTPUT_FOLDER}/
This diff is collapsed.
#!/bin/bash -l
#SBATCH -J geneder:09:seurat
#SBATCH --mail-type=all
#SBATCH --mail-user=leon-charles.tranchevent@uni.lu
#SBATCH -N 1
#SBATCH --ntasks-per-socket=14
#SBATCH --ntasks-per-node=28
#SBATCH -c 1
#SBATCH --mem=64GB
#SBATCH --time=0-04:00:00
#SBATCH -p batch
#SBATCH --qos=normal
echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
echo "== Node list: ${SLURM_NODELIST}"
echo "== Submit dir. : ${SLURM_SUBMIT_DIR}"
echo ""
# Defining global parameters.
OUTPUT_FOLDER=/home/users/ltranchevent/Data/GeneDER/Analysis/09/
CODE_FOLDER=/home/users/ltranchevent/Projects/GeneDER/Analysis/09-Single-cell_analysis/
REF=${1}
# Loading modules.
module load lang/R/3.6.2-foss-2019b-bare
# Load configuration
source ../libs/conf/confSH.sh
create_variables ../Confs/datasets_config.yml
# For all scRNA datasets
nbScDatasets=${#scdatasets__dataset_name[@]}
for (( i=0; i<$nbScDatasets; i++ ))
do
datasetName=${scdatasets__dataset_name[$i]}
if [ "${datasetName}" == "${REF}" ]
then
echo "== Job $i started (${datasetName}) =="
rm -rf ${OUTPUT_FOLDER}${datasetName}_seurat/
mkdir ${OUTPUT_FOLDER}${datasetName}_seurat/
Rscript --vanilla ${CODE_FOLDER}sanalyse_seurat.R ${datasetName} > ${OUTPUT_FOLDER}${datasetName}_seurat/analysis_log.out 2> ${OUTPUT_FOLDER}${datasetName}_seurat/analysis_log.err
echo "== Job $i ended (${datasetName}) =="
fi
done
# Moving the slurm log file to data
mv ${CODE_FOLDER}/slurm-*out ${OUTPUT_FOLDER}/
......@@ -10,6 +10,7 @@ dendextend
devtools
doParallel
DOSE
dplyr
edgeR
gcrma
GEOquery
......@@ -44,16 +45,18 @@ ReactomePA
readr
readxl
reshape2
rliger
ROntoTools
SCAN.UPC
sctransfrom
Seurat
SeuratWrappers
statmod
stats
stringr
stringr
survcomp
sva
tidyverse
tidyverse
topconfects
u133x3p.db
utils
......
# GeneDER
## Table of contents
# Table of contents
* [Introduction](#introduction)
* [Content](#content)
* [Data](#data)
* [Requirements](#requirements)
* [License](#license)
* [Instructions](#instructions)
## Introduction
This repository contains the code necessary to run the analyses described in the article titled "Meta-analysis of disease-associated molecular gender differences in Parkinson’s disease" authored by Léon-Charles Tranchevent, Rashi Halder and Enrico Glaab.
# Introduction
This repository contains the code necessary to run the analyses described in the article titled "Meta-analysis of disease-associated gender differences in transcriptome profiles for Parkinson’s disease" authored by Léon-Charles Tranchevent, Rashi Halder and Enrico Glaab.
This project focuses on a meta-analysis of transcriptomics datasets of Parkinson's disease patients and controls in order to identify variations associated with both disease status and biological sex. These variations are then further investigated through functional enrichment and regulatory network analyses.
## Content
The workflow is split into eight sequential steps, each one is associated with a corresponding folder. There is an additional folder for the configuration files (*e.g.*, to indicate where to find the data and to define the parameters of the analyses). Each step is briefly described below, but each associated folder also contains its own README file.
# Content
The workflow is split into nine sequential steps, each one is associated with a corresponding folder. There is an additional folder for the configuration files (*e.g.*, to indicate where to find the data and to define the parameters of the analyses). Each step is briefly described below but more precise instructions are provided in other sections.
1. The quality control of the raw expression data is performed.
2. The raw expression data is preprocessed and a final quality control is performed afterwards.
......@@ -23,13 +22,289 @@ The workflow is split into eight sequential steps, each one is associated with a
6. The meta-analyses are performed by integrating the results of the differential expression analyses across datasets (again separately for each sex). By comparing the male and female results, the female-specific, male-specific and sex-dimorphic genes are then defined.
7. Functional enrichment of the meta-analysis results is performed.
8. Regulatory networks around the key differentially expressed genes are reconstructed.
9. The processing and analysis of single-cell RNA sequencing datasets for validation purposes.
9. Single-cell RNA sequencing datasets are processed and analysed in order to investigate whether the variations obserevd in bulk transcriptomics datasets are also present in single-cell datasets.
## Data
# Data
The datasets used in our study have been extracted from the [Gene Expression Omnibus](https://www.ncbi.nlm.nih.gov/geo/). The code can be used to analyze other datasets as long as the raw data and the associated clinical data is available. The configuration of the meta-analysis (*i.e.*, which datasets to include) can be found in the configuration folder `Confs/`.
## Requirements
# Requirements
The code consists of R and bash scripts. In addition, makefiles are used to illustrate how the scripts were exactly used in our meta-analysis. This project relies on various R and BioConductor packages (see the full list in the file `Confs/packages`). It also relies on the ArrayUtils set of functions for which the repository can be found [here](https://git-r3lab.uni.lu/bds/geneder/arrayutils).
## License
# License
The code is available under the MIT License.
# Instructions
## Step 01: quality control
### Objectives
The objectives of this step are to analyze the quality of the raw data and to create reports that can be visually inspected to decide which samples should be kept and which should be discarded.
### Details and instructions
Affymetrix datasets are analysed using R and the 'arrayQualityMetrics' package, which produces HTML reports with figures and descriptions. Agilent datasets are analysed using home-made scripts, which creates figures but does not summarize the QC in a single document. The Makefile contains the commands to launch all jobs.
```
make clean_outputs
make run_qc
```
### Prerequisites
Since this is the first step of the analysis, the only prerequisite is to have the raw data (mostly from GEO) in the Data folder. There should be one folder per dataset, with a TSV file containing the clinical data ('ClinicalData.tsv' and a '/RAW/' folder with the raw data (should not be compressed unless a GEO series matrix file).
## Step 02: Pre-processing
### Objectives
The objectives of this step are to preprocess the raw data and to save them for further use.
### Details and instructions
All Affymetrix datasets are preprocessed using SCAN and GC-RMA. Illumina and Agilent datasets are preprocessed using dedicated R libraries (limma and beadarrays). The RNA-seq data are just post-processed since the preprocessing took place before. The Makefile contains the commands to launch all jobs.
The data are stored as TSV files.
```
make clean_outputs
make preprocess
```
For the datasets analyzed through SCAN, we can then make a plot of the preprocessing results (to identify the arrays for which SCAN modeling might have failed). We don't do any post-processing of the normalization procedure for Illumina or Agilent arrays (QC are run anyway in all cases).
```
make get_log
```
We can then investigate the effect of applying a variance stabilization method on the data (*e.g.*, to control heteroscedasticity)
```
make vsn
```
Finally, we create a report about the processing and the variance stabilization methods so that we can manually check which normalization procedure makes more sense or whether it is indeed necessary to correct for variance bias.
```
make doc
```
### Prerequisites
In general, the only prerequisite is to have the raw data (mostly from GEO) in the Data folder. For the array-based datasets, there should be one folder per dataset, with a '/RAW/' folder with the raw data as CEL files. For the RNA-seq dataset, the pre-processed data should be available as TSV files containing read counts. For SCAN in particular, it is not necessary to run the quality control before running the preprocessing since arrays are preprocessed independently (unless one has many problematic arrays, which would take CPU time for nothing). For all the other methods however, this is not the case, and the bad quality arrays needs to be filtered before pre-processing.
## Step 03: missing value prediction
### Objectives
The objectives of this step are to predict the gender /age of the patients whose gender / age is not indicated in the clinical annotations.
### Details and instructions
All datasets are used regardless of whether there exists samples with missing clinical annotations. This is motivated by the fact that we also want to estimate the overall accuracy of the predictions. The Makefile contains the commands to get the data from Biomart and then make the predictions. Plots are made and whether the predicted data should be used is left to the user to decide (manually).
```
make clean_outputs
make predict
```
A PDF document that contains all the figures is then created for manual inspection.
```
make doc
```
### Prerequisites
The prerequisites are to have the raw data (mostly from GEO) in the Data folder and the pre-processed data from step 02. There should be one folder per dataset, with a '/RAW/' folder with the raw data as CEL files (array data) or a TSV file with the pre-processed data (RNA-seq). In addition , a "ClinicalData.tsv" file should contain the clinical annotations (including of course gender and age).
## Step 04: dataset preparation
### Objectives
The first objectives of this step is clean the datasets, i.e., remove the arrays that have been flagged during the previous steps due to various errors (referred to as QC-I, PROC-no-converg, QC-II-SCAN, QC-II-GCRMA, CLIN-no-age or CLIN-no-gender). The second objective is to define which probes are going to be used to represent which genes for each platform (removing complex or ambiguous matches).
### Details and instructions
The datasets are processed one by one to remove the bad arrays and update the clinical files accordingly. The clinical data is also updated to include gender predictions. The necessary information is described in the local configuration file, that needs to be manually updated based on the outputs of the previous steps. Please note that the 'Batch.tsv' file is also updated since it might be used after preprocessing (for limma).
```
make clean_outputs
make data
```
The age and category balances are then checked for all datasets. No statistics is derived but plots are created for manual inspection.
```
make check
```
The most appropriate gene-probe match is defined (per dataset/platform individually).
```
make match
```
For manual inspection, a document that contains all figures is then generated.
```
make doc
```
### Prerequisites
The prerequisites are to have the preprocessed data for all datasets (Step 02) as well as the predicted gender information (Step 03).
## Step 05: differential expression
### Objectives
The objectives of this step is to identify the differentially expressed genes of each dataset, and for various comparisons of interest (e.g., female vs male, disease versus control). This produces gene lists that can be then used for the pathway and network analyses.
### Details and instructions
The datasets are processed one by one to identify differentially expressed genes (using limma). The following analyses are performed:
- Females vs males
- PD patients vs controls
- Female PD patients vs female controls
- Male PD patients vs male controls
- (Female PD patients vs female controls) vs (Male PD patients vs male controls)
- Female patients vs male patients
- Female controls vs male controls
- (Female patients vs male patients) vs (Female controls vs male controls) [equivalent to #5]
Various plots and lists are created in the process. The results are summarized at the probe level but with gene annotations.
```
make clean_outputs
make run_limma
```
Then, the expression levels of the various biomarkers are checked.
```
make biomarker
```
A document that contains all figures can then be generated.
```
make doc
```
Notice that not all datasets are analyzed using the exact same method. The co-factors might be different depending on whether there are replicates, potential batch effects, and complete clinical annotations (such as age). In addition, some datasets do not have enough samples in a given category (for instance, female PD patients) to perform the complex analyses that take this category into account. This means that for some datasets not all analyses are performed.
### Prerequisites
The only prerequisite is to have the preprocessed and cleaned data for all datasets (Step 04).
## Step 06: meta-analysis / data integration
### Objectives
The objectives of this step is to perform the meta-analysis, *i.e.* to integrate the results of the differential expression analysis across several datasets in order to identify robust DEGs.
### Details and instructions
The datasets are first summarized at the gene level (limma analyses are performed at the probe level). Conflicts and non unique mappings are handled to create a unique list of DEGS (per dataset still).
```
make clean_outputs
make summarize
```
The results are lists of DEGs (instead of differentially expressed probes) with NA for the genes that are not present in some of the datasets.
The integration itself is then computed and results are analyzed and checked.
```
make integrate
make analyse
make check
```
We create the final gene expression matrices.
```
make gexpr
```
A document that contains the main figures (but not all) can then be generated.
```
make doc
```
Finally, the gene lists to be further analyzed are created. The idea there is to split the sex-specific genes (for both males and females) and the sex-dimorphic genes.
```
make rankings
```
### Prerequisites
A prerequisite is to have the results of the limma analysis for all datasets (Step 05).
## Step 07: functional enrichment
### Objectives
The objectives of this step is to perform the enrichment analyses of the DEGs.
### Details and instructions
The genes are analyzed to identify functional terms that are enriched (functions, pathways or diseases).
The sex-dimorphic and sex-specific genes are derived from the previous analyses.
```
make clean_outputs
make prep
```
We then perform the enrichment with several tools, including ClusterProfiler (self-contained, GSEA), ROntoTools (self-contained, GSEA + network topology), Gene2Pathways (competitive methods) and PathFindR (ORA + network).
```
make enrich
make pf_enrich
```
We can combine the results of several tools that rely on the same ontologies.
```
make merge
```
Ultimately, we also derive the gender specific pathways (gsp) from the enrichment results based on the non specific genes.
```
make gsp
```
### Prerequisites
A prerequisite is to have the results of the integration (step 06).
## Step 08: regulatory network analysis
### Objectives
The objectives of this step is to investigate the potential regulators behind the observed DEGs.
### Details and instructions
We first start by selecting the genes we want to investigate.
```
make clean_outputs
make prep
```
We prepare the DoRothEA databased by downloading its latest version from OmniPath. We also prepare the gene lists (DoRothEA and the data we have might not use the same symbols for the same genes) in order to increase the coverage. This is done by matching against BioMart (considered as the reference here).
```
make getdb
```
At this stage, it is important to check the Dorothea_geneids_unmapped_mapped_clean.tsv file to make sure that the extra matching are correct. This file contains the gene for which two different ids are used in DoRothEA and in our data.
We create the DoRothEA GRN (simply extracting the regulatory interactions between genes, making sure we match as many genes as possible). We also run an enrichment analysis to detect if there is an enrichment/depletion of TF targets in our lists.
```
make GRNdot
make enrich
```
GeneGO results are obtained via the dedicated we interface but we refine them here automatically. In particular, we remove genes that were added to the GeneGO GRN for no particular reason. We instead want to focus on genes from our original lists. We also take care of matching the GeneGO gene names to our gene names.
```
make preprefine
make refineGRN1
```
At this stage, it is crucial to manually curate all XXXXX_mapping_refined.tsv files. These files (and the associated XXXXX_mapping_help.txt) will contain the information necessary to resolve the conflicts and missing matching. I use GeneCards and NCBI ENtrezGene to further check the matching. Details are present in the section below.
When the GeneGO matching is ready, we actually refine the GRN accordingly (basically removing the numerous unnecessary edges).
```
make refineGRN2
```
Last, we merge the enrichment results altogether. These files can then be manually inspected to identify interesting patterns.
```
make refineEnrich
```
### Prerequisites
A prerequisite is to have the results of the functional enrichment (Step 07), as to rely on the same input than GSEA.
## Step 09: single-cell analysis
### Objectives
The objectives of this step is to analyse the single-cell data (PD versus control) for male and female samples. This step includes the quality control and data pre-processing as well as the differential expression and the functional enrichment. Due to the low number of datasets (*i.e.*, only two), there is no meta-analysis.
### Details and instructions