Commit 6bfe3fb1 authored by Leon-Charles Tranchevent's avatar Leon-Charles Tranchevent
Browse files

Refactoring the code for the identification of the...

Refactoring the code for the identification of the gender-specific/gender-dimorphic genes [second and last part].
parent 0fab72eb
......@@ -26,5 +26,10 @@ A document that contains many but not all figures can then be generated.
make doc
```
Finally, the gene lists to be further analyzed are created. The idea there is to split the gender-specific genes (for bothe males and females) and the gender-dimorphic genes.
```
make rankings
```
# Prerequisites
A prerequisite is to have the results of the limma analysis for all datasets (Step 05).
#!/usr/bin/env Rscript
# ================================================================================================
# Libraries
# ================================================================================================
library("yaml")
library("tidyverse")
library("stringr")
source("../libs/conf/confR.R")
source("../libs/utils/utils.R")
message(paste0("[", Sys.time(), "] Libraries loaded."))
# ================================================================================================
# Configuration
# ================================================================================================
options(bitmapType = "cairo")
config <- read_config(config_dirs = c("../Confs/", "./"))
output_data_dir <- paste0(config$global_data_dir, config$local_data_dir)
input_data_dir <- paste0(config$global_data_dir, config$local_input_data_dir)
input_edata_dir <- paste0(config$global_data_dir, config$local_exp_data_dir)
input_rsedata_dir <- paste0(config$global_data_dir, config$local_rs_exp_data_dir)
message(paste0("[", Sys.time(), "] Configuration done."))
# ================================================================================================
# Functions
# ================================================================================================
# ================================================================================================
# Main
# ================================================================================================
# The aim of this script is to apply the second set of filters to the results of the integration.
# The filters for the male and female analyses are the following:
# - The specificity index should be higher than 1.10.
# - The gene should be either dimorphic (up in one gender, down in the other) or the absolute
# value of the log fold-change should be higher for the gender of interest (to be
# gender-specific).
# - The nominal P value for the gender of interest should be lower then 1 (only to remove
# the genes that are then ties with all the same P value of 1). This is because this script
# also produces a general ranking that will be then used for GSEA.
# The filters for the gender disease status analysis are the following:
# - We first filter the matrix as to keep only the genes that are potentially dimorphic (based
# on the female and male analyses). This is done on the raw results as to be able to recompute
# the adjusted P values. We then retain only the genes that pass the first filters (as in the
# non raw results). Namely the filters are:
# - The sign of the female and male log fold-changes should be different.
# - The dataset overlap between the female and male analysis should be strictly higher than 50%.
# - The genes should have confidence scores in females and males above the respective thresholds
# (like for the female and male analysis except that here we filter the GDS data).
# - The nominal P value should be lower then 1 (only to remove the genes that are then ties
# with all the same P value of 1). This is because this script also produces a general ranking
# that will be then used for GSEA.
#
# The results of the analyses are then pre-processed in order to derive the lists of gender-specfiic
# and gender-dimorphic genes. The rules are the following:
# - The gender-specific genes are coming from the female and male analyses.
# - The gender-dimorphic genes are coming from all three analyses.
# - A gender-specific gene needs to be identified as specific before or it is potentially dimorphic
# but the nominal P value in the other gender is 1 (so we can consider that the opposite LFC is
# not meaningful).
# - A gender-dimorphic is therefore a gene identified as potentially dimorphic but for which the
# nominal P value is less than 1 (so the regulation might be significant).
# - The lists are then joined in order to check that the male and female gender-specific lists are
# disjoint and to solve the overlap between the three lists of gender-dimorphic genes. This
# is based on the best pi values accross all lists.
specificity_threshold <- 1.1
pval_threshold <- 1
overlap_threshold <- 0.5
# We only do the SNage integration scheme.
i <- 3
integration <- config$integrations[[i]]
int_criteria <- str_split(integration$criteria, ",")[[1]]
# We perform the analysis only for the "VSN" case.
j <- 1
vsn <- config$variance_methods[[j]]
# We use the first gene-probe selection method (max-avg) only.
l <- 1
selection <- config$selections[[l]]
# A- we first handle the results of the female analysis. We read the enriched results
# and apply the second filters.
k <- 5
limma_analysis <- config$limma_analyses[[k]]
res_females_fn <- paste0(output_data_dir, integration$name, "_", vsn$name, "_",
limma_analysis$name, "_", selection$name, "_integration_enriched.tsv")
res_females <- read.delim(res_females_fn,
quote = "",
stringsAsFactors = FALSE)
res_females_dimorphic <- res_females %>%
filter(as.numeric(ranking_value) >= specificity_threshold) %>%
filter(as.numeric(int_csss_pval_marotmayer) < pval_threshold) %>%
filter(sign(as.numeric(ref_logFC)) != sign(as.numeric(ctrl_logFC)))
res_females_specific <- res_females %>%
filter(as.numeric(ranking_value) >= specificity_threshold) %>%
filter(as.numeric(int_csss_pval_marotmayer) < pval_threshold) %>%
filter(sign(as.numeric(ref_logFC)) == sign(as.numeric(ctrl_logFC))) %>%
filter(abs(as.numeric(ref_logFC)) > abs(as.numeric(ctrl_logFC)))
res_females_filtered <- rbind(res_females_specific, res_females_dimorphic)
rm(res_females_fn, res_females, res_females_dimorphic, res_females_specific)
res_females_fn <- paste0(output_data_dir, integration$name, "_", vsn$name, "_",
limma_analysis$name, "_", selection$name, "_integration_filtered.tsv")
write.table(res_females_filtered, res_females_fn,
quote = FALSE,
sep = "\t",
row.names = FALSE)
rm(res_females_fn)
# We defined the purely gender-specific genes and the genes that could potentially be
# gender-dimorphic. The rules are the ones mentioned above.
fres_females_specific_p1 <- res_females_filtered %>%
filter(sign(as.numeric(ref_logFC)) == sign(as.numeric(ctrl_logFC)))
fres_females_specific_p2 <- res_females_filtered %>%
filter(sign(as.numeric(ref_logFC)) != sign(as.numeric(ctrl_logFC))) %>%
filter(as.numeric(ctrl_Pval) == 1)
fres_females_dimorphic <- res_females_filtered %>%
filter(sign(as.numeric(ref_logFC)) != sign(as.numeric(ctrl_logFC))) %>%
filter(as.numeric(ctrl_Pval) < 1)
fres_females_specific <- rbind(fres_females_specific_p1, fres_females_specific_p2)
rm(res_females_filtered, fres_females_specific_p1, fres_females_specific_p2)
rm(k, limma_analysis)
# B- we then handle the results of the male analysis. We read the enriched results
# and apply the second filters.
k <- 6
limma_analysis <- config$limma_analyses[[k]]
res_males_fn <- paste0(output_data_dir, integration$name, "_", vsn$name, "_",
limma_analysis$name, "_", selection$name, "_integration_enriched.tsv")
res_males <- read.delim(res_males_fn,
quote = "",
stringsAsFactors = FALSE)
res_males_dimorphic <- res_males %>%
filter(as.numeric(ranking_value) >= specificity_threshold) %>%
filter(as.numeric(int_csss_pval_marotmayer) < pval_threshold) %>%
filter(sign(as.numeric(ref_logFC)) != sign(as.numeric(ctrl_logFC)))
res_males_specific <- res_males %>%
filter(as.numeric(ranking_value) >= specificity_threshold) %>%
filter(as.numeric(int_csss_pval_marotmayer) < pval_threshold) %>%
filter(sign(as.numeric(ref_logFC)) == sign(as.numeric(ctrl_logFC))) %>%
filter(abs(as.numeric(ref_logFC)) > abs(as.numeric(ctrl_logFC)))
res_males_filtered <- rbind(res_males_specific, res_males_dimorphic)
rm(res_males_fn, res_males, res_males_dimorphic, res_males_specific)
res_males_fn <- paste0(output_data_dir, integration$name, "_", vsn$name, "_",
limma_analysis$name, "_", selection$name, "_integration_filtered.tsv")
write.table(res_males_filtered, res_males_fn,
quote = FALSE,
sep = "\t",
row.names = FALSE)
rm(res_males_fn)
# We defined the purely gender-specific genes and the genes that could potentially be
# gender-dimorphic. The rules are the ones mentioned above.
fres_males_specific_p1 <- res_males_filtered %>%
filter(sign(as.numeric(ref_logFC)) == sign(as.numeric(ctrl_logFC)))
fres_males_specific_p2 <- res_males_filtered %>%
filter(sign(as.numeric(ref_logFC)) != sign(as.numeric(ctrl_logFC))) %>%
filter(as.numeric(ctrl_Pval) == 1)
fres_males_dimorphic <- res_males_filtered %>%
filter(sign(as.numeric(ref_logFC)) != sign(as.numeric(ctrl_logFC))) %>%
filter(as.numeric(ctrl_Pval) < 1)
fres_males_specific <- rbind(fres_males_specific_p1, fres_males_specific_p2)
rm(res_males_filtered, fres_males_specific_p1, fres_males_specific_p2)
rm(k, limma_analysis)
# We check that the overlap between the two specific files is 0 (otherwise we may
# have an issue that would need to be solved.
if (length(intersect(fres_females_specific$SYMBOL, fres_males_specific$SYMBOL)) != 0) {
message(paste0("[", Sys.time(), "] There is some overlap between the the female and male",
" gender-specific gene lists, which is unexpected."))
}
# We select only the fields we want to have in the final TSV file.
fres_females_specific <- fres_females_specific %>%
select("SYMBOL", "GSE20159", "GSE20163", "GSE20164", "GSE20292", "GSE26927", "GSE49036",
"GSE8397", "NBB", "P_GSE20159", "P_GSE20163", "P_GSE20164", "P_GSE20292", "P_GSE26927",
"P_GSE49036", "P_GSE8397", "P_NBB", "ref_logFC", "ref_Pval", "ref_adj_Pval", "ctrl_logFC",
"ctrl_Pval", "ctrl_adj_Pval", "confidence_score", "ranking_value")
fres_males_specific <- fres_males_specific %>%
select("SYMBOL", "GSE20159", "GSE20163", "GSE20292", "GSE26927", "GSE49036", "GSE8397",
"Moreira", "NBB", "P_GSE20159", "P_GSE20163", "P_GSE20292", "P_GSE26927", "P_GSE49036",
"P_GSE8397", "P_Moreira", "P_NBB", "ref_logFC", "ref_Pval", "ref_adj_Pval", "ctrl_logFC",
"ctrl_Pval", "ctrl_adj_Pval", "confidence_score", "ranking_value")
# We save the results (gender-specific).
k <- 5
limma_analysis <- config$limma_analyses[[k]]
res_females_fn <- paste0(output_data_dir, integration$name, "_", vsn$name, "_",
limma_analysis$name, "_", selection$name, "_genelist_female_genderspecific.tsv")
write.table(fres_females_specific, res_females_fn,
quote = FALSE,
sep = "\t",
row.names = FALSE)
rm(res_females_fn, k, limma_analysis)
k <- 6
limma_analysis <- config$limma_analyses[[k]]
res_males_fn <- paste0(output_data_dir, integration$name, "_", vsn$name, "_",
limma_analysis$name, "_", selection$name, "_genelist_male_genderspecific.tsv")
write.table(fres_males_specific, res_males_fn,
quote = FALSE,
sep = "\t",
row.names = FALSE)
rm(res_males_fn, k, limma_analysis)
# C- We take care of the GDS results. We read the raw_enriched results.
k <- 7
limma_analysis <- config$limma_analyses[[k]]
res_gds_fn <- paste0(output_data_dir, integration$name, "_", vsn$name, "_",
limma_analysis$name, "_", selection$name, "_integration_raw_enriched.tsv")
res_gds <- read.delim(res_gds_fn,
quote = "",
stringsAsFactors = FALSE)
# As indicated above, we restrict the analysis to the dimorphic genes with enough dataset overlap
# between the two genders.
res_gds_dimorphic <- res_gds %>%
filter(sign(as.numeric(F.wavg_csss_logFC)) != sign(as.numeric(M.wavg_csss_logFC))) %>%
filter(as.numeric(overlap_percentage) > overlap_threshold)
rm(res_gds)
# We recompute the adjusted P-values as to consider only these dimorphic genes.
original_pvalues <- as.numeric(res_gds_dimorphic$int_csss_pval_marotmayer)
res_gds_dimorphic$RE.int_csss_adj_pval_marotmayer <- p.adjust(original_pvalues, method = "BH")
res_gds_dimorphic$RE.gds_pi_value <- abs(res_gds_dimorphic$wavg_csss_logFC) *
-log10(res_gds_dimorphic$int_csss_pval_marotmayer)
rm(original_pvalues)
# We run again the first set of filters since we started from the raw results.
# These are equivalent to the filters in the integrate_datasets.R script (#492).
# In this case (GDS data), we have 6 datasets.
conf_threshold <- config$confidence_threshold / 6
nb_db_min <- max(config$nb_min_pval, config$perc_min_pval * 6)
res_gds_dimorphic <- res_gds_dimorphic %>%
filter(as.numeric(confidence_score) >= conf_threshold) %>%
filter(as.numeric(nb_csss_datasets) >= nb_db_min)
rm(conf_threshold, nb_db_min)
# The confidence thresholds are based on the number of data sources we have, 8 in our case
# for both the female and male analyses.
conf_threshold <- config$confidence_threshold / 8
nb_db_min <- max(config$nb_min_pval, config$perc_min_pval * 8)
res_gds_dimorphic <- res_gds_dimorphic %>%
filter(as.numeric(F.confidence_score) >= conf_threshold) %>%
filter(as.numeric(M.confidence_score) >= conf_threshold) %>%
filter(as.numeric(F.nb_csss_datasets) >= nb_db_min) %>%
filter(as.numeric(M.nb_csss_datasets) >= nb_db_min) %>%
filter(as.numeric(int_csss_pval_marotmayer) < pval_threshold)
rm(conf_threshold, nb_db_min)
# We save the filtered results.
res_gds_fn <- paste0(output_data_dir, integration$name, "_", vsn$name, "_",
limma_analysis$name, "_", selection$name, "_integration_filtered.tsv")
write.table(res_gds_dimorphic, res_gds_fn,
quote = FALSE,
sep = "\t",
row.names = FALSE)
rm(res_gds_fn)
# We start by filtering the GDS results as to keep only the GDS data and remove the
# results of the male and female analyses.
res_gds_dimorphic <- res_gds_dimorphic %>%
select("SYMBOL",
#"GSE20159", "GSE20163", "GSE20292", "GSE26927", "GSE49036", "GSE8397", "NBB",
#"P_GSE20159", "P_GSE20163", "P_GSE20292", "P_GSE26927", "P_GSE49036", "P_GSE8397",
#"P_NBB",
"wavg_csss_logFC", "int_csss_pval_marotmayer", "int_csss_adj_pval_marotmayer",
"RE.int_csss_adj_pval_marotmayer",
#"confidence_score",
"RE.gds_pi_value")
# We then do the same for the male and female dimorphic results.
fres_females_dimorphic_short <- fres_females_dimorphic %>%
select("SYMBOL", "ref_logFC", "ref_Pval", "ref_adj_Pval", "ref_pivalue")
fres_males_dimorphic_short <- fres_males_dimorphic %>%
select("SYMBOL", "ref_logFC", "ref_Pval", "ref_adj_Pval", "ref_pivalue")
# We do the merge keeping all genes (because some genes only have results for one
# of the analysis, some for two, and some for three).
fres_dimorphic <- merge(merge(res_gds_dimorphic, fres_females_dimorphic_short,
by = "SYMBOL",
all = TRUE),
fres_males_dimorphic_short,
by = "SYMBOL",
all = TRUE,
suffixes = c(".F", ".M"))
fres_dimorphic <- fres_dimorphic %>%
mutate(bestpi = pmax(fres_dimorphic$RE.gds_pi_value, fres_dimorphic$ref_pivalue.F,
fres_dimorphic$ref_pivalue.M, na.rm = TRUE)) %>%
mutate(tmp1 = ifelse(is.na(RE.gds_pi_value), "", ifelse(bestpi == RE.gds_pi_value, "GDS", "")),
tmp2 = ifelse(is.na(ref_pivalue.F), "", ifelse(bestpi == ref_pivalue.F, "F", "")),
tmp3 = ifelse(is.na(ref_pivalue.M), "", ifelse(bestpi == ref_pivalue.M, "M", ""))) %>%
mutate(source = paste0(tmp1, tmp2, tmp3)) %>%
select(!starts_with("tmp"))
# We save the filtered results.
res_gds_fn <- paste0(output_data_dir, integration$name, "_", vsn$name, "_",
limma_analysis$name, "_", selection$name, "_genelist_genderdimorphic.tsv")
write.table(fres_dimorphic, res_gds_fn,
quote = FALSE,
sep = "\t",
row.names = FALSE)
rm(res_gds_fn)
rm(k, limma_analysis)
# We check that the overlap between the gender-specific and gender-dimorphic files is 0 (otherwise
# we may have an issue that would need to be solved.
if (length(intersect(fres_females_specific$SYMBOL, fres_dimorphic$SYMBOL)) != 0) {
message(paste0("[", Sys.time(), "] There is some overlap between the the female and male",
" gender-specific gene lists, which is unexpected."))
}
# TODO HERE THEREsI OVERLAOP + CEJHCK MALES AS WELL...
# We clean the workspace and log the session details for reproducibility.
rm(specificity_threshold, pval_threshold, overlap_threshold)
rm(l, selection, vsn, j, integration, int_criteria, i)
rm(config, output_data_dir, input_data_dir, input_edata_dir, input_rsedata_dir)
sessionInfo()
......@@ -125,6 +125,7 @@ rm(k, limma_analysis)
# We merge the female and male results as to identify the genes that are potentially gender-specific
# and the ones that are potentially gender-dimorphic. This decision is based on the log fold-changes.
res_FM <- merge(x = res_females, y = res_males, by = "Gene", all = TRUE, suffixes = c(".F",".M"))
rm(res_females, res_males)
# The common potential gender-specific genes are present in both the female and male analyses,
# have the same regulation (up/down).
......@@ -140,6 +141,7 @@ unique_males_potGS <- res_FM %>% filter(is.na(log_fold_change.F))
# regulation (up/down).
potGD <- res_FM %>% filter(!is.na(log_fold_change.F), !is.na(log_fold_change.M)) %>%
filter(sign(log_fold_change.F) != sign(log_fold_change.M))
rm(res_FM)
# We then need to refine the potential gender-specific and gender-dimorphic genes based on various
# criteria. We start with the gender-specific genes and more precisely with the common potential
......@@ -159,12 +161,14 @@ common_males_potGS <- merge(common_potGS,
by = "Gene", all.x = TRUE) %>%
filter(gender_specific_score >= specificity_threshold) %>%
filter(abs(log_fold_change.M) >= abs(log_fold_change.F))
rm(spec_females, spec_males, common_potGS, specificity_threshold)
# We check that no gene is in common between common_females_potGS and common_males_potGS.
# This could happen if the gender specificity scores are not computed correctly.
m <- paste0("Genes specific for both males and females: ",
dim(merge(common_females_potGS, common_males_potGS, by = "Gene"))[1])
message(paste0("[", Sys.time(), "] ", m, "."))
rm(m)
# The unique_females_potGS and unique_males_potGS do not need any filtering since
# the signal is present in only one gender to start with.
......@@ -174,6 +178,7 @@ GS_females <- rbind(common_females_potGS, unique_females_potGS %>%
mutate(gender_specific_score = NA))
GS_males <- rbind(common_males_potGS, unique_males_potGS %>%
mutate(gender_specific_score = NA))
rm(common_females_potGS, common_males_potGS,unique_females_potGS, unique_males_potGS)
# We also create lists of both gender-specific and gender-dimorphic genes for each gender.
GSGD_females <- rbind(GS_females, potGD %>%
......@@ -188,129 +193,60 @@ GSGD_males <- rbind(GS_males, potGD %>%
# genes.
potGDsel <- merge(potGD, data_over, by = "Gene", all.x = TRUE) %>%
filter(overlap_percentage > overlap_threshold)
potGD_notsuitableforGDS <- merge(potGD, data_over, by = "Gene", all.x = TRUE) %>%
filter(overlap_percentage <= overlap_threshold)
# We then want to recompute the adjusted P values of the GDS analysis.
# The idea is to restrict the GDS analysis to only the potentially gender-dimorphic genes.
# We can not however rely on the potentially gender-dimorphic genes we have collected so far
# since these are obtained through filtering already. We need to get bak to the raw results
# of the integration, identify the potentially gender-dimorphic genes from there and
# correct the P values accordingly.
# We start by reading all raw results for the three limma analyses of interest.
# Starting with the female analysis.
k <- 5
limma_analysis <- config$limma_analyses[[k]]
raw_res_females_fn <- paste0(output_data_dir, integration$name, "_", vsn$name, "_",
limma_analysis$name, "_", selection$name, "_integration_raw.tsv")
raw_res_females <- read.delim(raw_res_females_fn,
row.names = 1,
quote = "",
stringsAsFactors = FALSE) %>%
mutate(Gene = SYMBOL, log_fold_change = wavg_csss_logFC, P_value = int_csss_pval_marotmayer,
adj_P_value = int_csss_adj_pval_marotmayer) %>%
select(Gene, nb_csss_datasets, nb_conf_datasets, nb_miss_datasets, confidence_score,
log_fold_change, P_value, adj_P_value)
rm(raw_res_females_fn, k, limma_analysis)
# Now the raw results for males.
k <- 6
limma_analysis <- config$limma_analyses[[k]]
raw_res_males_fn <- paste0(output_data_dir, integration$name, "_", vsn$name, "_",
limma_analysis$name, "_", selection$name, "_integration_raw.tsv")
raw_res_males <- read.delim(raw_res_males_fn,
row.names = 1,
quote = "",
stringsAsFactors = FALSE) %>%
mutate(Gene = SYMBOL, log_fold_change = wavg_csss_logFC, P_value = int_csss_pval_marotmayer,
adj_P_value = int_csss_adj_pval_marotmayer) %>%
select(Gene, nb_csss_datasets, nb_conf_datasets, nb_miss_datasets, confidence_score,
log_fold_change, P_value, adj_P_value)
rm(raw_res_males_fn, k, limma_analysis)
# Finally, the raw results of the full analysis (GDS).
k <- 7
limma_analysis <- config$limma_analyses[[k]]
raw_res_gds_fn <- paste0(output_data_dir, integration$name, "_", vsn$name, "_",
limma_analysis$name, "_", selection$name, "_integration_raw.tsv")
raw_res_gds <- read.delim(raw_res_gds_fn,
row.names = 1,
quote = "",
stringsAsFactors = FALSE) %>%
mutate(Gene = SYMBOL, log_fold_change = wavg_csss_logFC, P_value = int_csss_pval_marotmayer,
adj_P_value = int_csss_adj_pval_marotmayer) %>%
select(Gene, nb_csss_datasets, nb_conf_datasets, nb_miss_datasets, confidence_score,
log_fold_change, P_value, adj_P_value)
rm(raw_res_gds_fn, k, limma_analysis)
#potGD_notsuitableforGDS <- merge(potGD, data_over, by = "Gene", all.x = TRUE) %>%
# filter(overlap_percentage <= overlap_threshold)
rm(data_over, overlap_threshold)
# We merge the female and male raw results as we did before for the clean results.
raw_res_FM <- merge(x = raw_res_females, y = raw_res_males, by = "Gene", all = TRUE, suffixes = c(".F",".M"))
# The idea was once to re-adjust the adjusted P values of the GDS analysis but it is no
# longer the case.
# Like before, the potential gender-dimorphic genes are present in both analyses but with a
# different regulation (up/down). This decision is based on the log fold-changes and we should
# never consider the P values as these might be weakly correlated with the P values we want to
# adjust.
raw_potGD <- raw_res_FM %>% filter(!is.na(log_fold_change.F), !is.na(log_fold_change.M)) %>%
filter(sign(log_fold_change.F) != sign(log_fold_change.M))
# We can now recompute the P values for these potential raw gender-dimorphic genes only.
raw_res_gds_potGDonly <- raw_res_gds %>% filter(Gene %in% raw_potGD$Gene)
raw_res_gds_potGDonly$RE.adj_P_value <- p.adjust(raw_res_gds_potGDonly$P_value, method = "BH")
# TO REM
# X <- raw_res_gds %>% filter(Gene %in% raw_potGD$Gene)
# Y <- raw_res_gds %>% filter(!Gene %in% raw_potGD$Gene)
# plot_data <- rbind(X %>% select(P_value) %>% mutate(Genes = "Dimorphic") %>% mutate(logP = -log10(P_value)),
# Y %>% select(P_value) %>% mutate(Genes = "Non dimorphic") %>% mutate(logP = -log10(P_value)))
#
# myplot <- ggplot(plot_data, aes(x = Genes,
# y = logP,
# colour = Genes)) +
# #geom_boxplot(lwd = 1.2, outlier.shape = NA) +
# geom_violin(lwd = 1.2, draw_quantiles = c(0.25, 0.5, 0.75)) +
# theme(legend.position = "bottom",
# axis.title.x = element_blank(),
# axis.title.y = element_text(size = 16,
# face = "bold"),
# axis.text.x = element_blank(),
# axis.text.y = element_text(size = 16,
# face = "bold"),
# plot.title = element_text(size = 16,
# hjust = 0.5,
# face = "bold"),
# panel.background = element_blank(),
# axis.line = element_line(colour = "black", size = 1.5),
# panel.grid.major = element_line(colour = "lightgrey"),
# strip.background = element_rect(fill = "black"),
# strip.text = element_text(colour = "white",
# size = 12,
# face = "bold"),
# legend.text = element_text(size = 16,
# face = "bold")) +
# ggtitle("[GDS model] Nominal P values")
# print(myplot)
# END TO REM
# And we remove the genes from raw that did not pass the filters anyway (to keep only
# the genes in the clean version of the results).
new_res_gds <- raw_res_gds_potGDonly %>% filter(Gene %in% res_gds$Gene)
# We filter back these new GDS results using our original list of potential gender-dimorphic genes.
resGDS_potGDsel_only <- merge(new_res_gds, potGDsel, by = "Gene", all.y = TRUE)
# We filter the GDS results using our list of potential gender-dimorphic genes.
resGDS_potGDsel_only <- merge(res_gds, potGDsel, by = "Gene", all.y = TRUE)
rm(res_gds)
GD <- resGDS_potGDsel_only %>% filter(!is.na(log_fold_change))
whattodowiththem2 <- resGDS_potGDsel_only %>% filter(is.na(log_fold_change))
# Objectives: use the GDS signals as reference and only compare to F/M in the paper
# (to strengthen the results - aka confirmatory). So focus on the 1547 genes indeed.
# Complement GSEA with Fisher.
# Recompute the P-values ?
#potGD_notinGDS <- resGDS_potGDsel_only %>% filter(is.na(log_fold_change))
rm(resGDS_potGDsel_only, potGDsel, potGD)
# We save the files we have generated, starting with the female gender-specific genes.
GS_females_fn <- paste0(output_data_dir, integration$name, "_", vsn$name, "_none_", selection$name, "_genelist_genderspecific_females.tsv")
write.table(GS_females, GS_females_fn,
quote = FALSE,
sep = "\t",
row.names = FALSE)
rm(GS_females_fn,GS_females)
# And the male gender-specific genes.
GS_males_fn <- paste0(output_data_dir, integration$name, "_", vsn$name, "_none_", selection$name, "_genelist_genderspecific_males.tsv")
write.table(GS_males, GS_males_fn,
quote = FALSE,
sep = "\t",
row.names = FALSE)
rm(GS_males_fn, GS_males)
# The gender-dimorphic genes.
GD_fn <- paste0(output_data_dir, integration$name, "_", vsn$name, "_none_", selection$name, "_genelist_genderdimorphic.tsv")
write.table(GD, GD_fn,
quote = FALSE,
sep = "\t",
row.names = FALSE)
rm(GD_fn, GD)
# The complete female list (gender-specific + gender-dimorphic).
GSGD_females_fn <- paste0(output_data_dir, integration$name, "_", vsn$name, "_none_", selection$name, "_genelist_genderspecific_genderdimorphic_females.tsv")
write.table(GSGD_females, GSGD_females_fn,
quote = FALSE,
sep = "\t",
row.names = FALSE)
rm(GSGD_females_fn, GSGD_females)
# The complete male list (gender-specific + gender-dimorphic).
GSGD_males_fn <- paste0(output_data_dir, integration$name, "_", vsn$name, "_none_", selection$name, "_genelist_genderspecific_genderdimorphic_males.tsv")
write.table(GSGD_males, GSGD_males_fn,
quote = FALSE,
sep = "\t",
row.names = FALSE)
rm(GSGD_males_fn, GSGD_males)
# We clean the workspace and log the session details for reproducibility.
rm(l, selection, vsn, j, integration, int_criteria, i)
......
......@@ -29,10 +29,5 @@ Rscript --vanilla ${CODE_FOLDER}/compute_gender_specificity.R > ${OUTPUT_FOLDER}
Rscript --vanilla ${CODE_FOLDER}/compute_dataset_overlap.R > ${OUTPUT_FOLDER}dataset_overlap_log.out 2> ${OUTPUT_FOLDER}dataset_overlap_log.err
Rscript --vanilla ${CODE_FOLDER}/merge_and_filter_rankings.R > ${OUTPUT_FOLDER}merge_filter_log.out 2> ${OUTPUT_FOLDER}merge_filter_log.err
# We then run the final filtering as to obtain the final gene lists for the downstream analyses (and the manuscript).
Rscript --vanilla ${CODE_FOLDER}/apply_second_filters.R > ${OUTPUT_FOLDER}second_filters_log.out 2> ${OUTPUT_FOLDER}second_filters_log.err
# Moving the slurm log file to data
mv ${CODE_FOLDER}/slurm-*out ${OUTPUT_FOLDER}/