Commit 892ed6fc authored by Leon-Charles Tranchevent's avatar Leon-Charles Tranchevent
Browse files

Code refactoring and optimization for step 01.

parent a07ae9cd
......@@ -2,12 +2,11 @@
The objectives of this step are to analyse the quality of the raw data of the project and to create reports that can be visually inspected to decide which samples should be kept and which could (or should) be discarded.
# Details and instructions
Affymetrix datasets are analysed using R and the 'arrayQualityMetrics' package, which produces HTML reports with figures and descriptions. Agilent datasets are analysed using home-made scripts, which creates figures but does not summarize the QC in a single document. Illumina datasets are not controlled so far since we do not have the complete raw data.
The Makefile contains the commands to launch all jobs on the cluster.
Affymetrix datasets are analysed using R and the 'arrayQualityMetrics' package, which produces HTML reports with figures and descriptions. Agilent datasets are analysed using home-made scripts, which creates figures but does not summarize the QC in a single document. Illumina datasets are not controlled so far since we do not have the complete raw data. Similarly, the RNA-seq datasets are not controlled. The Makefile contains the commands to launch all jobs on the cluster.
```
make clean_outputs
make run_qc
```
# Prerequisites
Since this is the first step of the analysis, the only prerequisite is to have the raw data (mostly from GEO) in the Data folder. There should be one folder per dataset, with a TSV file containing the clinical data ('ClinicalData.tsv' and a '/RAW/' folder with the raw data (should not be compressed unless a GEO series matrix file).
Since this is the first step of the analysis, the only prerequisite is to have the raw data (mostly from GEO) in the Data folder. There should be one folder per dataset, with a TSV file containing the clinical data ('ClinicalData.tsv' and a '/RAW/' folder with the raw data (should not be compressed unless a GEO series matrix file).
\ No newline at end of file
......@@ -37,11 +37,19 @@ for (i in seq_len(length(config$datasets))) {
dataset <- config$datasets[[i]]
dataset_name <- dataset$dataset_name
# We only do the QC when it is necessary.
if (dataset$first_qc != 'TRUE') {
message(paste0("[", Sys.time(), "] [", dataset_name, "] QC not necessary."))
next
}
# Clinical data can be used to identify bias in the experiments.
# By default, we have the disease status, but we can also have the gender.
dataset_groups <- c("Disease.status")
if (dataset$clinical_descriptors == "DG") {
dataset_groups <- c("Disease.status", "Gender")
} else if (dataset$clinical_descriptors == "DGA") {
dataset_groups <- c("Disease.status", "Gender")
}
# We run the quality control of the current dataset (if necessary).
......@@ -55,6 +63,7 @@ for (i in seq_len(length(config$datasets))) {
message(paste0("[", Sys.time(), "] [", dataset_name, "] Starting QC."))
ArrayUtils::run_quality_control_on_raw(raw_data_subdir,
output_data_subdir,
use_oligo = dataset$use_oligo,
platform = dataset$platform,
phenotype_groups = dataset_groups,
verbose = TRUE)
......
......@@ -5,7 +5,7 @@
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mem=12GB
#SBATCH --time=0-00:45:00
#SBATCH --time=0-00:50:00
#SBATCH -p batch
#SBATCH --qos=qos-batch
......
INPUT_FOLDER=/home/users/ltranchevent/Data/GeneDER/Analysis/04/
OUTPUT_FOLDER=/home/users/ltranchevent/Data/GeneDER/Analysis/06/
CODE_FOLDER=/home/users/ltranchevent/Projects/GeneDER/Analysis/06-Data_integration/
clean:
@rm -rf *~
clean_outputs:
@rm -rf ${OUTPUT_FOLDER}*
@cp ${INPUT_FOLDER}Combined_probe_matching.tsv ${OUTPUT_FOLDER}Combined_probe_matching.tsv
summarize:
@sbatch ${CODE_FOLDER}summarize.sh
integrate:
@sbatch ${CODE_FOLDER}integrate.sh
analyse:
@sbatch ${CODE_FOLDER}analyse.sh
check:
@sbatch ${CODE_FOLDER}check.sh
gexpr:
@sbatch ${CODE_FOLDER}create_gene_expr_matrices.sh
doc:
@sbatch ${CODE_FOLDER}doc.sh
# Objectives
The objectives of this step is to integrate the results of the differential expression analysis across several datasets in order to identify similarities and overlaps and to identify robust DEGs.
# Details and instructions
The datasets are first summarized at the gene level (limma analyses are performed at the probe level). Conflicts and non unique mappings are handled to create a unique list of DEGS (per dataset still).
```
make clean_outputs
make summarize
```
The results are lists of DEGs (instead of differentially expressed probes) with NA for the genes that are not present in some of the datasets.
The integration itself is then computed and results are analyzed and checked.
```
make integrate
make analyse
make check
```
We create the final gene expression matrices (to share data outside of the project).
```
make gexpr
```
A document that contains many but not all figures can then be generated.
```
make doc
```
# Prerequisites
A prerequisite is to have the results of the limma analysis for all datasets (Step 05).
#!/bin/bash -l
#SBATCH -J geneder:06:analyse
#SBATCH --mail-type=all
#SBATCH --mail-user=leon-charles.tranchevent@uni.lu
#SBATCH -N 1
#SBATCH -n 4
#SBATCH --time=0-1:00:00
#SBATCH -p batch
#SBATCH --qos=qos-batch
echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
echo "== Node list: ${SLURM_NODELIST}"
echo "== Submit dir. : ${SLURM_SUBMIT_DIR}"
echo ""
# Defining global parameters.
OUTPUT_FOLDER=/home/users/ltranchevent/Data/GeneDER/Analysis/06/
CODE_FOLDER=/home/users/ltranchevent/Projects/GeneDER/Analysis/06-Data_integration/
# Loading modules.
module load lang/R/3.6.0-foss-2019a-bare
# Actual jobs
Rscript --vanilla ${CODE_FOLDER}analyse_integration_results.R > ${OUTPUT_FOLDER}analyse_log.out 2> ${OUTPUT_FOLDER}analyse_log.err
Rscript --vanilla ${CODE_FOLDER}analyse_integration_results_figures.R > ${OUTPUT_FOLDER}analyse_fig_log.out 2> ${OUTPUT_FOLDER}analyse_fig_log.err
# Moving the slurm log file to data
mv ${CODE_FOLDER}slurm-*out ${OUTPUT_FOLDER}
#!/usr/bin/env Rscript
# ================================================================================================
# Libraries
# ================================================================================================
library("yaml")
library("ggplot2")
library("tidyverse")
library("stringr")
source("../libs/conf/confR.R")
source("../libs/utils/utils.R")
message(paste0("[", Sys.time(), "] Libraries loaded."))
# ================================================================================================
# Configuration
# ================================================================================================
options(bitmapType = "cairo")
config <- read_config(config_dirs = c("../Confs/", "./"))
output_data_dir <- paste0(config$global_data_dir, config$local_data_dir)
input_data_dir <- paste0(config$global_data_dir, config$local_input_data_dir)
message(paste0("[", Sys.time(), "] Configuration done."))
# ================================================================================================
# Functions
# ================================================================================================
#' @title Function to save the lists of significant genes.
#'
#' @description This function saves the significant genes from a data-frame. It saves three
#' files, one with all significant genes, one with opnly up-regulated genes and one with
#' only down regulated genes. It then logs some basic counts.
#'
#' @param data The data-frame that contains the significant genes.
#' @param file_prefix A striong representing the prefix of the file names to be used to
#' save the significant genes.
save_sign_genes <- function(data, file_prefix) {
# We filter the data.
sign_genes <- data %>%
dplyr::mutate(diff_exp = ifelse(sign(wavg_csss_logFC) == 1, "up", "down")) %>% # nolint
dplyr::select(SYMBOL, diff_exp)
sign_genes_up <- sign_genes %>% dplyr::filter(diff_exp == "up") # nolint
sign_genes_down <- sign_genes %>% dplyr::filter(diff_exp == "down") # nolint
# We define filenames.
sign_genes_fn <- paste0(file_prefix, "sign_genes.tsv")
sign_genes_up_fn <- paste0(file_prefix, "sign_genes_up.tsv")
sign_genes_down_fn <- paste0(file_prefix, "sign_genes_down.tsv")
# We write down data.
write.table(sign_genes,
file = sign_genes_fn,
sep = "\t",
quote = FALSE,
col.names = FALSE,
row.names = FALSE)
write.table(sign_genes_up$SYMBOL,
file = sign_genes_up_fn,
sep = "\t",
quote = FALSE,
col.names = FALSE,
row.names = FALSE)
write.table(sign_genes_down$SYMBOL,
file = sign_genes_down_fn,
sep = "\t",
quote = FALSE,
col.names = FALSE,
row.names = FALSE)
# We log some infos for the user.
loc_log <- paste0(" Total: ", dim(sign_genes)[1],
" Up: ", dim(sign_genes_up)[1],
" (", round((dim(sign_genes_up) / dim(sign_genes))[1], 2) * 100,
"%) Down: ", dim(sign_genes_down)[1],
" (", round((dim(sign_genes_down) / dim(sign_genes))[1], 2) * 100,
"%)")
log_prefix <- stringr::str_split(file_prefix, "/")[[1]]
message(paste0("[", Sys.time(), "][", log_prefix[length(log_prefix)], "]", loc_log, "."))
rm(loc_log, log_prefix, sign_genes, sign_genes_up, sign_genes_down)
rm(sign_genes_fn, sign_genes_up_fn, sign_genes_down_fn)
}
# ================================================================================================
# Main
# ================================================================================================
# The objective here is to save the lists of significant genes for each configuration (from the
# overall TSV tables). In addition, plots are created to check the number of contributing
# datasources.
# We count the number of significant genes for all cases.
all_counts <- vector()
all_count_rnames <- vector()
# As usual, we loop over the integrations / variances / limmas / selections / combinations.
# We start by considering all integration schemes one by one.
for (i in seq_len(length(config$integrations))) {
# We get the integration name for that scheme.
integration <- config$integrations[[i]]
# We repeat the analysis for all VSN usages.
for (j in seq_len(length(config$variance_methods))) {
# We get the current VSN configuration.
vsn <- config$variance_methods[[j]]
# We loop over the different limma analyses.
for (k in seq_len(length(config$limma_analyses))) {
# We define the current limma analysis.
limma_comparison <- config$limma_analyses[[k]]$name
# We have structures to keep all data, regardless of the
# gene-probe selection nmethod in order to make comparisons later.
sign_genes_all <- list()
selection_names <- vector()
combination_names <- vector()
# We loop over the different gene-probe selection methods.
for (l in seq_len(length(config$selections))) {
# We define the current gene-probe selection method.
selection <- config$selections[[l]]
selection_names <- c(selection_names, selection$name)
# We define the file prefix for this configuration defined
# based on all above loops.
file_prefix <- paste0(output_data_dir, integration$name, "_", vsn$name, "_",
limma_comparison, "_", selection$name, "_")
# We read the results of the integration (contains results for all P value combination
# methods).
int_results_fn <- paste0(file_prefix, "integration.tsv")
integration_results <- read.table(int_results_fn,
header = TRUE,
sep = "\t",
row.names = 1,
as.is = TRUE)
rm(int_results_fn)
# We create the data structures to later plot the use counts.
results_core_matrix_full <- integration_results %>%
dplyr::select(-starts_with("sum_of")) %>%
dplyr::select(-starts_with("int_")) %>%
dplyr::select(-starts_with("P_")) %>%
dplyr::select(-starts_with("confidence")) %>%
dplyr::select(-starts_with("wavg")) %>%
dplyr::select(-starts_with("nb_")) %>%
dplyr::select(-starts_with("SYMBOL"))
# We filter out to non consensus log fold changes only for the use count plots.
results_csss_lfc <- rep(integration_results$wavg_csss_logFC, dim(results_core_matrix_full)[2])
dim(results_csss_lfc) <- dim(results_core_matrix_full)
results_core_matrix_full[sign(results_core_matrix_full) != sign(results_csss_lfc)] <- NA
nb_nas <- c(0, sapply(results_core_matrix_full,
function(y) sum(is.na(y))))
use_count_full <- dim(results_core_matrix_full)[1] - nb_nas
names(use_count_full)[1] <- "Total"
use_counts <- data.frame(Dataset = names(use_count_full), Count = use_count_full)
rm(results_core_matrix_full, nb_nas, use_count_full)
# Plotting of the use counts (aka the number of P values that were combined via the
# integration).
use_counts_fn <- paste0(file_prefix, "use_counts.png")
png(use_counts_fn)
myplot <- ggplot(use_counts,
aes(x = Dataset, y = Count, colour = Dataset, fill = Dataset)) +
geom_bar(stat = "identity", lwd = 1.5) +
theme(legend.position = "bottom",
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
panel.grid.major = element_line(colour = "lightgrey"),
strip.background = element_rect(fill = "black"),
strip.text = element_text(colour = "white"),
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank())
print(myplot)
dev.off()
rm(myplot, use_counts_fn, use_counts)
# Clean the df for further use (now we only need the global P values).
integration_results <- integration_results %>%
dplyr::select(SYMBOL, wavg_csss_logFC, starts_with("int_csss_adj_pval_")) #nolint
# We then study the various P value combination methods one by one.
all_local_counts <- vector()
for (o in seq_len(length(config$p_val_combinations))) {
pval_comb <- config$p_val_combinations[[o]]
combination_names <- c(combination_names, pval_comb$name)
# List the significant genes if any.
sign_genes <- integration_results %>%
dplyr::filter(.data[[paste0("int_csss_adj_pval_",
pval_comb$name)]] < config$p_val_sign_thres)
# Update the global structures (for comparison).
local_index <- 3 * (l - 1)
sign_genes_all[[local_index + o]] <- sign_genes
all_local_counts <- c(all_local_counts, dim(sign_genes)[1])
rm(local_index)
# We plot the eCDFs of the P values (if we have at least one gene).
if (dim(sign_genes)[1] > 0) {
sign_genes_4ecdf <- sign_genes %>%
dplyr::mutate(diff_exp = ifelse(sign(wavg_csss_logFC) == 1, "up", "down")) #nolint
ecdf_fn <- paste0(file_prefix, pval_comb$name, "_eCDFs.png")
png(ecdf_fn)
myplot <- ggplot(sign_genes_4ecdf,
aes_string(x = paste0("int_csss_adj_pval_", pval_comb$name),
colour = "diff_exp")) +
stat_ecdf(pad = FALSE, geom = "smooth") +
theme(legend.position = "bottom",
axis.title.x = element_blank(),
plot.title = element_text(size = 15,
hjust = 0.5),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
panel.grid.major = element_line(colour = "lightgrey"),
strip.background = element_rect(fill = "black"),
strip.text = element_text(colour = "white")) +
scale_color_brewer(palette = "Dark2") +
ggtitle("ECDF of P values for up and down genes.")
print(myplot)
dev.off()
rm(sign_genes_4ecdf, ecdf_fn, myplot)
} # End if there is at least one significant DEG.
# We now save the individual gene lists (even if enmpty).
save_sign_genes(sign_genes, paste0(file_prefix, pval_comb$name, "_"))
rm(pval_comb, sign_genes)
} # End for all P value combination methods.
rm(o, integration_results)
# We reformat the count matrix as to later be able to compare the different
# P value combination methods and gene selection methods.
all_counts <- rbind(all_counts, all_local_counts)
all_count_rnames <- c(all_count_rnames, paste(integration$name, vsn$name, limma_comparison,
selection$name, sep = "-"))
rm(all_local_counts, selection, file_prefix)
} # End of for each gene-probe selection.
rm(l)
# At this stage, we want to compare the gene-probe selection / P value combination methods.
# We start by looking at the overlaps (raw counts and percentages).
# n covers the gene-probe selection methods and the P value combination methods.
# Here n = 2 * 3 = 6.
n <- length(config$p_val_combinations) * length(config$selections)
overlap <- rep(1, n * n)
dim(overlap) <- c(n, n)
colnames(overlap) <- rep("_", n)
rm(n)
for (m in seq_len(length(sign_genes_all) - 1)) {
for (n in (m + 1):length(sign_genes_all)) {
set1 <- sign_genes_all[[m]]$SYMBOL
set2 <- sign_genes_all[[n]]$SYMBOL
inter_set1_set2 <- intersect(set1, set2)
perc_over <- 100 * length(inter_set1_set2) / min(length(set1), length(set2))
overlap[m, n] <- length(inter_set1_set2)
overlap[n, m] <- paste0(round(perc_over, 0), "\\%")
if (is.na(perc_over)) {
overlap[n, m] <- ""
}
colnames(overlap)[m] <- paste0(" (", length(set1), ")")
colnames(overlap)[n] <- paste0(" (", length(set2), ")")
rm(set1, set2, inter_set1_set2, perc_over)
}
rm(n)
}
rm(m, sign_genes_all)
# We save these data (overlaps).
subnames <- rep(selection_names, each = length(config$p_val_combinations))
names <- paste0(subnames, "-", combination_names)
colnames(overlap) <- paste0(names, colnames(overlap))
rownames(overlap) <- names
overlap[is.na(overlap)] <- 0
file_prefix <- paste0(output_data_dir, integration$name, "_", vsn$name, "_",
limma_comparison, "_")
overlap_fn <- paste0(file_prefix, "overlap.tsv") # not good file_prefix or ?
write.table(overlap, file = overlap_fn, sep = "\t", quote = FALSE, col.names = NA)
rm(subnames, names, overlap, overlap_fn, selection_names)
rm(limma_comparison, file_prefix)
} # End of for each limma analysis.
rm(k, vsn)
} # End for each variance configuration.
rm(j, integration)
} # End for each data integration scheme.
rm(i)
# We adjust the global count table.
rownames(all_counts) <- str_replace_all(all_count_rnames, "_", "-")
colnames(all_counts) <- combination_names[1:(length(combination_names) / length(config$selections))]
all_counts_fn <- paste0(output_data_dir, "all_counts.tsv")
write.table(all_counts, file = all_counts_fn, sep = "\t", quote = FALSE, col.names = NA)
rm(all_counts, all_counts_fn, all_count_rnames, combination_names)
# We log the session details for reproducibility.
rm(config, output_data_dir, input_data_dir)
sessionInfo()
#!/usr/bin/env Rscript
# ================================================================================================
# Libraries
# ================================================================================================
library("yaml")
library("tidyverse")
library("stringr")
source("../libs/conf/confR.R")
source("../libs/utils/utils.R")
message(paste0("[", Sys.time(), "] Libraries loaded."))
# ================================================================================================
# Configuration
# ================================================================================================
options(bitmapType = "cairo")
config <- read_config(config_dirs = c("../Confs/", "./"))
output_data_dir <- paste0(config$global_data_dir, config$local_data_dir)
input_data_dir <- paste0(config$global_data_dir, config$local_input_data_dir)
input_edata_dir <- paste0(config$global_data_dir, config$local_exp_data_dir)
message(paste0("[", Sys.time(), "] Configuration done."))
# ================================================================================================
# Functions
# ================================================================================================
#' @title Plot the differential expression of top hits.
#'
#' @description This function create one plot for each top hit (gene) of a given limma
#' comparison (obtained for a given configuration). Each plot contains the expression
#' values across samples (grouped per dataset and per relevant clinical descriptor).
#'
#' @param output_data_dir A string to indicate the folder in which plots should be saved.
#' @param integration_name A string that contains the name of the current integration
#' scheme.
#' @param limma_analysis A complete configuration of the Limma comparison.
#' @param selection_name A string corresponding to the used gene--probe selection
#' method.
#' @param palette The color palette to use for plotting.
#' @param clin_data_all A list that contains one data-frame per datasets with its
#' clinical data.
#' @param exp_data_all A list that contains one data-frame per datasets with its
#' expression profiles (not transformed).
#' @param zexp_data_all A list that contains one data-frame per datasets with its
#' expression profiles (Z-scores).
#' @param m_indx_map A named list representing a map between two dataset indexes,
#' namely the index of the global dataset configuration (all datasets) and the list
#' of relevant datasets for this specific configuration. The list contains the second
#' index (relevant only) while its names are for the first index (all datasets).
#' @param nb_hist A numeric to restrict the plotting to the top hits only. Default
#' to 10.
plot_top_hits <- function(output_data_dir, integration_name, limma_analysis, selection_name,
palette, clin_data_all, exp_data_all, zexp_data_all, m_indx_map, config,
nb_hits = 10) {
# We set the I/Os.
file_prefix_short <- paste0(output_data_dir, integration_name, "_", vsn$name, "_",
limma_analysis$name, "_")
# Read the matching data.
matching_data_fn <- paste0(file_prefix_short, "matching_data.tsv")
matching_data <- read.table(matching_data_fn,
header = TRUE,
sep = "\t",
row.names = NULL,
as.is = TRUE)
rm(matching_data_fn)
message(paste0("[", Sys.time(), "] Matching data read."))
# We read the results of the integration.
int_results_fn <- paste0(file_prefix_short, selection_name, "_integration.tsv")
results_data_df <- read.table(int_results_fn,
header = TRUE,
sep = "\t",
row.names = NULL,
as.is = TRUE)
rm(int_results_fn)
message(paste0("[", Sys.time(), "] Integration results read."))
# We plot the top hits defined by each P value combination method.
for (o in seq_len(length(config$p_val_combinations))) {
# We define the I/Os.
pval_comb <- config$p_val_combinations[[o]]
file_prefix <- paste0(file_prefix_short, selection_name, "_", pval_comb$name, "_")
# We identify the top hits.
top_hits <- results_data_df %>%
dplyr::arrange(!! sym(paste0("int_csss_adj_pval_", pval_comb$name)),
!! sym(paste0("int_csss_pval_", pval_comb$name))) %>%
dplyr::select(SYMBOL, !! sym(paste0("int_csss_adj_pval_", pval_comb$name))) %>%
head(nb_hits)
# We plot each gene sequentially.
for (p in seq_len(dim(top_hits)[1])) {
# We get the relevant probe identifiers.
gene <- top_hits[p, ]
matching_data_gene <- matching_data %>%
dplyr::filter(genes == gene$SYMBOL) %>%
dplyr::select(ends_with(str_replace(selection_name, "-", ".")))
# We do all datasets one by one and collect the clinical and expression data.
full_profiles <- vector()
full_zprofiles <- vector()
for (ii in seq_len(length(clin_data_all))) {
# We define the second index as to switch between the global list (full)
# and current list (selection).
jj <- m_indx_map[[ii]]
# We get the dataset details (name), and retrieve the correct probe id.
dataset <- config$datasets[[jj]]
dataset_name <- dataset$dataset_name
dataset_arraytype <- stringr::str_replace(dataset$array_type, "-", ".")
probeset_tag <- paste0(dataset_arraytype, "_",
stringr::str_replace(selection_name, "-", "."))
probe_id <- matching_data_gene[[probeset_tag]]
rm(dataset, probeset_tag, dataset_arraytype, jj)
# We read the clinical data.
clin_data <- clin_data_all[[ii]] %>%
dplyr::mutate(gender_disease_status = paste(Gender, Disease.status, sep = ".")) %>%
dplyr::select("Disease.status", "Gender", "gender_disease_status")
# We read the expression data (expression levels).
# The dataset might not have a probe for that gene.
exp_data <- t(exp_data_all[[ii]][1, ])
exp_data[, 1] <- NA