Commit 1ae85b89 authored by Leon-Charles Tranchevent's avatar Leon-Charles Tranchevent
Browse files

Merge branch 'develop' into 'master'.

parents 15a9bb14 f26f07fa
# Objectives
The objectives of this step are to analyse the quality of the raw data of the project and to create reports that can be visually inspected to decide which samples should be kept and which could (or should) be discarded.
The objectives of this step are to analyze the quality of the raw data and to create reports that can be visually inspected to decide which samples should be kept and which should be discarded.
# Details and instructions
Affymetrix datasets are analysed using R and the 'arrayQualityMetrics' package, which produces HTML reports with figures and descriptions. Agilent datasets are analysed using home-made scripts, which creates figures but does not summarize the QC in a single document. Illumina datasets are not controlled so far since we do not have the complete raw data.
The Makefile contains the commands to launch all jobs on the cluster.
Affymetrix datasets are analysed using R and the 'arrayQualityMetrics' package, which produces HTML reports with figures and descriptions. Agilent datasets are analysed using home-made scripts, which creates figures but does not summarize the QC in a single document. The Makefile contains the commands to launch all jobs on the cluster.
```
make clean_outputs
make run_qc
```
# Prerequisites
Since this is the first step of the analysis, the only prerequisite is to have the raw data (mostly from GEO) in the Data folder. There should be one folder per dataset, with a TSV file containing the clinical data ('ClinicalData.tsv' and a '/RAW/' folder with the raw data (should not be compressed unless a GEO series matrix file).
Since this is the first step of the analysis, the only prerequisite is to have the raw data (mostly from GEO) in the Data folder. There should be one folder per dataset, with a TSV file containing the clinical data ('ClinicalData.tsv' and a '/RAW/' folder with the raw data (should not be compressed unless a GEO series matrix file).
\ No newline at end of file
......@@ -42,11 +42,19 @@ for (i in seq_len(length(config$datasets))) {
dataset_groups <- c("Disease.status")
if (dataset$clinical_descriptors == "DG") {
dataset_groups <- c("Disease.status", "Gender")
} else if (dataset$clinical_descriptors == "DGA") {
dataset_groups <- c("Disease.status", "Gender")
}
# We run the quality control of the current dataset (if necessary).
if (selected_dataset_name == "" || selected_dataset_name == dataset_name) {
# We only do the QC when it is necessary.
if (dataset$first_qc != "TRUE") {
message(paste0("[", Sys.time(), "] [", dataset_name, "] QC not necessary."))
next
}
# We define the I/Os
raw_data_subdir <- paste0(raw_data_dir, dataset_name, "/")
output_data_subdir <- paste0(output_data_dir, dataset_name, "/")
......@@ -55,6 +63,7 @@ for (i in seq_len(length(config$datasets))) {
message(paste0("[", Sys.time(), "] [", dataset_name, "] Starting QC."))
ArrayUtils::run_quality_control_on_raw(raw_data_subdir,
output_data_subdir,
use_oligo = dataset$use_oligo,
platform = dataset$platform,
phenotype_groups = dataset_groups,
verbose = TRUE)
......
......@@ -5,9 +5,9 @@
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --mem=12GB
#SBATCH --time=0-00:45:00
#SBATCH --time=0-01:30:00
#SBATCH -p batch
#SBATCH --qos=qos-batch
#SBATCH --qos=normal
echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
......@@ -20,8 +20,7 @@ OUTPUT_FOLDER=/home/users/ltranchevent/Data/GeneDER/Analysis/01/
CODE_FOLDER=/home/users/ltranchevent/Projects/GeneDER/Analysis/01-Quality_control/
# Loading modules.
mu
module load lang/R/3.6.0-foss-2018a-X11-20180131-bare
module load lang/R/3.6.0-foss-2019a-bare
# Load configuration
source ../libs/conf/confSH.sh
......
......@@ -7,11 +7,9 @@ clean_outputs:
@rm -rf ${OUTPUT_FOLDER}*
preprocess:
@sbatch ${CODE_FOLDER}preprocess.sh
vsn:
@sbatch ${CODE_FOLDER}stabilize_variance.sh
vsn_raw:
@sbatch ${CODE_FOLDER}stabilize_variance_raw.sh
get_log:
@sbatch ${CODE_FOLDER}get_log.sh
vsn:
@sbatch ${CODE_FOLDER}stabilize_variance.sh
doc:
@sbatch ${CODE_FOLDER}doc.sh
......@@ -2,27 +2,28 @@
The objectives of this step are to preprocess the raw data and to save them for further use.
# Details and instructions
All Affymetrix datasets are preprocessed using R and the 'SCAN-UPC' package. Illumina and Agilent datasets are preprocessed using dedicated R libraries (limma and beadarrays). The Makefile contains the commands to launch all jobs on the cluster. The data are stored back as TSV files.
All Affymetrix datasets are preprocessed using SCAN and GC-RMA. Illumina and Agilent datasets are preprocessed using dedicated R libraries (limma and beadarrays). The RNA-seq data are just post-processed since the preprocessing took place before. The Makefile contains the commands to launch all jobs on the cluster.
The data are stored as TSV files.
```
make clean_outputs
make preprocess
```
We can then make a plot of the preprocessing results (to identify the arrays for which SCAN modelling might have failed). The list of datasets is hard-coded in the bash script 'prepare_log.sh'. We don't do any post-processing of the normalization procedure for Illumina or Agilent arrays.
For the datasets analyzed through SCAN, we can then make a plot of the preprocessing results (to identify the arrays for which SCAN modeling might have failed). We don't do any post-processing of the normalization procedure for Illumina or Agilent arrays (QC are run anyway in all cases).
```
make get_log
```
We can then investigate the effect of applying a variance stabilization method on the data
We can then investigate the effect of applying a variance stabilization method on the data (aka to control heteroscedasticity)
```
make vsn
make vsn_raw
```
Finally, we create a report about the variance stabilization method so that we can manually check whether it is necessary to correct for variance bias.
Finally, we create a report about the processing and the variance stabilization methods so that we can manually check which normalization procedure makes more sense or whether it is indeed necessary to correct for variance bias.
```
make doc
```
# Prerequisites
In general, the only prerequisite is to have the raw data (mostly from GEO) in the Data folder. There should be one folder per dataset, with a '/RAW/' folder with the raw data as CEL files. In particular, it is not necessary to run the quality control before running the preprocessing since arrays are preprocessed independently (unless one has many problematic arrays, which would take CPU time for nothing). However, for Illumina and Agilent arrays, this is not the case, and the bad quality arrays needs to be filtered before pre-processing (arrays are not treated individually).
In general, the only prerequisite is to have the raw data (mostly from GEO) in the Data folder. For the array-based datasets, there should be one folder per dataset, with a '/RAW/' folder with the raw data as CEL files. For the RNA-seq dataset, the pre-processed data should be available as TSV files containing read counts. For SCAN in particular, it is not necessary to run the quality control before running the preprocessing since arrays are preprocessed independently (unless one has many problematic arrays, which would take CPU time for nothing). For all the other methods however, this is not the case, and the bad quality arrays needs to be filtered before pre-processing.
......@@ -4,9 +4,9 @@
#SBATCH --mail-user=leon-charles.tranchevent@uni.lu
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --time=0-00:01:00
#SBATCH --time=0-00:02:00
#SBATCH -p batch
#SBATCH --qos=qos-batch
#SBATCH --qos=normal
echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
......
......@@ -6,7 +6,7 @@
#SBATCH -n 1
#SBATCH --time=0-00:02:00
#SBATCH -p batch
#SBATCH --qos=qos-batch
#SBATCH --qos=normal
echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
......@@ -19,8 +19,7 @@ OUTPUT_FOLDER=/home/users/ltranchevent/Data/GeneDER/Analysis/02/
CODE_FOLDER=/home/users/ltranchevent/Projects/GeneDER/Analysis/02-Preprocessing/
# Loading modules.
mu
module load lang/R/3.6.0-foss-2018a-X11-20180131-bare
module load lang/R/3.6.0-foss-2019a-bare
# Load configuration
source ../libs/conf/confSH.sh
......
......@@ -29,17 +29,17 @@ rm(output_data_dir)
# Load the prepared SCAN log file with the number of iterations and proportion of background
# probes per sample for each dataset.
SCAN_log <- read_delim(input_data_file, "\t", escape_double = FALSE, trim_ws = TRUE)
scan_log <- read_delim(input_data_file, "\t", escape_double = FALSE, trim_ws = TRUE)
message(paste0("[", Sys.time(), "] Data read."))
# We prepare the data for ggplot by adding extra columns to define classes.
D <- SCAN_log %>%
d <- scan_log %>%
dplyr::mutate(class_tag = ifelse(convergence < 1,
ifelse(prop_back < 0.85,
"Normal",
"High background (>85% probes)")
, "No convergence"))
rm(SCAN_log)
rm(scan_log)
message(paste0("[", Sys.time(), "] Data filtered."))
# Scatter plot of the number of iterations vs the proportion of background probes per dataset.
......@@ -47,7 +47,7 @@ message(paste0("[", Sys.time(), "] Data filtered."))
# Another distinction is made for points with a convergence higher than 1.0 (i.e., considered
# as no convergence).
png(output_figure_file, width = 1600, height = 1600, units = "px", res = 300)
myplot <- ggplot(D, aes(x = nb_iterations, y = prop_back, colour = class_tag)) +
myplot <- ggplot(d, aes(x = nb_iterations, y = prop_back, colour = class_tag)) +
geom_jitter(size = 3, alpha = 0.66) +
facet_wrap(dataset ~ .) +
theme(legend.position = "bottom", legend.title = element_blank(),
......@@ -63,8 +63,8 @@ rm(output_figure_file, myplot)
message(paste0("[", Sys.time(), "] Plot created."))
# We save the file for later use (just in case)
write_csv(D, path = input_data_file)
rm(D, input_data_file)
write_csv(d, path = input_data_file)
rm(d, input_data_file)
message(paste0("[", Sys.time(), "] Data saved."))
# We log the session details for reproducibility.
......
......@@ -44,6 +44,8 @@ for (i in seq_len(length(config$datasets))) {
dataset_groups <- c("Disease.status")
if (dataset$clinical_descriptors == "DG") {
dataset_groups <- c("Disease.status", "Gender")
} else if (dataset$clinical_descriptors == "DGA") {
dataset_groups <- c("Disease.status", "Gender")
}
# We preprocess the current dataset (if necessary).
......@@ -90,7 +92,7 @@ for (i in seq_len(length(config$datasets))) {
row.names = 1,
quote = "",
stringsAsFactors = FALSE)
# We are not running apt-gcrma so we rely on the ArrayLib functions.
# We are not running apt-gcrma so we rely on the ArrayUtils functions.
esets <- ArrayUtils::preprocess_data(raw_data_subdir,
output_data_files,
platform = dataset$platform,
......@@ -101,7 +103,7 @@ for (i in seq_len(length(config$datasets))) {
verbose = TRUE)
rm(exprs_raw, exprs_fn)
} else {
# We are not running apt-gcrma so we rely on the ArrayLib functions.
# We are not running apt-gcrma so we rely on the ArrayUtils functions.
esets <- ArrayUtils::preprocess_data(raw_data_subdir,
output_data_files,
platform = dataset$platform,
......@@ -139,7 +141,7 @@ for (i in seq_len(length(config$datasets))) {
output_data_subdir_nobc,
phenotype_groups = dataset_groups,
verbose = TRUE)
rm(output_data_subdir_nobc)
rm(output_data_subdir_nobc, esets)
message(paste0("[", Sys.time(), "] [", dataset_name, "][", normalization_method,
"] QC (without batch effect correction) finished."))
} else {
......@@ -163,7 +165,7 @@ for (i in seq_len(length(config$datasets))) {
row.names = 1,
quote = "",
stringsAsFactors = FALSE)
# We are not running apt-gcrma so we rely on the ArrayLib functions.
# We are not running apt-gcrma so we rely on the ArrayUtils functions.
eset <- ArrayUtils::preprocess_data(raw_data_subdir,
output_data_file,
platform = dataset$platform,
......@@ -174,7 +176,7 @@ for (i in seq_len(length(config$datasets))) {
verbose = TRUE)
rm(exprs_raw, exprs_fn)
} else {
# We are not running apt-gcrma so we rely on the ArrayLib functions.
# We are not running apt-gcrma so we rely on the ArrayUtils functions.
eset <- ArrayUtils::preprocess_data(raw_data_subdir,
output_data_file,
platform = dataset$platform,
......@@ -198,16 +200,17 @@ for (i in seq_len(length(config$datasets))) {
output_data_subdir_nobc,
phenotype_groups = dataset_groups,
verbose = TRUE)
rm(output_data_subdir_nobc)
rm(output_data_subdir_nobc, eset)
message(paste0("[", Sys.time(), "] [", dataset_name, "][", normalization_method,
"] QC (without batch effect correction) finished."))
} # End if dataset has known batches.
message(paste0("[", Sys.time(), "] [", dataset_name, "][", normalization_method,
"] Ended."))
rm(raw_data_subdir, output_data_subdir, normalization, normalization_method)
rm(output_data_subdir, normalization, normalization_method)
} # End for each normalization.
rm(j)
} # End if dataset should be analysed.
rm(dataset, dataset_name, dataset_groups)
rm(dataset, dataset_name, dataset_groups, raw_data_subdir)
} # End for each dataset.
# We log the session details for reproducibility.
......
......@@ -4,9 +4,9 @@
#SBATCH --mail-user=leon-charles.tranchevent@uni.lu
#SBATCH -N 1
#SBATCH -n 20
#SBATCH --time=0-06:30:00
#SBATCH --time=0-07:00:00
#SBATCH -p batch
#SBATCH --qos=qos-batch
#SBATCH --qos=normal
echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
......@@ -19,8 +19,7 @@ OUTPUT_FOLDER=/home/users/ltranchevent/Data/GeneDER/Analysis/02/
CODE_FOLDER=/home/users/ltranchevent/Projects/GeneDER/Analysis/02-Preprocessing/
# Loading modules.
mu
module load lang/R/3.6.0-foss-2018a-X11-20180131-bare
module load lang/R/3.6.0-foss-2019a-bare
# Load configuration
source ../libs/conf/confSH.sh
......
......@@ -28,6 +28,7 @@ do
# We get the correct CDF file based on the dataset / platform configurations.
datasetArrayType=${datasets__array_type[$i]}
cdfName="NA"
mpsName="NA"
nbPlatforms=${#platforms__platform_name[@]}
for (( j=0; j<$nbPlatforms; j++ ))
do
......@@ -35,14 +36,22 @@ do
if [ "${platformName}" == "${datasetArrayType}" ]
then
cdfName=${platforms__cdf_name[$j]}
mpsName=${platforms__mps_name[$j]}
fi
done
# We prepare and run the APT command.
apt_cmd="${apt_script} -a ${apt_script_method} -d ${global_raw_data_dir}Platforms/${cdfName} -o ${OUTPUT_FOLDER}apt_gcrma_temp ${global_raw_data_dir}${datasetName}/RAW/*"
# We prepare the APT command. We take care of the cases for which we need to
# summarize the data at the transcript level (exon arrays). Otherwise, we summarize at
# the probeset level (by default).
if [ "${mpsName}" == "NA" ]
then
apt_cmd="${apt_script} -a ${apt_script_method} -d ${global_raw_data_dir}Platforms/${cdfName} -o ${OUTPUT_FOLDER}apt_gcrma_temp ${global_raw_data_dir}${datasetName}/RAW/*"
else
apt_cmd="${apt_script} -a ${apt_script_method} -d ${global_raw_data_dir}Platforms/${cdfName} -m ${global_raw_data_dir}Platforms/${mpsName} -o ${OUTPUT_FOLDER}apt_gcrma_temp ${global_raw_data_dir}${datasetName}/RAW/*"
fi
# We run the APT command and we rename / copy the result file to the real apt folder.
eval "$apt_cmd"
# We rename and copy the result file to the real apt folder.
mv ${OUTPUT_FOLDER}apt_gcrma_temp/gc-correction.scale-intensities.rma-bg.quant-norm.pm-only.med-polish.summary.txt ${OUTPUT_FOLDER}apt_gcrma/${datasetName}.tsv
# We clean up the temporary folder for that run.
......
......@@ -34,21 +34,21 @@ message(paste0("[", Sys.time(), "] Configuration done."))
# ================================================================================================
#' @title Run the VSN method for a given configuration.
#'
#' @description This methods takes a given configuration and run VSN on it to try to stablize the
#'
#' @description This methods takes a given configuration and run VSN on it to try to stablize the
#' variance. It creates several plots before and after VSN is applied and also save the entire
#' expression matrix after the correction.
#'
#'
#' @param output_data_subdir The folder in chich the expression data is stored. It will also be
#' used to store the figures and the corrected data.
#' @param dataset_name The name of the dataset to consider.
#' @param normalization_name The name of the normalization with which the dataset was
#' @param normalization_name The name of the normalization with which the dataset was
#' pre-processed.
#' @param batchcorrectin_tag The batch correction tag used to the name the files.
run_vsn <- function(output_data_subdir, dataset_name, normalization_name, batchcorrection_tag) {
# We read the expression data.
input_file_name <- paste0(output_data_subdir, dataset_name, "_normalized_",
normalization$name, "_", batchcorrection_tag, ".tsv")
normalization_name, "_", batchcorrection_tag, ".tsv")
exp_data <- as.matrix(read.delim(input_file_name, row.names = 1))
exp_data_min <- min(min(exp_data))
......@@ -62,12 +62,12 @@ run_vsn <- function(output_data_subdir, dataset_name, normalization_name, batchc
# We plot the mean vs sdev before variance stabilization.
output_file_name <- paste0(output_data_subdir, dataset_name, "_normalized_",
normalization$name, "_", batchcorrection_tag, "_meansd_ranks.png")
normalization_name, "_", batchcorrection_tag, "_meansd_ranks.png")
png(output_file_name)
vsn::meanSdPlot(exp_data)
dev.off()
output_file_name <- paste0(output_data_subdir, dataset_name, "_normalized_",
normalization$name, "_", batchcorrection_tag, "_meansd_vals.png")
normalization_name, "_", batchcorrection_tag, "_meansd_vals.png")
png(output_file_name)
vsn::meanSdPlot(exp_data, ranks = FALSE)
dev.off()
......@@ -83,7 +83,7 @@ run_vsn <- function(output_data_subdir, dataset_name, normalization_name, batchc
rm(exp_data_majperc)
# We sometimes need to make several attempts since the GC-RMA data can lead to errors
# because they might contain many ties.
# because they might contain many ties (usage of quantile).
exp_data_vsn <- NULL
for (k in 1:5) {
exp_data_vsn <- tryCatch({
......@@ -106,16 +106,16 @@ run_vsn <- function(output_data_subdir, dataset_name, normalization_name, batchc
dim(noise) <- c(dim(exp_data)[1], dim(exp_data)[2])
exp_data <- exp_data + noise
}
} # End of loop on k of 10 iterations.
} # End of loop on k of 5 iterations.
# We plot the mean vs sdev after variance stabilization.
output_file_name <- paste0(output_data_subdir, dataset_name, "_normalized_",
normalization$name, "_", batchcorrection_tag, "_meansd_ranks_vsn.png")
normalization_name, "_", batchcorrection_tag, "_meansd_ranks_vsn.png")
png(output_file_name)
vsn::meanSdPlot(exp_data_vsn)
dev.off()
output_file_name <- paste0(output_data_subdir, dataset_name, "_normalized_",
normalization$name, "_", batchcorrection_tag, "_meansd_vals_vsn.png")
normalization_name, "_", batchcorrection_tag, "_meansd_vals_vsn.png")
png(output_file_name)
vsn::meanSdPlot(exp_data_vsn, ranks = FALSE)
dev.off()
......@@ -129,7 +129,7 @@ run_vsn <- function(output_data_subdir, dataset_name, normalization_name, batchc
# Scatter plot of the processed data versus the scaled vsn data.
output_file_name <- paste0(output_data_subdir, dataset_name, "_normalized_",
normalization$name, "_", batchcorrection_tag, "_data_vs_vsn.png")
normalization_name, "_", batchcorrection_tag, "_data_vs_vsn.png")
png(output_file_name)
plot(exp_data, exp_data_vsn_scl)
dev.off()
......@@ -137,7 +137,7 @@ run_vsn <- function(output_data_subdir, dataset_name, normalization_name, batchc
# We save the data.
output_file_name <- paste0(output_data_subdir, dataset_name, "_normalized_",
normalization$name, "_", batchcorrection_tag, "_vsn.tsv")
normalization_name, "_", batchcorrection_tag, "_vsn.tsv")
utils::write.table(exp_data_vsn_scl, file = output_file_name, sep = "\t", quote = FALSE)
rm(scaling_factor, exp_data_min, exp_data_max)
rm(exp_data_vsn, exp_data_vsn_scl, output_file_name)
......
......@@ -4,9 +4,9 @@
#SBATCH --mail-user=leon-charles.tranchevent@uni.lu
#SBATCH -N 1
#SBATCH -n 20
#SBATCH --time=0-01:00:00
#SBATCH --time=0-01:10:00
#SBATCH -p batch
#SBATCH --qos=qos-batch
#SBATCH --qos=normal
echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
......@@ -19,8 +19,7 @@ OUTPUT_FOLDER=/home/users/ltranchevent/Data/GeneDER/Analysis/02/
CODE_FOLDER=/home/users/ltranchevent/Projects/GeneDER/Analysis/02-Preprocessing/
# Loading modules.
mu
module load lang/R/3.6.0-foss-2018a-X11-20180131-bare
module load lang/R/3.6.0-foss-2019a-bare
# Load configuration
source ../libs/conf/confSH.sh
......
#!/usr/bin/env Rscript
# ================================================================================================
# Libraries
# ================================================================================================
library("yaml")
library("ArrayUtils")
library("Biobase")
library("vsn")
library("affy")
library("preprocessCore")
source("../libs/conf/confR.R")
message(paste0("[", Sys.time(), "] Libraries loaded."))
# ================================================================================================
# Configuration
# ================================================================================================
options(bitmapType = "cairo")
config <- read_config(config_dirs = c("../Confs/", "./"))
raw_data_dir <- config$global_raw_data_dir
output_data_dir <- paste0(config$global_data_dir, config$local_data_dir)
# Optional command line parameter to process only one dataset.
args <- commandArgs(trailingOnly = TRUE)
selected_dataset_name <- ""
if (length(args) > 0) {
selected_dataset_name <- args[1]
}
rm(args)
message(paste0("[", Sys.time(), "] Configuration done."))
# ================================================================================================
# Main
# ================================================================================================
# We do all datasets one by one.
for (i in seq_len(length(config$datasets))) {
# We get the dataset details.
dataset <- config$datasets[[i]]
dataset_name <- dataset$dataset_name
# We preprocess the current dataset (if necessary).
if (selected_dataset_name == "" || selected_dataset_name == dataset_name) {
# If we have not an affy dataset, we skip.
# Only for exploration purposes, we will include ways to analyse Agilent and
# Illumina datasetst should we decide to use vsn on raw data.
if (dataset$platform != "Affymetrix") {
next;
}
# We define the I/Os.
raw_data_subdir <- paste0(raw_data_dir, dataset_name, "/", "RAW/")
output_data_subdir <- paste0(output_data_dir, dataset_name, "/")
# We have no batch, so only one run.
message(paste0("[", Sys.time(), "] [", dataset_name,
"] Starting analyzing raw data."))
# We read the expression data.
raw_file_list <- affy::list.celfiles(raw_data_subdir, full.names = TRUE)
batch <- affy::ReadAffy(filenames = raw_file_list)
raw_exprs <- exprs(batch)
rm(raw_file_list, batch)
# We plot the mean vs sdev before variance stabilization.
output_file_name <- paste0(output_data_subdir, dataset_name, "_raw_meansd_ranks.png")
png(output_file_name)
vsn::meanSdPlot(log2(raw_exprs))
dev.off()
output_file_name <- paste0(output_data_subdir, dataset_name, "_raw_meansd_vals.png")
png(output_file_name)
vsn::meanSdPlot(log2(raw_exprs), ranks = FALSE)
dev.off()
rm(output_file_name)
# We run vsn on the raw data.
raw_exprs_vsn <- vsn::vsn2(raw_exprs, calib = "affine")
# We plot the mean vs sdev after variance stabilization.
output_file_name <- paste0(output_data_subdir, dataset_name, "_raw_meansd_ranks_vsn.png")
png(output_file_name)
vsn::meanSdPlot(raw_exprs_vsn)
dev.off()
output_file_name <- paste0(output_data_subdir, dataset_name, "_raw_meansd_vals_vsn.png")
png(output_file_name)
vsn::meanSdPlot(raw_exprs_vsn, ranks = FALSE)
dev.off()
rm(output_file_name)
rm(raw_exprs, raw_exprs_vsn)
message(paste0("[", Sys.time(), "] [", dataset_name, "] Ended."))
rm(raw_data_subdir, output_data_subdir)
} # End if dataset should be analysed.
rm(dataset, dataset_name)
} # End for each dataset.
# We log the session details for reproducibility.
rm(i, selected_dataset_name, raw_data_dir, output_data_dir, config)
sessionInfo()
# Objectives
The objectives of this step is to predict the gender /age of the patients whose gender / age is not indicated in the clinical annotations.
The objectives of this step are to predict the gender /age of the patients whose gender / age is not indicated in the clinical annotations.
# Details and instructions
All datasets are used regardless of whether there exists samples with missing clinical annotations. This is motivated by the fact that we also want to etimate the overall accuracy of the predictions. The Makefile contains the commands to get the data from Biomart and then make the predictions. Plots are made and whether the predicted data should be used is left to the user to decide (manually).
All datasets are used regardless of whether there exists samples with missing clinical annotations. This is motivated by the fact that we also want to estimate the overall accuracy of the predictions. The Makefile contains the commands to get the data from Biomart and then make the predictions. Plots are made and whether the predicted data should be used is left to the user to decide (manually).
```
make clean_outputs
make predict
```
......@@ -13,4 +14,4 @@ make doc
```
# Prerequisites
The prerequisites are to have the raw data (mostly from GEO) in the Data folder and the pre-processed data from step 02. There should be one folder per dataset, with a '/RAW/' folder with the raw data as CEL files and a "ClinicalData.tsv" file for the annotations (including of course gender and age).
The prerequisites are to have the raw data (mostly from GEO) in the Data folder and the pre-processed data from step 02. There should be one folder per dataset, with a '/RAW/' folder with the raw data as CEL files (array data) or a TSV file with the pre-processed data (RNA-seq). In addition , a "ClinicalData.tsv" file should contain the clinical annotations (including of course gender and age).
\ No newline at end of file
......@@ -6,7 +6,7 @@
#SBATCH -n 1
#SBATCH --time=0-00:01:00
#SBATCH -p batch
#SBATCH --qos=qos-batch
#SBATCH --qos=normal
echo "== Starting run at $(date)"
echo "== Job ID: ${SLURM_JOBID}"
......
......@@ -2,7 +2,7 @@ local_input_data_dir: !!str '02/'
local_data_dir: !!str '03/'
local_code_dir: !!str '03-Predict_missing/'
# Paramters for the age predicton models.
# Parameters for the age predicton models.
age_model_feat_min_cor: 0.65
age_model_feat_min_nb: 250
age_model_feat_max_nb: 1000
......
......@@ -32,14 +32,15 @@ message(paste0("[", Sys.time(), "] Configuration done."))
# Functions
# ================================================================================================
#' @title Function that estimate the importance of a feature.
#'
#'
#' @description This function computes two correlations (Pearson and Spearman) between a feature
#' vector and a response vector. It then return the maximum absolute value among the two
#' (as to take into account both correlation and anti-correlation).
#'
#' vector and a response vector. It then return the maximum absolute value among the two
#' (as to take into account both correlation and anti-correlation as well as linear and non linear
#' correlations').
#'
#' @param x The feature vector.
#' @param resp The response vector.
#' @return The maximum absolute correlation bewteen x and resp.
#' @return The maximum absolute correlation bewteen x and resp.
estimate_importance <- function(x, resp) {
pcor <- cor(x, resp, method = "pearson")
scor <- cor(x, resp, method = "spearman")
......@@ -66,8 +67,9 @@ for (i in seq_len(length(config$datasets))) {
# We load the data (clinical).
pheno_data <- Biobase::pData(ArrayUtils::load_clinical_data(raw_data_subdir))
pheno_data <- pheno_data %>% dplyr::mutate(cel_name = row.names(pheno_data)) %>%
dplyr::mutate(Age = as.numeric(as.character(Age)))