Commit ae29e367 authored by Leon-Charles Tranchevent's avatar Leon-Charles Tranchevent
Browse files

Updated the cleaning procedure to include the meta-datasets.

parent 1d2dcc12
......@@ -106,16 +106,16 @@ for (i in 1:length(local_config$datasets)) {
# By default, we have the disease status, but we can also have the gender.
dataset_groups <- c("Disease.status")
if (dataset_config[[3]] == "DG") {
if (dataset_config[3] == "DG") {
dataset_groups <- c("Disease.status", "Gender")
}
# We run the quality control of the current dataset (if necessary).
if (selected_dataset_name == "" || selected_dataset_name == dataset_config[[1]]) {
run_quality_control(dataset_config[[1]],
if (selected_dataset_name == "" || selected_dataset_name == dataset_config[1]) {
run_quality_control(dataset_config[1],
raw_data_dir,
output_data_dir,
dataset_config[[2]],
dataset_config[2],
dataset_groups)
}
}
......
......@@ -16,4 +16,5 @@ make run_meta
```
# Prerequisites
The only prerequisite is to have the raw data (mostly from GEO) in the Data folder. There should be one folder per dataset, with a '/RAW/' folder with the raw data as CEL files. In particular, it is not necessary to run the quality control before running the preprocessing since arrays are preprocessed independently (unless one expect many problematic arrays).
For the generic datasets, the only prerequisite is to have the raw data (mostly from GEO) in the Data folder. There should be one folder per dataset, with a '/RAW/' folder with the raw data as CEL files. In particular, it is not necessary to run the quality control before running the preprocessing since arrays are preprocessed independently (unless one expect many problematic arrays).
For the meta-datasets, it is necessary to have in addition performed the analysis of the individual datasets as to be able to remove the bad arrays (QC-I, SCAN-no-converg).
......@@ -339,18 +339,18 @@ for (i in 1:length(local_config$datasets)) {
dataset_config <- local_config$datasets[[i]]
# We define the dataset specific I/Os.
dataset_raw_data_dir <- paste0(raw_data_dir, dataset_config[[1]], "/", "RAW/")
dataset_raw_data_dir <- paste0(raw_data_dir, dataset_config[1], "/", "RAW/")
# We run the quality control of the current dataset (if necessary).
if (selected_dataset_name == "" || selected_dataset_name == dataset_config[[1]]) {
if (selected_dataset_name == "" || selected_dataset_name == dataset_config[1]) {
# We load the data (clinical, X- and Y-chromosome probes, SCAN and RMA data).
pheno_data <- get_clinical_data(dataset_config[[1]], raw_data_dir)
c(array_xprobes, array_yprobes) %<-% get_xy_probes(dataset_config[[3]], output_data_dir)
c(exprs_scan_mat, exprs_scan_eset) %<-% get_scan_data(dataset_config[[1]], input_data_dir)
c(exprs_rma_mat, exprs_rma_eset) %<-% get_rma_data(dataset_config[[1]],
pheno_data <- get_clinical_data(dataset_config[1], raw_data_dir)
c(array_xprobes, array_yprobes) %<-% get_xy_probes(dataset_config[3], output_data_dir)
c(exprs_scan_mat, exprs_scan_eset) %<-% get_scan_data(dataset_config[1], input_data_dir)
c(exprs_rma_mat, exprs_rma_eset) %<-% get_rma_data(dataset_config[1],
dataset_raw_data_dir,
compressed = dataset_config[[2]])
compressed = dataset_config[2])
# Subset the expression matrices with the X- and Y-chromosome probes.
exprs_scan_mat_xonly <- exprs_scan_mat[rownames(array_xprobes), ]
......
# Objectives
The objectives of this step is clean the datasets, i.e., remove the arrays that have been flagged during the previous steps due to various errors (refered to as QC-I, SCAN-no-converg, QC-II or CLIN-no-gender).
This produces clean datasets that can be used for further analyses.
# Details and instructions
The 9 datasets are processed one by one to remove the bad arrays and update the clinical files accordingly. The clinical data is also updated to include gender predictions (from the local configuration file).
```
make cleanO
make run
```
The meta-datasets are already half-clean (because bad arrays can influence batch correction), so we perform the other half of the cleaning here (QC-II or CLIN-no-gender). Please note that the 'Batch.tsv' file is not updated since it is not used after preprocessing with SCAN.
# Prerequisites
The prerequisites are to have the SCAN data for all datasets and meta-datasets (Step 02) as well as the predicted gender information (Step 03).
......@@ -114,7 +114,7 @@ for (i in 1:length(local_config$datasets)) {
dataset_config <- local_config$datasets[[i]]
# We run the quality control of the current dataset (if necessary).
if (selected_dataset_name == "" || selected_dataset_name == dataset_config[[1]]) {
if (selected_dataset_name == "" || selected_dataset_name == dataset_config[1]) {
# We load the data (clinical and preproccesed data).
pheno_data <- Biobase::pData(get_clinical_data(dataset_config[1], raw_data_dir))
......
......@@ -11,3 +11,5 @@ datasets:
- ['GSE8397', 'GSM208634.cel.gz,GSM208637.cel.gz,GSM208638.cel.gz,GSM208654.cel.gz', '']
- ['Simunovic', 'park_1143_01.cel,park_1148_01.cel', '']
- ['E-MEXP-1416', '', '']
- ['HG-U133A', 'GSM506020.CEL.gz', 'GSM506004.CEL;Gender;M,GSM506010.CEL;Gender;F,GSM506011.CEL;Gender;M,GSM506012.CEL;Gender;F']
- ['HG-U133_Plus_2', 'GSM503950_C_1074p_SNc.CEL.gz,GSM503951_C_1271p_SNc.CEL.gz,GSM503956_C_3603_SNc.CEL.gz', '']
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment