The objectives of this step are to analyse the quality of the raw data of the project and to create reports that can be visually inspected to decide which samples should be kept and which could (or should) be discarded.
# Details and instructions
Affymetrix datasets are analysed using R and the 'arrayQualityMetrics' package, which produces HTML reports with figures and descriptions. Agilent datasets are analysed using home-made scripts, which creates figures but does not summarize the QC in a single document. Illumina datasets are not controlled so far since we do not have the complete raw data.
The Makefile contains the commands to launch all jobs on the cluster.
Affymetrix datasets are analysed using R and the 'arrayQualityMetrics' package, which produces HTML reports with figures and descriptions. Agilent datasets are analysed using home-made scripts, which creates figures but does not summarize the QC in a single document. Illumina datasets are not controlled so far since we do not have the complete raw data. Similarly, the RNA-seq datasets are not controlled. The Makefile contains the commands to launch all jobs on the cluster.
```
make clean_outputs
make run_qc
```
# Prerequisites
Since this is the first step of the analysis, the only prerequisite is to have the raw data (mostly from GEO) in the Data folder. There should be one folder per dataset, with a TSV file containing the clinical data ('ClinicalData.tsv' and a '/RAW/' folder with the raw data (should not be compressed unless a GEO series matrix file).
Since this is the first step of the analysis, the only prerequisite is to have the raw data (mostly from GEO) in the Data folder. There should be one folder per dataset, with a TSV file containing the clinical data ('ClinicalData.tsv' and a '/RAW/' folder with the raw data (should not be compressed unless a GEO series matrix file).
The objectives of this step is to integrate the results of the differential expression analysis across several datasets in order to identify similarities and overlaps and to identify robust DEGs.
# Details and instructions
The datasets are first summarized at the gene level (limma analyses are performed at the probe level). Conflicts and non unique mappings are handled to create a unique list of DEGS (per dataset still).
```
make clean_outputs
make summarize
```
The results are lists of DEGs (instead of differentially expressed probes) with NA for the genes that are not present in some of the datasets.
The integration itself is then computed and results are analyzed and checked.
```
make integrate
make analyse
make check
```
We create the final gene expression matrices (to share data outside of the project).
```
make gexpr
```
A document that contains many but not all figures can then be generated.
```
make doc
```
# Prerequisites
A prerequisite is to have the results of the limma analysis for all datasets (Step 05).