README.md



To learn more about this project, read the wiki.


This repository contains scripts and related explanations for the analysis of metagenomics data of the microbiome in infants during the first week of life.
The different parts of the workflow are described below. The links in the sub-headings below lead to descriptions of these different parts and how different scripts are connected. You can find the scripts behind the links next to the bullet points.

Part 1: Read preprocessing, assembly, gene prediction and read mapping using IMP

The first steps of this analysis are done by the Integrated Metaomics Pipeline IMP.

Part 2: Curation of assembled contigs to remove artefactual sequences

In this part of the analysis, a pre-binning of sample contigs with contigs from a contamination control sample is performed and contigs that cluster with the contaminant sequences are removed.

fastaExtractCutRibosomal1000.pl
calculateContigLength.pl
160920_autocluster_contamination_1000.R
fastaExtractCutRibosomalNoCutoff.pl
160920_autocluster_contamination_1.R
fastaExtractrRNA4dom.pl


Part 3: Annotation of genes with KEGG orthologous groups (KOs) and counting of reads per KO

Here, HMMs are used to annotate the predicted genes with KOs. The number of reads mapping to each KO in each sample are then counted for later differential analysis.

161114_filter_gff_Pooled_noconta_1.R
hmmscan_addBest2gff.pl
161129_filter_prokka_Pooled_noconta_1.R


Part 4: Binning of contigs into population-level genomes

Contigs are binned using the algorithm developed for the MuSt study of type 1 diabetes. The scripts have been adapted for use with IMP output and previous curation of contigs.

fastaExtractCutRibosomal1000.pl
fastaExtractCutRibosomalNoCutoff.pl
160921_autoClust_noConta_evil_1000.R
160921_autoClust_noConta_evil_1.R


Part 5: Linking of population-level genomes over different samples using phylogenetic marker genes

In this step, the bins from the individual samples are connected based on the relatedness of the phylogenetic marker genes.

Part 6: Taxonomic annotation of the bins

We use PhyloPhlAn to find the taxonomy of every bin.

Part 7: Linking strains and population-level genomes over different samples using SNV-patterns

To find which strains are common to several samples based on reads and potentially linked to reconstructed genomes, we use StrainPhlAn.

Part 8: Analysis of intra- and inter-population variability

In this step, single nucleotide variants (SNVs) are used to examine the intra- and inter-population variability of the population-level genomes which were in common between different samples using Pogenom.

Coverage_Bins.R
getSize
Breadth.R


Part 9: mOTU analysis

To achieve a taxonomic overview, the relative abundances of metagenomic operational taxonomic units (mOTUs) were calculated from curated reads.