Commit 046dd6ef authored by Anna Buschart's avatar Anna Buschart
Browse files

added md docs

parent eb4ccc76
For every assembly, phylogenetic marker genes were called from the gene predictions using fetchMG.pl (http://vm-lux.embl.de/~mende/fetchMG/cite.html).
```
perl fetchMG.pl -o marker_genes -t 4 -d contig.Prodigal.fna -m extraction contig.Prodigal.faa
```
The mOTU software comes with a collection of phylogenetic marker genes padded with their genomic context ( _mOTU.v1.padded_ ). To annotate the phylogenetic marker genes found in our assembly, we aligned our genes with the genes in this collection belonging to mOTUs that had been detected in the sample of interest in the read level analysis (standard mOTU workflow) using blast, took the taxonomic annotation of the best hit and gave it to our gene. We call this the best present hit. In a second, alternative workflow, we took the taxonomic annotation of the best hit overall, without the limitation to the previously detected mOTUs. Here is the detailed description of how this was done.
Beside the marker genes collection, the mOTU software uses a mapping file _mOTU.v1.padded.motu.map_ which contains the gene Sequence ID, the length???, the COG ID and an internal mOTU ID. There is also the mapping file _mOTU.v1.map.txt_ containing the Sequence ID, the COG ID, the internal marker mOTU ID, and the mOTU linkage Group Name. In addition, _mOTU.v1.padded.motu.linkage.map_ contains the internal marker mOTU ID, and the taxonomy on the following levels: Domain, Phylum, Class, Order, Family, Genus, and Species, and, if applicable, the Species cluster, the Species cluster annotation and/or the mOTU species annotation (only the last column contains both Cluster and motu_linkage_group annotations).
We set up a directory for each COG present in the mapping files. We then put the Sequence IDs from the mapping file _mOTU.v1.map.txt_ into the directory of the COG they belong to.
```
for cog in `cut -f 3 mOTU.v1.padded.motu.map | sort | uniq`; do mkdir $cog;done
for cog in `cut -f 3 mOTU.v1.padded.motu.map | sort | uniq`; do grep $cog mOTU.v1.map.txt | cut -f 1 > $cog/$cog.mOTU.v1.map.txt ;done
```
There is also a file with the coordinates of the marker genes within the padded gene collection called _mOTU.v1.padded.coord_. From this extracted the coordinates for each gene and placed it in a file in the respective COG directory. The coordinates were then used to extract the gene sequences (without the padding) from the main database and place a fasta with all sequences belonging to one COG in the respective directory.
```
for cog in `cut -f 3 mOTU.v1.padded.motu.map | sort | uniq`; do for gene in $(cat $cog/$cog.mOTU.v1.map.txt); do grep $gene -w mOTU.v1.padded.coord >> $cog/$cog.mOTU.v1.padded.coord ;done; done
for cog in `cut -f 3 mOTU.v1.padded.motu.map | sort | uniq`; do cd $cog/; perl fastaExtractWithCoordBase1.pl ../mOTU.v1.padded $cog.mOTU.v1.padded.coord >> $cog.mOTU.v1.padded; cd ../; done
```
We also set up a directory for every sample, into which we copied the nucleotide sequences of the phylogenetic marker genes called on our assemblies (one file per COG).
```
for cog in `cut -f 3 mOTU.v1.padded.motu.map | sort | uniq`; do cp marker_genes/$cog.fna . ; done
```
This could already be used to find the best hit out of all mOTU marker genes, but we wanted to make use of the read-level analysis and limit the search only to those mOTUs that had been detected there. Therefore, we went into the results of the mOTU analysis, took the table _mOTU.counts.gz_, unpacked it, and extracted the mOTUs that were found. As the mOTU analysis had been done for metagenomic and metatranscriptomic reads, we combined both lists.
```
ln -s RESULTS/mOTU.counts.gz metaG.mOTU.counts.gz
gzip -d -c metaG.mOTU.counts.gz > metaG.mOTU.counts
grep motus.processing -v metaG.mOTU.counts | cut -f 1 > mOTUs.present
ln -s RESULTS/mOTU.counts.gz metaT.mOTU.counts.gz
gzip -d -c metaT.mOTU.counts.gz > metaT.mOTU.counts
grep motus.processing -v metaT.mOTU.counts | cut -f 1 >> mOTUs.present
cat mOTUs.present | sort | uniq > mOTUs.present.all
```
With the help of the R script `getValidGenes1.R`, we then searched the mapping files _mOTU.v1.padded.motu.linkage.map_ and _mOTU.v1.map.txt_ for the genes belonging to the mOTUs found in the sample. The genes were then extracted from the marker gene collection in COG-wise fashion into a new fasta file.
```
for cog in `cut -f 3 ../../mOTU.v1.padded.motu.map | sort | uniq`
do
Rscript ../../getValidGenes1.R $cog
perl testFastaExtract.pl ../../$cog/$cog.mOTU.v1.padded $cog.present.mOTU.genenames.txt > $cog.present.genes.fna
done
```
>(The leading ../../ in this example come from the directory structure with a directory per sample (ordered by family). They are here because they are also in the R script.
We then make a separate blast database with the limited genes for each COG and perform blastn for every gene annotated as this COG in our assembly. Already here, we limit the result to the best hit.
```
for cog in `cut -f 3 mOTU.v1.padded.motu.map | sort | uniq`
do
makeblastdb -in $cog.present.genes.fna -dbtype nucl -parse_seqids
blastn -db $cog.present.genes.fna -query $cog.fna -outfmt 7 -out $cog.bestPresentHits.tsv -max_target_seqs 1
done
```
The annotations of the best hit on the different taxonomic levels are then taken from _mOTU-LG.v1.annotations.txt_, _mOTU.v1.padded.motu.linkage.map_ and _mOTU.v1.map.txt_ using the R script `getValidGenes2new.R`.
In the alternative workflow, which finds the best hit out of the whole marker gene collection, we just built a blast database with all marker genes (per COG).
```
for cog in `cut -f 3 mOTU.v1.padded.motu.map | sort | uniq`
do
cd $cog/
makeblastdb -in $cog.mOTU.v1.padded -dbtype nucl -parse_seqids
cd ../
done
```
We then performed a blastn search for each sample using the predicted genes with this COG annotation from our assemblies. The taxonomy of this hit was again filtered from _mOTU-LG.v1.annotations.txt_, _mOTU.v1.padded.motu.linkage.map_ and _mOTU.v1.map.txt_ , this time using the script `getHitPhylogenyNew.R`.
```
for cog in `cut -f 3 ../../mOTU.v1.padded.motu.map | sort | uniq`
do
blastn -db ../../$cog/$cog.mOTU.v1.padded -query $cog.fna -outfmt 7 -out $cog.bestHits.tsv -max_target_seqs 1
Rscript ../../getHitPhylogenyNew.R $cog
done
```
Finally, we gathered all annotations of marker genes within the binned population-level genomes. We found that in most cases, annotation was unanimous.
One of the challenges in the MuSt was to link genes with functions of interest to their genomic contexts. One option to achieve this is reference-independent binning. During his time in the Eco-Systems Biology group at LCSB, Cédric Laczny (https://github.com/claczny?tab=repositories) developed VizBin (http://claczny.github.io/VizBin/), which uses Barnes-Hut Stochastic Neighbour Embedding of high-dimensional kmer frequency data to build 2D maps of contigs. Within VizBin, clusters within these maps can be nicely picked manually. However, in the MuSt we wanted to pick as many clusters as possible from each sample and we had 36 samples with multi-omic data (and 53 with metagenomic assemblies), so manual picking really was no option. In addition, we wanted to have the cluster picking informed by metagenomic coverage and the presence of single-copy essential genes, which was not possible with VizBin back then (it is now!).
The essential single-copy genes, as compiled by Dupont et al, 2011 (http://www.nature.com/ismej/journal/v6/n6/full/ismej2011189a.html) were called using HMMs (https://github.com/MadsAlbertsen/multi-metagenome/blob/master/R.data.generation/essential.hmm) provided by Mads Albertsen (http://madsalbertsen.github.io/multi-metagenome/) using a wrapper script for HMMER (http://hmmer.janelia.org/).
```
hmmsearch --tblout hmm.orfs.hits --cut_tc --notextw essential.hmm contig.Prodigal.faa > hmmer.out
```
(If this does not work for you, it may be because you have spaces in the fasta headers of your gene predictions. You can do `cut -f1 -d" " contig.Prodigal.faa > contig.Prodigal.renamed.faa` to help yourself before running HMMER on the renamed-file).
There is one further trick we played here: Cédric had noticed quite early on in working with VizBin and its predecessors (http://www.nature.com/articles/srep04516) that 16S and 23S rRNAs are so strongly conserved that they do not cluster with their genomes, unless the flanking regions are long enough to significantly impact on the genomic signature of the contig. We therefore removed 16S and 23S genes as much as possible without cutting contigs to lengths below 1000nt (which we need for a good signature). This cutting is done using .gff outputs from Barrnap (http://www.vicbioinformatics.com/software.barrnap.shtml) and a perl script (using Bioperl (http://bioperl.org/), so we format the contigs first using fastx's fasta_formatter (http://hannonlab.cshl.edu/fastx_toolkit/) to avoid lines longer than 2^15 characters). The resulting fasta file includes all contigs with at least 1000 nt length. Names are conserved.
```
fasta_formatter -i contigs.fa -o contigs.formatted.fa -w 80
fastaExtractCutRibosomal1000.pl contigs.formatted.fa rRNAgenes.euk.gff rRNAgenes.arc.gff rRNAgenes.bac.gff rRNAgenes.mito.gff contigs.1000.rRNAcut.fa rRNAcutting.log
```
Then, we ran the kmer frequency analysis on the cut contigs and the dimension reduction as implemented in VizBin non-interactively. As the coordinates of the 2D maps are returned in order, but without a name, a file with the contig names is also generated.
```
bash runBHSNE.sh contigs.1000.rRNAcut.fa .
grep ">" contigs.1000.rRNAcut.fa > contigs.1000.rRNAcut.names.txt
```
We then used DBSCAN (http://www.dbs.ifi.lmu.de/Publikationen/Papers/KDD-96.final.frame.pdf), as implemented in R (https://cran.r-project.org/web/packages/fpc/index.html) to bin clusters of contigs. Afterwards, we analyze the number and uniqueness of single-copy essential genes within the bins. If essential genes are present (more or less) in single copies, the clusters are accepted. If there are duplicates of the essential genes, we use the [average metagenomic coverage depth](calculating-coverage.md) of the contigs to separate the contigs into sub-bins with unimodal coverage distributions. The script uses Hartigans' diptest as implemented in the R-package diptest (https://cran.r-project.org/web/packages/diptest/index.html) to test for unimodality and a normal mixture model from the R-package mixtools (https://cran.r-project.org/web/packages/mixtools/index.html) to find a cut-off between groups of contigs with unimodal, (on a log scale) normally distributed coverage. Sub-bins are re-analyzed for uniqueness of the single-copy essential genes and the whole process is repeated three times with increasing stringency applied to the reachable points. All of these steps are performed by the script autoCluster.R.
```
Rscript autoCluster.R
```
This script uses an R workspace which we created previously and which contains all the information used during the binning. `makeWSvarAnnoCorrect.R` documents how this workspace was built. I don't recommend running this script, as it is full of paths to files which will only be found on our system.
The outputs of `autoCluster.R` include several files which document the cut-offs and reachability estimates used in the different iterations. Most importantly, a table with the cluster/bin membership of every contig is produced ( _contigs2clusters.tsv_ ). In addition, a 2D BH-SNE map with the final bins is created and for the relatively well recovered genomes (bins with >67% of the single-copy essential genes) bed files for the contigs and genes are created, as well as a visual evaluation of the length, coverage, taxonomy and functional content of the genomes.
The autoCluster method is being developed further. A version to be used with the output of the integrated metagenomic and metatranscriptomic assembly and analysis pipeline IMP, developed by Shaman Narayanasamy and Yohan Jarosz is available.
In the MuSt, we used metagenomic/metatranscriptomic coverage for two main purposes: In the context of [binning contigs](automatic-clustering.md) into population-level genomes, we used average depth of metagenomic coverage of contigs. In the context of expression analysis, we used total numbers of fragments mapping to genes with functional annotations.
Both measures were inferred from mapping reads to genes or contigs.
We mapped (filtered, trimmed) metagenomic reads to contigs using Bowtie2 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml).
```
bowtie2-build contigs.fa contigs
bowtie2 -x contigs -1 reads.screened.hg19.pair.1.fq -2 reads.screened.hg19.pair.2.fq -U reads.screened.hg19.single.fq -S DNAonContigs.sam -p 6 &>/DNAonContigs.log
```
Similarly, metagenomic reads can be mapped to genes.
```
bowtie2-build genes.rRNA.fa genes.rRNA
bowtie2 -x genes.rRNA -1 reads.screened.hg19.pair.1.fq -2 reads.screened.hg19.pair.2.fq -U reads.screened.hg19.single.fq -S DNAonGenesrRNA.sam -p 6 &>DNAonGenesrRNA.log
```
As we had stranded metatranscriptomic libraries, the mapping of RNA to reads looked a little different. To get reads mapping to the genes in sense (assuming, we use the index created above):
```
bowtie2 -x genes.rRNA --fr --nofw -1 reads.screened.hg19.pair.1.fq -2 reads.screened.hg19.pair.2.fq -U reads.screened.hg19.single.fq -S RNAonGenesrRNA.sam -p 6 &>RNAonGenesrRNA.log
```
To find all reads mapping antisense to genes could be done like this:
```
bowtie2 -x genes.rRNA --fr --norc -1 reads.screened.hg19.pair.1.fq -2 reads.screened.hg19.pair.2.fq -U reads.screened.hg19.single.fq -S revcompRNAonGenesrRNA.sam -p 6 &>revcompRNAonGenesrRNA.log
```
Now, to get the average depth of coverage (in this case of the contigs with metagenomic reads):
```
samtools view -bS DNAonContigs.sam > DNAonContigs.bam
samtools sort DNAonContigs.bam DNAonContigs.sorted
samtools depth DNAonContigs.sorted.bam > DNAonContigs.depth.txt
calculateCoverageAndGaps2.pl contigs.fa DNAonContigs.depth.txt > DNAonContigs.cov.tsv
```
The last line uses a custom script, which calculates and uses the length of the contig to calculate the average depth of coverage from the output of samtools' depth, because this omits nucleotides not covered at all.
To get the number of mapping (e.g. metatranscriptomic) fragments for each gene, we went to the Patrick May School of Wizadry and Oneliners (http://wwwen.uni.lu/lcsb/people/patrick_may)
```
grep -v -P "^\@" RNAonGenesrRNA.sam | cut -f 1,3 | sort | uniq | cut -f 2 | sort | uniq -c | perl -ane '$_=~/^\s+(\d+) (.+)$/;chomp($2); print "$2\t$1\n"; ' >RNAonGenesrRNA.mapped.tsv
```
This table can then be used to sum up the fragments mapping to genes with certain [functional annotations](functional-annotations.md) which can be used for differential analysis.
In addition to the described purposes, metagenomic or metatranscriptomic coverage of reads mapping to marker genes were used to infer community structures. For this purpose we applied the mOTU workflow (http://www.bork.embl.de/software/mOTU/) which forms part of MOCAT (http://vm-lux.embl.de/~kultima/MOCAT/index.html). Average depths of metagenomic or metatranscriptomic coverage were also used for general assessment of metagenomic and metatranscriptomic representations of genes, functions or contig bins within a sample. In very few instances, we compared these between samples after normalization to the total number of mapping nucleotides (sum of mapping lengths of all reads).
To annotate the predictions of protein coding genes with functions, we used different sets of HMM databases (for KEGG KOs, enzymes from MetaCyc and Uniprot, PFAM families, and TIGRFAM families). The databases for Pfam were downloaded from (INSERT LINK) and for TIGRFAM from (INSERT LINK). The other databases were compiled by ???
HMMER 3.1 (http://hmmer.janelia.org/) was run on each database to find hits ( _contig.Prodigal.faa_ is the file with amino acid sequences of the protein coding gene predictions).
```
hmmsearch --cpu 12 --noali --notextw --tblout contig.Prodigal.faa.kegg.hmmscan KO.hmm contig.Prodigal.faa >/dev/null
hmmsearch --cpu 12 --noali --notextw --tblout contig.Prodigal.faa.metacyc.hmmscan metacyc.hmm contig.Prodigal.faa >/dev/null
hmmsearch --cpu 12 --noali --notextw --tblout contig.Prodigal.faa.Pfam-A.hmmscan Pfam-A.hmm contig.Prodigal.faa >/dev/null
hmmsearch --cpu 12 --noali --notextw --tblout contig.Prodigal.faa.tigrpfam.hmmscan tigrpfam.hmm contig.Prodigal.faa >/dev/null
hmmsearch --cpu 12 --noali --notextw --tblout contig.Prodigal.faa.swissprot.hmmscan swissprot.hmm contig.Prodigal.faa >/dev/null
```
The output from HMMER was parsed using a script ???written by Venkata? which returns all hits in a better format (tab-separated).
```
perl consolidate_hmmscan_results.pl contig.Prodigal.faa contig.Prodigal.faa.kegg.hmmscan contig.Prodigal.faa.metacyc.hmmscan contig.Prodigal.faa.Pfam-A.hmmscan contig.Prodigal.faa.swissprot.hmmscan contig.Prodigal.faa.tigrpfam.hmmscan
```
There is a second version for this script which does exactly the same, but only for the KO annotations (`consolidate_hmmscan_results_justKEGG.pl`).
The formatted output is then parsed once more to retain only the best annotation out of all databases, or to get the top or top x percent of best hits from any database. To get the best annotation per gene, we can use `150310_MUST_hmmBestAll.py`.
```
python 150310_MUST_hmmBestAll.py contig.Prodigal.faakegg.tsv contig.Prodigal.faametacyc.tsv contig.Prodigal.faaswissprot.tsv contig.Prodigal.faapfam.tsv contig.Prodigal.faatigrpfam.tsv -g $(grep ">" contig.Prodigal.faa | wc -l)
```
Alternatively, or additionally (in the case of MuSt), we can get the best annotation per database:
```
python 150705_MUST_hmmParsePfam.py contig.Prodigal.faapfam.tsv pfamID -g $(grep ">" contig.Prodigal.faa | wc -l) -k
python 150705_MUST_hmmParse.py contig.Prodigal.faametacyc.tsv metaCycID -g $(grep ">" contig.Prodigal.faa | wc -l)
python 150705_MUST_hmmParse.py contig.Prodigal.faaswissprot.tsv swissprotEC -g $(grep ">" contig.Prodigal.faa | wc -l)
python 150705_MUST_hmmParse.py contig.Prodigal.faatigrpfam.tsv tigrID -g $(grep ">" contig.Prodigal.faa | wc -l)
python 150705_MUST_hmmParse.py contig.Prodigal.faakegg.tsv KO -g $(grep ">" contig.Prodigal.faa | wc -l)
```
As a last alternative, we can annotate genes with the nodes of a [reconstructed metabolic network](reconstructed-KO-network) based on the KO annotations. The script for this uses a table (provided here in `150705_KOs_in_NW.tsv`) with the link between nodes and KOs (which node contains which KOs) which is built by the network reconstruction script.
```
python 150705_MUST_keggParseNW.py contig.Prodigal.faakegg.tsv -g $(grep ">" contig.Prodigal.faa | wc -l)
```
As you can see, I always use the -g option of these scripts, which sets a sample-specific threshold for annotations to be accepted. In accordance with the HMMer manual, this is log2 of the number of predicted genes.
Now we have the annotations for each gene, but we might also like to know how many reads (representing how many fragments) map to the genes with a functional annotation to do some expression analysis. This can be done using very similar a family of scripts, which also take a gene-wise read [coverage](calculating-coverage) table as input. In the MuSt, the gene expression analysis was based on the best annotation out of all databases:
```
python 150322_bestHmmReadParse.py contig.Prodigal.faakegg.tsv contig.Prodigal.faametacyc.tsv contig.Prodigal.faaswissprot.tsv contig.Prodigal.faapfam.tsv contig.Prodigal.faatigrpfam.tsv DNAonGenesrRNA.cov.tsv -g $(grep ">" contig.Prodigal.faa | wc -l)
python $calcScript contig.Prodigal.faakegg.tsv contig.Prodigal.faametacyc.tsv contig.Prodigal.faaswissprot.tsv contig.Prodigal.faapfam.tsv contig.Prodigal.faatigrpfam.tsv RNAonGenesrRNA.cov.tsv -g $(grep ">" contig.Prodigal.faa | wc -l)
```
There is also a version of this script which calculates the average coverage depth of each function from all genes annotated with this function (`150415_bestHmmAveCovParse.py`). The results of this were not used within the MuSt.
```
#python 150415_bestHmmAveCovParse.py contig.Prodigal.faakegg.tsv contig.Prodigal.faametacyc.tsv contig.Prodigal.faaswissprot.tsv contig.Prodigal.faapfam.tsv contig.Prodigal.faatigrpfam.tsv DNAonGenesrRNA.cov.tsv -g $(grep ">" contig.Prodigal.faa | wc -l)
#python $calcScript contig.Prodigal.faakegg.tsv contig.Prodigal.faametacyc.tsv contig.Prodigal.faaswissprot.tsv contig.Prodigal.faapfam.tsv contig.Prodigal.faatigrpfam.tsv RNAonGenesrRNA.cov.tsv -g $(grep ">" contig.Prodigal.faa | wc -l)
```
Alternatively, only single databases can be used. In the case of MuSt, we used the KO database.
```
python 150630_keggReadParse.py contig.Prodigal.faakegg.tsv DNAonGenesrRNA.mapped.tsv -g $(grep ">" contig.Prodigal.faa | wc -l) -p 100.0
python 150630_keggReadParse.py contig.Prodigal.faakegg.tsv RNAonGenesrRNA.mapped.tsv -g $(grep ">" contig.Prodigal.faa | wc -l) -p 100.0
```
To annotate the predictions of protein coding genes with functions, we used different sets of HMM databases (for KEGG KOs, enzymes from MetaCyc and Uniprot, PFAM families, and TIGRFAM families). The databases for Pfam were downloaded from (INSERT LINK) and for TIGRFAM from (INSERT LINK). The other databases were compiled by ??? To annotate the predictions of protein coding genes with functions, we used different sets of HMM databases (for KEGG KOs, enzymes from MetaCyc and Uniprot, PFAM families, and TIGRFAM families). The databases for Pfam were downloaded from (INSERT LINK) and for TIGRFAM from (INSERT LINK). The other databases were compiled by ???
HMMER 3.1 (http://hmmer.janelia.org/) was run on each database to find hits ( _contig.Prodigal.faa_ is the file with amino acid sequences of the protein coding gene predictions). HMMER 3.1 (http://hmmer.janelia.org/) was run on each database to find hits ( _contig.Prodigal.faa_ is the file with amino acid sequences of the protein coding gene predictions).
...@@ -37,7 +36,7 @@ python 150705_MUST_keggParseNW.py contig.Prodigal.faakegg.tsv -g $(grep ">" cont ...@@ -37,7 +36,7 @@ python 150705_MUST_keggParseNW.py contig.Prodigal.faakegg.tsv -g $(grep ">" cont
``` ```
As you can see, I always use the -g option of these scripts, which sets a sample-specific threshold for annotations to be accepted. In accordance with the HMMer manual, this is log2 of the number of predicted genes. As you can see, I always use the -g option of these scripts, which sets a sample-specific threshold for annotations to be accepted. In accordance with the HMMer manual, this is log2 of the number of predicted genes.
Now we have the annotations for each gene, but we might also like to know how many reads (representing how many fragments) map to the genes with a functional annotation to do some expression analysis. This can be done using very similar a family of scripts, which also take a gene-wise read [coverage](calculating-coverage) table as input. In the MuSt, the gene expression analysis was based on the best annotation out of all databases: Now we have the annotations for each gene, but we might also like to know how many reads (representing how many fragments) map to the genes with a functional annotation to do some expression analysis. This can be done using very similar a family of scripts, which also take a gene-wise read [coverage](calculating-coverage.md) table as input. In the MuSt, the gene expression analysis was based on the best annotation out of all databases:
``` ```
python 150322_bestHmmReadParse.py contig.Prodigal.faakegg.tsv contig.Prodigal.faametacyc.tsv contig.Prodigal.faaswissprot.tsv contig.Prodigal.faapfam.tsv contig.Prodigal.faatigrpfam.tsv DNAonGenesrRNA.cov.tsv -g $(grep ">" contig.Prodigal.faa | wc -l) python 150322_bestHmmReadParse.py contig.Prodigal.faakegg.tsv contig.Prodigal.faametacyc.tsv contig.Prodigal.faaswissprot.tsv contig.Prodigal.faapfam.tsv contig.Prodigal.faatigrpfam.tsv DNAonGenesrRNA.cov.tsv -g $(grep ">" contig.Prodigal.faa | wc -l)
......
For the faecal samples of the MuSt that were analyzed on the metagenomic, metatranscriptomic and metaproteomic levels, we assembled a total of more than 26 million contigs containing approximately 30 million genes. For each of these contigs, we had interesting data, such as length, metagenomic coverage, position and coverage of variants, membership in a bin, position on a BH-SNE map, putative taxonomy and of course the genes they harbour. For the genes, we had similar data, such as the position and sense on the contig, coverage by metagenomic and metatranscriptomic reads, detection of the protein product, functional annotation(s), essentiality, and taxonomic annotations.
My favourite way of visualizing and statistically analyzing data is by using R. But even sample-wise datasets are huge for R to store, search and filter. Of course, there are some summarizing data sets, such as the abundance of every mOTU in every sample (matrix of 500 x 36) or the expression of every KO (matrix of a few thousand x 36) in every sample which can be easily worked with in R. But to find for example the taxonomy of every gene annotated as K00001 in every sample, along with its expression on the metatranscriptomic and metaproteomic level would be very difficult (or better: time consuming) using pure R. In addition, someone else might want to use a different tool to analyze the data.
I finally decided to build a MongoDB database (https://www.mongodb.org/), mainly because the contigs and genes naturally form a nested structure that is well representable by Mongo documents and because I hope that Mongo would scale well for adaptation to a larger project with a biologically meaningful sample size. I was also intrigued by the fact that it has a 2D indexing which can be used for searching, but I have actually never gotten around to implementing using the BH-SNE maps as interface to the database. MongoDBs can be accessed with a large number of tools or programming languages (I have only used R and python here) and it is nice and fast.
A python script (final version is `150928_mongifyMust.py`) was used to fill the database with most of the metaG, metaT and metaP data that was created within the MuSt with the exception of the actual sequences. This script is made for the MuSt with its particular system of sorting data by families and then samples. It is made up mostly of paths to the files that were created one by one without the use of the database. Therefore reusing this script in any other environment is not recommended. The resulting structure with the names of the fields is displayed here for your reference if you want to use some of the code to access it from R:
[database structure](http://git-r3lab.uni.lu/anna.buschart/MuStMultiomics/blob/master/figS11_DB.png)
The script `151020_funOIMongoWS.R` exemplifies how the database can be accessed from R to retrieve expression data for genes with a function of interest and the contig bins these genes belong to. To just give the most important points, here is what always needs to be done to interact with a MongoDB using the rmongodb-package (https://cran.r-project.org/web/packages/rmongodb/index.html).
```
library(rmongodb)
mongo <- mongo.create()
db <- "mydb"
coll <- "mydb.must"
# access the database here using the custom function getExprTab and the function of interest:
exTab <- getExprTab(funOI)
mongo.destroy(mongo)
```
The function `getExprTab()` then fetches the data into two lists via two aggregation pipelines:
```
unwindGenes <- mongo.bson.from.JSON('{"$unwind": "$genes"}')
matchAnno <- mongo.bson.from.JSON(paste('{"$match": {"genes.bestAnnotation": "',funOI,'"} }',sep=""))
matchProt <- mongo.bson.from.JSON(paste('{"$match": {"genes.proteinIdentification": "uniquely"} }',sep=""))
groupDNARNAname <- mongo.bson.from.JSON('{"$group": {"_id": "$_id", "aveCovDNA": {"$push": "$genes.aveCovDNA"},"sample":{"$push":"$sample"},"gene":{"$push":"$genes.gene"},"cluster":{"$push": "$cluster"},"aveCovRNA":{"$push": "$genes.aveCovRNAfw"}}}')
projectDNARNAname <- mongo.bson.from.JSON('{"$project": {"_id": 0, "gene":1,"aveCovDNA": 1, "cluster" :1,"aveCovRNA":1,"sample":1}}')
groupProtname <- mongo.bson.from.JSON('{"$group": {"_id": "$_id", "gene": {"$push": "$genes.gene"},"sample":{"$push":"$sample"},"proteinIdentification":{"$push":"$genes.proteinIdentification"},"proteinArea":{"$push":"$genes.proteinArea"}}}')
projectProtname <- mongo.bson.from.JSON('{"$project": {"_id": 0, "gene":1,"proteinIdentification":1,"proteinArea":1,"sample":1}}')
genesA <- mongo.bson.to.list(mongo.aggregation(mongo,coll,list(matchAnno,unwindGenes,matchAnno,groupDNARNAname,projectDNARNAname)))$result
genesP <- mongo.bson.to.list(mongo.aggregation(mongo,coll,list(matchAnno,unwindGenes,matchAnno,matchProt,groupProtname,projectProtname)))$result
```
As not all genes are found on the protein level, the proteins have their own pipeline.
The list with the data that is present for all genes is transformed into a data.frame like this:
```
res <- do.call(rbind,lapply(genesA,function(x) cbind(x$sample,x$gene,x$cluster,x$aveCovDNA,x$aveCovRNA)))
res <- data.frame("sample"=res[,1],"gene"=res[,2],"cluster"=res[,3],"aveCovDNA"=as.numeric(res[,4]),"aveCovRNA"=as.numeric(res[,5]),stringsAsFactors=F)
```
If there is data for the protein level, this is merged to the data.frame, otherwise 0 is returned for the protein abundance:
```
if(length(genesP)>0){
pFeat <- do.call(rbind,lapply(genesP,function(x) cbind(x$sample,x$gene, x$proteinArea)))
pFeat[sapply(pFeat[,3],length)>1,3] <- sapply(pFeat[sapply(pFeat[,3],length)>1,3], mean)
pFeat <- data.frame("sample"=unlist(pFeat[,1]),"gene"=unlist(pFeat[,2]),"proteinArea"=as.numeric(unlist(pFeat[,3])),stringsAsFactors=F)
aFeat <- merge(res,pFeat,by=c(1,2),all.x=T)
}else{
aFeat <- res
aFeat$proteinArea <- 0
}
aFeat[is.na(aFeat)] <- 0
return(aFeat)
```
Another example is given in `150928_MUST_relatedClusterWSFromMongo.R`. Here, a function is used which returns all genes of one bin ("cluster", clusOI) of contigs in a sample (sampleID) with their functions and levels on the different omic levels. This again uses aggregation pipelines with the following steps:
```
matchFun <- mongo.bson.from.JSON(paste('{"$match": {"genes.bestAnnotation": {"$exists": "true"}} }',sep=""))
matchCluster <- mongo.bson.from.JSON(paste('{"$match": {"cluster": "',clusOI,'"} }',sep=""))
matchSample <- mongo.bson.from.JSON(paste('{"$match": {"sample": "',sampleID,'"} }',sep=""))
matchProt <- mongo.bson.from.JSON(paste('{"$match": {"genes.proteinIdentification": "uniquely"} }',sep=""))
unwindGenes <- mongo.bson.from.JSON('{"$unwind": "$genes"}')
unwindProteins <- mongo.bson.from.JSON('{"$unwind": "$genes.proteinIdentification"}')
groupGeneA <- mongo.bson.from.JSON('{"$group": {"_id": "$_id", "gene": {"$push": "$genes.gene"}, "aveCovDNA":{"$push":"$genes.aveCovDNA"}, "aveCovRNA":{"$push":"$genes.aveCovRNAfw"}, "readsRNA":{"$push":"$genes.readsRNAfw"},"fun":{"$push":"$genes.bestAnnotation"}}}')
groupGeneProt <- mongo.bson.from.JSON('{"$group": {"_id": "$_id", "gene": {"$push": "$genes.gene"}, "proteinIdentification":{"$push":"$genes.proteinIdentification"}, "proteinArea":{"$push":"$genes.proteinArea"}}}')
projectGeneA <- mongo.bson.from.JSON('{"$project": {"_id": 0, "fun":1, "gene":1, "aveCovDNA":1,"aveCovRNA":1,"readsRNA":1}}')
projectGeneProt <- mongo.bson.from.JSON('{"$project": {"_id": 0, "gene":1,"proteinIdentification":1,"proteinArea":1}}')
```
The results are fed into lists.
```
genesA <- mongo.bson.to.list(mongo.aggregation(mongo,coll,list(matchCluster,matchSample,unwindGenes,matchFun,groupGeneA,projectGeneA)))$result
genesP <- mongo.bson.to.list(mongo.aggregation(mongo,coll,list(matchCluster,matchSample,unwindGenes,matchFun,matchProt,groupGeneProt,projectGeneProt)))$result
```
Which are again converted to data.frames using `do.call`, `rbind` and `lapply`:
```
tFeat <- do.call(rbind,lapply(genesA,function(x) cbind(x$gene,x$fun,x$aveCovDNA,x$aveCovRNA,x$readsRNA)))
tFeat[sapply(tFeat[,2],length)>1,2] <- sapply(tFeat[sapply(tFeat[,2],length)>1,2],function(x) paste(x,sep=";",collapse=";"))
tFeat <- data.frame("gene"=unlist(tFeat[,1]),"bestAnno"=unlist(tFeat[,2]),"aveCovDNA"=as.numeric(unlist(tFeat[,3])),"aveCovRNA"=as.numeric(unlist(tFeat[,4])),"readsRNA"=as.numeric(unlist(tFeat[,5])),stringsAsFactors=F)
```
The scripts `virusGenesMongo.R` and `eukaryoticGenesMongo.R` exemplify how functional annotations and coverage data for genes can be accessed by the gene ID and the sample ID.
Similarly, the contribution of all binned genome-level populations to the abundance of nodes in a reconstructed network can be retrieved using the script `getNWexprMongoAllSamples.R`.
For the MuSt Multiomics analysis, we built sample specific search databases. To do this we used the genome sequences of the probands who donated the faecal samples to personalize the human proteins in the search databases. On the other hand, we used the genes predicted from the assemblies of metagenomic and metatranscriptomic reads of each sample. However, assemblies don't reflect the strain level variation encountered in a microbial community but rather a consensus. Therefore we also called variants. First, metagenomic and metatranscriptomic reads were mapped to the contigs (see [calculating coverage](calculating-coverage.md)). The names of the resulting .bam-files were saved in `bams.DNARNA.txt`. We then used Platypus (http://www.well.ox.ac.uk/platypus; version 0.7.9.1) to call variants:
```
Platypus.py callVariants --refFile=contigs.fa --bamFiles=bams.DNARNA.txt --nCPU=12 -o DNARNAonContigs.vcf
```
bgzip and tabix (https://github.com/samtools/htslib) were used to compress the .vcf-files.
```
bgzip DNARNAonContigs.vcf
tabix DNARNAonContigs.vcf.gz
```
vcftools (https://vcftools.github.io/man_latest.html) was used to filter the variants:
```
vcftools --gzvcf DNARNAonContigs.vcf.gz --remove-filtered GOF --remove-filtered QD --remove-filtered hp10 --remove-filtered Q20 --remove-filtered MQ --remove-filtered SC --remove-filtered QualDepth --remove-filtered badReads --recode --stdout | bgzip > DNARNAonContigsFiltered.vcf.gz
bgzip -d DNARNAonContigsFiltered.vcf.gz
```
For the next step, we changed some code written by Nic Pinel (http://www.eafit.edu.co/docentes-investigadores/Paginas/nicolas-pinel.aspx) for his project with Emilie Muller (http://www.nature.com/ncomms/2014/141126/ncomms6603/full/ncomms6603.html). Beside the .vcf this script uses the .tab output of prodigal (https://github.com/hyattpd/prodigal/releases/) which had been used to call the genes and the assembled contigs (which were formatted into blocks of width 80 using the fasta_formatter of the fastx suit (http://hannonlab.cshl.edu/fastx_toolkit/).
```
perl variant_annotateRepairedTab.pl -v DNARNAonContigsFiltered.vcf -a gene.tab -s contigs.formatted.fa -p
```
This script gives out two fasta files with the nucleotide and amino acid sequences of all predicted proteins, including the original predictions and the variants, as well as a .tab file with the variant gene positions in the prodigal-style. The resulting .tab-file was concatenated with the original tab file. This was used to remove incomplete protein fragments which would not form tryptic peptides.
```
cat gene.tab contigs.formatted.variants.tab >> genes.variants.tab
perl trypsinStartEnd.pl contigs.formatted.variants.faa genes.variants.tab >> genes.variants.endsRemoved.faa
```
Before the microbial protein predictions were concatenated with the human proteins, they were renamed to have a unique ID starting with sp|1xxxxxxx|.
```
perl /home/users/aheintzbuschart/myScripts/rename4proteomics.pl genes.variants.endsRemoved.faa 1 >> genes.variants.endsRemoved.Renamed.faa
```
In the MuSt, we used a previously established method (http://www.nature.com/articles/npjbiofilms20157) to reconstruct a community-wide metabolic network from KEGG orthologous groups (KOs) by forming edges between KO nodes, if the KOs share reactants. The reconstruction essentially uses the same script as in the previous project (INSERT LINK), with only one significant change in the combination of KOs in nodes. (In both projects, KOs of enzymes which have exactly the same reactants are combined into one nodes. The rationale behind this is that KOs with the exact same reactants are often subunits of the same enzyme. Within the concept of the network reconstructions and the topological analyses, these cases are better represented by single nodes than tight clusters of KOs.) In the MuSt workflow, the KOs are only combined into nodes, if they have adjacent KO numbers. This is due to the fact that KOs forming the subunits of one enzyme usually have adjacent numbers and KOs with the exact same reactants but non-adjacent numbers are often iso-enzymes or even enzymes catalyzing inverse reactions. To use the script `140630_MUST_NW.R`, several tables have to be downloaded from http://rest.kegg.jp:
http://rest.kegg.jp/link/Ko/pathway
http://rest.kegg.jp/link/Ko/rn
http://rest.kegg.jp/link/rn/rp
http://rest.kegg.jp/link/rp/rc/
http://rest.kegg.jp/link/rc/rn
http://rest.kegg.jp/list/rp
http://rest.kegg.jp/list/cpd
We also use the file ko2des_clean.txt which contains the descriptions for every KO in the HMM database.
The script gives out the reconstructed network, as well as a file with all KOs and nodes in the network. This can be used to find the expression levels of genes [functionally annotated](functional-annotations.md) with these KOs and restrict the network to only nodes represented by (expressed) genes.
Based on expression levels, sub-modules with significant differences under different conditions (such as samples from different individuals, families, dependent on disease etc) can be described. We used the R-package BioNet (https://bioconductor.org/packages/release/bioc/html/BioNet.html) for this. BioNet can be applied using the program Heinz (https://github.com/ls-cwi/heinz) or an R-implementation (FastHeinz). Function to either use the pure-R version of BioNet with output from DESeq2 (https://bioconductor.org/packages/release/bioc/html/DESeq2.html) or call Heinz on the command line from R (using `runHeinz.sh`) can be found in `140630_MUST_NW.R`.
These functions supplied with BioNet to make a graphical representation of the highest scoring module.
To make a more fancy plot, which also displays the relative gene/transcript/protein copy numbers linked to the binned population-level genome reconstructions, further functions can be found in `plotModules_omicLevels.R`. These functions, in turn, use data retrieved from a [mongo database](mongo-database.md). These plots are very rich in specific information, so the script should not be expected to work without this information.
MG-RAST (http://metagenomics.anl.gov/) can be used in many ways to analyze and annotate metagenomic (or metatranscriptomic) data. Independent of the kind of sequences that were uploaded, annotations are achieved using sequence alignment via Blast. Annotations can be accessed using the MG-RAST API (http://api.metagenomics.anl.gov/api.html) or the "Download" page.
Within the MuSt project, we analyzed both functional and taxonomic annotations. However, we found that functional annotation can be improved by [using HMMs](functional-annotations). In addition, we used the genomic context of the contigs and binning of the contigs to establish taxonomy of prokaryotic (mostly bacterial) genes. Therefore, the taxonomic assignments of potential eukaryotic and viral genes were the main application of MG-RAST within the MuSt.
So we uploaded the gene predictions (protein coding and bacterial rRNAs). If you want to check out our data, you can find the accession numbers in MGRASTaccessions.tsv. We then downloaded the RefSeq-based organism annotations via the Download-page. The downloaded tables were mined for eukaryotic genes (except for remaining human sequences) and viral genes (except the phiX174 genome, which is spiked into illumina sequencing libraries) using the R-script `MGRASTgeneLevelTax.R`. As one gene can be aligned to genes of several taxa, we kept the assignment with the highest bitscore for each gene. In addition, we kept all assignments with a score that was at least 80 % of the highest bitscore for this gene. To find the lowest common ancestor (LCA) and the species, genus, family, order, class, phylum and kingdom level taxonomy of each gene, the script calls the script `parse_taxbrowser_MGRAST.py`. This script relies heavily on a parser written by Romain Studer (http://evosite3d.blogspot.de/2013/06/browsing-ncbi-taxonomy-with-python.html) and the NCBI taxonomy. The output of the R-script is a table for each sample with the eukaryotic genes and a table with the viral genes (and some temporary files). Each table contains the gene name, the LCA, and the names at species, genus, family, order, class, phylum and kingdom level.
All additional information about these genes, such as coverage or detection at the protein level, were retrieved from the [Mongo database](mongo-database.md).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment