Commit bd6cc926 authored by Anna Buschart's avatar Anna Buschart
Browse files

change online

Merge branch 'master' of https://git-r3lab.uni.lu/anna.buschart/MuStMultiomics
parents 1f1e04ec 5a3eaf7a
...@@ -37,6 +37,7 @@ to automatically [cluster](automatic-clustering.md) contigs based on nucleotide ...@@ -37,6 +37,7 @@ to automatically [cluster](automatic-clustering.md) contigs based on nucleotide
to gather contig clusters by [related phylogenetic marker genes in a phylogenetic tree](phylogenetic-marker-genes-trees.md): to gather contig clusters by [related phylogenetic marker genes in a phylogenetic tree](phylogenetic-marker-genes-trees.md):
* [150819_MUST_tree.R](150819_MUST_tree.R) * [150819_MUST_tree.R](150819_MUST_tree.R)
* [150816_getPhyloMarkers.R](150816_getPhyloMarkers.R)
to [reconstruct](reconstructed-KO-network.md) a metabolic network from KOs and analyse it: to [reconstruct](reconstructed-KO-network.md) a metabolic network from KOs and analyse it:
* [140630_MUST_NW.R](140630_MUST_NW.R) * [140630_MUST_NW.R](140630_MUST_NW.R)
......
For the faecal samples of the MuSt that were analyzed on the metagenomic, metatranscriptomic and metaproteomic levels, we assembled a total of more than 26 million contigs containing approximately 30 million genes. For each of these contigs, we had interesting data, such as length, metagenomic coverage, position and coverage of variants, membership in a bin, position on a BH-SNE map, putative taxonomy and of course the genes they harbour. For the genes, we had similar data, such as the position and sense on the contig, coverage by metagenomic and metatranscriptomic reads, detection of the protein product, functional annotation(s), essentiality, and taxonomic annotations. For the faecal samples of the MuSt that were analyzed on the metagenomic, metatranscriptomic and metaproteomic levels, we assembled a total of more than 26 million contigs containing approximately 30 million genes. For each of these contigs, we had interesting data, such as length, metagenomic coverage, position and coverage of variants, membership in a bin, position on a BH-SNE map, putative taxonomy and of course the genes they harbour. For the genes, we had similar data, such as the position and sense on the contig, coverage by metagenomic and metatranscriptomic reads, detection of the protein product, functional annotation(s), essentiality, and taxonomic annotations.
My favourite way of visualizing and statistically analyzing data is by using R. But even sample-wise datasets are huge for R to store, search and filter. Of course, there are some summarizing data sets, such as the abundance of every mOTU in every sample (matrix of 500 x 36) or the expression of every KO (matrix of a few thousand x 36) in every sample which can be easily worked with in R. But to find for example the taxonomy of every gene annotated as K00001 in every sample, along with its expression on the metatranscriptomic and metaproteomic level would be very difficult (or better: time consuming) using pure R. In addition, someone else might want to use a different tool to analyze the data.
My favourite way of visualizing and statistically analyzing data is by using [R](https://www.r-project.org/). But even sample-wise datasets are huge for R to store, search and filter. Of course, there are some summarizing data sets, such as the abundance of every mOTU in every sample (matrix of 500 x 36) or the expression of every KO (matrix of a few thousand x 36) in every sample which can be easily worked with in R. But to find for example the taxonomy of every gene annotated as K00001 in every sample, along with its expression on the metatranscriptomic and metaproteomic level would be very difficult (or better: time consuming) using pure R. In addition, someone else might want to use a different tool to analyze the data.
I finally decided to build a MongoDB database (https://www.mongodb.org/), mainly because the contigs and genes naturally form a nested structure that is well representable by Mongo documents and because I hope that Mongo would scale well for adaptation to a larger project with a biologically meaningful sample size. I was also intrigued by the fact that it has a 2D indexing which can be used for searching, but I have actually never gotten around to implementing using the BH-SNE maps as interface to the database. MongoDBs can be accessed with a large number of tools or programming languages (I have only used R and python here) and it is nice and fast. I finally decided to build a MongoDB database (https://www.mongodb.org/), mainly because the contigs and genes naturally form a nested structure that is well representable by Mongo documents and because I hope that Mongo would scale well for adaptation to a larger project with a biologically meaningful sample size. I was also intrigued by the fact that it has a 2D indexing which can be used for searching, but I have actually never gotten around to implementing using the BH-SNE maps as interface to the database. MongoDBs can be accessed with a large number of tools or programming languages (I have only used R and python here) and it is nice and fast.
A python script (final version is `150928_mongifyMust.py`) was used to fill the database with most of the metaG, metaT and metaP data that was created within the MuSt with the exception of the actual sequences. This script is made for the MuSt with its particular system of sorting data by families and then samples. It is made up mostly of paths to the files that were created one by one without the use of the database. Therefore reusing this script in any other environment is not recommended. The resulting structure with the names of the fields is displayed here for your reference if you want to use some of the code to access it from R:
A python script (final version is [`150928_mongifyMust.py`](150928_mongifyMust.py)) was used to fill the database with most of the metaG, metaT and metaP data that was created within the MuSt with the exception of the actual sequences. This script is made for the MuSt with its particular system of sorting data by families and then samples. It is made up mostly of paths to the files that were created one by one without the use of the database. Therefore reusing this script in any other environment is not recommended. The resulting structure with the names of the fields is displayed here for your reference if you want to use some of the code to access it from R:
[database structure](http://git-r3lab.uni.lu/anna.buschart/MuStMultiomics/blob/master/figS11_DB.png) [database structure](http://git-r3lab.uni.lu/anna.buschart/MuStMultiomics/blob/master/figS11_DB.png)
The script `151020_funOIMongoWS.R` exemplifies how the database can be accessed from R to retrieve expression data for genes with a function of interest and the contig bins these genes belong to. To just give the most important points, here is what always needs to be done to interact with a MongoDB using the rmongodb-package (https://cran.r-project.org/web/packages/rmongodb/index.html). The script [`151020_funOIMongoWS.R`](151020_funOIMongoWS.R) exemplifies how the database can be accessed from R to retrieve expression data for genes with a function of interest and the contig bins these genes belong to. To just give the most important points, here is what always needs to be done to interact with a MongoDB using the rmongodb-package (https://cran.r-project.org/web/packages/rmongodb/index.html).
``` ```
library(rmongodb) library(rmongodb)
...@@ -52,7 +55,7 @@ if(length(genesP)>0){ ...@@ -52,7 +55,7 @@ if(length(genesP)>0){
return(aFeat) return(aFeat)
``` ```
Another example is given in `150928_MUST_relatedClusterWSFromMongo.R`. Here, a function is used which returns all genes of one bin ("cluster", clusOI) of contigs in a sample (sampleID) with their functions and levels on the different omic levels. This again uses aggregation pipelines with the following steps: Another example is given in [`150928_MUST_relatedClusterWSFromMongo.R`](150928_MUST_relatedClusterWSFromMongo.R). Here, a function is used which returns all genes of one bin ("cluster", clusOI) of contigs in a sample (sampleID) with their functions and levels on the different omic levels. This again uses aggregation pipelines with the following steps:
``` ```
matchFun <- mongo.bson.from.JSON(paste('{"$match": {"genes.bestAnnotation": {"$exists": "true"}} }',sep="")) matchFun <- mongo.bson.from.JSON(paste('{"$match": {"genes.bestAnnotation": {"$exists": "true"}} }',sep=""))
...@@ -80,8 +83,8 @@ tFeat[sapply(tFeat[,2],length)>1,2] <- sapply(tFeat[sapply(tFeat[,2],length)>1,2 ...@@ -80,8 +83,8 @@ tFeat[sapply(tFeat[,2],length)>1,2] <- sapply(tFeat[sapply(tFeat[,2],length)>1,2
tFeat <- data.frame("gene"=unlist(tFeat[,1]),"bestAnno"=unlist(tFeat[,2]),"aveCovDNA"=as.numeric(unlist(tFeat[,3])),"aveCovRNA"=as.numeric(unlist(tFeat[,4])),"readsRNA"=as.numeric(unlist(tFeat[,5])),stringsAsFactors=F) tFeat <- data.frame("gene"=unlist(tFeat[,1]),"bestAnno"=unlist(tFeat[,2]),"aveCovDNA"=as.numeric(unlist(tFeat[,3])),"aveCovRNA"=as.numeric(unlist(tFeat[,4])),"readsRNA"=as.numeric(unlist(tFeat[,5])),stringsAsFactors=F)
``` ```
The scripts `virusGenesMongo.R` and `eukaryoticGenesMongo.R` exemplify how functional annotations and coverage data for genes can be accessed by the gene ID and the sample ID. The scripts [`virusGenesMongo.R`](virusGenesMongo.R) and [`eukaryoticGenesMongo.R`](eukaryoticGenesMongo.R) exemplify how functional annotations and coverage data for genes can be accessed by the gene ID and the sample ID.
Similarly, the contribution of all binned genome-level populations to the abundance of nodes in a reconstructed network can be retrieved using the script `getNWexprMongoAllSamples.R`. Similarly, the contribution of all binned genome-level populations to the abundance of nodes in a reconstructed network can be retrieved using the script [`getNWexprMongoAllSamples.R`](getNWexprMongoAllSamples.R).
This workflow makes use of the functional annotation data and the completeness information stored in the [mongo database](mongo-database.md), collects the amino acid sequences of all complete phylogenetic marker gene predictions in all samples and builds a phylogenetic tree based on multiple sequence alignment for each class of marker genes. The information about membership of the marker genes in a contig [cluster](automatic-clustering.md) (binned reconstructed population-level genome) is retained, so closely related reconstructed genomes from different samples can be indentified as such. This workflow makes use of the functional annotation data and the completeness information stored in the [mongo database](mongo-database.md), collects the amino acid sequences of all complete phylogenetic marker gene predictions in all samples and builds a phylogenetic tree based on multiple sequence alignment for each class of marker genes. The information about membership of the marker genes in a contig [cluster](automatic-clustering.md) (binned reconstructed population-level genome) is retained, so closely related reconstructed genomes from different samples can be indentified as such.
In the first step, an R-script is run to retrieve all complete genes annotated as one of the mOTU [marker genes](annotate-phylogenetic-marker-genes.md) - COG0012, COG0016, COG0018, COG0172, COG0215, COG0495, COG0525, COG0533, COG0541, COG0552 - or __rpoB__ (TIGR02013) from the [mongo database](mongo-database.md). It returns a table with the gene IDs, the sample IDs and the contig cluster IDs of the complete marker genes for each sample and class of marker gene. (The script is documented here, but it depends on the specific file structure used in this project, so don't expect it to run anywhere else). In the first step, an R-script is run to retrieve all complete genes annotated as one of the mOTU [marker genes](annotate-phylogenetic-marker-genes.md) - COG0012, COG0016, COG0018, COG0172, COG0215, COG0495, COG0525, COG0533, COG0541, COG0552 - or __rpoB__ (TIGR02013) from the [mongo database](mongo-database.md). It returns a table with the gene IDs, the sample IDs and the contig cluster IDs of the complete marker genes for each sample and class of marker gene. (The script is documented [here](150816_getPhyloMarkers.R), but it depends on the specific file structure used in this project, so don't expect it to run anywhere else).
``` ```
Rscript 150816_getPhyloMarkers.R Rscript 150816_getPhyloMarkers.R
...@@ -55,4 +55,4 @@ for(mark in markers){ ...@@ -55,4 +55,4 @@ for(mark in markers){
This will write a tab-separated table for each group of closely related genes. The tables contain the ID of samples and the ID of the contig clusters (population-level genome) which contain a phylogenetic marker gene from the group of related genes. The files are named by a number that derives from the phylogenetic tree and the marker gene. This will write a tab-separated table for each group of closely related genes. The tables contain the ID of samples and the ID of the contig clusters (population-level genome) which contain a phylogenetic marker gene from the group of related genes. The files are named by a number that derives from the phylogenetic tree and the marker gene.
Another useful package for phylogenetic tree visualization in R is [geiger](http://www.webpages.uidaho.edu/~lukeh/software.html). Using this and ape, we can visualize a tree for each marker gene and colour it according to the sample of origin. Or we can extract specific parts of the tree, like for example leaves formed by phylogenetic marker genes which belong to contig clusters with weak classification. Both examples are performed in the script _150819_MUST_tree.R_, which contains some nice functions for visualization. The script is however based on the MuSt sample IDs and MuSt file structure and should not be run as is on other data sets. Another useful package for phylogenetic tree visualization in R is [geiger](http://www.webpages.uidaho.edu/~lukeh/software.html). Using this and ape, we can visualize a tree for each marker gene and colour it according to the sample of origin. Or we can extract specific parts of the tree, like for example leaves formed by phylogenetic marker genes which belong to contig clusters with weak classification. Both examples are performed in the script [`150819_MUST_tree.R`](150819_MUST_tree.R), which contains some nice functions for visualization. The script is however based on the MuSt sample IDs and MuSt file structure and should not be run as is on other data sets.
In the MuSt, we used a previously established method (http://www.nature.com/articles/npjbiofilms20157) to reconstruct a community-wide metabolic network from KEGG orthologous groups (KOs) by forming edges between KO nodes, if the KOs share reactants. The reconstruction essentially uses the same script as in the previous project (INSERT LINK), with only one significant change in the combination of KOs in nodes. (In both projects, KOs of enzymes which have exactly the same reactants are combined into one nodes. The rationale behind this is that KOs with the exact same reactants are often subunits of the same enzyme. Within the concept of the network reconstructions and the topological analyses, these cases are better represented by single nodes than tight clusters of KOs.) In the MuSt workflow, the KOs are only combined into nodes, if they have adjacent KO numbers. This is due to the fact that KOs forming the subunits of one enzyme usually have adjacent numbers and KOs with the exact same reactants but non-adjacent numbers are often iso-enzymes or even enzymes catalyzing inverse reactions. To use the script `140630_MUST_NW.R`, several tables have to be downloaded from http://rest.kegg.jp: In the MuSt, we used a previously established method (http://www.nature.com/articles/npjbiofilms20157) to reconstruct a community-wide metabolic network from KEGG orthologous groups (KOs) by forming edges between KO nodes, if the KOs share reactants. The reconstruction essentially uses the same script as in the previous project, with only one significant change in the combination of KOs in nodes. (In both projects, KOs of enzymes which have exactly the same reactants are combined into one nodes. The rationale behind this is that KOs with the exact same reactants are often subunits of the same enzyme. Within the concept of the network reconstructions and the topological analyses, these cases are better represented by single nodes than tight clusters of KOs.) In the MuSt workflow, the KOs are only combined into nodes, if they have adjacent KO numbers. This is due to the fact that KOs forming the subunits of one enzyme usually have adjacent numbers and KOs with the exact same reactants but non-adjacent numbers are often iso-enzymes or even enzymes catalyzing inverse reactions. To use the script [`140630_MUST_NW.R`](140630_MUST_NW.R), several tables have to be downloaded from http://rest.kegg.jp:
http://rest.kegg.jp/link/Ko/pathway http://rest.kegg.jp/link/Ko/pathway
http://rest.kegg.jp/link/Ko/rn http://rest.kegg.jp/link/Ko/rn
http://rest.kegg.jp/link/rn/rp http://rest.kegg.jp/link/rn/rp
...@@ -6,13 +6,13 @@ http://rest.kegg.jp/link/rp/rc/ ...@@ -6,13 +6,13 @@ http://rest.kegg.jp/link/rp/rc/
http://rest.kegg.jp/link/rc/rn http://rest.kegg.jp/link/rc/rn
http://rest.kegg.jp/list/rp http://rest.kegg.jp/list/rp
http://rest.kegg.jp/list/cpd http://rest.kegg.jp/list/cpd
We also use the file ko2des_clean.txt which contains the descriptions for every KO in the HMM database. We also use the file [ko2des_clean.txt](ko2des_clean.txt) which contains the descriptions for every KO in the HMM database.
The script gives out the reconstructed network, as well as a file with all KOs and nodes in the network. This can be used to find the expression levels of genes [functionally annotated](functional-annotations.md) with these KOs and restrict the network to only nodes represented by (expressed) genes. The script gives out the reconstructed network, as well as a file with all KOs and nodes in the network. This can be used to find the expression levels of genes [functionally annotated](functional-annotations.md) with these KOs and restrict the network to only nodes represented by (expressed) genes.
Based on expression levels, sub-modules with significant differences under different conditions (such as samples from different individuals, families, dependent on disease etc) can be described. We used the R-package BioNet (https://bioconductor.org/packages/release/bioc/html/BioNet.html) for this. BioNet can be applied using the program Heinz (https://github.com/ls-cwi/heinz) or an R-implementation (FastHeinz). Function to either use the pure-R version of BioNet with output from DESeq2 (https://bioconductor.org/packages/release/bioc/html/DESeq2.html) or call Heinz on the command line from R (using `runHeinz.sh`) can be found in `140630_MUST_NW.R`. Based on expression levels, sub-modules with significant differences under different conditions (such as samples from different individuals, families, dependent on disease etc) can be described. We used the R-package BioNet (https://bioconductor.org/packages/release/bioc/html/BioNet.html) for this. BioNet can be applied using the program Heinz (https://github.com/ls-cwi/heinz) or an R-implementation (FastHeinz). Function to either use the pure-R version of BioNet with output from DESeq2 (https://bioconductor.org/packages/release/bioc/html/DESeq2.html) or call Heinz on the command line from R (using [`runHeinz.sh`](runHeinz.sh)) can be found in [`140630_MUST_NW.R`](140630_MUST_NW.R).
These functions supplied with BioNet to make a graphical representation of the highest scoring module. These functions supplied with BioNet to make a graphical representation of the highest scoring module.
To make a more fancy plot, which also displays the relative gene/transcript/protein copy numbers linked to the binned population-level genome reconstructions, further functions can be found in `plotModules_omicLevels.R`. These functions, in turn, use data retrieved from a [mongo database](mongo-database.md). These plots are very rich in specific information, so the script should not be expected to work without this information. To make a more fancy plot, which also displays the relative gene/transcript/protein copy numbers linked to the binned population-level genome reconstructions, further functions can be found in [`plotModules_omicLevels.R`](plotModules_omicLevels.R). These functions, in turn, use data retrieved from a [mongo database](mongo-database.md). These plots are very rich in specific information, so the script should not be expected to work without this information.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment