Gitlab is now using https://gitlab.lcsb.uni.lu as it's primary address. Please update your bookmarks. FAQ.

Commit d0fd5cfe authored by Anna Buschart's avatar Anna Buschart
Browse files

add tree-based analysis

parent e458d5fd
#This script contains two functions (zoomSample, zoomTaxa) which can be used to colour phylogenetic trees based on the samples of origin
# of the phylogenetic marker genes forming its leaves. They also pull different sub-groups from the tree,
# based on earlier taxonomic annotations or based on the samples of origin.
# Examples for usage of these functions are shown in lines 87ff.
#The script would need phylogenetic trees in Newick format (see line 98) and additional information on the chosen colours, samples of origin
# and the contig clusters the genes building the tree belong to (see line 88).
#The script is based on the file structure and sample IDs of the MuSt project and should not be expected to run on other data as is!!!
#written by Anna Heintzbuschart, August 2015
library(ape)
library(geiger)
library(RColorBrewer)
zoomSample <- function (phy, focus, subtree = FALSE, col.sub = Mcol[visFac], col.tip = compCol, markerGene = mark, ...) {
if (!is.list(focus))
focus <- list(focus)
n <- length(focus)
for (i in 1:n) if (is.character(focus[[i]]))
focus[[i]] <- which(phy$tip.label %in% focus[[i]])
if (is.function(col.sub)) {
col.sub <- if (deparse(substitute(col.sub)) == "grey")
grey(1:n/n)
else col.sub(n)
}
ext <- vector("list", n)
extcol <- vector("list", n)
extname <- names(focus)
for (i in 1:n) {
ext[[i]] <- drop.tip(phy, phy$tip.label[-focus[[i]]], subtree = subtree, rooted = TRUE)
extcol[[i]] <- unlist(sapply(ext[[i]]$tip.label,function(x) col.tip[which(phy$tip.label==x)]))
ext[[i]]$tip.label <- gsub("M.+_","",gsub("_Contig.+","",ext[[i]]$tip.label))
}
nc <- round(sqrt(n)) + 1
nr <- ceiling(sqrt(n))
M <- matrix(0, nr, nc)
x <- c(rep(1, nr), 2:(n + 1))
M[1:length(x)] <- x
layout(M, c(1, rep(3/(nc - 1), nc - 1)))
phy$tip.label <- rep("", length(phy$tip.label))
colo <- rep("black", dim(phy$edge)[1])
for (i in 1:n) colo[which.edge(phy, focus[[i]])] <- col.sub[i]
plot.phylo(phy, edge.color = colo, ...)
mtext(markerGene,3,1)
for (i in 1:n){
plot.phylo(ext[[i]], edge.color = col.sub[i], tip.color = extcol[[i]])
mtext(extname[i],3,0,col=col.sub[i],font=2)
}
}
zoomTaxa <- function (phy, focus, subtree = FALSE, col.sub = rainbow, col.tip = sampCol, ...) {
if (!is.list(focus))
focus <- list(focus)
n <- length(focus)
for (i in 1:n) if (is.character(focus[[i]]))
focus[[i]] <- which(phy$tip.label %in% focus[[i]])
if (is.function(col.sub)) {
col.sub <- if (deparse(substitute(col.sub)) == "grey")
grey(1:n/n)
else col.sub(n)
}
ext <- vector("list", n)
extcol <- vector("list", n)
for (i in 1:n) {
ext[[i]] <- drop.tip(phy, phy$tip.label[-focus[[i]]], subtree = subtree, rooted = TRUE)
extcol[[i]] <- unlist(sapply(ext[[i]]$tip.label,function(x) col.tip[which(phy$tip.label==x)]))
ext[[i]]$tip.label <- gsub("_Contig.+","",ext[[i]]$tip.label)
}
nc <- round(sqrt(n)) + 1
nr <- ceiling(sqrt(n))
M <- matrix(0, nr, nc)
x <- c(rep(1, nr), 2:(n + 1))
M[1:length(x)] <- x
layout(M, c(1, rep(3/(nc - 1), nc - 1)))
phy$tip.label <- rep("", length(phy$tip.label))
colo <- rep("black", dim(phy$edge)[1])
for (i in 1:n) colo[which.edge(phy, focus[[i]])] <- col.sub[i]
plot.phylo(phy, edge.color = colo, ...)
for (i in 1:n){
plot.phylo(ext[[i]], edge.color = col.sub[i], tip.color = extcol[[i]])
}
}
#####
cI <- readRDS("clusterInf.RDS")
cis <- cI[cI$cluster!="S",]
McolA <- c(brewer.pal(7,"Blues")[3:7],brewer.pal(7,"Oranges")[3:7],brewer.pal(6,"Purples")[3:6],brewer.pal(8,"Greens")[3:8]) #all 4 families
Mcol <- McolA[-c(5,11)]
visFac <- c(rep(1:3,each=3),rep(4,times=2),rep(5,times=3),rep(6,times=2),rep(7:8,each=3),rep(9,times=2),10:13,rep(14,times=2),15,rep(16,times=2),17,rep(18,times=2))
samples <- unique(cI$sample)
compCol <- colorRampPalette(brewer.pal(11,"Spectral"))(109)[109:1]
markers <- c("rpoB","COG0012","COG0016","COG0018","COG0172","COG0215","COG0495","COG0525","COG0533","COG0541","COG0552")
for(mark in markers){
tall <- read.tree(paste(mark,".allCompleteGenes.faa.final_tree.nw",sep=""))
ci <- cis[cis$cluster!="N"&apply(cis[,1:2],1,function(x) paste(x,sep="_",collapse="_")) %in%
gsub("_Contig.+","",tall$tip.label),]
tiplist <- list()
for(sample in samples){
tiplist[[sample]] <- which(gsub("_.+","",tall$tip.label)==sample)
}
comptip <- unlist(sapply(gsub("_Contig.+","",tall$tip.label),function(x){
if(grepl("S",x)) "grey75" else if(grepl("N",x)) "grey50" else {
compCol[ci$uniqueEss[apply(ci[,1:2],1,function(x) paste(x,sep="_",collapse="_")) == x]+1]
}
}))
samptip <- unlist(sapply(gsub("_.+","",tall$tip.label),function(x) Mcol[visFac][which(samples==x)]))
pdf(paste(mark,"perSampleTrees.pdf",sep="."),width=15,height=18,pointsize=8)
zoomSample(tall,tiplist,col.sub=Mcol[visFac],col.tip=comptip)
dev.off()
pdf(paste(mark,"unclassPhylaTree.pdf",sep=""),width=11,height=8,pointsize=8)
zoomTaxa(tall,which(gsub("_Contig.+","",tall$tip.label) %in% apply(ci[grep("NA",ci$phylaMotuPresent),1:2],1,
function(x) paste(x,sep="_",collapse="_"))),
col.tip=samptip)
mtext("Unclassified phyla",1,1,font=2)
mtext(mark,3,1)
dev.off()
}
......@@ -34,6 +34,9 @@ to automatically [cluster](automatic-clustering.md) contigs based on nucleotide
* autoCluster.R
* fastaExtractCutRibosomal1000.pl
* makeWSvarAnnoCorrect.R
to gather contig clusters by [related phylogenetic marker genes in a phylogenetic tree](phylogenetic-marker-genes-trees.md):
* 150819_MUST_tree.R
to [reconstruct](reconstructed-KO-network.md) a metabolic network from KOs and analyse it:
* 140630_MUST_NW.R
......
This workflow makes use of the functional annotation data and the completeness information stored in the [mongo database](mongo-database.md), collects the amino acid sequences of all complete phylogenetic marker gene predictions in all samples and builds a phylogenetic tree based on multiple sequence alignment for each class of marker genes. The information about membership of the marker genes in a contig [cluster](automatic-clustering.md) (binned reconstructed population-level genome) is retained, so closely related reconstructed genomes from different samples can be indentified as such.
In the first step, an R-script is run to retrieve all complete genes annotated as one of the mOTU [marker genes](annotate-phylogenetic-marker-genes.md) - COG0012, COG0016, COG0018, COG0172, COG0215, COG0495, COG0525, COG0533, COG0541, COG0552 - or __rpoB__ (TIGR02013) from the [mongo database](mongo-database.md). It returns a table with the gene IDs, the sample IDs and the contig cluster IDs of the complete marker genes for each sample and class of marker gene. (The script is documented here, but it depends on the specific file structure used in this project, so don't expect it to run anywhere else).
```
Rscript 150816_getPhyloMarkers.R
```
This table is then used as input to a perl script which extracts the genes of interest from the fasta file containing all gene predictions of a sample (names of samples are part of the combinedIds file). We append the output of the script to the same file, so we have one fasta file with all complete genes for each class of marker genes.
```
for lib in `cut -f 2 combinedIds`
do
cd $lib/
for marker in COG0012 COG0016 COG0018 COG0172 COG0215 COG0495 COG0525 COG0533 COG0541 COG0552
do
perl fastaProteinExtractAddSampleCluster.pl $marker.faa $marker.geneNamesClusters.tsv >> ../../trees/$marker.allCompleteGenes.faa
done
perl fastaProteinExtractAddSampleCluster.pl genes.faa rpoB.geneNamesClusters.tsv >> ../../trees/rpoB.allCompleteGenes.faa
cd ../
done
cd ../../trees/
```
Now, we use the [ETE toolkit](http://etetoolkit.org/) to build a phylogenetic tree for each of the marker gene classes.
```
for gene in rpoB COG0012 COG0016 COG0018 COG0172 COG0215 COG0495 COG0525 COG0541 COG0552 COG0533
do
echo $gene
python ete.py build --noimg --cpu 1 -a $gene.allCompleteGenes.faa -o allCompleteGenes.muscle_default-none-none-fasttree_default -w muscle_default-none-none-fasttree_default
done
```
The output of ETE includes a tree in Newick file format _.allCompleteGenes.faa.final_tree.nw_. We can read this into R using the package [ape](http://ape-package.ird.fr/).
To find groups of marker genes which are close to each other, we can search the tree for polytomous points with more than 7 leaves.
```
library(ape)
library(geiger)
markers <- c("rpoB","COG0012","COG0016","COG0018","COG0172","COG0215","COG0495","COG0525","COG0533","COG0541","COG0552")
for(mark in markers){
tall <- read.tree(paste(mark,".allCompleteGenes.faa.final_tree.nw",sep=""))
degree <- tabulate(tall$edge[, 1])
target <- which(degree > 7)
for(group in target){
write.table(data.frame("sample"=gsub("_.+","",tips(tall,group)),
"cluster"=gsub("M.+_","",gsub("_Contig.+","",tips(tall,group))),stringsAsFactors=F),
paste(mark,group,"clusters.tsv",sep="."),row.names=F,quote=F,sep="\t")
}
}
```
This will write a tab-separated table for each group of closely related genes. The tables contain the ID of samples and the ID of the contig clusters (population-level genome) which contain a phylogenetic marker gene from the group of related genes. The files are named by a number that derives from the phylogenetic tree and the marker gene.
Another useful package for phylogenetic tree visualization in R is [geiger](http://www.webpages.uidaho.edu/~lukeh/software.html). Using this and ape, we can visualize a tree for each marker gene and colour it according to the sample of origin. Or we can extract specific parts of the tree, like for example leaves formed by phylogenetic marker genes which belong to contig clusters with weak classification. Both examples are performed in the script _150819_MUST_tree.R_, which contains some nice functions for visualization. The script is however based on the MuSt sample IDs and MuSt file structure and should not be run as is on other data sets.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment