Gitlab is now using https://gitlab.lcsb.uni.lu as it's primary address. Please update your bookmarks. FAQ.

Commit e41873fd authored by Patrick May's avatar Patrick May
Browse files

docs by anna

parent 2ab0c238
Pipeline #26066 failed with stages
in 0 seconds
......@@ -8,12 +8,16 @@ IMP3 output: Analysis
Annotation
----------
The results of the :ref:`Analysis <step_analysis>` are written into the ``Analysis`` directory within in the defined ``outputdir``, see
IMP3 writes the results of the :ref:`Analysis <step_analysis>` into the ``Analysis`` directory within in the defined ``outputdir``, see
:ref:`configuration <configuration>`.
The gene annotation results are in the ``annotation`` subdirectory:
The gene annotation results are in the ``annotation`` subdirectory.
The `GFF3 <http://gmod.org/wiki/GFF3>`_ file ``annotation_CDS_RNA_hmms.gff`` is the final annotation file and contains all gene annotations (including mRNAs, rRNAs and tRNAs) with the functional annotations and gene descriptions.
............
The GFF file
............
The `GFF3 <http://gmod.org/wiki/GFF3>`_ file ``annotation_CDS_RNA_hmms.gff`` is the final annotation file and contains all gene annotations (including mRNAs, rRNAs, tRNAs and CRISPR arrays) with the functional annotations and gene descriptions.
The file is in `standard GFF format <https://www.ensembl.org/info/website/upload/gff.html>`_ with nine columns:
- 1: contig ID,
......@@ -24,17 +28,23 @@ The file is in `standard GFF format <https://www.ensembl.org/info/website/upload
- 6: a score ("." if no score is reported),
- 7: the feature's direction or sense ("+" or "-"),
- 8: frame (0 for CDS, "." for other features),
- 9: attributes (each attribute starts with a key followed by "=", e.g. "ID=", attributes are separated by ";"
- 9: attributes (each attribute starts with a key followed by "=", e.g. "ID=", attributes are separated by ";" )
The most important attributes of column *9* are:
- **ID**, which is used in other gene-based outputs,
- **partial**, which contains information from the used gene-caller(`Prodigal <https://github.com/hyattpd/Prodigal>`_) about whether the open reading frames are complete:
- **partial**, which contains information from the gene-caller (`Prodigal <https://github.com/hyattpd/Prodigal>`_) about whether the open reading frames are complete:
- *00* means both start and stop codons were found,
- *11* means neither were found,
- *01* means the right-most end is incomplete - i.e. missing stop-codon for +strand features, missing start-codon for -strand features,
- *10* means the left-most end is incomplete - i.e. missing start-codon for +strand features, missing stop-codon for -strand features, and
- **the results of the HMM searches**, named with the HMM database name provided in the **config file**, e.g. ``essential=`` for the essential genes.
- **the results of the HMM searches**, named with the HMM database name provided in the **config file**, e.g. *essential=* for the essential genes.
The functional annotation of the genes by `HMMer <http://hmmer.org/>`_ produces a number of intermediary outputs. With the results summarized
in ``annotation_CDS_RNA_hmms.gff``, the intermediary files are archived and compressed in ``intermediary.tar.gz``.
......................
Reads per gene / group
......................
The annotated features are used to determine the numbers of reads mapping to the features and to groups of features that share the
same functional annotation using `featureCounts <http://bioinf.wehi.edu.au/featureCounts/>`_:
......@@ -50,24 +60,30 @@ same functional annotation using `featureCounts <http://bioinf.wehi.edu.au/featu
- In addition, summaries of the numbers of reads that were mapped and overlapped with the features are found in the respective files ending on ``tsv.summary``.
- If only **metaT** reads were used as input, there will be no data for rRNA, because rRNA has been filtered out.
The *Fasta* file for downstream **metaP** analysis is called ``proteomics.final.faa``. The *Fasta* header for the protein sequences is in a format that
should work with most proteomics search engines and analysis tools, in particular the `MetaProteomeAnalyzer <http://www.mpa.ovgu.de/>`_ . If the user chose to add
host proteins (see :ref:`configuration <configuration>`, the provided host protein *Fasta* file should also consistent to the requirements of the used proteomics software.
As an intermediary step (without host proteins) a file named ``proteomics.proteins.faa`` is generated.
.......................................
Proteomics databases and gene sequences
.......................................
IMP3 output can be used for downstream **metaP** analysis:
- ``proteomics.final.faa``: the FASTA file for downstream **metaP** analysis. The FASTA header for the protein sequences is in a format that
should work with most proteomics search engines and analysis tools, in particular the `MetaProteomeAnalyzer <http://www.mpa.ovgu.de/>`_ .
- As an intermediary step (without host proteins), a file named ``proteomics.proteins.faa`` is generated.
The proteomics file is a cleaned version of ``prokka.faa``. `prokka <https://github.com/tseemann/prokka>`_ outputs a few more files:
- ``prokka.ffn`` with the CDS,
- ``prokka.fna`` with the contigs (also present in ``prokka.fsa`` with a slightly different header),
- ``prokka.log``, a logfile,
- ``prokka.txt``, a summary of the number of analysed contigs and annotated features,
- ``prokka.tsv``, a tabular output with all features,
- ``prokka.tbl``, a sort-of flat version of the same information,
- ``prokka.gff``, a `GFF <http://gmod.org/wiki/GFF3>`_ file with lots of commented lines (starting with "#"), which is actually the foundation of the `GFF <http://gmod.org/wiki/GFF3>`_ file described above.
The proteomics file is a cleaned version of ``prokka.faa``. In addition to the amino-acid sequences, `prokka <https://github.com/tseemann/prokka>`_
also outputs the CDS as ``prokka.ffn``, the contigs in ``prokka.fna`` and - with a slightly different header - in ``prokka.fsa``, as well
as a logfile ``prokka.log``, a summary of the number of analysed contigs and annotated features ``prokka.txt``, a tabular output
with all features ``prokka.tsv``, and a sort-of flat version of the same information in ``prokka.tbl``, and a `GFF <http://gmod.org/wiki/GFF3>`_ file with
lots of commented lines (starting with "#") ``prokka.gff``, which is actually the foundation of the `GFF <http://gmod.org/wiki/GFF3>`_ file described above.
An intermediary step between this file and the final `GFF <http://gmod.org/wiki/GFF3>`_ is ``annotation.filt.gff`` which contains all the information of the
original prokka output minus the comments. Depending on the planned further analysis steps, the user may also see indices for some of the sequences
(``prokka.<faa|ffn>.<suffix>``). If the IMP3 :ref:`Binning <step_binning>` is run, IMP3 will add a file containing the link between the gene IDs
as given by `prokka <https://github.com/tseemann/prokka>`_ and the format required by `DASTool <https://github.com/cmks/DAS_Tool>`_, called ``annotation_CDS_RNA_hmms.contig2ID.tsv``.
The functional annotation of the genes by `HMMer <http://hmmer.org/>`_ produces a number of intermediary outputs. With the results summarized
in ``annotation_CDS_RNA_hmms.gff``, the intermediary files are archived and compressed in ``intermediary.tar.gz``.
----
SNPs
----
......
......@@ -7,13 +7,14 @@ IMP3 output: Assembly
During the :ref:`Assembly step <step_assembly>` (or if the user provided an existing assembly, see :ref:`input <input_options>`) all outputs will be written to
the ``Assembly`` directory within the defined ``outputdir`` directory (see :ref:`configuration <configuration>`).
- The final output is a *Fasta* file of the assembled contigs ``<mg|mt|mgmt>.assembly.merged.fa`` [\*] . The *Fasta* headers contain the **sample name** as given in the :ref:`config file <configuration>`, separated by an underscore, *contig*, another underscore, and a number, e.g. *test_contig_1*.
- The final output is a FASTA file of the assembled contigs ``<mg|mt|mgmt>.assembly.merged.fa`` [\*] . The FASTA headers contain the **sample name** as given in the :ref:`config file <configuration>`, separated by an underscore, *contig*, another underscore, and a number, e.g. *test_contig_1*.
- The **index** files of the final contig file will be generated , namely for `BWA <http://bio-bwa.sourceforge.net/>`_ (suffixes ``amb``, ``ann``, ``bwt``, ``pac``, and ``sa``), `Samtools <http://www.htslib.org/doc/faidx.html>`_ (``fai``) and bioperl (``index``).
- A ``bed3`` file is also stored for later access.
**Note**: *some of these files are produced by the* :ref:`Analysis step <step_analysis>` *, so they are not all generated during the* :ref:`Assembly step <step_assembly>`*.*
**Note**: *some of these files are produced by the* :ref:`Analysis step <step_analysis>` *, so they will not be present after running only the* :ref:`Assembly step <step_assembly>`.
During the :ref:`Assembly step <step_assembly>`, the **processed** reads will be mapped back to the final contigs and the alignment is stored
``<mg|mt>.reads.sorted.bam`` and index ``<mg|mt>.reads.sorted.bam.bai``. The `BAM <https://genome.sph.umich.edu/wiki/BAM>`_ files are sorted.
During the :ref:`Assembly step <step_assembly>`, IMP3 maps back the **processed** reads to the final contigs and stores the alignment
``<mg|mt>.reads.sorted.bam`` and index ``<mg|mt>.reads.sorted.bam.bai``. The `BAM <https://genome.sph.umich.edu/wiki/BAM>`_ files are sorted
by contig name and position.
The :ref:`Assembly step <step_assembly>` actually consists of a large number of sub-steps (:ref:`iterative assembly <step_assembly>`) generating a huge
amount of intermediate result files and directories that will be archived and compressed into ``intermediary.tar.gz``.
......@@ -26,11 +27,11 @@ Stats from Assembly step
The :ref:`Assembly step <step_assembly>` will also collect some summary statistics and save them to the ``Stats`` directory:
- The GC-content of the final contigs is recorded in the tab-separated file ``<mg|mt|mgmt>/<mg|mt|mgmt>.assembly.gc_content.txt``, which contains a header and holds two columns for the contig names and GC in percent (i.e 0-100), respectively.
- The GC-content of the final contigs is recorded in the tab-separated file ``<mg|mt|mgmt>/<mg|mt|mgmt>.assembly.gc_content.txt``, which contains a header and holds two columns for the contig names and GC in percent (i.e 0-100), respectively.
- The length of the contigs is provided in a tab-separated file ``<mg|mt|mgmt>/<mg|mt|mgmt>.assembly.length.txt`` with two simple columns, contig names and lengths.
- The numbers of the read mapping are kept in the ``Stats`` subdirectories with the **metaG** and/or **metaT** (``mg/`` or ``mt/``):
- The stats on the read mapping are kept in the ``Stats`` subdirectories with the **metaG** and/or **metaT** (``mg/`` or ``mt/``):
- ``<mg|mt>/<mg|mt|mgmt>.assembly.contig_flagstat.txt`` contains the numeric part of the `samtools flagstat <http://www.htslib.org/doc/samtools-flagstat.html>`_ output.
- The average depth of coverage for each set of reads for all contigs that have at least one read mapping to them is given in ``<mg|mt>/<mg|mt|mgmt>.assembly.contig_depth.txt``. This file is headerless and tab-separated, with the contig names in the first column and the average depth of coverage in the second.
- the average depth of coverage for each set of reads for all contigs that have at least one read mapping to them is given in ``<mg|mt>/<mg|mt|mgmt>.assembly.contig_depth.txt``. This file is headerless and tab-separated, with the contig names in the first column and the average depth of coverage in the second.
- The :ref:`Binning step <step_binning>` adds based on this the file ``<mg|mt>/<mg|mt|mgmt>.assembly.contig_depth.0.txt``, which also contains lines for contigs with zero coverage.
- The file ``<mg|mt>/<mg|mt|mgmt>.assembly.contig_coverage.txt`` contains the processed output of `bedtools genomeCoverageBed <https://bedtools.readthedocs.io/en/latest/content/tools/genomecov.html>`_. It contains seven tab-separated columns:
- 1: contig names,
......
......@@ -20,7 +20,7 @@ Overview of IMP3 output
Logs
While running, IMP3 will write the intermediary output and final results into the user-defined **output directory** (``outputdir``, see :ref:`configuration <configuration>`).
Before finishing IMP3 will compress some of the intermediary steps to reduce space. Finally, the IMP3 workflow will generate
Before finishing, IMP3 will compress some of the intermediary steps to reduce space. Finally, the IMP3 workflow will generate
summaries and visualizations (see :ref:`configuration <configuration>` if defined).
An overview of the all outputs, files and their directory structure is given below:
......
This diff is collapsed.
......@@ -8,11 +8,11 @@ IMP3 is designed to perform integrated analyses of **metaG** and **metaT** data.
and mapped reads.
A typical :ref:`workflow <steps_overview>` starts with a pair of files of paired-end **metaG** reads and a pair
of **metaT** reads (both in `Fastq <https://en.wikipedia.org/wiki/FASTQ_format>`_ format, either gzipped or not). Alternatively, IMP3 can take only **metaG**
of **metaT** reads (both in `FASTQ <https://en.wikipedia.org/wiki/FASTQ_format>`_ format, either gzipped or not). Alternatively, IMP3 can take only **metaG**
or only **metaT** reads.
If the data is already assembled and the contigs need to get annotated and binned into metagenomics-assembled genomes (**MAGs**),
IMP3 also takes assemblies (in *Fasta* format) in addition to alignments (`BAM <https://genome.sph.umich.edu/wiki/BAM>`_ files) or trimmed reads (`Fastq <https://en.wikipedia.org/wiki/FASTQ_format>`_ format)
If the data is already assembled and the contigs should be annotated and binned into metagenomics-assembled genomes (**MAGs**),
IMP3 also takes assemblies (in FASTA format) in addition to alignments (`BAM <https://genome.sph.umich.edu/wiki/BAM>`_ files) or trimmed reads (`FASTQ <https://en.wikipedia.org/wiki/FASTQ_format>`_ format)
as input.
All inputs are defined in the :ref:`config file <configuration>`.
......@@ -21,9 +21,10 @@ All inputs are defined in the :ref:`config file <configuration>`.
Reads
-----
The ``Metagenomics`` and ``Metatranscriptomics`` input fields expect two or three `Fastq <https://en.wikipedia.org/wiki/FASTQ_format>`_ or gzipped `Fastq <https://en.wikipedia.org/wiki/FASTQ_format>`_ files,
The ``Metagenomics`` and ``Metatranscriptomics`` input fields expect two or three `FASTQ <https://en.wikipedia.org/wiki/FASTQ_format>`_ or
gzipped `FASTQ <https://en.wikipedia.org/wiki/FASTQ_format>`_ files,
separated by a space. The first two files should be the forward and reverse reads, if IMP3 analysis should start from pre-processing reads. The reads
need to be in the same order in both files. The singleton read file is given last if available.
need to be in the same order in both files.
For processing the original raw `Fastq <https://en.wikipedia.org/wiki/FASTQ_format>`_ files, the following setting should be used:
......@@ -38,8 +39,10 @@ For processing the original raw `Fastq <https://en.wikipedia.org/wiki/FASTQ_form
Alignment_metagenomics: ""
Alignment_metatranscriptomics: ""
If the user has **already pre-processed** reads, either to the :ref:`Assembly <step_assembly>` step or for further analysis,
the user can pass one additional single ends file. **Note**: *In this case, the user should **not** include ``preprocessing`` in the IMP :ref:`steps <steps_overview>`.*
If the user has **already pre-processed** reads, either for the :ref:`Assembly <step_assembly>` step or for further analysis,
the user can pass one additional single ends file. The singleton read file is given last.
**Note**: *In this case, the user should NOT include* ``preprocessing`` *in the IMP* :ref:`steps <steps_overview>`.
.. code-block:: yaml
......@@ -55,7 +58,9 @@ the user can pass one additional single ends file. **Note**: *In this case, the
The user can also give only **metaG** or only **metaT** reads to IMP3, either raw or pre-processed.
If the user wishes to perform a **metaG** assembly including long reads, a single file of **already pre-processed** long-reads data (in `Fastq <https://en.wikipedia.org/wiki/FASTQ_format>`_ format). In this case, additionally the sequencing method
(possible values are ``nanopore`` and ``pacbio``) can be added. **Note**: *The long reads are only used by IMP3, if ``metaspades`` is chosen as assembler.*
(possible values are ``nanopore`` and ``pacbio``) can be added.
**Note**: *The long reads are only used by IMP3, if* ``metaspades`` *is chosen as assembler.*
.. code-block:: yaml
......@@ -73,9 +78,9 @@ Contigs and alignments
----------------------
If the user has already an assembly and would like to use IMP3 to annotate genes, perform binning and/or determine contig-level
taxonomy, the contigs can be used in input in *Fasta* format. In addition, the user can give reads either in `Fastq <https://en.wikipedia.org/wiki/FASTQ_format>`_ files or already aligned
as `BAM <https://genome.sph.umich.edu/wiki/BAM>`_ files. For `Fastq <https://en.wikipedia.org/wiki/FASTQ_format>`_ files, the same limitations apply as discussed above.
The `BAM <https://genome.sph.umich.edu/wiki/BAM>`_ files should be sorted by contig name.
taxonomy, the contigs can be used as input in FASTA format. In addition, the user can give reads either in `FASTQ <https://en.wikipedia.org/wiki/FASTQ_format>`_ files or already aligned
as `BAM <https://genome.sph.umich.edu/wiki/BAM>`_ files. For `FASTQ <https://en.wikipedia.org/wiki/FASTQ_format>`_ files, the same limitations apply as discussed above.
The `BAM <https://genome.sph.umich.edu/wiki/BAM>`_ files should be sorted by contig name and coordinate.
.. code-block:: yaml
......
......@@ -5,7 +5,6 @@ Running and configuring IMP3
.. toctree::
:maxdepth: 1
:caption: Running and configuring IMP3
:hidden:
:name: running_IMP3
run
......
......@@ -5,12 +5,38 @@
IMP3 steps: Analysis
====================
In the Analysis step, IMP3 calls open reading frames, rRNA and tRNA genes, and annotate CRISPR repeats.
Open reading frames are functionally annotated. The number of reads mapping to each gene and functional
group of genes are calculated. The steps are described in more detail below.
.. _prokkaC:
-----------------
Customized prokka
-----------------
IMP3 uses `prokka <http://www.vicbioinformatics.com/software.prokka.shtml>`_ to call
open reading frames (ORF), rRNA and tRNA genes, and annotate CRISPR repeats. Internally,
`prokka <http://www.vicbioinformatics.com/software.prokka.shtml>`_ calls `prodigal <https://github.com/hyattpd/Prodigal>`_
for the ORF calling, `barrnap <https://github.com/tseemann/barrnap>`_ for the rRNA regions,
`ARAGORN <https://academic.oup.com/nar/article/32/1/11/1194008>`_ for tRNA loci and
`MinCED <https://github.com/ctSkennerton/minced>`_ to detect CRISPR arrays.
Prokka forces prodigal to only call complete genes. Due to the fragmented nature of metagenomic contigs, it
is preferable to also allow partial genes. IMP3's customized prokka allows prodigal to call incomplete
ORFs and records whether prodigal detected start and stop codons. One side-effect of this is that the
amino acid sequences prokka returns ``prokka.faa`` are badly formatted. The prokka amino acid sequences also don't start with M
if prodigal called genes with an alternative start codon. Both issues are corrected in another IMP3 step as part of the metaproteomics
preparations (``proteomics.proteins.faa``).
Prokka would usually provide some functional analyses by aligning the called ORFs to some databases. However, this analysis
is optimized for speed, meaning that genes that have been annotated with one database are not annotated with the next. This
leads to genes potentially having inconsistent annotations, and it would be impossible for the user to find out what would have
been the best hit in another database. Since IMP3 does :ref:`functional annotations <HMMs>` of all genes with any database
the user chooses to reach consistent annotations, we've disabled the prokka-based annotation.
Prokka also spends considerable time to convert its output into genbank format. As IMP3 has no need for genbank-formatted
data, we've disabled this.
.. _HMMs:
......@@ -20,6 +46,8 @@ Annotation with HMMs
.. _variants_metaP:
---------------------------
......
......@@ -5,3 +5,57 @@
IMP3 steps: Assembly
====================
IMP3 uses the iterative assembly approach of the original IMP. Like IMP, IMP3 can perform hybrid assemblies of metagenomic
and metatranscriptomic reads. In addition, IMP3 can perform purely metagenomic reads, ignoring metatranscriptomics reads. IMP3
can also perform another kind of hybrid assembly, namely of short and long metagenomic reads. Obviously, IMP3 also performs
iterative assemblies of only metagenomic or only metatranscriptomic reads.
After the assembly, reads are mapped back to the assembly.
.. _step_mono_iter_assembly:
-----------------------------------------------------------
Iterative assembly: metagenomic OR metatranscriptomic reads
-----------------------------------------------------------
In the simplest case, the user provides only one kind of short reads (metaG or metaT). IMP will then use the assembler defined
by the user (`Megahit <http://www.metagenomics.wiki/tools/assembly/megahit>`_
or `MetaSpades <http://cab.spbu.ru/software/meta-spades/>`_) to try to assemble all reads. After the assembly, the reads are mapped
back to the assembled contigs. Reads that did not map will be given to the same assembler again.
The set of contigs from the second assembly will be merged with the first set using the overlap assembler
`CAP3 <http://seq.cs.iastate.edu/cap3.html>`_. The IMP developers
have `shown <https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1116-8>`_ that this approach leads to longer
contigs without compromising assembly quality. Finally, the metaG or metaT are mapped
back to the final set of contigs by `BWA <http://bio-bwa.sourceforge.net/bwa.shtml>`_.
The same approach is performed for the metaG reads, if the user provides metaG and metaT but does not choose ``hybrid`` as assembly option.
In this case, both metaG and metaT reads are mapped to the final metagenomic assembly.
-------------------------------------------------
Hybrid assembly: long and short metagenomic reads
-------------------------------------------------
One of the assemblers built into IMP3, `MetaSpades <http://cab.spbu.ru/software/meta-spades/>`_,
is able to perform long/short-hybrid assemblies. If the user provides both kinds
of metagenomic reads, the two sets of reads are co-assembled in the first round of the iterative assembly.
Only the short reads are mapped back to the assembled contigs and, as described :ref:`above <step_mono_iter_assembly>`,
a second assembly will be attempted with the unmapped reads, and both sets of contigs will eventually be merged.
.. _step_hybrid_iter_assembly:
------------------------------------------------------------------------------
Iterative multi-omic hybrid assembly: metagenomic and metatranscriptomic reads
------------------------------------------------------------------------------
This approach developed for the `original IMP <https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1116-8>`_.
It is currently only implemented for the `Megahit <http://www.metagenomics.wiki/tools/assembly/megahit>`_ assembler.
First, the metatranscriptomic reads are assembled. The metatranscriptomic reads are mapped back to the contigs using
`BWA <http://bio-bwa.sourceforge.net/bwa.shtml>`_.
Reads that don't map are extracted and assembly is attempted a second time. The contigs from both assemblies are
supplied as reads to the hybrid assembly, together with all metagenomic and metatranscriptomic reads. All metagenomic and
metatranscriptomic reads are mapped against the resulting contigs. Metagenomic and metatranscriptomic reads that
don't map are extracted and are co-assembled. The resulting contigs are merged with the first set of hybrid contigs using
the `CAP3 <http://seq.cs.iastate.edu/cap3.html>`_ overlap assembler.
Finally, both metaG and metaT reads are mapped back to the final set of contigs.
......@@ -5,3 +5,51 @@
IMP3 steps: Preprocessing
=========================
Preprocessing of reads consists of 1 to 3 steps: trimming,
removal of ribosomal RNAs, filtering of reads mapping to one or more reference genomes.
----------------
Trimming
----------------
Trimming is performed on both metaG and metaT reads. Trimming is performed by `Trimmomatic <http://www.usadellab.org/cms/?page=trimmomatic>`_.
Trimmomatic trimming includes the removal of the defined adapters, removal of low-quality bases at the beginning and/or ends of the reads,
and/or truncation of reads if the quality in a sliding window becomes too low, and/or complete removal of a read if the remaining length is
too short.
Trimmomatic may produce singletons from paired-end data, when one read is completely removed due to quality reasons. The singletons produced
from the first and second reads are concatenated into one file. After the trimmometic step, there are therefore always
three output files for paired-end data, ``r1``, ``r2``, and ``se``.
A user-defined step is the removal of trailing Gs that are commonly introduced by Nextseq machines when the sequenced DNA
fragment is shorter than the number of bases added during the sequencing run. This is accomplished by
`cutadapt <https://cutadapt.readthedocs.io/en/v2.7/guide.html>`_
with setting ``--nextseq-trim`` and is performed on all reads, if requested in the config file.
------------
rRNA removal
------------
rRNA reads are separated from other metaT reads by `SortMeRNA <https://bioinfo.lifl.fr/RNA/sortmerna/>`_.
The reason is that rRNA is highly abundant in total rRNA but doesn't
assemble readily with the default settings of the assemblers in IMP3. Very commonly, rRNA is actually depleted during library
preparation, with different success for different source organisms, making the rRNA abundance even less interpretable. While the rRNA
removal step removes the rRNA reads from futher processing, they are kept in separates file for potential use outside of IMP3.
-------------------
Reference filtering
-------------------
Commonly, users will want to remove reads that map to one or more reference genomes, e.g. a host genome in a gut
microbiome or a known contaminant. IMP3 achieves this step for metaG and metaT reads by mapping against the files chosen
by the user with BWA. Only reads that do not map are kept. BWA is actually run independently on the paired-end data and singletons.
Both partners of a set of paired reads where one partner maps to the host genome are removed from the final data set.
If the user supplies more than one reference genome for filtering, the reads that did not map to the first reference will be
mapped against the second. The reads that did not map to this one will be mapped against the third reference and so on.
Currently reads that map to the reference genome(s) are not kept, nor are the alignments.
-----------
Cleaning up
-----------
Input and intermediary FASTQ files are gzipped at the end of the preprocessing step.
......@@ -5,3 +5,5 @@
IMP3 steps: Taxonomy
====================
Additionally, `kraken2 <https://ccb.jhu.edu/software/kraken2/>`_ is
run on the contigs from the :ref:`Assembly <step_assembly>` step.
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment