Preprocessing: kneaddata for reads filtering
I would propose to consider to use
kneaddata for reads filtering.
This tool aims to perform principled in silico separation of bacterial reads from these "contaminant" reads, be they from the host, from bacterial 16S sequences, or other user-defined sources.
- can be installed via
- can use multiple references for filtering
- outputs reads mapped to each given reference in separate FASTQ files
fastqcfor the input/output FASTQ files)
The rRNA filtering step could be included there as well or it could still be a separate rule. With or without the rRNA filtering, this would reduce the code complexity considerably: there would be no need for those "chained" FASTQ files with multiple filtering-suffixes in their names.
The trimming step included in
kneaddata can and has to be skipped because of the optional poly-G trimming which has to be done prior to filtering.