Ribosomal RNA Filtering
Now we have some high quality reads to work with, our next step is to filter our reads. We’ll use these filtered reads in our lesson on functional analysis.
Why filter reads?
We’re going to be looking at the functional capacity of our microbial communities AKA what genes are being expressed in what abundances. That means we’ll really only be interested in mRNA, the RNA that codes for those proteins.
Our samples were treated to remove free nucleotides and tRNAs as part of the extraction process. They were also enriched for mRNAs using an enrichment kit which binds and removes rRNA fragments. But just to make sure there aren’t any rRNA fragments left, we’re going to do some more filtering.
For our lesson on taxonomic analysis we will use the unfiltered reads, since in this case it doesn’t matter whether the reads are mRNA or rRNA.
Filtering with SortMeRNA
We will use a command line tool called SortMeRNA to do this filtering.
SortMeRNA uses a large RNA database and a complex algorithm to decide whether each read is rRNA or not. It sorts reads into two files - one for rRNA, one for “other” (which in our case is most likely to be mRNA).
Making an output directory
The first thing we should do is make a directory to store our outputs from SortMeRNA. We’ll use the same mkdir
command we used in the last episode.
First, make sure you’re in the cs_course
directory.
Code
cd ~/cs_course
Now we can make a new directory for SortMeRNA outputs inside our existing results folder.
Code
mkdir -p results/sortmerna
Building the SortMeRNA command
We’re ready to start thinking about our command! Let’s have a look at the help documentation for SortMeRNA.
Code
sortmrna -h
Output
Usage: sortmerna -ref FILE [-ref FILE] -reads FWD_READS [-reads REV_READS] [OPTIONS]:
[REQUIRED]
--ref PATH Required Reference file (FASTA) absolute or relative path.
Use multiple times, once per a reference file
--reads PATH Required Raw reads file (FASTA/FASTQ/FASTA.GZ/FASTQ.GZ).
Use twice for files with paired reads.
The file extensions are Not important. The program automatically
recognizes the file format as flat/compressed, fasta/fastq
[COMMON]
--workdir PATH Optional Workspace directory USRDIR/sortmerna/run/
Default structure: WORKDIR/
idx/ (References index)
kvdb/ (Key-value storage for alignments)
out/ (processing output)
readb/ (pre-processed reads/index)
--kvdb PATH Optional Directory for Key-value database WORKDIR/kvdb
KVDB is used for storing the alignment results.
--idx-dir PATH Optional Directory for storing Reference index. WORKDIR/idx
--readb PATH Optional Storage for pre-processed reads WORKDIR/readb/
Directory storing the split reads, or the random access index of compressed reads
--fastx BOOL Optional Output aligned reads into FASTA/FASTQ file
--sam BOOL Optional Output SAM alignment for aligned reads.
--SQ BOOL Optional Add SQ tags to the SAM file
--blast STR Optional output alignments in various Blast-like formats
Sample values: '0' - pairwise
'1' - tabular (Blast - m 8 format)
'1 cigar' - tabular + column for CIGAR
'1 cigar qcov' - tabular + columns for CIGAR and query coverage
'1 cigar qcov qstrand' - tabular + columns for CIGAR, query coverage,
and strand
--aligned STR/BOOL Optional Aligned reads file prefix [dir/][pfx] WORKDIR/out/aligned
Directory and file prefix for aligned output i.e. each
output file goes into the specified directory with the given prefix.
The appropriate extension: (fasta|fastq|blast|sam|etc) is automatically added.
Both 'dir' and 'pfx' are optional.
The 'dir' can be a relative or an absolute path.
If 'dir' is not specified, the output is created in the WORKDIR/out/
If 'pfx' is not specified, the prefix 'aligned' is used
Examples:
'-aligned $MYDIR/dir_1/dir_2/1' -> $MYDIR/dir_1/dir_2/1.fasta
'-aligned dir_1/apfx' -> $PWD/dir_1/apfx.fasta
'-aligned dir_1/' -> $PWD/aligned.fasta
'-aligned apfx' -> $PWD/apfx.fasta
'-aligned (no argument)' -> WORKDIR/out/aligned.fasta
--other STR/BOOL Optional Non-aligned reads file prefix [dir/][pfx] WORKDIR/out/other
Directory and file prefix for non-aligned output i.e. each
output file goes into the specified directory with the given prefix.
The appropriate extension: (fasta|fastq|blast|sam|etc) is automatically added.
Must be used with 'fastx'.
Both 'dir' and 'pfx' are optional.
The 'dir' can be a relative or an absolute path.
If 'dir' is not specified, the output is created in the WORKDIR/out/
If 'pfx' is not specified, the prefix 'other' is used
Examples:
'-other $MYDIR/dir_1/dir_2/1' -> $MYDIR/dir_1/dir_2/1.fasta
'-other dir_1/apfx' -> $PWD/dir_1/apfx.fasta
'-other dir_1/' -> $PWD/dir_1/other.fasta
'-other apfx' -> $PWD/apfx.fasta
'-other (no argument)' -> aligned_out/other.fasta
i.e. the same output directory
as used for aligned output
--num_alignments INT Optional Positive integer (INT >=0).
If used with '-no-best' reports first INT alignments per read reaching
E-value threshold, which allows to lower the CPU time and memory use.
Otherwise outputs INT best alignments.
If INT = 0, all alignments are output
--no-best BOOL Optional Disable best alignments search False
The 'best' alignment is the highest scoring alignment out of All alignments of a read,
and the read can potentially be aligned (reaching E-value threshold) to multiple reference
sequences.
By default the program searches for best alignments i.e. performs an exhaustive search
over all references. Using '-no-best' will make the program to search just
the first N alignments, where N is set using '-num_alignments' i.e. 1 by default.
--min_lis INT Optional Search only alignments that have the LIS 2
of at least N seeds long
LIS stands for Longest Increasing Subsequence. It is computed using seeds, which
are k-mers common to the read and the reference sequence. Sorted sequences of such seeds
are used to filter the candidate references prior performing the Smith-Waterman alignment.
--print_all_reads BOOL Optional Output null alignment strings for non-aligned reads False
to SAM and/or BLAST tabular files
--paired BOOL Optional Flags paired reads False
If a single reads file is provided, use this option to indicate
the file contains interleaved paired reads when neither
'paired_in' | 'paired_out' | 'out2' | 'sout' are specified.
--paired_in BOOL Optional Flags the paired-end reads as Aligned, False
when either of them is Aligned.
With this option both reads are output into Aligned FASTA/Q file
Must be used with 'fastx'.
Mutually exclusive with 'paired_out'.
--paired_out BOOL Optional Flags the paired-end reads as Non-aligned, False
when either of them is non-aligned.
With this option both reads are output into Non-Aligned FASTA/Q file
Must be used with 'fastx'.
Mutually exclusive with 'paired_in'.
--out2 BOOL Optional Output paired reads into separate files. False
Must be used with 'fastx'.
If a single reads file is provided, this options implies interleaved paired reads
When used with 'sout', four (4) output files for aligned reads will be generated:
'aligned-paired-fwd, aligned-paired-rev, aligned-singleton-fwd, aligned-singleton-rev'.
If 'other' option is also used, eight (8) output files will be generated.
--sout BOOL Optional Separate paired and singleton aligned reads. False
To be used with 'fastx'.
If a single reads file is provided, this options implies interleaved paired reads
Cannot be used with 'paired_in' | 'paired_out'
--zip-out STR/BOOL Optional Controls the output compression '-1'
By default the report files are produced in the same format as the input i.e.
if the reads files are compressed (gz), the output is also compressed.
The default behaviour can be overriden by using '-zip-out'.
The possible values: '1/true/t/yes/y'
'0/false/f/no/n'
'-1' (the same format as input - default)
The values are Not case sensitive i.e. 'Yes, YES, yEs, Y, y' are all OK
Examples:
'-reads freads.gz -zip-out n' : generate flat output when the input is compressed
'-reads freads.flat -zip-out' : compress the output when the input files are flat
--match INT Optional SW score (positive integer) for a match. 2
--mismatch INT Optional SW penalty (negative integer) for a mismatch. -3
--gap_open INT Optional SW penalty (positive integer) for introducing a gap. 5
--gap_ext INT Optional SW penalty (positive integer) for extending a gap. 2
-e DOUBLE Optional E-value threshold. 1
Defines the 'statistical significance' of a local alignment.
Exponentially correllates with the Minimal Alignment score.
Higher E-values (100, 1000, ...) cause More reads to Pass the alignment threshold
-F BOOL Optional Search only the forward strand. False
-N BOOL Optional SW penalty for ambiguous letters (N's) scored
as --mismatch
-R BOOL Optional Search only the reverse-complementary strand. False
There are two required arguments for SortMeRNA - reference (--ref
) and reads (--reads
).
The reference is a database containing rRNA sequences which helps SortMeRNA determine whether reads are rRNA or not. We’ve already downloaded the rRNA database onto the instance.
The reads are our trimmed FASTQ files. Since our reads are paired, we have to give this argument twice and specify each file separately. The program will take their ‘pairedness’ into account while it is sorting.
We’ll also use some other optional arguments to customise the command and make our output easier to work with later:
--workdir
tells SortMeRNA to put all outputs in a directory of our choice--threads
tells it how many cores/processors to use--aligned
tells it what we want the file containing aligned (i.e. rRNA) reads to be called--other
tells it what we want unaligned (i.e. non-rRNA) reads to be called--fastx
tells it we want our output files in fastq format--paired-out
tells it that if one read in a pair is aligned and the other is not, both reads in the pair should be classed as unaligned
Once again, our command is going to be quite long, so we’ll use \
to separate it into multiple lines. We’ll also use &>
to redirect the terminal outputs to a file, like we did with Cutadapt. This time, though, there’s one more step - telling the command to run in the background.
Running a command in the background
All the commands we have run so far have been in the “foreground”, meaning they’ve been run directly in the terminal window, the prompt disappears and we can’t use the terminal again until the command is finished.
Commands can also be run in the “background” so the prompt is returned before the command is finished and we can continue using our terminal. Commands run in the background are often called “jobs”. A major advantage of running a long job in the background is that you can log out of your instance without killing the process.
SortMeRNA can take a while to run, which is why we’re going to take advantage of this.
If you run a job in the foreground it will stop as soon as you log out of the instance! This could cause a problem if you momentarily have unstable internet or your computer runs out of battery. Running long commands in the background means you are protected from these circumstances and means you can do other things in the terminal in the meantime.
To run a command in the background, you add an ampersand (&
) symbol to the end. When you press enter, your prompt will immediately return, but the code is still running.
Running SortMeRNA
We’re now ready to run our command!
First make sure you’re in your cs_course
folder.
Code
cd ~/cs_course
Then enter the full command:
Code
sortmerna --ref databases/sortmerna_databases/smr_v4.3_default_db.fasta \
--reads data/trimmed_reads/SRR6820491_1.trim.fastq \
--reads data/trimmed_reads/SRR6820491_2.trim.fastq \
--workdir results/sortmerna \
--aligned results/sortmerna/SRR6820491_rrna \
--other results/sortmerna/SRR6820491_mrna \
--threads 8 \
--fastx --paired-out \
&> results/sortmerna/SRR6820491_sortmerna.log &
Once you press enter your prompt should immediately return and you will see a terminal printout below your command that looks something like this:
Code
[1] 274852
The first number (in square brackets) is your ‘job number’. The second number is a ‘process ID (PID)’. These numbers allow you to refer to your job should you need to.
There are many different ways to run jobs in the background in a terminal. How you run these commands will depend on the computing resources (and their fair use policies) you are using. The main options include:
&
, which we’ve covered here. Depending on the infrastructure you’re running the command on, you may also need to usenohup
to prevent the background job from being killed when you close the terminal.
- The command line program
screen
, which allows you to create a shell session that can be completely detached from a terminal and re-attached when needed. - Queuing system - many shared computing resources, like the High Performance Computing (HPC) clusters owned by some Universities, operate a queuing system (e.g. SLURM or SGE) so each user gets their fair share of computing resources. With these you submit your command / job to the queueing system, which will then handle when to run the job on the resources available.
As we’re running the command in the background we no longer see the output on the terminal but we can still check on the progress of the command. There are two ways to do this.
- Using the command
jobs
to view what is running - Examining the log file created by SortMeRNA using
less
Checking progress: jobs
The jobs command is used to list the jobs that you are running in the background and in the foreground. If the prompt is returned with no information no commands are being run.
Code
jobs
Output
[1]+ Running sortmerna --ref databases/sortmerna_databases/smr_v4.3_default_db.fasta --reads data/trimmed_reads/SRR6820491_1.trim.fastq --reads data/trimmed_reads/SRR6820491_2.trim.fastq --workdir results/sortmerna --aligned results/sortmerna/SRR6820491_rrna --other results/sortmerna/SRR6820491_mrna --threads 8 --fastx --paired-out &> results/sortmerna/SRR6820491_sortmerna.log &
The [1]
is the job number. If you need to stop the job running, you can use kill %1
, where 1 is the job number.
Checking progress: the log file
We can also look at the log file we generated to store the command’s terminal outputs. Using less
we can navigate through this file.
Code
cd results/sortmerna
less SRR6820491_sortmerna.log
The contents of the file will depend on how far through the process SortMeRNA is. At the start of an assembly you’ll probably see something like this:
Output
[process:1393] === Options processing starts ... ===
Found value: sortmerna
Found flag: --ref
Found value: databases/sortmerna_database/smr_v4.3_default_db.fasta of previous flag: --ref
Found flag: --reads
Found value: data/trimmed_reads/SRR6820491_1.trim.fastq.gz of previous flag: --reads
Found flag: --reads
Found value: data/trimmed_reads/SRR6820491_2.trim.fastq.gz of previous flag: --reads
Found flag: --workdir
Found value: results/sortmerna/SRR6820491_temp/ of previous flag: --workdir
Found flag: --aligned
Found value: results/sortmerna/SRR6820491_temp/SRR6820491_rrna of previous flag: --aligned
Found flag: --other
Found value: results/sortmerna/SRR6820491_temp/SRR6820491_mrna of previous flag: --other
Found flag: --threads
Found value: 8 of previous flag: --threads
Found flag: --fastx Found value: --paired-out of previous flag: --fastx
The file won’t update until you close and open it again, so you’ll need to keep checking back to see what progress is being made. Different steps in the process take different amounts of time so it might appear stuck. However, it is almost certainly still running if it was run in the background.
less
:
key | action |
---|---|
Space | to go forward |
b | to go backward |
g | to go to the beginning |
G | to go to the end |
q | to quit |
See Lesson 2 - Working with Files and Directories for a full overview on using less
.
SortMeRNA is likely to take about 70 minutes to finish, so feel free to leave this running while you do something else and come back to it later. You don’t need to remain connected to the instance during this time but once you have disconnected from the instance it does mean you can no longer use jobs
to track the job.
Determining if SortMeRNA has finished
After leaving it for an hour or so, SortMeRNA should have finished running.
If you remained connected to the instance during the process you will be able to tell it has finished because you get the following output in your terminal when the command has finished.
Output
[1]+ Done sortmerna --ref databases/sortmerna_databases/smr_v4.3_default_db.fasta --reads data/trimmed_reads/SRR6820491_1.trim.fastq --reads data/trimmed_reads/SRR6820491_2.trim.fastq --workdir results/sortmerna --aligned results/sortmerna/SRR6820491_rrna --other results/sortmerna/SRR6820491_mrna --threads 8 --fastx --paired-out &> results/sortmerna/SRR6820491_sortmerna.log &
This message won’t be displayed if you disconnected from the instance for whatever reason during the assembly process. However, you can still examine the log file in the sortmrna
directory. If the assembly has finished the log file will have …
Move to the sortmerna
directory and use less
to examine the contents of the log file:
Code
cd ~/cs_course/results/sortmerna/
less SRR6820491_sortmerna.log
Navigate to the end of the file using G. You should see something like:
Output
[writeReports:268] === done Reports in 213.494 sec ===
SortMeRNA Outputs
Output
idx kvdb readb SRR6820491_mrna.fq SRR6820491_rrna.fq SRR6820491_rrna.log SRR6820491_sortmerna.log