Ribosomal RNA Filtering

Now we have some high quality reads to work with, our next step is to filter our reads. We’ll use these filtered reads in our lesson on functional analysis.

Why filter reads?

We’re going to be looking at the functional capacity of our microbial communities AKA what genes are being expressed in what abundances. That means we’ll really only be interested in mRNA, the RNA that codes for those proteins.

Our samples were treated to remove free nucleotides and tRNAs as part of the extraction process. They were also enriched for mRNAs using an enrichment kit which binds and removes rRNA fragments. But just to make sure there aren’t any rRNA fragments left, we’re going to do some more filtering.

For our lesson on taxonomic analysis we will use the unfiltered reads, since in this case it doesn’t matter whether the reads are mRNA or rRNA.

Filtering with SortMeRNA

We will use a command line tool called SortMeRNA to do this filtering.

SortMeRNA uses a large RNA database and a complex algorithm to decide whether each read is rRNA or not. It sorts reads into two files - one for rRNA, one for “other” (which in our case is most likely to be mRNA).

Making an output directory

The first thing we should do is make a directory to store our outputs from SortMeRNA. We’ll use the same mkdir command we used in the last episode.

First, make sure you’re in the cs_course directory.

Code

cd ~/cs_course

Now we can make a new directory for SortMeRNA outputs inside our existing results folder.

Code

mkdir -p results/sortmerna

Building the SortMeRNA command

We’re ready to start thinking about our command! Let’s have a look at the help documentation for SortMeRNA.

Code

sortmrna -h

Output — SortMeRNA seq help documentation

Output


 Usage:   sortmerna -ref FILE [-ref FILE] -reads FWD_READS [-reads REV_READS] [OPTIONS]:
    [REQUIRED]
    --ref             PATH        Required  Reference file (FASTA) absolute or relative path.
       Use multiple times, once per a reference file

    --reads           PATH        Required  Raw reads file (FASTA/FASTQ/FASTA.GZ/FASTQ.GZ).
       Use twice for files with paired reads.
       The file extensions are Not important. The program automatically
       recognizes the file format as flat/compressed, fasta/fastq

    [COMMON]
    --workdir         PATH        Optional  Workspace directory                         USRDIR/sortmerna/run/
       Default structure: WORKDIR/
                              idx/   (References index)
                              kvdb/  (Key-value storage for alignments)
                              out/   (processing output)
                              readb/ (pre-processed reads/index)


    --kvdb            PATH        Optional  Directory for Key-value database            WORKDIR/kvdb
       KVDB is used for storing the alignment results.

    --idx-dir         PATH        Optional  Directory for storing Reference index.      WORKDIR/idx

    --readb           PATH        Optional  Storage for pre-processed reads             WORKDIR/readb/

       Directory storing the split reads, or the random access index of compressed reads

    --fastx           BOOL        Optional  Output aligned reads into FASTA/FASTQ file
    --sam             BOOL        Optional  Output SAM alignment for aligned reads.

    --SQ              BOOL        Optional  Add SQ tags to the SAM file

    --blast           STR         Optional  output alignments in various Blast-like formats
       Sample values: '0'                    - pairwise
                      '1'                    - tabular (Blast - m 8 format)
                      '1 cigar'              - tabular + column for CIGAR
                      '1 cigar qcov'         - tabular + columns for CIGAR and query coverage
                      '1 cigar qcov qstrand' - tabular + columns for CIGAR, query coverage,
                                               and strand

    --aligned         STR/BOOL    Optional  Aligned reads file prefix [dir/][pfx]       WORKDIR/out/aligned
       Directory and file prefix for aligned output i.e. each
       output file goes into the specified directory with the given prefix.
       The appropriate extension: (fasta|fastq|blast|sam|etc) is automatically added.
       Both 'dir' and 'pfx' are optional.
       The 'dir' can be a relative or an absolute path.
       If 'dir' is not specified, the output is created in the WORKDIR/out/
       If 'pfx' is not specified, the prefix 'aligned' is used
       Examples:
       '-aligned $MYDIR/dir_1/dir_2/1' -> $MYDIR/dir_1/dir_2/1.fasta
       '-aligned dir_1/apfx'           -> $PWD/dir_1/apfx.fasta
       '-aligned dir_1/'               -> $PWD/aligned.fasta
       '-aligned apfx'                 -> $PWD/apfx.fasta
       '-aligned  (no argument)'       -> WORKDIR/out/aligned.fasta

    --other           STR/BOOL    Optional  Non-aligned reads file prefix [dir/][pfx]   WORKDIR/out/other
       Directory and file prefix for non-aligned output i.e. each
       output file goes into the specified directory with the given prefix.
       The appropriate extension: (fasta|fastq|blast|sam|etc) is automatically added.
       Must be used with 'fastx'.
       Both 'dir' and 'pfx' are optional.
       The 'dir' can be a relative or an absolute path.
       If 'dir' is not specified, the output is created in the WORKDIR/out/
       If 'pfx' is not specified, the prefix 'other' is used
       Examples:
       '-other $MYDIR/dir_1/dir_2/1' -> $MYDIR/dir_1/dir_2/1.fasta
       '-other dir_1/apfx'           -> $PWD/dir_1/apfx.fasta
       '-other dir_1/'               -> $PWD/dir_1/other.fasta
       '-other apfx'                 -> $PWD/apfx.fasta
       '-other  (no argument)'       -> aligned_out/other.fasta
                                        i.e. the same output directory
                                        as used for aligned output

    --num_alignments  INT         Optional  Positive integer (INT >=0).

       If used with '-no-best' reports first INT alignments per read reaching
       E-value threshold, which allows to lower the CPU time and memory use.
       Otherwise outputs INT best alignments.
       If INT = 0, all alignments are output

    --no-best         BOOL        Optional  Disable best alignments search                          False
       The 'best' alignment is the highest scoring alignment out of All alignments of a read,
       and the read can potentially be aligned (reaching E-value threshold) to multiple reference
       sequences.
       By default the program searches for best alignments i.e. performs an exhaustive search
       over all references. Using '-no-best' will make the program to search just
       the first N alignments, where N is set using '-num_alignments' i.e. 1 by default.

    --min_lis         INT         Optional  Search only alignments that have the LIS                2
                                            of at least N seeds long
       LIS stands for Longest Increasing Subsequence. It is computed using seeds, which
       are k-mers common to the read and the reference sequence. Sorted sequences of such seeds
       are used to filter the candidate references prior performing the Smith-Waterman alignment.

    --print_all_reads BOOL        Optional  Output null alignment strings for non-aligned reads     False
                                            to SAM and/or BLAST tabular files
    --paired          BOOL        Optional  Flags paired reads                                      False
        If a single reads file is provided, use this option to indicate
        the file contains interleaved paired reads when neither
        'paired_in' | 'paired_out' | 'out2' | 'sout' are specified.


    --paired_in       BOOL        Optional  Flags the paired-end reads as Aligned,                  False
                                            when either of them is Aligned.
        With this option both reads are output into Aligned FASTA/Q file
        Must be used with 'fastx'.
        Mutually exclusive with 'paired_out'.

    --paired_out      BOOL        Optional  Flags the paired-end reads as Non-aligned,              False
                                            when either of them is non-aligned.
        With this option both reads are output into Non-Aligned FASTA/Q file
        Must be used with 'fastx'.
        Mutually exclusive with 'paired_in'.

    --out2            BOOL        Optional  Output paired reads into separate files.                False
       Must be used with 'fastx'.
       If a single reads file is provided, this options implies interleaved paired reads
       When used with 'sout', four (4) output files for aligned reads will be generated:
       'aligned-paired-fwd, aligned-paired-rev, aligned-singleton-fwd, aligned-singleton-rev'.
       If 'other' option is also used, eight (8) output files will be generated.

    --sout            BOOL        Optional  Separate paired and singleton aligned reads.            False
       To be used with 'fastx'.
       If a single reads file is provided, this options implies interleaved paired reads
       Cannot be used with 'paired_in' | 'paired_out'

    --zip-out         STR/BOOL    Optional  Controls the output compression                        '-1'
       By default the report files are produced in the same format as the input i.e.
       if the reads files are compressed (gz), the output is also compressed.
       The default behaviour can be overriden by using '-zip-out'.
       The possible values: '1/true/t/yes/y'
                            '0/false/f/no/n'
                            '-1' (the same format as input - default)
       The values are Not case sensitive i.e. 'Yes, YES, yEs, Y, y' are all OK
       Examples:
       '-reads freads.gz -zip-out n' : generate flat output when the input is compressed
       '-reads freads.flat -zip-out' : compress the output when the input files are flat

    --match           INT         Optional  SW score (positive integer) for a match.                2
    
    --mismatch        INT         Optional  SW penalty (negative integer) for a mismatch.          -3

    --gap_open        INT         Optional  SW penalty (positive integer) for introducing a gap.    5

    --gap_ext         INT         Optional  SW penalty (positive integer) for extending a gap.      2

    -e                DOUBLE      Optional  E-value threshold.                                      1
       Defines the 'statistical significance' of a local alignment.
       Exponentially correllates with the Minimal Alignment score.
       Higher E-values (100, 1000, ...) cause More reads to Pass the alignment threshold

    -F                BOOL        Optional  Search only the forward strand.                         False

    -N                BOOL        Optional  SW penalty for ambiguous letters (N's) scored
                                            as --mismatch
                                            
    -R                BOOL        Optional  Search only the reverse-complementary strand.           False

There are two required arguments for SortMeRNA - reference (--ref) and reads (--reads).

The reference is a database containing rRNA sequences which helps SortMeRNA determine whether reads are rRNA or not. We’ve already downloaded the rRNA database onto the instance.

The reads are our trimmed FASTQ files. Since our reads are paired, we have to give this argument twice and specify each file separately. The program will take their ‘pairedness’ into account while it is sorting.

We’ll also use some other optional arguments to customise the command and make our output easier to work with later:

--workdir tells SortMeRNA to put all outputs in a directory of our choice
--threads tells it how many cores/processors to use
--aligned tells it what we want the file containing aligned (i.e. rRNA) reads to be called
--other tells it what we want unaligned (i.e. non-rRNA) reads to be called
--fastx tells it we want our output files in fastq format
--paired-out tells it that if one read in a pair is aligned and the other is not, both reads in the pair should be classed as unaligned

Once again, our command is going to be quite long, so we’ll use \ to separate it into multiple lines. We’ll also use &> to redirect the terminal outputs to a file, like we did with Cutadapt. This time, though, there’s one more step - telling the command to run in the background.

Running a command in the background

All the commands we have run so far have been in the “foreground”, meaning they’ve been run directly in the terminal window, the prompt disappears and we can’t use the terminal again until the command is finished.

Commands can also be run in the “background” so the prompt is returned before the command is finished and we can continue using our terminal. Commands run in the background are often called “jobs”. A major advantage of running a long job in the background is that you can log out of your instance without killing the process.

SortMeRNA can take a while to run, which is why we’re going to take advantage of this.

Warning

If you run a job in the foreground it will stop as soon as you log out of the instance! This could cause a problem if you momentarily have unstable internet or your computer runs out of battery. Running long commands in the background means you are protected from these circumstances and means you can do other things in the terminal in the meantime.

To run a command in the background, you add an ampersand (&) symbol to the end. When you press enter, your prompt will immediately return, but the code is still running.

Running SortMeRNA

We’re now ready to run our command!

First make sure you’re in your cs_course folder.

Code

cd ~/cs_course

Then enter the full command:

Code

sortmerna --ref databases/sortmerna_databases/smr_v4.3_default_db.fasta \
          --reads data/trimmed_reads/SRR6820491_1.trim.fastq \
          --reads data/trimmed_reads/SRR6820491_2.trim.fastq \
          --workdir results/sortmerna \
          --aligned results/sortmerna/SRR6820491_rrna \
          --other results/sortmerna/SRR6820491_mrna \
          --threads 8 \
          --fastx --paired-out \
          &> results/sortmerna/SRR6820491_sortmerna.log &

Once you press enter your prompt should immediately return and you will see a terminal printout below your command that looks something like this:

Code

[1] 274852

The first number (in square brackets) is your ‘job number’. The second number is a ‘process ID (PID)’. These numbers allow you to refer to your job should you need to.

Running commands on different servers

There are many different ways to run jobs in the background in a terminal. How you run these commands will depend on the computing resources (and their fair use policies) you are using. The main options include:

&, which we’ve covered here. Depending on the infrastructure you’re running the command on, you may also need to use nohup to prevent the background job from being killed when you close the terminal.
The command line program screen, which allows you to create a shell session that can be completely detached from a terminal and re-attached when needed.
Queuing system - many shared computing resources, like the High Performance Computing (HPC) clusters owned by some Universities, operate a queuing system (e.g. SLURM or SGE) so each user gets their fair share of computing resources. With these you submit your command / job to the queueing system, which will then handle when to run the job on the resources available.

As we’re running the command in the background we no longer see the output on the terminal but we can still check on the progress of the command. There are two ways to do this.

Using the command jobs to view what is running
Examining the log file created by SortMeRNA using less

Checking progress: `jobs`

The jobs command is used to list the jobs that you are running in the background and in the foreground. If the prompt is returned with no information no commands are being run.

Code

jobs

Output

[1]+  Running      sortmerna --ref databases/sortmerna_databases/smr_v4.3_default_db.fasta --reads data/trimmed_reads/SRR6820491_1.trim.fastq --reads data/trimmed_reads/SRR6820491_2.trim.fastq --workdir results/sortmerna --aligned results/sortmerna/SRR6820491_rrna --other results/sortmerna/SRR6820491_mrna --threads 8 --fastx --paired-out &> results/sortmerna/SRR6820491_sortmerna.log &

The [1] is the job number. If you need to stop the job running, you can use kill %1, where 1 is the job number.

Checking progress: the log file

We can also look at the log file we generated to store the command’s terminal outputs. Using less we can navigate through this file.

Code

cd results/sortmerna
less SRR6820491_sortmerna.log

The contents of the file will depend on how far through the process SortMeRNA is. At the start of an assembly you’ll probably see something like this:

Output

[process:1393] === Options processing starts ... ===
Found value: sortmerna
Found flag: --ref
Found value: databases/sortmerna_database/smr_v4.3_default_db.fasta of previous flag: --ref
Found flag: --reads
Found value: data/trimmed_reads/SRR6820491_1.trim.fastq.gz of previous flag: --reads
Found flag: --reads
Found value: data/trimmed_reads/SRR6820491_2.trim.fastq.gz of previous flag: --reads
Found flag: --workdir
Found value: results/sortmerna/SRR6820491_temp/ of previous flag: --workdir
Found flag: --aligned
Found value: results/sortmerna/SRR6820491_temp/SRR6820491_rrna of previous flag: --aligned
Found flag: --other
Found value: results/sortmerna/SRR6820491_temp/SRR6820491_mrna of previous flag: --other
Found flag: --threads
Found value: 8 of previous flag: --threads
Found flag: --fastx
Found value: --paired-out of previous flag: --fastx

The file won’t update until you close and open it again, so you’ll need to keep checking back to see what progress is being made. Different steps in the process take different amounts of time so it might appear stuck. However, it is almost certainly still running if it was run in the background.

Navigation commands in less:

key	action
`Space`	to go forward
`b`	to go backward
`g`	to go to the beginning
`G`	to go to the end
`q`	to quit

See Lesson 2 - Working with Files and Directories for a full overview on using less.

SortMeRNA is likely to take about 70 minutes to finish, so feel free to leave this running while you do something else and come back to it later. You don’t need to remain connected to the instance during this time but once you have disconnected from the instance it does mean you can no longer use jobs to track the job.

Determining if SortMeRNA has finished

After leaving it for an hour or so, SortMeRNA should have finished running.

If you remained connected to the instance during the process you will be able to tell it has finished because you get the following output in your terminal when the command has finished.

Output

[1]+  Done      sortmerna --ref databases/sortmerna_databases/smr_v4.3_default_db.fasta --reads data/trimmed_reads/SRR6820491_1.trim.fastq --reads data/trimmed_reads/SRR6820491_2.trim.fastq --workdir results/sortmerna --aligned results/sortmerna/SRR6820491_rrna --other results/sortmerna/SRR6820491_mrna --threads 8 --fastx --paired-out &> results/sortmerna/SRR6820491_sortmerna.log &

This message won’t be displayed if you disconnected from the instance for whatever reason during the assembly process. However, you can still examine the log file in the sortmrna directory. If the assembly has finished the log file will have …

Move to the sortmerna directory and use less to examine the contents of the log file:

Code

cd ~/cs_course/results/sortmerna/
less SRR6820491_sortmerna.log

Navigate to the end of the file using G. You should see something like:

Output

[writeReports:268] === done Reports in 213.494 sec ===

SortMeRNA Outputs

Output

idx  kvdb  readb  SRR6820491_mrna.fq  SRR6820491_rrna.fq  SRR6820491_rrna.log  SRR6820491_sortmerna.log