Quality of Raw Reads

Getting started

The first thing we need to do is log in to our cloud instance.

**1. Open the cloudspan folder you created for the course**

Open your file manager and navigate to the cloudspan folder (hint: we recommended you make the folder in your Desktop directory - but you might have made it somewhere else). If you cannot find the folder, you can remind yourself where it is stored by looking at the absolute path you wrote down in the previous episode.

The folder should contain the login key file we downloaded in the previous episode and nothing else.

2. Right-click and open your machine’s command line interface

Now we can open the command line.

Windows users:
- Right click anywhere inside the blank space of the file manager, then select Git Bash Here. A new window will open - this is your command line interface, also known as the shell or the terminal. It will automatically open with your cloudspan directory as the working directory.
Mac users, you have two options:
- EITHER: Open Terminal in one window and type cd followed by a space. Do not press enter! Now open Finder in another window. Drag and drop the cloudspan folder from the Finder to the Terminal. You should see the file path leading to your cloudspan folder appear. Now press enter to navigate to the folder.
- OR: Open Terminal and type cd followed by the absolute path that leads to your cloudspan folder. Press enter.

Reminder: our file structure

Before we start, here’s a reminder of what our file structure looks like as a hierarchy tree:

Keep this in mind as we continue to navigate the file system, and don’t hesitate to refer back to it if needed.

Before we can start working properly with our data we need to do some quality control to make sure our data is as high quality as possible. There are lots of ways errors could be introduced during the sampling and sequencing process, such as RNA becoming degraded or bases being called incorrectly.

Quality control (QC) has two main steps: - FastQC to examine the quality of the reads - Cutadapt to trim off low quality bases and remove adapter sequences

Quality control using FastQC

In previous lessons we had a look at our data files and found they were in FASTQ format, a common format for sequencing data. We used grep to look for ‘bad reads’ containing more than three consecutive Ns and put these reads into their own text file.

This could be rather time consuming and is very nuanced. Luckily, there’s a better way! Rather than assessing every read in the raw data by hand we can use FastQC to visualise the quality of the whole sequencing file.

About FastQC

Rather than looking at quality scores for each individual read, FastQC looks at quality collectively across all reads within a sample. The image below shows one FastQC-generated plot that indicates a very high quality sample:

Example of high quality sample - all bars are in green section.

The x-axis displays the base position (bp) in the read, and the y-axis shows quality scores. In this example, the sample contains reads that are 40 bp long.

Each position has a box-and-whisker plot showing the distribution of quality scores for all reads at that position.

The horizontal red line indicates the median quality score.
The yellow box shows the 1st to 3rd quartile range (this means that 50% of reads have a quality score that falls within the range of the yellow box at that position).
The whiskers show the absolute range, which covers the lowest (0th quartile) to highest (4th quartile) values.
The plot background is also color-coded to identify good (green), acceptable (yellow), and bad (red) quality scores.

Now let’s take a look at a quality plot on the other end of the spectrum.

Example of low quality sample - more bars are not in green section.

Here, we see positions within the read in which the boxes span a much wider range. Also, quality scores drop quite low into the “bad” range, particularly on the tail end of the reads. The FastQC tool produces several other diagnostic plots to assess sample quality, in addition to the one plotted above.

Getting started with FastQC

First, we are going to organise our results by creating a directory to contain the output of all of the results we generate in this course.

The mkdir command can be used to make a new directory. Using the -p flag for mkdir allows it to create a new directory, even if one of the parent directories doesn’t already exist. It also suppresses errors if the directory already exists, without overwriting that directory.

Return to your home directory (/home/csuser):

Code

cd

Create the directories results inside cs_course:

Code

mkdir -p cs_course/results

You might want to use ls to check those directories have been made.

We are going to have lots of outputs so it makes sense to organise them with some more subdirectories. Let’s make one called qc (for quality control).

Code

mkdir -p cs_course/results/qc

Now we have created the directories we are ready to start with quality control.

FastQC has been installed on your instance. We can run it with the -h flag to display the help documentation and find out what parameters are available:

Code

fastqc -h

Output — FastQC seq help documentation

Output


             FastQC - A high throughput sequence QC analysis tool

SYNOPSIS

      fastqc seqfile1 seqfile2 .. seqfileN

      fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
             [-c contaminant file] seqfile1 .. seqfileN

DESCRIPTION

     FastQC reads a set of sequence files and produces from each one a quality
     control report consisting of a number of different modules, each one of
     which will help to identify a different potential type of problem in your
     data.
     
     If no files to process are specified on the command line then the program
     will start as an interactive graphical application.  If files are provided
     on the command line then the program will run with no user interaction
     required.  In this mode it is suitable for inclusion into a standardised
     analysis pipeline.
     
     The options for the program as as follows:
     
     -h --help       Print this help file and exit
     
     -v --version    Print the version of the program and exit
     
     -o --outdir     Create all output files in the specified output directory.
                     Please note that this directory must exist as the program
                     will not create it.  If this option is not set then the
                     output file for each sequence file is created in the same
                     directory as the sequence file which was processed.
                     
     --casava        Files come from raw casava output. Files in the same sample
                     group (differing only by the group number) will be analysed
                     as a set rather than individually. Sequences with the filter
                     flag set in the header will be excluded from the analysis.
                     Files must have the same names given to them by casava
                     (including being gzipped and ending with .gz) otherwise they
                     won't be grouped together correctly.
                     
     --nano          Files come from nanopore sequences and are in fast5 format. In
                     this mode you can pass in directories to process and the program
                     will take in all fast5 files within those directories and produce
                     a single output file from the sequences found in all files.                    
                     
     --nofilter      If running with --casava then don't remove read flagged by
                     casava as poor quality when performing the QC analysis.
                    
     --extract       If set then the zipped output file will be uncompressed in
                     the same directory after it has been created.  By default
                     this option will be set if fastqc is run in non-interactive
                     mode.
                     
     -j --java       Provides the full path to the java binary you want to use to
                     launch fastqc. If not supplied then java is assumed to be in
                     your path.
                    
     --noextract     Do not uncompress the output file after creating it.  You
                     should set this option if you do not wish to uncompress
                     the output when running in non-interactive mode.
                     
     --nogroup       Disable grouping of bases for reads >50bp. All reports will
                     show data for every base in the read.  WARNING: Using this
                     option will cause fastqc to crash and burn if you use it on
                     really long reads, and your plots may end up a ridiculous size.
                     You have been warned!
                     
     --min_length    Sets an artificial lower limit on the length of the sequence
                     to be shown in the report.  As long as you set this to a value
                     greater or equal to your longest read length then this will be
                     the sequence length used to create your read groups.  This can
                     be useful for making directly comaparable statistics from
                     datasets with somewhat variable read lengths.
                     
     -f --format     Bypasses the normal sequence file format detection and
                     forces the program to use the specified format.  Valid
                     formats are bam,sam,bam_mapped,sam_mapped and fastq
                     
     -t --threads    Specifies the number of files which can be processed
                     simultaneously.  Each thread will be allocated 250MB of
                     memory so you shouldn't run more threads than your
                     available memory will cope with, and not more than
                     6 threads on a 32 bit machine
                   
     -c              Specifies a non-default file which contains the list of
     --contaminants  contaminants to screen overrepresented sequences against.
                     The file must contain sets of named contaminants in the
                     form name[tab]sequence.  Lines prefixed with a hash will
                     be ignored.

     -a              Specifies a non-default file which contains the list of
     --adapters      adapter sequences which will be explicity searched against
                     the library. The file must contain sets of named adapters
                     in the form name[tab]sequence.  Lines prefixed with a hash
                     will be ignored.
                     
     -l              Specifies a non-default file which contains a set of criteria
     --limits        which will be used to determine the warn/error limits for the
                     various modules.  This file can also be used to selectively
                     remove some modules from the output all together.  The format
                     needs to mirror the default limits.txt file found in the
                     Configuration folder.
                     
    -k --kmers       Specifies the length of Kmer to look for in the Kmer content
                     module. Specified Kmer length must be between 2 and 10. Default
                     length is 7 if not specified.
                     
    -q --quiet       Supress all progress messages on stdout and only report errors.
    
    -d --dir         Selects a directory to be used for temporary files written when
                     generating report images. Defaults to system temp directory if
                     not specified.
                     
 BUGS

     Any bugs in fastqc should be reported either to simon.andrews@babraham.ac.uk
     or in www.bioinformatics.babraham.ac.uk/bugzilla/

This documentation tells us that to run FastQC we use the format fastqc seqfile1 seqfile2 .. seqfileN. Running the command will produce some files. By default, these are placed in the working directory from where you ran the command. We will use the -o option to specify a different directory for the output instead.

FastQC can accept multiple file names as input so we can use the *.fastqc wildcard to run FastQC on both of the FASTQ files at the same time.

First, navigate to your cs_course/ directory.

Code

cd ~/cs_course

Now you can enter the command, using -o to tell FastQC to put its output files into our newly-made qc/ directory. We’ll also use the -t flag to speed up the process by telling FastQC to run on 8 threads (processing cores).

Why here?

You might be wondering why we’re running our command from the cs_course directory and not the place where the data is stored (~/cs_course/data/raw_reads/), or where we want our outputs to end up (~/cs_course/results/qc/). The reason is that it’s best practice not to run commands from the same folder as your data in case you accidentally do something which would overwrite your data files. From cs_course we can easily “see” both our data and results directories to refer to them with local paths.

Over the rest of the course we will be moving all over the file tree to look at files but will always return to ~/cs_course to run commands. We encourage you to do the same in your own work!

Code

fastqc data/raw_reads/*.fastq -o results/qc/ -t 8

Press enter and you will see an automatically updating output message telling you the progress of the analysis. It should start like this:

Output

Started analysis of SRR6820491_1.fastq
Approx 5% complete for SRR6820491_1.fastq
Approx 10% complete for SRR6820491_1.fastq
Approx 15% complete for SRR6820491_1.fastq
Approx 20% complete for SRR6820491_1.fastq
Approx 25% complete for SRR6820491_1.fastq
Approx 30% complete for SRR6820491_1.fastq

In total, it should take around ten minutes for FastQC to run on our fastq files (however, this will depend on the size and number of files you give it). When the analysis completes, your prompt will return. So your screen will look something like this:

Output

Approx 75% complete for SRR6820491_2.fastq
Approx 80% complete for SRR6820491_2.fastq
Approx 85% complete for SRR6820491_2.fastq
Approx 90% complete for SRR6820491_2.fastq
Approx 95% complete for SRR6820491_2.fastq
Analysis complete for SRR6820491_2.fastq
$

Looking at FastQC outputs

The FastQC program has created four new files (two for each .fastq file) within our results/qc/ directory. We can see them by listing the contents of the qc folder

Code

ls results/qc/

Output

SRR6820491_1_fastqc.html  SRR6820491_1_fastqc.zip  SRR6820491_2_fastqc.html  SRR6820491_2_fastqc.zip

For each input FASTQ file, FastQC has created a .zip file and a .html file. The .zip file extension indicates that this is actually a compressed set of multiple output files. A summary report for our data is in the the .html file.

If we were working on our local computers, we’d be able to look at each of these HTML files by opening them in a web browser.

However, these files are currently sitting on our remote AWS instance, where our local computer can’t see them. And, since we are only logging into the AWS instance via the command line, it doesn’t have any web browser setup to display these files either.

So the easiest way to look at these webpage summary reports will be to transfer them to our local computers (i.e. your laptop).

To do this we will use the scp command. scp stands for ‘secure copy protocol’, and is a widely used UNIX tool for moving files between computers. You must run scp in your local terminal in your laptop.

The scp command takes this form:

Code

scp <file I want to copy<where I want the copy to be placed>

You need to start a second terminal window that is not logged into the cloud instance and ensure you are in your cloudspan directory. This is important because it contains your .pem file, which will allow the scp command access to your AWS instance to copy the file.

Starting a new terminal

Open your file manager and navigate to the cloudspan folder (which should contain the login key file)
Open your machine’s command line interface:
- Windows users: Right click anywhere inside the blank space of the file manager, then select Git Bash Here.
- Mac users: Open Terminal and type cd followed by the absolute path that leads to your cloudspan folder. Press enter.
Check that you are in the right folder using pwd

Now use scp to download the file. We need to add in our .pem file, like when we log in, and we’ll also use the symbol . (this directory) to tell the scp command where to deposit the downloaded files.

The command will look something like:

Code

scp -i login-key-instanceNNN.pem csuser@instanceNNN.cloud-span.aws.york.ac.uk:~/cs_course/results/qc/*_fastqc.html .

Remember to replace NNN with your instance number.

As the file is downloading you will see an output like:

Output

SRR6820491_1_fastqc.html         100%  543KB   1.8MB/s   00:00  
SRR6820491_2_fastqc.html         47%  539KB   1.8MB/s   00:00

Once the files have downloaded, use File Explorer (Windows) or Finder (Mac) to find the files and open them - they should open up in your browser.

Help!

If you had trouble downloading and viewing the files you can view them here: SRR6820491_1_fastqc.html and SRR6820491_2_fastqc.html

First we will look at the “Per base sequence quality” graphs for SRR6820491_1.fastq and SRR6820491_2.fastq.

SRR6820491_1.fastq	SRR6820491_2.fastq

The x-axis displays the base position in the read, and the y-axis shows quality scores. The blue line represents the mean quality across samples.

In SRR6820491_1 (the forward reads) the overall quality is fairly high, although it seems to be lower at the start and end of some reads. At the end of the read in particular there appears to be quite a bit of variation.

In SRR6820491_2 (reverse reads) the quality is much more variable though the mean quality remains high. Some reads are dipping into the medium and low quality sections towards the end.

We definitely want to do some trimming to make sure the lower quality bases are removed from our reads. First though, we should also have a look at the “Adapter Content” graph which will show us where adapter sequences occur in the reads.

Adapter sequences are short sequences that are added to the sample to aid during the preparation of the DNA library. They therefore don’t tell us anything biologically important and should be removed if they are present in high numbers. They might also be removed in the case of certain applications, such as when the base sequence needs to be particularly accurate.

SRR6820491_1.fastq	SRR6820491_2.fastq

These graphs show us that a high proportion of our reads contain adapters, so we’ll need to trim these out too. Helpfully, FastQC has identified that the adapters present are Illumina Universal adapters, so we can easily find out what the sequence to trim out will be.

Trimming reads

So, we’ve looked at our read files and discovered that some of the bases are low quality and a high proportion of them still contain adapters. Now we’ll learn how to use a program called Cutadapt to filter poor quality reads and trim poor quality bases.

Cutadapt Options

Cutadapt has a variety of options to trim your reads. If we run the following command, we can see some of our options.

Code

cutadapt -h

There are many parameters here so we will just show the top of the output which explains the usage, but you should look through and familiarise yourself with the options available. Importantly it shows you what are the required parameters and also the version of the software you have used which is important to keep note of. You should always record what versions of software you have used and what parameters you ran the software with in order to make your analysis reproducible.

Output

cutadapt version 3.5

Copyright (C) 2010-2021 Marcel Martin <marcel.martin@scilifelab.se>

cutadapt removes adapter sequences from high-throughput sequencing reads.

Usage:
    cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq

For paired-end reads:
    cutadapt -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq in1.fastq in2.fastq

This output shows us that there is a different usage for single end or paired end reads. For single ended data we need to use:

-a to specify the sequence of the adaptor to be removed
-o to specify the name of the trimmed output
the name of the raw sequence we want to be trimmed.

For paired end data, Cutadapt expects: - two adaptor sequences for trimming, specified by -a and -A - two (different) names for the trimmed outputs, specified by -o and -p - the two files to be trimmed.

There are lots of other options to add too, but these are the compulsory ones.

The first input file typically contains a _1 or _R1 in the name. The second input file typically contains a _2 or _R2 in the name.

For more information about the arguments used see the Cutadapt documentation

Running Cutadapt

To begin, navigate to your cs_course directory:

Code

cd ~/cs_course/

We need to make a new directory to keep our trimmed reads in. We’ll put it inside the data directory on the same level as our existing raw_reads directory.

Code

mkdir data/trimmed_reads

We should also make a directory to keep our Cutadapt log file in, in case we need to look back at it in the future. We’ll put that inside results/

Code

mkdir results/cutadapt

Now we can start constructing our command. Here are the flags and options we’ll be using:

flag/option	meaning
`-q 20`	Bases need to have a quality phred score of at least 20 to be maintained
`-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA`	The adapter sequence to trim from the forward reads, based on Illumina’s Universal Adapters
`-A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT`	The sequence to trim from the reverse reads, also based on Illumina’s Universal Adapters
`-o <FILEPATH>`	The name of the output of the trimmed forward reads
`-p <FILEPATH>`	The name of the output of the trimmed reverse reads
`-m 150`	Remove reads shorter than 150bp after trimmming has occured
`-j 8`	Run the process on 8 threads (to speed up the process)

By default, Cutadapt will print updates on its progress to the console (like we saw with FastQC). It will also print a summary of how many reads were trimmed or kept, and other useful information about its process. Ideally we want to keep a copy of this summary so we can refer to it in future.

We’ll therefore add an instruction to the end of the command telling Cutadapt to ‘redirect’ its console output to a file. The way to do this is &> <filepath>. This will ensure we keep a good record of what we’ve been doing. Be careful though! If you have an error in your command you’ll have to look inside the log to find out what the error is, as it won’t get printed to the console like it usually would.

Multi-line commands

This command is quite long! When typing a long command into your terminal, you can use the \ character to separate code chunks onto separate lines. This can make your code more readable.

Here’s what our final command looks like:

Code

cutadapt -q 20 \
          -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA \
          -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT \
          -o data/trimmed_reads/SRR6820491_1.trim.fastq \
          -p data/trimmed_reads/SRR6820491_2.trim.fastq \
          -m 150 \
          -j 8 \
          data/raw_reads/SRR6820491_1.fastq \
          data/raw_reads/SRR6820491_2.fastq \
          &> results/cutadapt/SRR6820491_fastq.log

Reminder: pasting in the command line

The adapter sequences we provide are very long, so you may want to copy and paste these sequences from this page into your command. However, you cannot paste in the shell using Ctrl + V like you can usually. Instead you have two options:

right click and select paste
hover the mouse pointer over the terminal window and press the mouse middle button

The same is true for copying in the shell - Ctrl + C will not work. Instead, you will need to highlight whatever you want to copy, right click and select copy.

You can copy and paste the full command if you wish, but we recommend typing it out yourself to help you remember what each of the flags does.

When you have submitted the command, your prompt will disappear for a couple of minutes. When it comes back, your command has finished running.

Let’s open up our log file and see the summary.

Code

less results/cutadapt/SRR6820491_fastq.log

Output

This is cutadapt 2.8 with Python 3.8.10
Command line parameters: -q 20 -a AGATCGGAAGAG -A AGATCGGAAGAG -o data/trimmed_reads/SRR6820491_1.trim.fastq -p data/trimmed_reads/SRR6820491_2.trim.fastq -m 150 -j 8 data/raw_reads/SRR6820491_1.fastq data/raw_reads/SRR6820491_2.fastq
Processing reads on 8 cores in paired-end mode ...
Finished in 70.26 s (5 us/read; 12.04 M reads/minute).

=== Summary ===

Total read pairs processed:         14,099,347
  Read 1 with adapter:               4,874,180 (34.6%)
  Read 2 with adapter:               4,716,762 (33.5%)
Pairs that were too short:           5,913,632 (41.9%)
Pairs written (passing filters):     8,185,715 (58.1%)

Total basepairs processed: 4,258,002,794 bp
  Read 1: 2,129,001,397 bp
  Read 2: 2,129,001,397 bp
Quality-trimmed:              46,148,219 bp (1.1%)
  Read 1:    11,786,248 bp
  Read 2:    34,361,971 bp
Total written (filtered):  2,470,608,528 bp (58.0%)
  Read 1: 1,235,692,634 bp
  Read 2: 1,234,915,894 bp

Exercise

Use the output from your Cutadapt command to answer the following questions.

How many reads had adapters in the R1 reads?
What total percent of reads were trimmed/filtered?

Solution

4,874,180 (34.6%)
58%

We can confirm that we have our output files:

Code

cd data/trimmed_reads
ls

Output

SRR6820491_1.trim.fastq  SRR6820491_2.trim.fastq

Great! Now we have some high quality samples to work with as we move onto our next step: filtering out ribosomal RNA.

Bonus Exercise

Now that our samples have gone through quality control, they should perform better on the quality tests run by FastQC. Go ahead and re-run FastQC on your trimmed FASTQ files and visualise the HTML files to see whether your per base sequence quality is higher after trimming.

Solution

Start by making a new folder inside your cutadapt directory to keep the FastQC outputs in.

Code

cd ~/cs_course
mkdir results/cutadapt/trimmed_fastqc

Now you can run FastQC on your trimmed reads.

Code

fastqc data/trimmed_reads/*.fastq -o results/cutadapt/trimmed_fastqc -t 8

In the terminal on your local machine (not the one logged into the AWS instance) use SCP to download the files.

Code

scp -i login_key_instanceNNN.pem csuser@instanceNNN.cloud-span.aws.york.ac.uk:/home/csuser/cs_course/results/cutadapt/trimmed_fastqc/*.html .

Then take a look at the html files in your browser.

After trimming and filtering, our overall quality is higher with no quality scores dipping into the red (low quality) zone. We now have a distribution of sequence lengths, and most of the sequences contain no adapters. Some of the metrics are still marked as “failed” (have a red cross next to them) but none that are relevant to our workflow. Cutadapt has worked well!