QC polished assembly
Overview
Teaching: 30 min
Exercises: 40 minQuestions
Why would we quality control (QC) an assembly?
How can we perform QC on an assembly?
What metrics can we compare between assemblies to understand the quality of an assembly?
Objectives
Understand the terms N50, misassembly and largest contig.
Understand what factors might affect the quality of an assembly.
Use the help documentation to work out an appropriate flag for seqkit
Apply seqkit to assess multiple assemblies
Use metaQUAST to identify the quality of an assembly.
Why QC a metagenome assembly?
We now have a polished assembly. The next thing to do is to perform quality control (QC) checks to make sure we can be confident in our results and conclusions about the taxonomic, metabolic or functional composition of the communities that we are studying. This also allows us to check whether our efforts to improve the assembly with polishing were successful.
The quality of metagenome assemblies is typically lower than it is for single genome assemblies due to the use of long reads.
In this episode we explain what is meant by assembly quality and how to examine and compare the quality of your metagenome assemblies using seqkit stats
and MetaQUAST
What do we mean by assembly quality?
There are several variables that affect assembly quality. Let’s go over them in more detail.
Contiguity
A high quality assembly is highly contiguous meaning there are long stretches of the genome that have been successfully pieced together. The opposite of contiguous is fragmented.
Contiguity is strongly correlated with both the technology used and the quality of the original DNA used. “Short read”-only assemblies are often very fragmented as it is much more difficult to assemble the short reads into a contiguous assembly. With long reads it is easier to span bits of a genome that are tricky to reassemble, like repeats. However, some preparation methods, such as bead-beating, result in long-reads which are relatively short.
The extent to which contiguity matters depends on your research question. If you need to identify a large structural difference, high contiguity is important.
Degree of duplication
A duplication is when there is more than one copy of a particular region of the genome in the assembly. If a genome is much bigger than expected then there may have been duplication. Removing duplicated regions from assemblies is difficult but is made easier after the process of binning.
Degree of completeness
The more complete an assembly, the higher quality it is. Sometimes there an regions of the assembly that are unexpectedly missing, meaning the metagenome is incomplete.
Chimeric Contigs
Chimeric contigs are when contigs belonging to different genomes get stuck together as one continuous piece. This is caused during assembly, and can be controlled by the parameters used to generate the assembly. While it is difficult to identify chimeras, it is worth considering the parameters we use for the polishing and assembly steps because inappropriate parameters can result in problems such as these.
Low base quality
Low base quality happens when mutations are present in reads that do not reflect actual biological variation. This happens more in long reads due to a higher error rate. However, this is outweighed by the fact that using long reads for metagenome assemblies results in higher overall quality due to higher contiguity. This is why we ‘polished’ our genome in the last episode by comparing the draft assembly to raw short reads.
Using seqkit to generate summary statistics of an assembly
After finishing the draft assembly we used seqkit stats
to see some basic statistics about the assembly. We will be using it again to get summary statistics for all three of the assemblies (unpolished, long read polished and short read polished). We can then use those statistics to examine the polishing process.
We can review the help documentation for seqkit stats.
cd ~/cs_course/analysis/
seqkit stats --help
seqkit stats help documentation
The N50 length
N50 is a metric indicating the distribution of contig lengths in an assembly. The value of N50 indicates that 50% of the total sequence is in contigs that are the that size or larger. For a more thorough introduction to N50, read What’s N50? By The Molecular Ecologist.
It indicates the average size of the contigs the assembly software has produced. A higher N50 length means that more of the assembly is in longer fragments. That means the chunks of sequence produced by the assembler are, on average, larger.
seqkit stats
has an option to calculate the N50 length. Use the seqkit stats
help documentation to answer the exercise below.
Exercise 1: Flag to get the N50 length
a) Using the help documentation, what flag can we add to get the N50 length for this assembly?
b) What would the new command be if we added this flag?
Bonus exercise: What flag would enable us to save the output table in a tabular (i.e. tsv) format?Solution
Next, run the command on the original draft assembly (~/cs_course/analysis/assembly/assembly.fasta
) to calculate the N50 length and answer the questions below about the output.
Hint: Seqkit stats N50 command
Exercise 2: Calculating the N50 length
a) What is the output if we run the new command from the above exercise?
b) What new statistics do we get that we didn’t have with the original command?
c) What is the N50 length of this assembly?Bonus exercise: Looking at the information available online for Seqkit stats, can you work out what the extra statistics other than N50 tell us?
Solution
Generating statistics for all three assemblies
Instead of passing just one FASTA file to seqkit stats
we can use all three FASTA files at once.
First we need to navigate into the analysis directory.
cd ~/cs_course/analysis/
Within the analysis directory, these three files are:
- Draft assembly generated by Flye in
assembly/assembly.fasta
- Long-read polished assembly generated by Medaka in
medaka/consensus.fasta
- Short-read polished assembly generated by Pilon in
pilon/pilon.fasta
This makes our command:
seqkit stats -a assembly/assembly.fasta medaka/consensus.fasta pilon/pilon.fasta
file | format | type | num_seqs | sum_len | min_len | avg_len | max_len | Q1 | Q2 | Q3 | sum_gap | N50 | Q20(%) | Q30(%) | GC(%) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
assembly/assembly.fasta | FASTA | DNA | 154 | 15042667 | 3164 | 97679.7 | 6068626 | 7305.0 | 13291.0 | 34356.0 | 0 | 2976488 | 0.00 | 0.00 | 52.39 |
medaka/consensus.fasta | FASTA | DNA | 154 | 15062138 | 3142 | 97806.1 | 6074421 | 7166.0 | 13265.5 | 34010.0 | 0 | 2991856 | 0.00 | 0.00 | 52.30 |
pilon/pilon.fasta | FASTA | DNA | 154 | 15070837 | 3144 | 97862.6 | 6074532 | 7175.0 | 13279.5 | 34010.0 | 0 | 2992063 | 0.00 | 0.00 | 52.27 |
Exercise 3: Comparing the Assemblies
Using the seqkit output for all three assemblies, compare the statistics for each of the three assemblies. What has changed across the two rounds of polishing? (From assembly>medaka>pilon)
Solution
We can compare these basic assembly statistics in this way. However, these may not tell the full story as there will also have been changes to the overall sequence (e.g. correcting individual base errors).
Using metaQUAST to further assess assembly quality
We will use MetaQUAST to further evaluate our metagenomic assemblies. MetaQUAST is based on the QUAST genome quality tool but accounts for high species divesity and misassemblies.
MetaQUAST assesses the quality of assemblies using alignments to close references. This means we need to determine which references are appropriate for our data. MetaQUAST can automatically select reference genomes to align the assembly to, but it does not always pick the most appropriate references. As we know what organisms make up our metagenome we will be supplying a file containing the references we want to use instead. If you use MetaQUAST on your own data you could use the default references MetaQUAST selects or provide your own if you have an idea what organisms could be in your dataset.
So far we’ve been able to do a lot of work with files that already exist, but now we need to write our own file using the text editor Nano.
Text editors
Text editors, like nano, “notepad” on Windows or “TextEdit” on Mac are used to edit any plain text files. Plain text files are those that contain only characters, not images or formatting.
We are using nano because it is one of the least complex Unix text editors. However, many programmers use Emacs or Vim (both of which require more time to learn), or a graphical editor such as Gedit.
No matter what editor you use, you will need to know where it searches for and saves files. If you start it from the shell, it will (probably) use your current working directory as its default location.
Writing files
First, we create a directory for the metaQUAST output in the analysis directory.
cd ~/cs_course/analysis/
mkdir metaquast
cd metaquast
To open nano we type the command nano
followed by the name of the text file we want to generate.
nano reference_genomes.txt
You should see something like this:
The text at the bottom of the screen shows the keyboard shortcuts for performing various tasks in nano
. We will talk more about how to interpret this information soon.
When you press enter your terminal should change. You should see a white bar at the top with GNU nano 4.8
and some suggested commands at the bottom of the page.
There should also be a white box which indicates where your cursor is.
Paste the following list of organism names into this file.
Bacillus subtilis
Cryptococcus neoformans
Enterococcus faecalis
Escherichia coli str. K-12 substr. MG1655
Lactobacillus fermentum
Listeria monocytogenes EGD-e
Pseudomonas aeruginosa
Saccharomyces cerevisiae
Salmonella enterica subsp. enterica serovar Typhimurium str. LT2
Staphylococcus aureus
Once we’re happy with our text, we can press Ctrl-O (press the Ctrl or Control key and, while holding it down, press the O key) to write our data to disk. You will the be prompted with File Name to Write: reference_genomes.txt
as we named the file when we first used the command we don’t need to change this name and can press Enter to save the file. Once our file is saved, we can use Ctrl-X to quit the nano
editor and
return to the shell.
Control, Ctrl, or ^ Key
The Control key is also called the “Ctrl” key. There are various ways in which using the Control key may be described. For example, you may see an instruction to press the Ctrl key and, while holding it down, press the X key, described as any of:
Control-X
Control+X
Ctrl-X
Ctrl+X
^X
C-x
In
nano
, along the bottom of the screen you’ll see^G Get Help ^O WriteOut
. This means that you can use Ctrl-G to get help and Ctrl-O to save your file.If you are using a Mac, you might be more familiar with the
Command
key, which is labelled with a ⌘ . But you will often use the theCtrl
key when working in a Terminal.
Copying and pasting in Git bash
Most people will want to use Ctrl+C and Ctrl+V to copy and paste. However in GitBash these shortcuts have other functions. Ctrl+C interrupts the currently running command and Ctrl+V tells the terminal to treat every keystroke as a literal character, so will add shortcuts like Ctrl+C as characters. Instead you can copy and paste in two ways:
- Keyboard: Use Shift and the left/right arrows to select text and press Enter to copy. You can paste the text by pressing Insert.
- Mouse: Left click and drag to highlight text, then right click to copy. Move the cursor to where you want to paste and right click to paste.
You should then be able to see this file when you ls
and view it using less
.
ls
reference_genomes.txt
Once we have our list of reference genomes we can run MetaQUAST on the original assembly and the two polished assemblies.
First we should look at the help documentation to work out which commands are right for us.
metaquast.py -h
MetaQUAST help documentation
From this we can see we need --references-list
to supply our list of reference organisms (this is the file we just made called reference_genomes.txt
), followed by our three assemblies separated by a space.
metaquast.py --references-list reference_genomes.txt ../assembly/assembly.fasta ../medaka/consensus.fasta ../pilon/pilon.fasta
This should take around 5 minutes so we will be leaving it running in the foreground.
Once starting the command you should see something like this and metaQUAST will start downloading the reference species we specified in our files.
Version: 5.2.0
System information:
OS: Linux-3.10.0
Python version: 3.10.5
CPUs number: 4
Started: 2022-09-26 17:26:14
Logging to analysis/metaquast/quast_results/results_2022_09_26_17_26_13/metaquast.log
NOTICE: Maximum number of threads is set to 10 (use --threads option to set it manually)
Contigs:
Pre-processing...
1 ../assembly/assembly.fasta ==> assembly
2 ../medaka/consensus.fasta ==> consensus
3 ../pilon/pilon.fasta ==> pilon
List of references was provided, starting to download reference genomes from NCBI...
2022-09-26 17:26:31
2022-09-26 17:26:31
Trying to download found references from NCBI. Totally 10 organisms to try.
Bacillus_subtilis | successfully downloaded (total 1, 9 more to go)
Cryptococcus_neoformans | successfully downloaded (total 2, 8 more to go)
Enterococcus_faecalis | successfully downloaded (total 3, 7 more to go)
Escherichia_coli_str._K-12_substr._MG1655 | successfully downloaded (total 4, 6 more to go)
Lactobacillus_fermentum | successfully downloaded (total 5, 5 more to go)
Listeria_monocytogenes_EGD-e | successfully downloaded (total 6, 4 more to go)
Pseudomonas_aeruginosa | successfully downloaded (total 7, 3 more to go)
Saccharomyces_cerevisiae | successfully downloaded (total 8, 2 more to go)
Salmonella_enterica_subsp._enterica_serovar_Typhimurium_str._LT2 | successfully downloaded (total 9, 1 more to go)
Staphylococcus_aureus | successfully downloaded (total 10, 0 more to go)
Once MetaQUAST has finished you should see an output like:
MetaQUAST finished.
Log is saved to analysis/metaquast/quast_results/results_2022_09_26_17_26_13/metaquast.log
Finished: 2022-09-26 17:31:00
Elapsed time: 0:04:45.735581
Total NOTICEs: 13; WARNINGs: 1; non-fatal ERRORs: 0
Thank you for using QUAST!
We can now navigate into the quast_results
directory MetaQUAST generated. Within this directory there will be a folder for each time MetaQUAST has been run in this directory. We can navigate then navigate into the results file generated which will be in the format results_YYYY_MM_DD_HH_MM_SS
. (There is also a symbolic link called latest
in this directory that is a shortcut to the newest MetaQUAST run).
cd quast_results/results_YYYY_MM_DD_HH_MM_SS/
ls
combined_reference icarus.html icarus_viewers metaquast.log not_aligned quast_downloaded_references report.html runs_per_reference summary
Once in this directory, we can see that MetaQUAST has generated multiple different files.
If you want to explore all the files you can download this whole directory using scp
, with -r
flag to download all directories and what they contain. This will require ~500MB of space.
However, most of this information is in the report.html
file so we can download only that one instead.
As this is a HTML file we will first need to download it to our local computer in order to open it.
scp -i login-key-instanceNNN.pem csuser@instanceNNN.cloud-span.aws.york.ac.uk:~/cs_course/analysis/metaquast/quast_results/results_YYYY_MM_DD_HH_MM_SS/report.html .
Make sure you replace both the NNN with your particular number and also the directory name of the results file.
The results.html
file relies on some of the other files generated by MetaQUAST so with only the one file you won’t have full functionality but we can still view the information we want.
If you haven’t managed to download the file you can view our example report.html
You should take a few minutes to explore the file before answering the following exercise.
Understanding metaQUAST output
The output is organised into several sections, with a column for each assembly. The worst scoring column is shown in red, the best in blue.
Genome statistics
This section determines the quality of the assembly based on the size of the total assembly. The duplication ratio is the amount of the total assembly that is represented more than once. The more fragmented the assembly, the higher this value will be. See What makes an assembly bad section above to see further details.
Misassemblies
Missassemblies are when pieces in an assembly are overlapped in the incorrect way. In the metaQUAST manual, they define missassemblies in the four following ways (taken from the manual):
misassemblies is the number of positions in the contigs (breakpoints) that satisfy one of the following criteria.
- the left flanking sequence aligns over 1 kbp away from the right flanking sequence on the reference;
- flanking sequences overlap on more than 1 kbp;
- flanking sequences align to different strands or different chromosomes;
- flanking sequences align on different reference genomes (MetaQUAST only)
These can be caused by errors in the assembly, or they can be caused my structural variation in the sample, such as inversions, relocations and translocation.
Mismatches
This is where there are incorrect bases in the contigs making up the assembly. The summary gives you the total number of mismatches per 100kbp of sequence and short insertions and deletions (indels) per 100kbp.
Statistics without reference
The statistics without a reference are based on the full assembly rather than comparing the assembly to the species that you have asked it to compare to. The largest contig is an indicator of how good the overall assembly is. If the largest contig is small, it means that the rest of the assembly is of a poor quality.
There is a detailed description of each of the outputs in the metaQUAST manual which you can use to understand the output more fully.
Comparing assemblies using MetaQUAST
Using the above image how has the iterative polishing from assembly > consensus > pilon improved the assembly?
Answer
Key Points
The N50 is the contig length of the 50th percentile, meaning that 50% of the contigs are at least this length in the assembly
A misassembly is when a portion of the assembly is incorrectly put back together
The largest contig is the longest contiguous piece in the assembly
Seqkit can generate summary statistics that will tell us the N50, largest contig and the number of gaps
metaQUAST can generate additional information in a report which can be used to identify misassemblies