Cloud-SPAN logo.
  • Home
  • Precourse Instructions
  • About
  1. Polishing
  2. Polishing an assembly
  • Files and Directories
    • Understanding your file system
    • Logging onto the Cloud
    • Introducing the Shell
  • Using the Command Line
    • Navigating Files and Directories
    • Working with Files and Directories
    • Redirection
  • QC & Assembly
    • Introduction to Metagenomics
    • Assessing Read Quality, Trimming and Filtering
    • Metagenome Assembly
  • Polishing
    • Polishing an assembly
    • QC polished assembly
  • Binning & Functional Annotation
    • Metagenome Binning
    • QC of metagenome bins
    • Functional annotation
  • Taxonomic Annotations
    • Taxonomic Assignment
    • Diversity Tackled With R
    • Taxonomic Analysis with R
  • Extras
    • Data
    • Glossary
    • Workflow Reference

On this page

  • Polishing with short reads
  • Why bother polishing?
  • Polishing an assembly with long reads
  • Polishing with short reads
    • Generating the Pilon input files
    • Running Pilon
  • Edit this page
  • Report an issue
  1. Polishing
  2. Polishing an assembly

Polishing an assembly

In the previous episode we generated a draft assembly using Flye from our long read Nanopore data.

Long reads can span regions which make them difficult to assemble with short reads such as regions with large repeats. Despite this, some long reads will be misassembled. In addition, the base accuracy of long reads is lower than that of short reads and some bases will be incorrectly assigned. Consequently it is common to correct these errors by “polishing” an assembly. We will use two polishing strategies:

  1. Polishing with long reads using Medaka. This maps the raw long reads to the assembly to identify contigs that have been incorrectly joined.
  2. Polishing with short reads using Pilon which uses the more accurate short read data to correct incorrectly called bases in the assembly. More detail on the advantages and disadvantages of short and long read sequencing is covered in our Statistically useful experimental design workshop in Platform choice.

Polishing with short reads

Diagram showing overlap of reads for polishing.

Diagram showing overlap of reads for polishing.

In the diagram, the long read assembly is shown at the top. The four short reads shown below the assembly have been aligned to it. The third position in the assembly is A but the three short reads that cover this region contain a T. The assembly probably contains a miscalled base but it can be corrected - or polished - at this position with the higher accuracy short read information. Typically short read polishing would be carried out three times with the base accuracy increasing each time. However, to reduce the compute requirements and the time required to finish the assembly, we will be performing it just once.

Why bother polishing?

How important polishing is to your analysis will depend on what you need it for. Usually we generate metagenome assemblies so we can compare the sequences to a database and find out what taxa they belong to.

You might NOT need to polish your assembly if:

  • you only need to taxa to the genus level (meaning single incorrect bases are not important)

You DO need to polish your assembly if:

  • you want to identify taxa to the species level (if possible). This is a common requirement since one of the main advantages of whole genome sequencing over amplicon sequencing is that you can assign annotations to the species level. We will cover Taxonomic annotations later in the course.
  • you want to generate protein predictions or identify protein structure domains to determine the functionality of metagenomes. This is discussed in more detail in Watson and Warr (2019): Errors in long-read assemblies can critically affect protein prediction.

Polishing an assembly with long reads

First we will polish the draft Flye assembly using the filtered raw long reads. As with the assembly, we need to use polishing software that is especially written for long read raw reads.

Medaka is a command line tool built by Oxford Nanopore Technologies which will polish an assembly by generating a consensus from raw Nanopore sequences using a recurrent neural network.

We will be using one Medaka command, medaka_consensus. This pipeline will first align the raw reads to the draft assembly, then process this alignment to generate a pileup. The pileup is presented to a recurrent neural network in order to produce a consensus sequence.

Medaka is installed on the AWS instance. Let’s take a look at the help page for medaka_consensus:

Code
medaka_consensus -h
Output --- medaka_consensus Help
medaka 1.7.2

Assembly polishing via neural networks. Medaka is optimized
to work with the Flye assembler.

medaka_consensus [-h] -i <fastx> -d <fasta>

     -h  show this help text.
     -i  fastx input basecalls (required).
     -d  fasta input assembly (required).
     -o  output folder (default: medaka).
     -g  don't fill gaps in consensus with draft sequence.
     -r  use gap-filling character instead of draft sequence (default: None)
     -m  medaka model, (default: r941_min_hac_g507).
         Choices: r103_fast_g507 r103_hac_g507 r103_min_high_g345 r103_min_high_g360 r103_prom_high_g360 r103_sup_g507 r1041_e82_400bps_fast_g615 r1041_e82_400bps_hac_g615 r1041_e82_400bps_sup_g615 r104_e81_fast_g5015 r104_e81_hac_g5015 r104_e81_sup_g5015 r104_e81_sup_g610 r10_min_high_g303 r10_min_high_g340 r941_e81_fast_g514 r941_e81_hac_g514 r941_e81_sup_g514 r941_min_fast_g303 r941_min_fast_g507 r941_min_hac_g507 r941_min_high_g303 r941_min_high_g330 r941_min_high_g340_rle r941_min_high_g344 r941_min_high_g351 r941_min_high_g360 r941_min_sup_g507 r941_prom_fast_g303 r941_prom_fast_g507 r941_prom_hac_g507 r941_prom_high_g303 r941_prom_high_g330 r941_prom_high_g344 r941_prom_high_g360 r941_prom_high_g4011 r941_prom_sup_g507 r941_sup_plant_g610
         Alternatively a .tar.gz/.hdf file from 'medaka train'.
     -f  Force overwrite of outputs (default will reuse existing outputs).
     -x  Force recreation of alignment index.
     -t  number of threads with which to create features (default: 1).
     -b  batchsize, controls memory use (default: 100).

To use Medaka we need to specify certain parameters in the command, like we did when we ran Flye last session. The help page tells us that the basic format for medaka is medaka_consensus [-h] -i <fastx> -d <fasta>1, indicating that the -i and -d flags are mandatory.

Let’s have a look at the flags and options we’re going to use:

Flag/option Meaning Our input
-i Input basecalls (i.e. what we are polishing with) -i data/nano_fastq/ERR5000342_sub12_filtered
-d Input assembly (i.e. what is being polished) -d results/assembly/assembly.fasta
-m Neural network model to use (described in the documentation) -m r941_min_hac_g507
-o Output directory -o results/medaka
-t Number of threads -t 8

Once again we will run this command in the background and redirect the output to a file. This means we add &> to redirect the output and & to the very end to make it run in the background.

Before we start let’s make sure we’re in the cs_course folder and create a directory called medaka for our output in the results folder.

Code
cd ~/cs_course
mkdir results/medaka

Now it’s time to run Medaka!

Code
medaka_consensus \
-i data/nano_fastq/ERR5000342_sub12_filtered.fastq \
-d results/assembly/assembly.fasta \
-m r941_min_hac_g507 \
-o results/medaka \
-t 8 \
&> results/medaka/medaka.out &

We can check the command is running using jobs:

Code
jobs

If it is successfully running you should see an output like:

Output
[1]+  Running                 medaka_consensus -i data/nano_fastq/ERR3152367_sub5_filtered.fastq -d results/assembly/assembly.fasta -m r941_min_hac_g507 -o results/medaka -t 8 &> results/medaka/medaka.out &

We can also look in the output file (medaka.out) to check the progress of the command. In this case Medaka will take about three hours to run.

Code
less results/medaka/medaka.out

If the Medaka command has been run correctly you will see something like this near the start of the output:

Output
Checking program versions
This is medaka 1.7.2
This is medaka 1.7.2
Program    Version            Required   Pass
bcftools   1.16-20-g6f4732b   1.11       True
bgzip      1.16-16-g3c6f83f   1.11       True
minimap2   2.24               2.11       True
samtools   1.16.1-33-gbd942a0 1.11       True
tabix      1.16-16-g3c6f83f   1.11       True
Aligning basecalls to draft
Constructing minimap index.
[M::mm_idx_gen::0.515*0.99] collected minimizers
[M::mm_idx_gen::0.648*1.40] sorted minimizers
[M::main::0.877*1.29] loaded/built the index for 146 target sequence(s)
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 146
[M::mm_idx_stat::0.910*1.28] distinct minimizers: 2598367 (94.68% are singletons); average occurrences: 1.076; average spacing: 5.350; total length: 14953273
Help!

Medaka may give you a warning along the lines of:

Output
2023-03-31 12:22:21.572954: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-31 12:22:21.573012: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-03-31 12:22:24.690162: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Couldnot load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-03-31 12:22:24.690198: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

Don’t worry, you can safely ignore this warning wherever it pops up. It is telling us that it couldn’t load a library required for parallel computing using GPUs. We are not using a GPU setup and so this warning is irrelevant.

Help!

If you have entered the command incorrectly you will usually find out quite quickly! An easy way to tell is if you get an output in your terminal that starts with ‘Exit’ rather than ‘Done’:

Output
[1]+  Exit 1                  medaka_consensus -i data/nano_fastq/ERR5000342_sub12_filtered.fastq -d results/assembly/assembly.fasta -m r941_min_hac_g507 -o results/medaka -t 8 & results/medaka/medaka.out

Another way to tell is by looking inside the medaka.out log file. You will get some of the usual text output as Medaka loads its dependencies but once it starts processing the command it will tell you that something is wrong, e.g.

Output
Creating fai index file /home/csuser/cs_course/results/assembly/assembly.fasta.fai
[E::fai_build3_core] Failed to open the file /home/csuser/cs_course/results/assembly/assembly.fasta
[faidx] Could not build fai index /home/csuser/cs_course/results/assembly/assembly.fasta.fai
Failed to run alignment of reads to draft.

Things to check if your command fails:

  • spelling/typos — have you spelled the command correctly and typed the directory names without error?
  • paths — are all of your absolute/relative paths complete and accurate? (tip: using tab completion can help with peace of mind here, as you can check that your path definitely follows the right trail)
  • flags — have you included all the mandatory flags, including the one (-) or two (--) dashes that usually accompany them?
  • output — does the output in your log/.out file give you any clues as to where your error lies? e.g. in the example above the line “Failed to open the file /home/csuser/cs_course/results/assembly/assembly.fasta” suggests that this path might be wrong somehow.

If you really can’t find your mistake then the your instructors and/or other participants will be able to help you sort it out, so there is no need to panic!

Medaka first looks for the other programs that it needs (known as dependencies) and their versions. These dependencies are installed on the AWS instance. Once it confirms they are present, it begins by aligning the raw reads (basecalls) to the assembly using minimap.

Once Medaka has completed the end of the file (which you can skip to by typing G) will contain something like:

Code
less results/medaka/medaka.out
Output
[20:51:16 - DataIndx] Loaded 1/1 (100.00%) sample files.
[20:51:16 - DataIndx] Loaded 1/1 (100.00%) sample files.
[20:51:16 - DataIndx] Loaded 1/1 (100.00%) sample files.
[20:51:16 - DataIndx] Loaded 1/1 (100.00%) sample files.
[20:51:16 - DataIndx] Loaded 1/1 (100.00%) sample files.
[20:51:16 - DataIndx] Loaded 1/1 (100.00%) sample files.
[20:51:16 - DataIndx] Loaded 1/1 (100.00%) sample files.
[20:51:17 - DataIndx] Loaded 1/1 (100.00%) sample files.
[20:51:17 - DataIndx] Loaded 1/1 (100.00%) sample files.
[20:51:17 - DataIndx] Loaded 1/1 (100.00%) sample files.
Polished assembly written to medaka/consensus.fasta, have a nice day.

Once Medaka has completed (~ 3 hours on average) we can take a look at the files it has generated.

Code
cd results/medaka
ls
Output
calls_to_draft.bam  calls_to_draft.bam.bai  consensus.fasta  consensus.fasta.gaps_in_draft_coords.bed  consensus_probs.hdf

Medaka has created multiple files:

  • calls_to_draft.bam - a BAM file containing the alignment of the raw reads (basecalls) to the draft assembly
  • calls_to_draft.bam.bai — an index file of the above BAM file
  • consensus.fasta — the consensus sequence, or polished assembly in our case in FASTA format
  • consensus.fasta.gaps_in_draft_coords.bed — a BED file containing information about the location of any gaps in the consensus sequence which can be used when visualising the assembly
  • consensus_probs.hdf — a file that contains the output of the neural network calculations and is not an output for end-users, so we don’t need to worry about this file

In our case we’re interested in the polished assembly, so we want the consensus.fasta file.

BAM and SAM Files

A SAM file, is a tab-delimited text file that contains information for each individual read and its alignment to the genome. The paper by Heng Li et al. provides the full specification.

The compressed binary version of SAM is called a BAM file. We use this version to reduce size and to allow for indexing, which enables efficient random access of the data contained within the file.

The file begins with a header, which is optional. The header describes the source of data, reference sequence, method of alignment etc. - these will change depending on the aligner being used. Following the header is the alignment section. Each line that follows corresponds to alignment information for a single read. There are 11 mandatory fields for essential mapping information and a variable number of other fields for aligner specific information.

See Genomics - Variant Calling for a deeper dive.

Polishing with short reads

We will be using the program Pilon to further polish the draft assembly using the raw short reads. Pilon will improve a draft assembly by filling gaps, fixing misassemblies and correcting bases. You can read more about how it works in the paper Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement.

Bioinformatics programs are not built equally. Some programs, like Flye or Medaka, require very few input files as they will generate any that they need within the pipeline. Some programs however, require a lot of user input to generate the input files that are needed.

Pilon is in the latter group of bioinformatics software, so we will need to do some pre-processing using other programs to create some of the inputs needed.

Generating the Pilon input files

We will first use the program BWA to generate an alignment of the raw short reads against the draft genome in consensus.fasta. The steps will be:

  1. indexing the polished assembly, consensus.fasta with bwa index. Indexing allows the aligner to quickly find potential alignment sites for query sequences in a genome, which saves time during alignment.
  2. creating a directory for the outputs of Pilon
  3. aligning the short reads (the illumina data) to the assembly, consensus.fasta with bwa mem
  4. converting the short read alignment to a BAM file samtools view
  5. sorting the short read alignment with samtools sort
  6. indexing the short read alignment with samtools index

Make sure you are in the cs_course folder and use bwa index to index the consensus assembly:

Code
cd ~/cs_course
bwa index results/medaka/consensus.fasta

This should only take a few seconds to complete so we don’t need to run the job in the background. Once the indexing is complete you should see an output like:

Output
[bwa_index] Pack FASTA... 0.51 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 5.86 seconds elapse.
[bwa_index] Update BWT... 0.10 sec
[bwa_index] Pack forward-only FASTA... 0.10 sec
[bwa_index] Construct SA from BWT and Occ... 1.81 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index consensus.fasta
[main] Real time: 8.704 sec; CPU: 8.395 sec

This will generate five additional files in the medaka directory with the file extensions .pac, .bwt, .ann, .amb and .sa. These files are used by BWA in step 3.

Next we will make a directory in the results directory for our pilon polishing outputs:

Code
mkdir results/pilon

We will now do steps 3, 4 and 5 in one go by chaining them together with pipes.

Chaining together commands with a pipe

It is possible to chain together commands in Linux (and Unix) using a command known as “pipe”. This allows the output from one command to be directly passed as input to other command without the need for using intermediate files. This is useful when the intermediate file is not needed and keeps your workspace tidy. The pipe command is the character | which is obtained with ⇧ Shift + \ on most keyboards.

You can use multiple pipes in a single line of commands but data will only go from the left to the right:

command1 | command2 | command3 | .... |

We will be using two pipes to join three separate steps. First we will align the raw reads to the draft assembly, then convert the output to BAM format, before finally sorting this alignment to generate a sorted BAM file. Chaining the steps together together will only generate one final output file, avoiding the need to generate large intermediary files we don’t need again between the other two steps.

  1. we will align the short reads (the illumina data) to the assembly, consensus.fasta with bwa mem: bwa mem -t 8 results/medaka/consensus.fasta data/illumina_fastq/ERR4998593_1.fastq data/illumina_fastq/ERR4998593_2.fastq
  2. convert the short read alignment alignment to a BAM file with samtools view: samtools view - -Sb
  3. sort the short read alignment with samtools sort: samtools sort - -@4 -o pilon/short_read_alignment.bam

Here are the various flags/options used in these commands and what they mean:

Command Flag/option Meaning
bwa mem -t 8 [input assembly] [input short read file(s)] -t 8 Number of threads (8)
samtools view - -Sb - Take piped output from bwa mem as input
-Sb Convert from SAM to BAM format
samtools sort - -@8 -o [filename] - Take piped output from samtools view as input
-@8 Number of threads (8)
-o [filename] Output a file with name [filename]

This will take around 60 minutes so we will use redirection and & to print the output to a file and run the command in the background (just like in the previous lesson). We will also wrap our whole three-fold command in brackets so we run all three steps in the background.

Add the pipes between these commands and run:

Code
(bwa mem -t 8 results/medaka/consensus.fasta data/illumina_fastq/ERR4998593_1.fastq data/illumina_fastq/ERR4998593_2.fastq | samtools view - -Sb | samtools sort - -@8 -o results/pilon/short_read_alignment.bam) &> results/pilon/alignment.out &

Once the command is running, you can check the process of this job by looking at the alignment.out file.

Code
less results/pilon/alignment.out
Output
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 529802 sequences (80000102 bp)...
[M::process] read 529802 sequences (80000102 bp)...
[M::mem_pestat] # candidate unique pairs for (FF, FR, RF, RR): (2, 14035, 29, 8)
[M::mem_pestat] skip orientation FF as there are not enough pairs
[M::mem_pestat] analyzing insert size distribution for orientation FR...
[M::mem_pestat] (25, 50, 75) percentile: (333, 416, 538)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 948)
[M::mem_pestat] mean and std.dev: (435.53, 162.74)
[M::mem_pestat] low and high boundaries for proper pairs: (1, 1153)
[M::mem_pestat] analyzing insert size distribution for orientation RF...
[M::mem_pestat] (25, 50, 75) percentile: (891, 2304, 6802)
[M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 18624)
[M::mem_pestat] mean and std.dev: (3479.45, 2918.28)

The command should take around 45 minutes to run. Once completed, the end of the alignment.out file should contain something like:

Output
low and high boundaries for proper pairs: (1, 1222)
[M::mem_pestat] skip orientation RF as there are not enough pairs
[M::mem_pestat] skip orientation RR as there are not enough pairs
[M::mem_process_seqs] Processed 182762 reads in 37.961 CPU sec, 5.397 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 8 ~/cs_course/results/medaka/consensus.fasta ~/cs_course/data/illumina_fastq/ERR4998593_1.fastq ~/cs_course/data/illumina_fastq/ERR4998593_2.fastq
[main] Real time: 2207.540 sec; CPU: 14851.383 sec
[bam_sort_core] merging from 7 files and 4 in-memory blocks...

We have now generated the short_read_alignment.bam file - this is a binary file (meaning it’s not human readable) so we won’t be checking its contents.

Now carry out step 6, index the alignment:

Code
samtools index results/pilon/short_read_alignment.bam
Something to think about

Why didn’t we include this command in the sequence of pipes in the previous step?

The answer is that we will need access to the BAM file produced for further analysis. If we included this step as part of a pipe the intermediate BAM file would not be saved.

This command will take around one or two minutes so we don’t need to run it in the background.

Once your prompt has returned you should also have a file named short_read_alignment.bam.bai in your pilon directory, which is the index.

Running Pilon

Now we have generated the necessary input files we can finally run Pilon.

Pilon is installed on the AWS instance and we can view the help documentation using:

Code
pilon --help
Output — Pilon help Documentation
Pilon version 1.24 Thu Jan 28 13:00:45 2021 -0500

   Usage: pilon --genome genome.fasta [--frags frags.bam] [--jumps jumps.bam] [--unpaired 
                unpaired.bam]  [...other options...]
          pilon --help for option details


         INPUTS:
           --genome genome.fasta
              The input genome we are trying to improve, which must be the reference used
              for the bam alignments.  At least one of --frags or --jumps must also be given.
           --frags frags.bam
              A bam file consisting of fragment paired-end alignments, aligned to the --genome
              argument using bwa or bowtie2.  This argument may be specified more than once.
           --jumps jumps.bam
              A bam file consisting of jump (mate pair) paired-end alignments, aligned to the
              --genome argument using bwa or bowtie2.  This argument may be specified more than once.
           --unpaired unpaired.bam
              A bam file consisting of unpaired alignments, aligned to the --genome argument
              using bwa or bowtie2.  This argument may be specified more than once.
           --bam any.bam
              A bam file of unknown type; Pilon will scan it and attempt to classify it as one
              of the above bam types.
           --nanopore ont.bam
              A bam file containing Oxford Nanopore read alignments. Experimental.
           --pacbio pb.bam
              A bam file containing Pacific Biosciences read alignments. Experimental.
         OUTPUTS:
           --output prefix
              Prefix for output files
           --outdir directory
              Use this directory for all output files.
           --changes
              If specified, a file listing changes in the <output>.fasta will be generated.
           --vcf
              If specified, a vcf file will be generated
           --vcfqe
               If specified, the VCF will contain a QE (quality-weighted evidence) field rather
               than the default QP (quality-weighted percentage of evidence) field.
           --tracks
               This options will cause many track files (*.bed, *.wig) suitable for viewing in
               a genome browser to be written.
         CONTROL:
           --variant
              Sets up heuristics for variant calling, as opposed to assembly improvement;
              equivalent to "--vcf --fix all,breaks".
           --chunksize
              Input FASTA elements larger than this will be processed in smaller pieces not to
              exceed this size (default 10000000).
           --diploid
              Sample is from diploid organism; will eventually affect calling of heterozygous SNPs
           --fix fixlist
              A comma-separated list of categories of issues to try to fix:
                "snps": try to fix individual base errors;
                "indels": try to fix small indels;
                "gaps": try to fill gaps;
                "local": try to detect and fix local misassemblies;
                "all": all of the above (default);
                "bases": shorthand for "snps" and "indels" (for back compatibility);
                "none": none of the above; new fasta file will not be written.
              The following are experimental fix types:
                "amb": fix ambiguous bases in fasta output (to most likely alternative);
                "breaks": allow local reassembly to open new gaps (with "local");
                "circles": try to close circular elements when used with long corrected reads;
                "novel": assemble novel sequence from unaligned non-jump reads.
           --dumpreads
              Dump reads for local re-assemblies.
           --duplicates
              Use reads marked as duplicates in the input BAMs (ignored by default).
           --iupac
              Output IUPAC ambiguous base codes in the output FASTA file when appropriate.
           --nonpf
              Use reads which failed sequencer quality filtering (ignored by default).
           --targets targetlist
              Only process the specified target(s).  Targets are comma-separated, and each target
              is a fasta element name optionally followed by a base range.
              Example: "scaffold00001,scaffold00002:10000-20000" would result in processing all of 
              scaffold00001 and coordinates 10000-20000 of scaffold00002.
              If "targetlist" is the name of a file, each line will be treated as a target
              specification.
           --verbose
              More verbose output.
           --debug
              Debugging output (implies verbose).
           --version
              Print version string and exit.
         HEURISTICS:
           --defaultqual qual
              Assumes bases are of this quality if quals are no present in input BAMs (default 10).
           --flank nbases
              Controls how much of the well-aligned reads will be used; this many bases at each
              end of the good reads will be ignored (default 10).
           --gapmargin
              Closed gaps must be within this number of bases of true size to be closed (100000)
           --K
              Kmer size used by internal assembler (default 47).
           --mindepth depth
              Variants (snps and indels) will only be called if there is coverage of good pairs
              at this depth or more; if this value is >= 1, it is an absolute depth, if it is a
              fraction < 1, then minimum depth is computed by multiplying this value by the mean
              coverage for the region, with a minumum value of 5 (default 0.1: min depth to call
              is 10% of mean coverage or 5, whichever is greater).
           --mingap
              Minimum size for unclosed gaps (default 10)
           --minmq
              Minimum alignment mapping quality for a read to count in pileups (default 0)
           --minqual
              Minimum base quality to consider for pileups (default 0)
           --nostrays
              Skip making a pass through the input BAM files to identify stray pairs, that is,
              those pairs in which both reads are aligned but not marked valid because they have
              inconsistent orientation or separation. Identifying stray pairs can help fill gaps
              and assemble larger insertions, especially of repeat content.  However, doing so
              sometimes consumes considerable memory.

You can read more about the possible outputs Pilon can produce in the Wiki.

We can see there are many different options for pilon. We will be using the defaults for our assembly.

  • --genome — this will be the output assembly from Medaka
  • --frags — the short reads we used to create the BAM alignment were paired-end fragments, so we need to specify this using this flag
  • --outdir — this specifies a directory for all the output

Check you are in the cs_course folder and run Pilon:

Code
cd ~/cs_course
pilon --genome results/medaka/consensus.fasta --frags results/pilon/short_read_alignment.bam --outdir results/pilon &> results/pilon/pilon.out &

We can again keep track of the analysis by looking at the pilon.out file with less.

Code
less results/pilon/pilon.out

The top of the file:

Output
Pilon version 1.24 Thu Jan 28 13:00:45 2021 -0500
Genome: medaka/consensus.fasta
Fixing snps, indels, gaps, local
Input genome size: 18930486
Scanning BAMs
short_read_alignment.bam: 68579742 reads, 0 filtered, 7344428 mapped, 6036898 proper, 167134 stray, FR 100% 401+/-206, max 1018
Processing contig_1057:1-17080
frags short_read_alignment.bam: coverage 36
Total Reads: 9147, Coverage: 36, minDepth: 5
Confirmed 10586 of 17080 bases (61.98%)
Corrected 234 snps; 40 ambiguous bases; corrected 22 small insertions totaling 38 bases, 101 small deletions totaling 141 bases
# Attempting to fix local continuity breaks
# fix break: contig_1057:780-2830 0 -0 +0 NoSolution
# fix break: contig_1057:3081-4075 0 -0 +0 NoSolution
# fix break: contig_1057:4312-5132 0 -0 +0 NoSolution
# fix break: contig_1057:5367-5452 0 -0 +0 NoSolution
# fix break: contig_1057:5798-6336 0 -0 +0 NoSolution
# fix break: contig_1057:6798-8683 0 -0 +0 NoSolution
# fix break: contig_1057:8886-9132 0 -0 +0 NoSolution
# fix break: contig_1057:9394-12244 0 -0 +0 NoSolution

When Pilon finishes (around 1.5 hours) the end of the file will contain something like:

Output
Writing updated contig_1125_pilon to ./pilon.fasta
Writing updated contig_885_pilon to ./pilon.fasta
Writing updated contig_1130_pilon to ./pilon.fasta
Writing updated contig_1121_pilon to ./pilon.fasta
Writing updated contig_697_pilon to ./pilon.fasta
Mean frags coverage: 30
Mean total coverage: 30

Navigate into the pilon directory and have a look at the output files Pilon has produced.

Code
cd results/pilon
ls
Output
alignment.out  pilon.fasta short_read_alignment.bam  short_read_alignment.bam.bai

We can see pilon has produced a fasta file pilon.fasta, which is the newly polished assembly. This file is now our assembly.

In the next episode we will assess the quality of this assembly and compare its quality to that of the unpolished assembly.

Recommended reading:

While you’re waiting for the polishing to finish, here’s some things you might want to read about:

  • Comparison of combined assembly and polishing methods Trycycler: consensus long-read assemblies for bacterial genomes
  • Polishing strategy for ONT and Pacbio Hifi reads Polishing high-quality genome assemblies
  • Comparison of polishing of ONT data with alignment free tool Jasper compared to POLCA, NextPolish and ntEdit JASPER: a fast genome polishing tool that improves accuracy and creates population-specific reference genomes
  • Comparison of short read polishers including pilon to the polisher Polypolish Polypolish: Short-read polishing of long-read bacterial genome assemblies
  • Pilon short read polisher paper Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement
  • Accuracy of polishers including medaka for nanopore data Nanpore consensus quality
  • Comparison of nanopore polishing tools Comparative evaluation of Nanopore polishing tools for microbial genome assembly and polishing strategies for downstream analysis
Back to top
Polishing
QC polished assembly

Licensed under CC-BY 4.0 2021–24 by Cloud-SPAN

  • Edit this page
  • Report an issue
Cookie Preferences