QC of metagenome bins

Overview

Teaching: 50 min
Exercises: 10 min

Questions

How can we assess the quality of the metagenome bins?

Objectives

Check the quality of the Metagenome-Assembled Genomes (MAGs).

Understanding MIMAG quality standards.

Quality check

The quality of a metagenome-assembled genome (MAG) or bin is highly dependent on several things:

the depth of sequencing
the abundance of the organism in the community
how successful the assembly was
how successful the polishing (if used) was

In order to determine the quality of a MAG we can look at two different metrics. These are:

completeness (i.e. how much of the genome is captured in the MAG?) and
contamination (i.e. do all the sequences in the MAG belong to the same organism?).

We can use the program CheckM to determine the quality of MAGs. CheckM uses a collection of domain and lineage-specific markers to estimate completeness and contamination of a MAG. This short YouTube video by Dr Robert Edwards explains how CheckM uses a hidden Markov model to calculate the level of contamination and completeness of bins, based on marker gene sets.

CheckM has multiple different workflows available which are appropriate for different datasets. See CheckM documentation on Workflows for more information.

We will be using the lineage-specific workflow here. lineage_wf places your bins in a reference tree to determine which lineage it corresponds to. This allows it to use the appropriate marker genes to estimate quality parameters.

First let’s move into our analysis folder.

cd ~/cs_course/analysis/

CheckM has been pre-installed on the instance so we can check the help documentation for the lineage-specific workflow using the 'h tag..

checkm lineage_wf -h

CheckM help documentation

usage: checkm lineage_wf [-h] [-r] [--ali] [--nt] [-g] [-u UNIQUE] [-m MULTI]
                         [--force_domain] [--no_refinement]
                         [--individual_markers] [--skip_adj_correction]
                         [--skip_pseudogene_correction]
                         [--aai_strain AAI_STRAIN] [-a ALIGNMENT_FILE]
                         [--ignore_thresholds] [-e E_VALUE] [-l LENGTH]
                         [-f FILE] [--tab_table] [-x EXTENSION] [-t THREADS]
                         [--pplacer_threads PPLACER_THREADS] [-q]
                         [--tmpdir TMPDIR]
                         bin_input output_dir

Runs tree, lineage_set, analyze, qa

positional arguments:
  bin_input             directory containing bins (fasta format) or path to file describing genomes/genes - tab separated in 2 or 3 columns [genome ID, genome fna, genome translation file (pep)]
  output_dir            directory to write output files

optional arguments:
  -h, --help            show this help message and exit
  -r, --reduced_tree    use reduced tree (requires <16GB of memory) for determining lineage of each bin
  --ali                 generate HMMER alignment file for each bin
  --nt                  generate nucleotide gene sequences for each bin
  -g, --genes           bins contain genes as amino acids instead of nucleotide contigs
  -u, --unique UNIQUE   minimum number of unique phylogenetic markers required to use lineage-specific marker set (default: 10)
  -m, --multi MULTI     maximum number of multi-copy phylogenetic markers before defaulting to domain-level marker set (default: 10)
  --force_domain        use domain-level sets for all bins
  --no_refinement       do not perform lineage-specific marker set refinement
  --individual_markers  treat marker as independent (i.e., ignore co-located set structure)
  --skip_adj_correction
                        do not exclude adjacent marker genes when estimating contamination
  --skip_pseudogene_correction
                        skip identification and filtering of pseudogenes
  --aai_strain AAI_STRAIN
                        AAI threshold used to identify strain heterogeneity (default: 0.9)
  -a, --alignment_file ALIGNMENT_FILE
                        produce file showing alignment of multi-copy genes and their AAI identity
  --ignore_thresholds   ignore model-specific score thresholds
  -e, --e_value E_VALUE
                        e-value cut off (default: 1e-10)
  -l, --length LENGTH   percent overlap between target and query (default: 0.7)
  -f, --file FILE       print results to file (default: stdout)
  --tab_table           print tab-separated values table
  -x, --extension EXTENSION
                        extension of bins (other files in directory are ignored) (default: fna)
  -t, --threads THREADS
                        number of threads (default: 1)
  --pplacer_threads PPLACER_THREADS
                        number of threads used by pplacer (memory usage increases linearly with additional threads) (default: 1)
  -q, --quiet           suppress console output
  --tmpdir TMPDIR       specify an alternative directory for temporary files

Example: checkm lineage_wf ./bins ./output

This readout tells us what we need to include in the command:

the x flag telling CheckM the format of our bins (fa)
the directory that contains the bins (pilon.fasta.metabat-bins1500-YYYMMDD_HHMMSS/)
the directory that we want the output to be saved in (checkm/)
the --reduced_tree flag to limit the memory requirements
the -f flag to specify an output file name/format
the --tab_table flag so the output is in a tab-separated format
the -t flag to set the number of threads used to four, which is the number we have on our instance

As a result our command looks like this:

checkm lineage_wf -x fa binning/pilon.fasta.metabat-bins1500-YYYMMDD_HHMMSS/ checkm/ --reduced_tree -t 4 --tab_table -f MAGs_checkm.tsv

When the run ends (it should take around 8 minutes) we can open our results file.

less MAGs_checkm.tsv

Bin Id  Marker lineage  # genomes       # markers       # marker sets   0       1       2       3       4       5+      Completeness    Contamination   Strain heterogeneity
bin.1   k__Bacteria (UID203)    5449    104     58      100     4       0       0       0       0       2.19    0.00    0.00
bin.2   k__Bacteria (UID203)    5449    104     58      57      47      0       0       0       0       70.69   0.00    0.00
bin.3   root (UID1)     5656    56      24      56      0       0       0       0       0       0.00    0.00    0.00
bin.4   g__Bacillus (UID864)    93      711     241     595     116     0       0       0       0       6.42    0.00    0.00
bin.5   o__Pseudomonadales (UID4488)    185     813     308     1       807     5       0       0       0       99.68   0.61    0.00
bin.6   c__Bacilli (UID285)     586     325     181     1       324     0       0       0       0       99.45   0.00    0.00

Running this workflow is equivalent to running six separate CheckM commands. The CheckM documentation explains this is more detail.

Exercise 1: Downloading the tsv file

Fill in the blanks to complete the code you need to download the MAGs_checkm.tsv to your local computer using SCP:
scp -i ___ csuser@instanceNNN.cloud-span.aws.york.ac.uk.:___/cs_course/analysis/MAGs_checkm.tsv ____
Solution

In a terminal logged into your local machine type:
scp -i login-key-instanceNNN.pem csuser@instanceNNN.cloud-span.aws.york.ac.uk:~/cs_course/analysis/MAGs_checkm.tsv <the destination directory of your choice>

How much contamination we can tolerate and how much completeness we need depends on the scientific question being tackled.

To help us, we can use a standard called Minimum Information about a Metagenome-Assembled Genome (MIMAG), developed by the Genomics Standard Consortium. You can read more about MIMAG in this 2017 paper.

As part of the standard,a framework to determine MAG quality from statistics is outlined. A MAG can be assigned one of three different metrics: High, Medium or Low quality draft metagenome assembled genomes.

See the table below for an overview of each category.

Quality Category	Completeness	Contamination	rRNA/tRNA encoded
High	> 90%	≤ 5%	Yes (≥ 18 tRNA and all rRNA)
Medium	≥ 50%	≤ 10%	No
Low	< 50%	≤ 10%	No

We have already determined the completeness and contamination of each of our MAGs using CheckM. Next we will use a program to determine which rRNA and tRNAs are present in each MAG.

Note that due to the difficulty in assembly of short-read metagenomes, often just a completeness of >90% and a contamination of ≤ 5% is treated as a good quality MAG.

Exercise 2: Explore the quality of the obtained MAGs

Once you have downloaded the MAGs_checkm.tsv file, you can open it in Excel or another spreadsheet program. If you didn’t manage to download the file, or do not have an appropriate program to view it in you can see or download our example file here.

Looking at the results of our quality checks, what category would each of our MAGs fall into (ignore the tRNA and rRNA requirement for now)?

Solution

There are two potential high quality draft metagenome assembled genomes in bin.5 and bin.6. We also have one medium quality draft MAG in bin.2. Finally, there are three low quality MAGs in bin.1, bin.3 and bin.4.

Your bins may have different names/numbers to these but you should still see similar results.

Key Points

CheckM can be used to evaluate the quality of each Metagenomics-Assembled Genome.

We can use the percentage contamination and completion to identify the quality of these bins.

There are MIMAG standards which can be used to categorise the quality of a MAG.

Many MAGs will be incomplete, but that does not mean that this data is not still useful for downstream analysis.

previous episode

Binning and Functional Annotation

next episode

QC of metagenome bins

Overview

Quality check

CheckM help documentation

Exercise 1: Downloading the tsv file

Solution

Exercise 2: Explore the quality of the obtained MAGs

Solution

Key Points

previous episode

next episode