Functional annotation

What is functional annotation?

Now we have our binned MAGs, we can start to think about what functions genes contained within their genomes do. We can do this via functional annotation - a way to collect information about and describe a DNA sequence.

Next lesson we will talk about taxonomic annotation, which tells us which organisms are present in the metagenome assembly. This lesson, however, we will do some brief functional annotation to get more information about the potential metabolic capacity of the organism we are annotating. This is possible because there is software available which uses features in DNA sequences to predict where genes start and end, allowing us to predict which genes are in our MAGs.

A high quality functional annotation is important because it is very useful for lots of downstream analyses. For instance, if we were looking for genes that have a particular function, we would only be able to do that if we were able to predict the location of the genes in these assemblies.

For example, the paper this data is pulled from uses functional annotation of MAGs to look for genes associated with denitrification pathways. The abundance of these genes is then linked to N₂O flux rates at different sites.

In this lesson we will only be doing a very small amount of functional annotation using the tool Prokka for rapid prokaryotic genome annotation. This is intended as a taster to give you an idea what you can use your MAGs for. There are many other routes to be taken regarding functional annotation, some of which will be discussed briefly at the end of this episode.

As with taxonomic annotation, effectiveness is determined by the database that the MAG sequence is being compared to. If you do not use the appropriate database you may not end up with many annotated sequences. In particular, Prokka (the tool we will use in this episode) annotates archaea and bacterial genomes. If you are trying to annotate a fungal genome or a eukaryote, you will need to use something different.

Using Prokka for functional annotation

Software choices

We are using Prokka here as it is still the software most commonly used. However, the program is no longer being updated. One recent alternative that is being actively developed is Bakta.

Prokka identifies candidate genes in a iterative process. First it uses Prodigal (another command line tool) to find candidate genes.These are then compared against databases of known protein sequences in order to determine their function. If you like, you can read more about Prokka in this 2014 paper.

Prokka has been pre-installed on our instance. First, let’s create a directory inside results where we can store our outputs from Prokka.

Code

cd ~/cs_course
mkdir results/prokka

For now we will annotate just one MAG at a time with Prokka. In the previous episode we produced 90 MAGs of varying quality. In this example, we will start with the MAG bin.45.fa, as this MAG had fairly high completeness (57.76%) and only 1.72% contamination. You should choose your own “best” bin to use (one with good completeness and low contamination).

Before we start we’ll need to activate a conda environment to run the software.

Activating an environment

Environments are a way of installing a piece of software so that it is isolated, so that things installed within an environment, do not affect other software installed at system wide level. For some pieces of software, the requirements for different dependency versions, such different versions of python mean this is an easy way to have multiple pieces of software installed without conflicts. One popular way to manage environments is to use conda which is a popular environment manager. We will not discuss using conda in detail, so for further information of how to use it, here is a Carpentries course that covers how to use conda in more detail.

For this course we have created a conda environment containing prokka. In order to use this we will need to use the conda activate command:

Code

conda activate prokka

You will be able to tell you have activated your environment because your prompt should go from looking like this, with (base) at the beginning…

Code

(base) csuser@instance001:~ $

…to having (prokka) at the beginning. If you forget whether you are in an the prokka environment, look back to see what the prompt looks like.

Code

(prokka) csuser@instance001:~ $

Now let’s take a look at the help page for Prokka using the -h flag.

Code

prokka -h

Output — Prokka Help documentation

Name:
  Prokka 1.12 by Torsten Seemann <torsten.seemann@gmail.com>
Synopsis:
  rapid bacterial genome annotation
Usage:
  prokka [options] <contigs.fasta>General:
  --help            This help
  --version         Print version and exit
  --docs            Show full manual/documentation
  --citation        Print citation for referencing Prokka
  --quiet           No screen output (default OFF)
  --debug           Debug mode: keep all temporary files (default OFF)
Setup:
  --listdb          List all configured databases
  --setupdb         Index all installed databases
  --cleandb         Remove all database indices
  --depends         List all software dependencies
Outputs:
 --outdir [X]      Output folder [auto] (default '')
 --force           Force overwriting existing output folder (default OFF)
 --prefix [X]      Filename output prefix [auto] (default '')
 --addgenes        Add 'gene' features for each 'CDS' feature (default OFF)
 --addmrna         Add 'mRNA' features for each 'CDS' feature (default OFF)
 --locustag [X]    Locus tag prefix [auto] (default '')
 --increment [N]   Locus tag counter increment (default '1')
 --gffver [N]      GFF version (default '3')
 --compliant       Force Genbank/ENA/DDJB compliance: --addgenes --mincontiglen 200 --centre XXX (default OFF)
 --centre [X]      Sequencing centre ID. (default '')
 --accver [N]      Version to put in Genbank file (default '1')
Organism details:
  --genus [X]       Genus name (default 'Genus')
  --species [X]     Species name (default 'species')
  --strain [X]      Strain name (default 'strain')
  --plasmid [X]     Plasmid name or identifier (default '')
 Annotations:
  --kingdom [X]     Annotation mode: Archaea|Bacteria|Mitochondria|Viruses (default 'Bacteria')
  --gcode [N]       Genetic code / Translation table (set if --kingdom is set) (default '0')
  --gram [X]        Gram: -/neg +/pos (default '')
  --usegenus        Use genus-specific BLAST databases (needs --genus) (default OFF)
  --proteins [X]    FASTA or GBK file to use as 1st priority (default '')
  --hmms [X]        Trusted HMM to first annotate from (default '')
  --metagenome      Improve gene predictions for highly fragmented genomes (default OFF)
  --rawproduct      Do not clean up /product annotation (default OFF)
  --cdsrnaolap      Allow [tr]RNA to overlap CDS (default OFF)
Computation:
  --cpus [N]        Number of CPUs to use [0=all] (default '8')
  --fast            Fast mode - only use basic BLASTP databases (default OFF)
  --noanno          For CDS just set /product="unannotated protein" (default OFF)
  --mincontiglen [N] Minimum contig size [NCBI needs 200] (default '1')
  --evalue [n.n]    Similarity e-value cut-off (default '1e-06')
  --rfam            Enable searching for ncRNAs with Infernal+Rfam (SLOW!) (default '0')
  --norrna          Don't run rRNA search (default OFF)
  --notrna          Don't run tRNA search (default OFF)
  --rnammer         Prefer RNAmmer over Barrnap for rRNA prediction (default OFF)

Looking at the help page tells us how to construct our basic command, which has these arguments: - --outdir tells Prokka where to put the output - --prefix tells Prokka what to call the outputs (in this case, the name of the bin will suffice) - finally we need to provide the file to be annotated

Prokka produces multiple different file types, which you can see in the table below. We are mainly interested in .faa, .ffn and .tsv but many of the other files are useful for submission to different databases.

Suffix	Description of file contents
.fna	FASTA file of original input contigs (nucleotide)
.faa	FASTA file of translated coding genes (protein)
.ffn	FASTA file of all genomic features (nucleotide)
.fsa	Contig sequences for submission (nucleotide)
.tbl	Feature table for submission
.sqn	Sequin editable file for submission
.gbk	Genbank file containing sequences and annotations
.gff	GFF v3 file containing sequences and annotations
.log	Log file of Prokka processing output
.txt	Annotation summary statistics
.tsv	Tab-separated file of all features: locus_tag,ftype,len_bp,gene,EC_number,COG,product

Code

prokka --outdir results/prokka/bin.45 --prefix bin.45 results/binning/assembly_ERR5000342.fasta.metabat-bins1500-YYYMMDD_HHMMSS/bin.45.fa

This should take around 1-2 minutes on the instance so we will not be running the command in the background.

Exercise 1: Recap of Prokka command

Test yourself! What do each of these parts of the command signal?

--outdir bin.45
--prefix bin.45
results/binning/assembly_ERR5000342.fasta.metabat-bins1500-YYYMMDD_HHMMSS/bin.45.fa

Solution

bin.45 is the name of the directory where Prokka will place its output files
bin.45 will be the name of each output file e.g. bin.45.tsv or bin.45.faa
This is the file path for the file we want Prokka to annotate

When you initially run the command you should see similar to the following.

Output

[11:58:55] This is prokka 1.12
[11:58:55] Written by Torsten Seemann <torsten.seemann@gmail.com>
[11:58:55] Homepage is https://github.com/tseemann/prokka
[11:58:55] Local time is Wed Mar 22 11:58:55 2023
[11:58:55] You are csuser
[11:58:55] Operating system is linux
[11:58:55] You have BioPerl 1.006924
[11:58:55] System has 8 cores.
[11:58:55] Will use maximum of 8 cores.
[11:58:55] Annotating as >>> Bacteria <<<
[11:58:55] Generating locus_tag from 'results/binning/assembly_ERR5000342.fasta.metabat-bins1500-YYYMMDD_HHMMSS/bin.45.fa' contents.

And you should see the following when the command has finished:

Output

[12:00:28] Output files:
[12:00:28] bin.45/bin.45.fna
[12:00:28] bin.45/bin.45.faa
[12:00:28] bin.45/bin.45.ffn
[12:00:28] bin.45/bin.45.fsa
[12:00:28] bin.45/bin.45.err
[12:00:28] bin.45/bin.45.sqn
[12:00:28] bin.45/bin.45.txt
[12:00:28] bin.45/bin.45.gbk
[12:00:28] bin.45/bin.45.tsv
[12:00:28] bin.45/bin.45.gff
[12:00:28] bin.45/bin.45.log
[12:00:28] bin.45/bin.45.tbl
[12:00:28] Annotation finished successfully.
[12:00:28] Walltime used: 1.55 minutes
[12:00:28] If you use this result please cite the Prokka paper:
[12:00:28] Seemann T (2014) Prokka: rapid prokaryotic genome annotation. Bioinformatics. 30(14):2068-9.
[12:00:28] Type 'prokka --citation' for more details.
[12:00:28] Share and enjoy!

Now prokka has finished running, we can exit the conda environment using the conda deactivate command.

Code

conda deactivate

Your prompt should return from something like this:

Code

(prokka) csuser@metagenomicsT3instance04:~ $ conda deactivate

to this:

Code

(base) csuser@metagenomicsT3instance04:~ $

If we navigate into the bin.45 output file we can use ls to see that Prokka has generated many files.

Code

cd results/prokka/bin.45
ls

Output

bin.45.err  bin.45.faa  bin.45.ffn  bin.45.fna  bin.45.fsa  bin.45.gbk  bin.45.gff  bin.45.log  bin.45.sqn  bin.45.tbl  bin.45.tsv  bin.45.txt

As mentioned previously, the two files we are most interested in are those with the extension .tsv and .faa:

the .tsv file contains information about every gene identified by Prokka
the .faa file is a FASTA file containing the amino acid sequence of every gene that has been identified.

We can take a look at the .tsv file using head.

Code

head bin.45.tsv

Output

locus_tag       ftype   gene    EC_number       product
DDJNKIGN_00001  CDS     hypothetical protein
DDJNKIGN_00002  CDS     hypothetical protein
DDJNKIGN_00003  CDS     macA    Macrolide export protein MacA
DDJNKIGN_00004  CDS     hypothetical protein
DDJNKIGN_00005  CDS     hypothetical protein
DDJNKIGN_00006  CDS     hypothetical protein
DDJNKIGN_00007  CDS     hypothetical protein
DDJNKIGN_00008  CDS     pstA_1  Phosphate transport system permease protein PstA
DDJNKIGN_00009  CDS     pstA_2  Phosphate transport system permease protein PstA

This file gives us a list of all the sequences that Prokka has identified as being protein-coding, along with the gene name (if there is one) and the protein product (again, if there is one).

You will notice that some of the output are labelled simply “hypothetical protein”. This means the locus in questions looks like a protein-coding gene, but there isn’t a match for it in any of the databases used by Prokka to label genes.

Others have a gene and product name, meaning Prokka was able to successfully identify them as a specific gene. The product column tells you the name of the protein this gene codes for.

We can then look at the .faa file to see the sequences of these proteins.

Code

head bin.45.faa

Output

>DDJNKIGN_00001 hypothetical protein
MTSSTVINTLVAAQTPILKQNLRPVSVWLHHCGLGGVQASWIQFRDSLRQAIIDALSAAG
MTDCMNELKYRWGL
>DDJNKIGN_00002 hypothetical protein
MQPRPGIPFAGALVPLSTFNKTALRSNSIDLTNPPQLEPFTRREQYRIVVSGDEPDCDDT
LELPVWDCDLIRKCYEVSYHKARLDYYGPAAPFSPKDMTSFRGSSRQCWERTERLRSAGC
TTSRPINCLRQILNVSWTKNMSAVLAGGLLQGLRPEPQLDPAWAAFFALPDIEITSLRST
GTSSPDRTRSRKRTPSAESRRPWRCHGPQPVLPG
>DDJNKIGN_00003 Macrolide export protein MacA
MTSKHIGMVAGAMAFIAAGVGCARSRTAAAGDERPAVSVVKIARGDLSQGLTLAAEFRPF

What next?

Now we have information about the various genes (and the proteins they code for) present in one of our bins. What can we do with this information?

Relating genes to an online database

There are tools available which allow you to visualise the proteins in your bin and how they fit into different metabolic pathways. Some of these are available through your browser.

One such tool is BlastKOALA, where you can upload the .faa file we just looked at and get back a breakdown of the proteins mapped to the KEGG database (a database of molecular interaction maps). Start by downloading the .faa file to your local maching using scp:

Code

scp -i login-key-instanceNNN.pem csuser@instanceNNN.cloud-span.aws.york.ac.uk:~/cs_course/results/prokka/bin.45/bin.45.faa

You can then upload this file onto BlastKOALA. BlastKOALA is a tool which can annotate the sequences with K numbers. These then relate back to the KEGG database.

a screenshot of the blastKoala upload page — .

You should click on the “Choose file” button and navigate to where your *.faa file is located on your computer.

Choose the Prokaryotes database as the one to search - you can run it again with eukaryotes and/or viruses if you like later. However, we’re mostly expecting to see prokaryotes so this will be the most useful output.

Once you have pressed submit you should be re-directed to a screen that says “Request accepted”. You will also be sent an email with two links, one to submit the job and one to cancel. Make sure you press the submit link as your job will not be running without it! If you haven’t received an email, check your spam.

Once you have pressed the submit link in the email you should be redirected to a BlastKOALA page that says “Job submitted”. This is an online server shared by lots of people, so your job has to queue with other jobs before it can be executed. This may take a while. You will recieve an email when the job has completed.

Once the job has been completed you will receive a link by email. From this you can explore the annotated MAG. You can view/download the data and use the KEGG mapper reconstruct pathway option to see how these genes interact with each other.

a screenshot of the blastKoala output — .

Using an annotation tool like this can help you understand more about the genes and pathways present in your sample(s). For example, as previously described, the paper this data is pulled from uses functional annotation of MAGs to look for genes associated with denitrification pathways.

Building a tree from the 16S sequence

Prokka is able to identify 16S sequences present in our MAGs. This can be used to build a quick taxonomic tree to see what organisms our MAG is related to.

First we will search for the presence of 16S sequences in the Prokka output.

While still logged into the instance, navigate to the Prokka output directory you generated earlier (~/cs_course/results/prokka/bin.45). Once in that directory we can search for sequenced identified as being 16S in the .tsv file using grep:

Code

grep 16S *.tsv

You should get a result similar to the below. Yours may differ slightly depending on the MAG you ran.

Output

NBEAANKK_00310  CDS     yrrK    3.1.-.- Putative pre-16S rRNA nuclease
NBEAANKK_02181  rRNA    16S ribosomal RNA

Note

If you don’t get an output here it may be that your MAG doesn’t have any 16S sequences present. In the case of this lesson, that means you have run Prokka on a less complete MAG than the one we used. You should double check your output from CheckM and pick a MAG that is highly complete to run through Prokka instead.

Our output shows that there is 1 full size 16S ribosomal RNA genes present in our data, and a putative rRNA nuclease (which we can discount as it isn’t an rRNA).

The next step is to pull out the sequence of one of these 16S rRNA genes and run it through a BLAST database. This is possible using the .ffn file which gives the sequences in nucleotide format. We’ll need to search the .ffn file for the tag associated with our gene of interest (the first column in the output above).

We do this using seqkit with a grep option, which you can read more about here). Here’s the format of this command:

Code

seqkit grep -p <prokka_id> <prokka.ffn>

So in our case this command would be:

Code

$ seqkit grep -p NBEAANKK_02181 bin.45.ffn

You can see the output of the command below:

Output

>NBEAANKK_02181 16S ribosomal RNA
ACGAGAGTTTGATCCTGGCTCAGAGCGAACGCTGGCGGCGTGCTTAACACATGCAAGTCG
AGTGCGCGCCTGTAGCAATACAGGTGGCGCACGGCGCACGGGTGCGTAACACGTGGGTAA
TCTGCCCTTCGATGGGGAATAACTCGCCGAAAGGCGAGCTAATTCCGCATAACATTCCGA
GAACTTTGGTTTTTGGATTCAAAGGCGTAAGTCGTCGGAGGAGGAGCCCGCGCACGATTA
GCTAGTTGGTGAGGTAACGGCTCACCAAGGCTATGATCGTTAGCTGGTCTGAGAGGATGG
CCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATC
TTGCACAATGGGCGAAAGCCTGATGCAGCGACGCCGCGTGGGGGATGAAGCTTTTCGGAG
TGTAAACCCCTTTCGACCCGGACGAATGCCTCGCAAGAGGACTGACGGTACGGGTATAAG
AAGCCCCGGCTAACTACGTGCCAGCAGCCGCGGTAAGACGTAGGGGGCCAGCGTTGCTCG
GAATTACTGGGTGTAAAGGGTTCGTAGGCGGTGTGGCAAGTCGGGAGTGAAATCTCTGGG
CTCAACCCAGAGGCTGCTTCCGAAACTGCTGTGCTTGAGTGTGGGAGAGGCGCGTGGAAT
TGCAGGTGTAGCGGTGAAATGCGTAGATATCTGCAGGAACACCCGTGGCGAAAGCGGCGC
GCTGGACCACAACTGACGCTGAGGAACGAAAGCTAGGGGAGCAAACAGGATTAGATACCC
TGGTAGTCCTAGCCCTAAACGATCAGGACTTGGGGTGCCGCCCGTTCGGGCGTCGTCCCG
GAGCTAACGCGTTAAGTCCTGCACCTGGGGAGTACGGTCGCAAGACTGAAACTCAAAGGA
ATTGACGGGGGCCCGCACAAGCGGTGGAACATGTGGTTCAATTCGACGCTACGCGAGGAA
CCTTACCTGGGCTCGAAATGCTTATGACCAGCTGTAGAAATACGGCCTTCCCGCAAGGGA
CAGGAGTATAGGCGCTGCATGGCTGTCGTCAGCTCGTGCCGTGAGGTGTTGGGTTAAGTC
CCGCAACGAGCGCAACCCCTGCACGTAGTTGCCACTCCGCAAGGAGGGAACTCTACGTGG
ACTGCTCCGGATAACGGAGAGGAAGGTGGGGATGACGTCAAGTCCGCATGGCCTTTATGT
CCAGGGCTACACACGTGTTACAATGCAGGGTACAAACCGTTGCCAACCCGCGAGGGGGAG
CTAATCGGAAAAAACTCTGCTCAGTTCGGATTGCAGTCTGCAACTCGACTGCATGAAGCC
GGAATCGCTAGTAATGGCGTATCAGATCGACGCCGTGAATACGTTCCCGGGCCTTGTACA
CACCGCCCGTCACATCACGAAAGTGAGTTGTACTAGAAGTCGTCACGCTGACCGCAAGGA
GGCAGACGCCCAAGGTATGACCCATGATTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGA
GAACCTGCGGCTGGATCACCTCCTTT

Now we have the 16S rRNA sequence we can upload this to BLAST and search the 16S database to see what organisms this MAG relates to.

BLAST

We will be using BLAST (Basic Local Alignment Search Tool) which is an algorithm to find regions of similarity between biological sequences. BLAST is a very popular program in bioinformatics so you may be familiar with the online BLAST server run by NCBI.

We will be using the online server, available at BLAST.

Select the button that says “Nucleotide BLAST”.

Under “Choose Search Set” set the database to “rRNA/ITS database”. Then, copy and paste the 16S sequence from your instance into the box at the top titled “Enter accession number(s), gi(s), or FASTA sequence(s)”

Your screen should look something like the below:

Now, click the blue BLAST button!

Your job will then be added to a queue of other jobs until there is space for it to run. Usually this is only a couple of minutes, especially as the sequence length and database size we are using are small. Make sure you leave the tab open while you wait so you can see your results when they arrive.

Your ouput should look like this:

From here you can explore the sequences that were aligned to your 16S sequence using the “Descriptions”, “Graphic Summary”, “Alignments” and “Taxonomy” tabs. You can also browse the “Distance tree of results” to see where your 16S sequence lies in relation to other species.

Exercise 2: Understand the blast output

You will now have a 16S sequence from the MAG that you have chosen. Use the output from your BLAST search to answer these questions.

What do you think is the most likely annotation for your MAG? ?
Which columns in the BLAST output do you think are the most important for selecting which is the best hit?
Now you have identified this MAG, try repeating the process for the other bins and see which organisms they belong to. Which are the best hits for these MAGs?

Solution

This will vary depending on the MAG you have picked but it will be one of the first hits in the output. The “closest” match will probably be the one with the highest total score. However these are not the only columns worth using to identify the best hit.
Other columns worth looking at (because the top hit may not be the best) are the query cover, percent identity and the E-value.
- percent identity is how similar the query sequence (your input) is to the target AKA how many characters are identical. Higher percent identity = more similar sequences
- query coverage is the percentage of the query sequence that overlaps the target. If there is only a small overlap then the match is less significant, even if it has a very high percent identity. We want as much of the two sequences to be identical as possible.
- E-value is the number of matches you would expect to see by chance. This is dependent on the size of the database. Lower E-value = less likely to be by chance = a better match.