Cloud-SPAN logo. Cloud-SPAN logo.
  • Home
  • Precourse Instructions
  • About
  1. Functional Annotation
  2. Extracting Functional Information
  • Files and Directories
    • Understanding your file system
    • Logging onto the Cloud
    • Introducing the Shell
  • Using the Command Line
    • Navigating Files and Directories
    • Working with Files and Directories
    • Redirection
  • QC & RNA pre-processing
    • Introduction to Metatranscriptomics
    • Quality of Raw Reads
    • Ribosomal RNA Filtering
  • Taxonomic Annotation
    • Extracting a Community Profile
    • Visualising Community Structure
  • Functional Annotation
    • Extracting Functional Information
    • Downstream analysis of taxonomic and functional profiles
  • Extras
    • Data
    • Glossary
    • Workflow Reference

On this page

  • What is Functional Profiling?
  • What is Humann?
    • Running Humann
  • Humann outputs
    • Normalising abundances
    • Identifying gene families in pathways
    • Group gene families into GO terms
  • Edit this page
  • Report an issue
  1. Functional Annotation
  2. Extracting Functional Information

Extracting Functional Information

In the previous lesson we used Metaphlan to assess the taxonomic profile and discover which microbes are active in our sample. Now we know “who” is in our sample, it’s time to find out “what” they are doing, that is to say the functional information. To do so we shall use another piece of software called Humann.

What is Functional Profiling?

Functional profiling is the identification of the function of a microbiome. With metagenomic data (DNA) a functional profile will indicate which genes and gene pathways can be expressed, but in metatranscriptomics the analysis of mRNA we can identify which genes pathways are expressed. In other words, this lets us see which gene pathways are active in the microbiome, and how relatively active they are.

Gene pathways, sometimes called metabolic pathways, are collections of genes that code for enzymes that carry out certain biological functions such as carbohydrate metabolism, amino acid biosynthesis, stress response, or antibiotic resistance. Similarly, gene families are evolutionarily related genes that perform similar functions. Because of the natural genetic diversity of microbial life it is easier to search for gene families and pathways than individual genes.

Usually gene expression is measured in relative abundance rather than absolute as the sequenced RNA is only a subset of the true total biological pool. RNA extraction is never 100% efficient as molecules can be lost during cell lysis, purification, library prep, and sequencing prep. Sequencing depth is always limited, and so even RNA molecules present in the sequencing plate may not be detected if they are lower in count. As such absolute abundances will not necessarily be a meaningful measure, whereas the relative expression of gene pathways can be informative and useful for the comparison of function between samples or treatments.

What is Humann?

HUMAnN (The HMP Unified Metabolic Analysis Network) is another piece of software from the BioBakery suite of software as detailed in the previous lesson. It is a pipeline for accurate determination of either the abundance or presence/absence of microbial gene pathways in a community. It can use either DNA (metagenomic) or RNA (metatranscriptomic) data.

It works by finding the abundance of orthologous (performing approximately the same function) gene families in the data, using the KEGG Orthology (KO) catalog by default. It does this by aligning the transcript reads firstly against the Chocophlan database to get nucleotide-level hits, then any unaligned reads are translated into amino acid sequences and then aligned against a UniRef database of protein sequences.

It also finds gene pathways (defined as two or more genes) using KEGG Pathways or GO (Gene Ontology) terms. Abundance of each pathway (how many copies there are present in the data) as well as presence/absence can be outputted by Humann.

Running Humann

First of all let’s make sure we’re in our working directory of cs_course.

Code
cd /home/csuser/cs_course

Then let’s make a new directory to store out Humann results in.

Code
mkdir -p analysis/humann

Humann is a very extensive programme with many features. Let’s run the help function to see what it can do:

Code
humann --help
NoteOutput — Humann help documentation
Output
$ humann --help
usage: humann [-h] -i <input.fastq> -o <output> [--threads <1>] [--version] [-r] [--bypass-nucleotide-index]
              [--bypass-nucleotide-search] [--bypass-prescreen] [--bypass-translated-search]
              [--taxonomic-profile <taxonomic_profile.tsv>] [--memory-use {minimum,maximum}]
              [--input-format {fastq,fastq.gz,fasta,fasta.gz,sam,bam,blastm8,genetable,biom}] [--search-mode {uniref50,uniref90}]
              [-v] [--metaphlan <metaphlan>] [--metaphlan-options <metaphlan_options>] [--prescreen-threshold <0.01>]
              [--bowtie2 <bowtie2>] [--bowtie-options <bowtie_options>] [--nucleotide-database <nucleotide_database>]
              [--nucleotide-identity-threshold <0.0>] [--nucleotide-query-coverage-threshold <90.0>]
              [--nucleotide-subject-coverage-threshold <50.0>] [--diamond <diamond>] [--diamond-options <diamond_options>]
              [--evalue <1.0>] [--protein-database <protein_database>] [--rapsearch <rapsearch>]
              [--translated-alignment {usearch,rapsearch,diamond}]
              [--translated-identity-threshold <Automatically: 50.0 or 80.0, Custom: 0.0-100.0>]
              [--translated-query-coverage-threshold <90.0>] [--translated-subject-coverage-threshold <50.0>] [--usearch <usearch>]
              [--gap-fill {on,off}] [--minpath {on,off}] [--pathways {metacyc,unipathway}]
              [--pathways-database <pathways_database.tsv>] [--xipe {on,off}] [--annotation-gene-index <3>]
              [--id-mapping <id_mapping.tsv>] [--remove-temp-output] [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
              [--o-log <sample.log>] [--output-basename <sample_name>] [--output-format {tsv,biom}] [--output-max-decimals <10>]
              [--remove-column-description-output] [--remove-stratified-output]

HUMAnN : HMP Unified Metabolic Analysis Network 3

optional arguments:
  -h, --help            show this help message and exit

[0] Common settings:
  -i <input.fastq>, --input <input.fastq>
                        input file of type {fastq,fastq.gz,fasta,fasta.gz,sam,bam,blastm8,genetable,biom}
                        [REQUIRED]
  -o <output>, --output <output>
                        directory to write output files
                        [REQUIRED]
  --threads <1>         number of threads/processes
                        [DEFAULT: 1]
  --version             show program's version number and exit

[1] Workflow refinement:
  -r, --resume          bypass commands if the output files exist
  --bypass-nucleotide-index
                        bypass the nucleotide index step and run on the indexed ChocoPhlAn database
  --bypass-nucleotide-search
                        bypass the nucleotide search steps
  --bypass-prescreen    bypass the prescreen step and run on the full ChocoPhlAn database
  --bypass-translated-search
                        bypass the translated search step
  --taxonomic-profile <taxonomic_profile.tsv>
                        a taxonomic profile (the output file created by metaphlan)
                        [DEFAULT: file will be created]
  --memory-use {minimum,maximum}
                        the amount of memory to use
                        [DEFAULT: minimum]
  --input-format {fastq,fastq.gz,fasta,fasta.gz,sam,bam,blastm8,genetable,biom}
                        the format of the input file
                        [DEFAULT: format identified by software]
  --search-mode {uniref50,uniref90}
                        search for uniref50 or uniref90 gene families
                        [DEFAULT: based on translated database selected]
  -v, --verbose         additional output is printed

[2] Configure tier 1: prescreen:
  --metaphlan <metaphlan>
                        directory containing the MetaPhlAn software
                        [DEFAULT: $PATH]
  --metaphlan-options <metaphlan_options>
                        options to be provided to the MetaPhlAn software
                        [DEFAULT: "-t rel_ab"]
  --prescreen-threshold <0.01>
                        minimum percentage of reads matching a species
                        [DEFAULT: 0.01]

[3] Configure tier 2: nucleotide search:
  --bowtie2 <bowtie2>   directory containing the bowtie2 executable
                        [DEFAULT: $PATH]
  --bowtie-options <bowtie_options>
                        options to be provided to the bowtie software
                        [DEFAULT: "--very-sensitive"]
  --nucleotide-database <nucleotide_database>
                        directory containing the nucleotide database
                        [DEFAULT: /home/ubuntu/cs_course/data/chocophlanhumann/chocophlan]
  --nucleotide-identity-threshold <0.0>
                        identity threshold for nuclotide alignments
                        [DEFAULT: 0.0]
  --nucleotide-query-coverage-threshold <90.0>
                        query coverage threshold for nucleotide alignments
                        [DEFAULT: 90.0]
  --nucleotide-subject-coverage-threshold <50.0>
                        subject coverage threshold for nucleotide alignments
                        [DEFAULT: 50.0]

[3] Configure tier 2: translated search:
  --diamond <diamond>   directory containing the diamond executable
                        [DEFAULT: $PATH]
  --diamond-options <diamond_options>
                        options to be provided to the diamond software
                        [DEFAULT: "--top 1 --outfmt 6"]
  --evalue <1.0>        the evalue threshold to use with the translated search
                        [DEFAULT: 1.0]
  --protein-database <protein_database>
                        directory containing the protein database
                        [DEFAULT: /home/ubuntu/cs_course/data/chocophlanhumann/uniref/uniref]
  --rapsearch <rapsearch>
                        directory containing the rapsearch executable
                        [DEFAULT: $PATH]
  --translated-alignment {usearch,rapsearch,diamond}
                        software to use for translated alignment
                        [DEFAULT: diamond]
  --translated-identity-threshold <Automatically: 50.0 or 80.0, Custom: 0.0-100.0>
                        identity threshold for translated alignments
                        [DEFAULT: Tuned automatically (based on uniref mode) unless a custom value is specified]
  --translated-query-coverage-threshold <90.0>
                        query coverage threshold for translated alignments
                        [DEFAULT: 90.0]
  --translated-subject-coverage-threshold <50.0>
                        subject coverage threshold for translated alignments
                        [DEFAULT: 50.0]
  --usearch <usearch>   directory containing the usearch executable
                        [DEFAULT: $PATH]

[5] Gene and pathway quantification:
  --gap-fill {on,off}   turn on/off the gap fill computation
                        [DEFAULT: on]
  --minpath {on,off}    turn on/off the minpath computation
                        [DEFAULT: on]
  --pathways {metacyc,unipathway}
                        the database to use for pathway computations
                        [DEFAULT: metacyc]
  --pathways-database <pathways_database.tsv>
                        mapping file (or files, at most two in a comma-delimited list) to use for pathway computations
                        [DEFAULT: metacyc database ]
  --xipe {on,off}       turn on/off the xipe computation
                        [DEFAULT: off]
  --annotation-gene-index <3>
                        the index of the gene in the sequence annotation
                        [DEFAULT: 3]
  --id-mapping <id_mapping.tsv>
                        id mapping file for alignments
                        [DEFAULT: alignment reference used]

[6] More output configuration:
  --remove-temp-output  remove temp output files
                        [DEFAULT: temp files are not removed]
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        level of messages to display in log
                        [DEFAULT: DEBUG]
  --o-log <sample.log>  log file
                        [DEFAULT: temp/sample.log]
  --output-basename <sample_name>
                        the basename for the output files
                        [DEFAULT: input file basename]
  --output-format {tsv,biom}
                        the format of the output files
                        [DEFAULT: tsv]
  --output-max-decimals <10>
                        the number of decimals to output
                        [DEFAULT: 10]
  --remove-column-description-output
                        remove the description in the output column
                        [DEFAULT: output column includes description]
  --remove-stratified-output
                        remove stratification from output
                        [DEFAULT: output is stratified]
flag/option meaning
--input analysis/sortmerna/SRR6820491_mrna.fq specifies the input file containing our filtered mRNA reads from SortMeRNA. We use the filtered reads here because rRNA sequences do not contain protein-coding information and would slow down functional annotation considerably.
--output analysis/humann/ tells HUMAnN where to write all output files generated during the run.
--taxonomic-profile analysis/metaphlan/metaphlan_output.txt provides the MetaPhlAn taxonomic profile so HUMAnN only searches database entries relevant to the taxa already identified. This greatly improves runtime efficiency.
--threads 8 tells HUMAnN to use all 8 CPU cores available on our instance.
&> analysis/humann/log.txt & redirects all terminal output to a log file and runs the process in the background.

And so, the command we are going to run Humann with looks like this:

Code
humann --input analysis/sortmerna/SRR6820491_mrna.fq \
--output analysis/humann/ \
--taxonomic-profile analysis/metaphlan/metaphlan_output.txt \
--threads 8 \
&> analysis/humann/log.txt &

As we are looking at community function we no longer need the rRNA sequences as they do not contain gene coding regions and so will slow the run down considerably. This means we shall be using the filtered output of SortMeRNA.

In order to speed up the run further we feed our Metaphlan output into the command with --taxonomic-profile, which tells Humann to only look at database entries relevant to the taxa we identified with Metaphlan.

For space-saving and efficiency we are also using the UniRef90 EC filtered database, a version which only contains proteins associated with a level-4 enzyme commision category (ones with precisely identified substrates). Whilst this may not identify a broader number of transcribed proteins, it will be more informative for pathway analysis.

This command will take up to a few hours to run, but hopefully you managed to run it at the end of last session!

Tip

The UniRef databases can be extremely large, with over 100GB being standard. Our instances are not quite large enough to accommodate the full size database we would usually need. This is because for costing purposes only we’re working on slightly smaller instances for this course. It is entirely possible to expand the disk space of an AWS instance, and during your own analyses we recommend about 1TB of storage (or more if you are processing more than ~20 samples).

Humann outputs

We can check that Humann has ran correctly by checking the log like so:

Code
less analysis/humann/log.txt
Output
Output files will be written to: /home/csuser/cs_course/analysis/humann
Removing spaces from identifiers in input file ...


Total species selected from prescreen: 0

Selected species explain 0.00% of predicted community composition



No species were selected from the prescreen.
Because of this the custom ChocoPhlAn database is empty.
This will result in zero species-specific gene families and pathways.



Running diamond ........


Aligning to reference database: uniref90_201901b_ec_filtered.dmnd

Total bugs after translated alignment: 1
unclassified: 10677087 hits

Total gene families after translated alignment: 3140

Unaligned reads after translated alignment: 35.3608922734 %


Computing gene families ...

Computing pathways abundance and coverage ...

Output files created:
/home/csuser/cs_course/analysis/humann/SRR6820491_mrna_genefamilies.tsv
/home/csuser/cs_course/analysis/humann/SRR6820491_mrna_pathabundance.tsv
/home/csuser/cs_course/analysis/humann/SRR6820491_mrna_pathcoverage.tsv

If your log looks like this then congratulations! Humann has ran and successfully outputted the results.

Humann will produce five outputs:

Code
ls -F analysis/humann/
Output
log.txt                           
SRR6820491_mrna_pathabundance.tsv
SRR6820491_mrna_genefamilies.tsv  
SRR6820491_mrna_pathcoverage.tsv
SRR6820491_mrna_humann_temp/

A directory of temporary files

This will contain multiple files generated by humann during the analysis, including aligned reads, intermediate mapping files, and a log file. Some other pieces of software clear their temporary files afterwards, but Humann leaves them intact. This can be useful for troubleshooting, but takes up a large portion of you storage space. If you are happy the run is complete you can delete these files to save space

Tabular file with gene families and abundances

This is a functional profile consisting of two columns: the identified gene families, and their abundances. We can inspect it with:

Code
less analysis/humann/SRR6820491_mrna_genefamilies.tsv

It should look a bit like this:

Output
# Gene Family   SRR6820491_mrna_Abundance-RPKs
UNMAPPED        2259677.0000000000
UniRef90_C8C186 94213.0855549303
UniRef90_C8C186|unclassified    94213.0855549303
UniRef90_A0A098YGS3     90112.0680509912
UniRef90_A0A098YGS3|unclassified        90112.0680509912
UniRef90_A6B9Y3 82876.7853519426
UniRef90_A6B9Y3|unclassified    82876.7853519426
UniRef90_A0A343J0E6     75794.0211237180
UniRef90_A0A343J0E6|unclassified        75794.0211237180
UniRef90_A0A141RML6     73640.8387018042
UniRef90_A0A141RML6|unclassified        73640.8387018042
UniRef90_A6BBS7 69930.2570132110
UniRef90_A6BBS7|unclassified    69930.2570132110
UniRef90_R0JKC3 57483.5504630520
UniRef90_R0JKC3|unclassified    57483.5504630520
UniRef90_A0A127SFT5     54307.5293219123
UniRef90_A0A127SFT5|unclassified        54307.5293219123
UniRef90_A0A127SHY8     48265.0559379932
UniRef90_A0A127SHY8|unclassified        48265.0559379932
UniRef90_Q79DR3 40958.2935396859
UniRef90_Q79DR3|unclassified    40958.2935396859
UniRef90_P62593 39834.9879483638

...

Gene family abundance is reported in RPK (reads per kilobase), a measure which normalises for gene length to reflect the actual copy number in the community. Gene families are reported in order of abundance from most abundant to least. The “UNMAPPED” reads are reads which remain unidentified after both the nucleotide and the translated protein alignment steps. It treats all unmapped reads as a gene of size 1 kilobase for calculating the RPK.

NoteChallenge

Have a look at the output above and try and answer these questions:

  1. What is the most abundant gene family?
  2. What is the abundance of the UNMAPPED reads?
CautionSolution
  1. UniRef90_C8C186
  2. 2,259,677 RPK

Tabular file with gene pathways and their abundance

This is a functional profile consisting of two columns: identified gene pathways and their abundances.

Code
less analysis/humann/SRR6820491_mrna_pathabundance.tsv

It should look a bit like this:

Output
# Pathway       SRR6820491_mrna_Abundance
UNMAPPED        644218.0056648480
UNINTEGRATED    369564.0910171091
UNINTEGRATED|unclassified       369564.0910171091
PWY-1042: glycolysis IV 5122.9174199680
PWY-1042: glycolysis IV|unclassified    5122.9174199680
PWY-5695: inosine 5'-phosphate degradation      2522.4955682189
PWY-5695: inosine 5'-phosphate degradation|unclassified 2522.4955682189
PWY-7221: guanosine ribonucleotides de novo biosynthesis        2393.1160596558
PWY-7221: guanosine ribonucleotides de novo biosynthesis|unclassified   2393.1160596558
ANAGLYCOLYSIS-PWY: glycolysis III (from glucose)        2219.7642712425
ANAGLYCOLYSIS-PWY: glycolysis III (from glucose)|unclassified   2219.7642712425
PWY-7208: superpathway of pyrimidine nucleobases salvage        2183.2076526440
PWY-7208: superpathway of pyrimidine nucleobases salvage|unclassified   2183.2076526440
PWY-7228: superpathway of guanosine nucleotides de novo biosynthesis I  2161.1939689820
PWY-7228: superpathway of guanosine nucleotides de novo biosynthesis I|unclassified     2161.1939689820
PWY-5100: pyruvate fermentation to acetate and lactate II       2124.4245452763
PWY-5100: pyruvate fermentation to acetate and lactate II|unclassified  2124.4245452763
PWY-5484: glycolysis II (from fructose 6-phosphate)     1959.7318701297
PWY-5484: glycolysis II (from fructose 6-phosphate)|unclassified        1959.7318701297
PWY-6609: adenine and adenosine salvage III     1889.9623784489
PWY-6609: adenine and adenosine salvage III|unclassified        1889.9623784489

...

Tabular file with gene pathways and their coverage

We shall not be using this file today, but this gives a qualitative measure that a specific metabolic pathway is present in the microbial community. The measure ranges from 0 (not present) to 1 (present)

Log file

Finally, our log file should have captured all the terminal messages Humann prints during its run. It’s worth checking to see if the run has ran smoothly. As we checked before, if it has you should see the following at the end:

Output
Output files created: /home/ubuntu/cs_course/analysis/humann/SRR6820491_mrna_genefamilies.tsv /home/ubuntu/cs_course/analysis/humann/SRR6820491_mrna_pathabundance.tsv /home/ubuntu/cs_course/analysis/humann/SRR6820491_mrna_pathcoverage.tsv

Normalising abundances

It’s worth noting that gene family abundances are measured in RPKs (Reads Per Kilobase), and so are normalised for read length but not sampling depth. This means that our output may still be biased by sampling depth. Whilst some analyses such as strain profiling may use RPKs, most of the time it is better to work with normalised abundances.

To normalise our data we can use the humann_renorm_table command, with the following options:

flag/option meaning
-i analysis/humann/SRR6820491_mrna_genefamilies.tsv specifies the HUMAnN gene family abundance table we want to normalise.
-u relab converts abundances into relative abundance values rather than absolute RPKs.
-m community tells HUMAnN to normalise abundances across the entire microbial community rather than within individual taxa.
--update-snames updates the sample names in the output file to indicate the new normalisation method.
-o analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsv specifies the name of the output file containing the normalised abundances.

Making sure we’re still in our cs_course directory, let’s run the command on our gene families

Code
humann_renorm_table \
  -i analysis/humann/SRR6820491_mrna_genefamilies.tsv \
  -u relab \
  -m community \
  --update-snames \
  -o analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsv

And again but with our pathway abundances (all we changed here was in the input -i and output -o):

Code
humann_renorm_table \
  -i analysis/humann/SRR6820491_mrna_pathabundance.tsv \
  -u relab \
  -m community \
  --update-snames \
  -o analysis/humann/SRR6820491_mrna_pathabundance_normalised.tsv

Take a look at the output using:

Code
less analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsv
Output
# Gene Family   SRR6820491_mrna_Abundance-RELAB
UNMAPPED        0.589358
UniRef90_C8C186 0.0245722
UniRef90_C8C186|unclassified    0.0245722
UniRef90_A0A098YGS3     0.0235026
UniRef90_A0A098YGS3|unclassified        0.0235026
UniRef90_A6B9Y3 0.0216155
UniRef90_A6B9Y3|unclassified    0.0216155
UniRef90_A0A343J0E6     0.0197682
UniRef90_A0A343J0E6|unclassified        0.0197682
UniRef90_A0A141RML6     0.0192066
UniRef90_A0A141RML6|unclassified        0.0192066
UniRef90_A6BBS7 0.0182389
UniRef90_A6BBS7|unclassified    0.0182389
UniRef90_R0JKC3 0.0149926
UniRef90_R0JKC3|unclassified    0.0149926
UniRef90_A0A127SFT5     0.0141642
UniRef90_A0A127SFT5|unclassified        0.0141642
UniRef90_A0A127SHY8     0.0125883
UniRef90_A0A127SHY8|unclassified        0.0125883
UniRef90_Q79DR3 0.0106825
UniRef90_Q79DR3|unclassified    0.0106825
UniRef90_P62593 0.0103896

...

You’ll note the abundances are now relative (in %) rather than absolute RPKs. This will allow us to more reliably compare the gene expressions in our sample to other studies.

NoteChallenge

Have a look at the normalised gene families file.

  1. What percentage of sequences do not have an assigned gene family?
  2. What is the relative abundance of the most abundant gene family?
CautionSolution
  1. 58.93%
  2. 2.5%

Identifying gene families in pathways

Now we have normalised abundance data on our gene families we can go one step further and see which genes are involved in the pathways that Humann has identified. HUMAnN can provide this information using the pathway and gene family tables.

We will “unpack” the pathways to reveal the gene families involved. This adds another level of detail to the pathway abundance table.

We can generate a table linking pathways to the gene families that contribute to them using the humann_unpack_pathways. There’s a few options we have to specify:

flag/option meaning
--input-genes analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsv specifies the normalised gene family abundance table.
--input-pathways analysis/humann/SRR6820491_mrna_pathabundance_normalised.tsv specifies the normalised pathway abundance table.
-o analysis/humann/unpacked_pathways_normalised.tsv tells HUMAnN where to write the unpacked pathway output.

Let’s run the command!

Code
humann_unpack_pathways \
  --input-genes analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsv \
  --input-pathways analysis/humann/SRR6820491_mrna_pathabundance_normalised.tsv \
  -o analysis/humann/unpacked_pathways_normalised.tsv

This will produce a table that looks like this:

Code
less analysis/humann/unpacked_pathways_normalised.tsv
Output
# Pathway       SRR6820491_mrna_Abundance-RELAB
ANAEROFRUCAT-PWY: homolactic fermentation       0.00115042
ANAEROFRUCAT-PWY|unclassified   0.00115042
ANAEROFRUCAT-PWY|unclassified|UniRef90_F4LQU7   3.06411e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_A0A376LF36       2.18255e-07
ANAEROFRUCAT-PWY|unclassified|UniRef90_D1L5J9   1.53932e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_F4LWJ5   3.82403e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_B5Y7Q7   0.000806969
ANAEROFRUCAT-PWY|unclassified|UniRef90_A0A0U9HE71       4.4773e-06
ANAEROFRUCAT-PWY|unclassified|UniRef90_A0A1B1YMK6       4.50368e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_A0A1B1YMK2       4.01626e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_A0A376RRT5       1.11938e-06
ANAEROFRUCAT-PWY|unclassified|UniRef90_M1YXF4   7.12165e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_A0A1H3A0T4       2.5612e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_A3DCA8   0.000505282
ANAEROFRUCAT-PWY|unclassified|UniRef90_B5Y768   0.000249563
ANAEROFRUCAT-PWY|unclassified|UniRef90_B5Y8V1   0.000376358
ANAEROFRUCAT-PWY|unclassified|UniRef90_K0FCI7   8.40225e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_A3DBX9   4.01119e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_A0A2X1LMF5       3.11421e-07
ANAEROFRUCAT-PWY|unclassified|UniRef90_R1AXJ4   4.86346e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_G8LY94   5.4377e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_B5Y840   0.00023678

...

From this table:

  • The pathway row shows the total community abundance for the pathway.

  • The species-stratified row shows how much each species contributes.

  • The gene family row shows the individual UniRef90 genes and their abundances.

It’s important to note that this output varies because humann_unpack_pathways uses a probabilistic approach to distribute pathway abundance among the many potential contributing genes in the “unclassified” pool. This means each time the command runs it may pick one of many equally contributing genes to appear in the pathway.

NoteChallenge
  1. In the results above (as your results may be slightly different!), which gene family contributes to the most abundant pathway with an associated family in this sample?
CautionSolution
  1. UniRef90_F4LQU7

Group gene families into GO terms

Our outputs contain many different gene IDs in long lists that can be extremely laborious and time-consuming to look up and analyse. Finding which ones are important or interesting can be a difficult task, but we can instead use GO terms to simplify this process.

Gene Ontology (GO) terms are high-level, structured, standardised, and species-agnostic functional descriptors. GO terms describe gene/protein function across three independent high-level categories:

  • Biological Process (pathways/events) e.g. cell division

  • Molecular Function (biochemical activity) e.g. kinase

  • Cellular Component (location) e.g. membrane

Using GO terms will reduce the complexity of functional annotation outputs and highlight the biological functions of interest in our analysis.

We can use HUMAnN’s regrouping utility humann_regroup_table to map gene families to GO terms. Let’s build the command.

flag/option meaning
-i analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsv specifies the normalised gene family abundance table to regroup.
-g uniref90_go tells HUMAnN to regroup UniRef90 gene families into Gene Ontology (GO) categories.
-o analysis/humann/SRR6820491_mrna_go.tsv specifies the output file for the regrouped GO term abundances.
Code
humann_regroup_table \
  -i analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsv \
  -g uniref90_go \
  -o analysis/humann/SRR6820491_mrna_go.tsv

After this has ran we can then turn the quite cryptic GO term IDs into human-readable titles using the renaming utility humann_rename_table.

flag/option meaning
-i analysis/humann/SRR6820491_mrna_go.tsv specifies the regrouped GO table to rename.
-n go tells HUMAnN to use the GO naming database for annotation.
-o analysis/humann/SRR6820491_mrna_go_renamed.tsv specifies the output file containing human-readable GO term names.
Code
humann_rename_table \
  -i analysis/humann/SRR6820491_mrna_go.tsv \
  -n go \
  -o analysis/humann/SRR6820491_mrna_go_renamed.tsv

Let’s inspect the output using:

Code
less analysis/humann/SRR6820491_mrna_go_renamed.tsv
Output
# Gene Family   SRR6820491_mrna_Abundance-RELAB
UNMAPPED        0.589358
UNGROUPED       0.006087598206
UNGROUPED|unclassified  0.006087598206
GO:0000015: [CC] phosphopyruvate hydratase complex      0.0007034955
GO:0000015: [CC] phosphopyruvate hydratase complex|unclassified 0.0007034955
GO:0000034: [MF] adenine deaminase activity     3.99883e-06
GO:0000034: [MF] adenine deaminase activity|unclassified        3.99883e-06
GO:0000049: [MF] tRNA binding   0.0031198066590000005
GO:0000049: [MF] tRNA binding|unclassified      0.0031198066590000005
GO:0000103: [BP] sulfate assimilation   2.56775e-05
GO:0000103: [BP] sulfate assimilation|unclassified      2.56775e-05
GO:0000105: [BP] histidine biosynthetic process 0.0014610406499999996
GO:0000105: [BP] histidine biosynthetic process|unclassified    0.0014610406499999996
GO:0000107: [MF] imidazoleglycerol-phosphate synthase activity  0.00021605209
GO:0000107: [MF] imidazoleglycerol-phosphate synthase activity|unclassified     0.00021605209
GO:0000155: [MF] phosphorelay sensor kinase activity    0.0005325315649999998
GO:0000155: [MF] phosphorelay sensor kinase activity|unclassified       0.0005325315649999998
GO:0000156: [MF] phosphorelay response regulator activity       3.5137059999999997e-05
GO:0000156: [MF] phosphorelay response regulator activity|unclassified  3.5137059999999997e-05

...
NoteChallenge

From our GO Terms outputs:

  1. Which of the GO terms is least abundant?
  2. Which of the GO terms related to Molecular Functions is the most abundant?
CautionSolution
  1. methanofuran biosynthetic process
  2. adenine deaminase activity
TipAnother note on interpreting an isolated Metatranscriptome

When analysing our RNA reads from a metatranscriptomics dataset by itself there is a crucial confounding factor we must consider.

Any given transcript could appear more abundant in one sample compared to another for one of two reasons:

  • The gene is more highly expressed in that sample

  • The copy number of that gene is higher in that sample

Without accompanying metagenomic data it is impossible to truly detect which one of these scenarios is true. This is a recurrent challenge in analysing any metatranscriptome in isolation.

The best case scenario would be to have a metagenomic DNA dataset from the same sample and run Metaphlan on that, then feed that taxonomic profile output into Humann (as metagenomic data is more reliable for quanitifying species). After that running Humann on both metagenomic and metatranscriptomic data will give us gene family copy numbers as well as expression levels, which together can then be used to normalise our dataset and provide a more reliable measure of expression abundance.

Because we do not have an attached metagenomic dataset here, we must be careful in our interpretation of our analysis!

Back to top
Functional Annotation
Downstream analysis of taxonomic and functional profiles

Licensed under CC-BY 4.0 2021–26 by Cloud-SPAN

  • Edit this page
  • Report an issue
Cookie Preferences