Extracting Functional Information

In the previous lesson we used Metaphlan to assess the taxonomic profile and discover which microbes are active in our sample. Now we know “who” is in our sample, it’s time to find out “what” they are doing, that is to say the functional information. To do so we shall use another piece of software called Humann.

What is Functional Profiling?

Functional profiling is the identification of the function of a microbiome. With metagenomic data (DNA) a functional profile will indicate which genes and gene pathways can be expressed, but in metatranscriptomics the analysis of mRNA we can identify which genes pathways are expressed. In other words, this lets us see which gene pathways are active in the microbiome, and how relatively active they are.

Gene pathways, sometimes called metabolic pathways, are collections of genes that code for enzymes that carry out certain biological functions such as carbohydrate metabolism, amino acid biosynthesis, stress response, or antibiotic resistance. Similarly, gene families are evolutionarily related genes that perform similar functions. Because of the natural genetic diversity of microbial life it is easier to search for gene families and pathways than individual genes.

Usually gene expression is measured in relative abundance rather than absolute as the sequenced RNA is only a subset of the true total biological pool. RNA extraction is never 100% efficient as molecules can be lost during cell lysis, purification, library prep, and sequencing prep. Sequencing depth is always limited, and so even RNA molecules present in the sequencing plate may not be detected if they are lower in count. As such absolute abundances will not necessarily be a meaningful measure, whereas the relative expression of gene pathways can be informative and useful for the comparison of function between samples or treatments.

What is Humann?

HUMAnN (The HMP Unified Metabolic Analysis Network) is another piece of software from the BioBakery suite of software as detailed in the previous lesson. It is a pipeline for accurate determination of either the abundance or presence/absence of microbial gene pathways in a community. It can use either DNA (metagenomic) or RNA (metatranscriptomic) data.

It works by finding the abundance of orthologous (performing approximately the same function) gene families in the data, using the KEGG Orthology (KO) catalog by default. It does this by aligning the transcript reads firstly against the Chocophlan database to get nucleotide-level hits, then any unaligned reads are translated into amino acid sequences and then aligned against a UniRef database of protein sequences.

It also finds gene pathways (defined as two or more genes) using KEGG Pathways or GO (Gene Ontology) terms. Abundance of each pathway (how many copies there are present in the data) as well as presence/absence can be outputted by Humann.

Running Humann

First of all let’s make a new directory to store out Humann results in.

Code

mkdir -p analysis/humann

Humann is a very extensive programme with many features. Let’s run:

Code

humann --help

Output — Humann help documentation

Output

$ humann --help
usage: humann [-h] -i <input.fastq> -o <output> [--threads <1>] [--version] [-r] [--bypass-nucleotide-index]
              [--bypass-nucleotide-search] [--bypass-prescreen] [--bypass-translated-search]
              [--taxonomic-profile <taxonomic_profile.tsv>] [--memory-use {minimum,maximum}]
              [--input-format {fastq,fastq.gz,fasta,fasta.gz,sam,bam,blastm8,genetable,biom}] [--search-mode {uniref50,uniref90}]
              [-v] [--metaphlan <metaphlan>] [--metaphlan-options <metaphlan_options>] [--prescreen-threshold <0.01>]
              [--bowtie2 <bowtie2>] [--bowtie-options <bowtie_options>] [--nucleotide-database <nucleotide_database>]
              [--nucleotide-identity-threshold <0.0>] [--nucleotide-query-coverage-threshold <90.0>]
              [--nucleotide-subject-coverage-threshold <50.0>] [--diamond <diamond>] [--diamond-options <diamond_options>]
              [--evalue <1.0>] [--protein-database <protein_database>] [--rapsearch <rapsearch>]
              [--translated-alignment {usearch,rapsearch,diamond}]
              [--translated-identity-threshold <Automatically: 50.0 or 80.0, Custom: 0.0-100.0>]
              [--translated-query-coverage-threshold <90.0>] [--translated-subject-coverage-threshold <50.0>] [--usearch <usearch>]
              [--gap-fill {on,off}] [--minpath {on,off}] [--pathways {metacyc,unipathway}]
              [--pathways-database <pathways_database.tsv>] [--xipe {on,off}] [--annotation-gene-index <3>]
              [--id-mapping <id_mapping.tsv>] [--remove-temp-output] [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
              [--o-log <sample.log>] [--output-basename <sample_name>] [--output-format {tsv,biom}] [--output-max-decimals <10>]
              [--remove-column-description-output] [--remove-stratified-output]

HUMAnN : HMP Unified Metabolic Analysis Network 3

optional arguments:
  -h, --help            show this help message and exit

[0] Common settings:
  -i <input.fastq>, --input <input.fastq>
                        input file of type {fastq,fastq.gz,fasta,fasta.gz,sam,bam,blastm8,genetable,biom}
                        [REQUIRED]
  -o <output>, --output <output>
                        directory to write output files
                        [REQUIRED]
  --threads <1>         number of threads/processes
                        [DEFAULT: 1]
  --version             show program's version number and exit

[1] Workflow refinement:
  -r, --resume          bypass commands if the output files exist
  --bypass-nucleotide-index
                        bypass the nucleotide index step and run on the indexed ChocoPhlAn database
  --bypass-nucleotide-search
                        bypass the nucleotide search steps
  --bypass-prescreen    bypass the prescreen step and run on the full ChocoPhlAn database
  --bypass-translated-search
                        bypass the translated search step
  --taxonomic-profile <taxonomic_profile.tsv>
                        a taxonomic profile (the output file created by metaphlan)
                        [DEFAULT: file will be created]
  --memory-use {minimum,maximum}
                        the amount of memory to use
                        [DEFAULT: minimum]
  --input-format {fastq,fastq.gz,fasta,fasta.gz,sam,bam,blastm8,genetable,biom}
                        the format of the input file
                        [DEFAULT: format identified by software]
  --search-mode {uniref50,uniref90}
                        search for uniref50 or uniref90 gene families
                        [DEFAULT: based on translated database selected]
  -v, --verbose         additional output is printed

[2] Configure tier 1: prescreen:
  --metaphlan <metaphlan>
                        directory containing the MetaPhlAn software
                        [DEFAULT: $PATH]
  --metaphlan-options <metaphlan_options>
                        options to be provided to the MetaPhlAn software
                        [DEFAULT: "-t rel_ab"]
  --prescreen-threshold <0.01>
                        minimum percentage of reads matching a species
                        [DEFAULT: 0.01]

[3] Configure tier 2: nucleotide search:
  --bowtie2 <bowtie2>   directory containing the bowtie2 executable
                        [DEFAULT: $PATH]
  --bowtie-options <bowtie_options>
                        options to be provided to the bowtie software
                        [DEFAULT: "--very-sensitive"]
  --nucleotide-database <nucleotide_database>
                        directory containing the nucleotide database
                        [DEFAULT: /home/ubuntu/cs_course/data/chocophlanhumann/chocophlan]
  --nucleotide-identity-threshold <0.0>
                        identity threshold for nuclotide alignments
                        [DEFAULT: 0.0]
  --nucleotide-query-coverage-threshold <90.0>
                        query coverage threshold for nucleotide alignments
                        [DEFAULT: 90.0]
  --nucleotide-subject-coverage-threshold <50.0>
                        subject coverage threshold for nucleotide alignments
                        [DEFAULT: 50.0]

[3] Configure tier 2: translated search:
  --diamond <diamond>   directory containing the diamond executable
                        [DEFAULT: $PATH]
  --diamond-options <diamond_options>
                        options to be provided to the diamond software
                        [DEFAULT: "--top 1 --outfmt 6"]
  --evalue <1.0>        the evalue threshold to use with the translated search
                        [DEFAULT: 1.0]
  --protein-database <protein_database>
                        directory containing the protein database
                        [DEFAULT: /home/ubuntu/cs_course/data/chocophlanhumann/uniref/uniref]
  --rapsearch <rapsearch>
                        directory containing the rapsearch executable
                        [DEFAULT: $PATH]
  --translated-alignment {usearch,rapsearch,diamond}
                        software to use for translated alignment
                        [DEFAULT: diamond]
  --translated-identity-threshold <Automatically: 50.0 or 80.0, Custom: 0.0-100.0>
                        identity threshold for translated alignments
                        [DEFAULT: Tuned automatically (based on uniref mode) unless a custom value is specified]
  --translated-query-coverage-threshold <90.0>
                        query coverage threshold for translated alignments
                        [DEFAULT: 90.0]
  --translated-subject-coverage-threshold <50.0>
                        subject coverage threshold for translated alignments
                        [DEFAULT: 50.0]
  --usearch <usearch>   directory containing the usearch executable
                        [DEFAULT: $PATH]

[5] Gene and pathway quantification:
  --gap-fill {on,off}   turn on/off the gap fill computation
                        [DEFAULT: on]
  --minpath {on,off}    turn on/off the minpath computation
                        [DEFAULT: on]
  --pathways {metacyc,unipathway}
                        the database to use for pathway computations
                        [DEFAULT: metacyc]
  --pathways-database <pathways_database.tsv>
                        mapping file (or files, at most two in a comma-delimited list) to use for pathway computations
                        [DEFAULT: metacyc database ]
  --xipe {on,off}       turn on/off the xipe computation
                        [DEFAULT: off]
  --annotation-gene-index <3>
                        the index of the gene in the sequence annotation
                        [DEFAULT: 3]
  --id-mapping <id_mapping.tsv>
                        id mapping file for alignments
                        [DEFAULT: alignment reference used]

[6] More output configuration:
  --remove-temp-output  remove temp output files
                        [DEFAULT: temp files are not removed]
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        level of messages to display in log
                        [DEFAULT: DEBUG]
  --o-log <sample.log>  log file
                        [DEFAULT: temp/sample.log]
  --output-basename <sample_name>
                        the basename for the output files
                        [DEFAULT: input file basename]
  --output-format {tsv,biom}
                        the format of the output files
                        [DEFAULT: tsv]
  --output-max-decimals <10>
                        the number of decimals to output
                        [DEFAULT: 10]
  --remove-column-description-output
                        remove the description in the output column
                        [DEFAULT: output column includes description]
  --remove-stratified-output
                        remove stratification from output
                        [DEFAULT: output is stratified]

Although there are many different possible ways we can use Humann, the core arguments are very simple:

--input tells it where our sorted mRNA input is
--output tells it where to output the files

We will also include a few more

--threads 8 tells it to use all 8 CPU cores of our instance
--taxonomic-profile tells it were the output of our Metaphlan run is
&> analysis/humann/log.txt & to get the terminal to redirect messages to a log

And so, the command we are going to run Humann with looks like this:

Code

humann --input analysis/sortmerna/SRR6820491_mrna.fq \
--output analysis/humann/ \
--taxonomic-profile analysis/metaphlan/metaphlan_output.txt \
--threads 8 \
&> analysis/humann/log.txt &

As we are looking at community function we no longer need the rRNA sequences as they do not contain gene coding regions and so will slow the run down considerably. This means we shall be using the filtered output of SortMeRNA.

In order to speed up the run further we feed our Metaphlan output into the command with --taxonomic-profile, which tells Humann to only look at database entries relevant to the taxa we identified with Metaphlan.

For space-saving and efficiency we are also using the UniRef90 EC filtered database, a version which only contains proteins associated with a level-4 enzyme commision category (ones with precisely identified substrates). Whilst this may not identify a broader number of transcribed proteins, it will be more informative for pathway analysis.

This command will take up to a few hours to run, but hopefully you managed to run it at the end of last session!

Note

The UniRef databases can be extremely large, with over 100GB being standard. Our instances are not quite large enough to accommodate the full size database we would usually need. This is because for costing purposes only we’re working on slightly smaller instances for this course. It is entirely possible to expand the disk space of an AWS instance, and during your own analyses we recommend about 1TB of storage (or more if you are processing more than ~20 samples).

Humann outputs

Humann will produce five outputs:

A directory of temporary files

This will contain multiple files generated by humann during the analysis, including aligned reads, intermediate mapping files, and a log file. Some other pieces of software clear their temporary files afterwards, but Humann leaves them intact. This can be useful for troubleshooting, but takes up a large portion of you storage space. If you are happy the run is complete you can delete these files to save space

Tabular file with gene families and abundances

This is a functional profile consisting of two columns: the identified gene families, and their abundances. We can inspect it with:

Code

less analysis/humann/SRR6820491_mrna_genefamilies.tsv

It should look a bit like this:

Output

\# Gene Family SRR6820491_mrna_Abundance-RPKs\
UNMAPPED 2255586.0000000000\
UniRef90_C8C186 93805.0748916235\
UniRef90_A0A098YGS3 89602.4872409256\
UniRef90_A6B9Y3 82482.4155008284\
UniRef90_A0A343J0E6 75487.8135721117\
UniRef90_A0A141RML6 73343.1679686251\
UniRef90_A6BBS7 69594.8394299401\
UniRef90_R0JKC3 57188.9820833425\
UniRef90_A0A127SFT5 54088.4634226876

...

Gene family abundance is reported in RPK (reads per kilobase), a measure which normalises for gene length to reflect the actual copy number in the community. Gene families are reported in order of abundance from most abundant to least. The “UNMAPPED” reads are reads which remain unidentified after both the nucleotide and the translated protein alignment steps. It treats all unmapped reads as a gene of size 1 kilobase for calculating the RPK.

Note

Have a look at the output above and try and answer these questions:

What is the most abundant gene family?
What is the abundance of the UNMAPPED reads?

Solution

UniRef90_C8C186
2,255,586 RPK

Tabular file with gene pathways and their abundance

This is a functional profile consisting of two columns: identified gene pathways and their abundances.

Code

less analysis/humann/SRR6820491_mrna_pathabundance.tsv

It should look a bit like this:

Output

#Pathway SRR6820491_mrna_Abundance\
UNMAPPED 643040.1584233012\
UNINTEGRATED 372056.7564164627\
UNINTEGRATED\|unclassified 372056.7564164627\
PWY-1042: glycolysis IV 5122.3409693358\
PWY-1042: glycolysis IV\|unclassified 5122.3409693358\
PWY-5695: inosine 5'-phosphate degradation 2522.4660551855\
PWY-5695: inosine 5'-phosphate degradation\|unclassified 2522.4660551855\
PWY-7221: guanosine ribonucleotides de novo biosynthesis 2393.1160596558\
PWY-7221: guanosine ribonucleotides de novo biosynthesis\|unclassified 2393.1160596558\
ANAGLYCOLYSIS-PWY: glycolysis III (from glucose) 2219.7232237331\
ANAGLYCOLYSIS-PWY: glycolysis III (from glucose)\|unclassified 2219.7232237331

...

Tabular file with gene pathways and their coverage

We shall not be using this file today, but this gives a qualitative measure that a specific metabolic pathway is present in the microbial community. The measure ranges from 0 (not present) to 1 (present)

Log file

Finally, our log file should have captured all the terminal messages Humann prints during its run. It’s worth checking to see if the run has ran smoothly. If it has you should see the following at the end:

Output

Output files created: /home/ubuntu/cs_course/analysis/humann/SRR6820491_mrna_genefamilies.tsv /home/ubuntu/cs_course/analysis/humann/SRR6820491_mrna_pathabundance.tsv /home/ubuntu/cs_course/analysis/humann/SRR6820491_mrna_pathcoverage.tsv

Normalising abundances

It’s worth noting that gene family abundances are measured in RPKs and so are normalised for read length but not sampling depth. This means that our output may still be biased by sampling depth. Whilst some analyses such as strain profiling may use RPKs, most of the time it is better to work with normalised abundances.

To normalise our data we can use the following command:

Code

humann_renorm_table -i analysis/humann/SRR6820491_mrna_genefamilies.tsv -u relab -m community --update-snames -o analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsv

And the same for pathway abundances:

Code

humann_renorm_table -i analysis/humann/SRR6820491_mrna_pathabundance.tsv -u relab -m community --update-snames -o analysis/humann/SRR6820491_mrna_pathabundance_normalised.tsv

Take a look at the output using:

Code

less analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsv

Output

#Gene Family SRR6820491_mrna_Abundance-RELAB 
UNMAPPED 0.587576 
UniRef90_C8C186 0.024436 
UniRef90_A0A098YGS3 0.0233413 
UniRef90_A6B9Y3 0.0214865 
UniRef90_A0A343J0E6 0.0196644 
UniRef90_A0A141RML6 0.0191057 
UniRef90_A6BBS7 0.0181293 
UniRef90_R0JKC3 0.0148976 
...

You’ll note the abundances are now relative (in %) rather than absolute RPKs. This will allow us to more reliably compare the gene expressions in our sample to other studies.

Note

Have a look at the normalised gene families file.

What percentage of sequences do not have an assigned gene family?
What is the relative abundance of the most abundant gene family?

Solution

58.76%
2.4%

Identifying gene families in pathways

Now we have normalised abundance data on our gene families we can go one step further and see which genes are involved in the pathways that Humann has identified. HUMAnN can provide this information using the pathway and gene family tables.

We will “unpack” the pathways to reveal the gene families involved. This adds another level of detail to the pathway abundance table.

We can generate a table linking pathways to the gene families that contribute to them using the “unpack” function like so:

Code

humann_unpack_pathways --input-genes analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsv --input-pathways analysis/humann/SRR6820491_mrna_pathabundance_normalised.tsv -o analysis/humann/unpacked_pathways_normalised.tsv

This will produce a table that looks like this:

Output

#Pathway SRR6820491_mrna_Abundance-RELAB\
ANAEROFRUCAT-PWY: homolactic fermentation 0.00114903\
ANAEROFRUCAT-PWY\|unclassified 0.00114903\
ANAEROFRUCAT-PWY\|unclassified\|UniRef90_B5Y768 0.000249259\
ANAEROFRUCAT-PWY\|unclassified\|UniRef90_M1ZLX7 1.74048e-05\
ANAEROFRUCAT-PWY\|unclassified\|UniRef90_R1AXJ4 4.85755e-05\
ANAEROFRUCAT-PWY\|unclassified\|UniRef90_D9PUH5 3.73373e-06\
ANAEROFRUCAT-PWY\|unclassified\|UniRef90_A0A1M6IZD8 2.86919e-05\
ANAEROFRUCAT-PWY\|unclassified\|UniRef90_B5Y8I1 0.000413006\
ANAEROFRUCAT-PWY\|unclassified\|UniRef90_F4LQU7 3.06039e-05\
ANAEROFRUCAT-PWY\|unclassified\|UniRef90_B5Y8B9 0.000497463\
ANAEROFRUCAT-PWY\|unclassified\|UniRef90_A0A2X1LMF5 3.11042e-07\
...

From this table:

The pathway row shows the total community abundance for the pathway.
The species-stratified row shows how much each species contributes.
The gene family row shows the individual UniRef90 genes and their abundances.

Note

Which gene family contributes to the most abundant pathway with an associated family in this sample?

Solution

UniRef90_B5Y768

Group gene families into GO terms

Our outputs contain many different gene IDs in long lists that can be extremely laborious and time-consuming to look up and analyse. Finding which ones are important or interesting can be a difficult task, but we can instead use GO terms to simplify this process.

Gene Ontology (GO) terms are high-level, structured, standardised, and species-agnostic functional descriptors. GO terms describe gene/protein function across three independent high-level categories:

Biological Process (pathways/events) e.g. cell division
Molecular Function (biochemical activity) e.g. kinase
Cellular Component (location) e.g. membrane

Using GO terms will reduce the complexity of functional annotation outputs and highlight the biological functions of interest in our analysis.

We can use HUMAnN’s regrouping utility to map gene families to GO terms. The command looks like this:

Code

humann_regroup_table -i analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsv -g uniref90_go -o analysis/humann/SRR6820491_mrna_go.tsv

After this has ran we can then turn the quite cryptic GO term IDs into human-readable titles using the renaming utility:

Code

humann_rename_table -i analysis/humann/SRR6820491_mrna_go.tsv -n go -o analysis/humann/SRR6820491_mrna_go_renamed.tsv

Let’s inspect the output using:

Code

less analysis/humann/SRR6820491_mrna_go_renamed.tsv

Note

From our GO Terms outputs:

Which of the GO terms is least abundant?
Which of the GO terms related to Molecular Functions is the most abundant?

Solution

methanofuran biosynthetic process
adenine deaminase activity

Another note on interpreting an isolated Metatranscriptome

When analysing our RNA reads from a metatranscriptomics dataset by itself there is a crucial confounding factor we must consider.

Any given transcript could appear more abundant in one sample compared to another for one of two reasons:

The gene is more highly expressed in that sample
The copy number of that gene is higher in that sample

Without accompanying metagenomic data it is impossible to truly detect which one of these scenarios is true. This is a recurrent challenge in analysing any metatranscriptome in isolation.

The best case scenario would be to have a metagenomic DNA dataset from the same sample and run Metaphlan on that, then feed that taxonomic profile output into Humann (as metagenomic data is more reliable for quanitifying species). After that running Humann on both metagenomic and metatranscriptomic data will give us gene family copy numbers as well as expression levels, which together can then be used to normalise our dataset and provide a more reliable measure of expression abundance.

Because we do not have an attached metagenomic dataset here, we must be careful in our interpretation of our analysis!