Extracting Functional Information
In the previous lesson we used Metaphlan to assess the taxonomic profile and discover which microbes are active in our sample. Now we know “who” is in our sample, it’s time to find out “what” they are doing, that is to say the functional information. To do so we shall use another piece of software called Humann.
What is Functional Profiling?
Functional profiling is the identification of the function of a microbiome. With metagenomic data (DNA) a functional profile will indicate which genes and gene pathways can be expressed, but in metatranscriptomics the analysis of mRNA we can identify which genes pathways are expressed. In other words, this lets us see which gene pathways are active in the microbiome, and how relatively active they are.
Gene pathways, sometimes called metabolic pathways, are collections of genes that code for enzymes that carry out certain biological functions such as carbohydrate metabolism, amino acid biosynthesis, stress response, or antibiotic resistance. Similarly, gene families are evolutionarily related genes that perform similar functions. Because of the natural genetic diversity of microbial life it is easier to search for gene families and pathways than individual genes.
Usually gene expression is measured in relative abundance rather than absolute as the sequenced RNA is only a subset of the true total biological pool. RNA extraction is never 100% efficient as molecules can be lost during cell lysis, purification, library prep, and sequencing prep. Sequencing depth is always limited, and so even RNA molecules present in the sequencing plate may not be detected if they are lower in count. As such absolute abundances will not necessarily be a meaningful measure, whereas the relative expression of gene pathways can be informative and useful for the comparison of function between samples or treatments.
What is Humann?
HUMAnN (The HMP Unified Metabolic Analysis Network) is another piece of software from the BioBakery suite of software as detailed in the previous lesson. It is a pipeline for accurate determination of either the abundance or presence/absence of microbial gene pathways in a community. It can use either DNA (metagenomic) or RNA (metatranscriptomic) data.
It works by finding the abundance of orthologous (performing approximately the same function) gene families in the data, using the KEGG Orthology (KO) catalog by default. It does this by aligning the transcript reads firstly against the Chocophlan database to get nucleotide-level hits, then any unaligned reads are translated into amino acid sequences and then aligned against a UniRef database of protein sequences.
It also finds gene pathways (defined as two or more genes) using KEGG Pathways or GO (Gene Ontology) terms. Abundance of each pathway (how many copies there are present in the data) as well as presence/absence can be outputted by Humann.
Running Humann
First of all let’s make sure we’re in our working directory of cs_course.
Code
cd /home/csuser/cs_courseThen let’s make a new directory to store out Humann results in.
Code
mkdir -p analysis/humannHumann is a very extensive programme with many features. Let’s run the help function to see what it can do:
Code
humann --helpOutput
$ humann --help
usage: humann [-h] -i <input.fastq> -o <output> [--threads <1>] [--version] [-r] [--bypass-nucleotide-index]
[--bypass-nucleotide-search] [--bypass-prescreen] [--bypass-translated-search]
[--taxonomic-profile <taxonomic_profile.tsv>] [--memory-use {minimum,maximum}]
[--input-format {fastq,fastq.gz,fasta,fasta.gz,sam,bam,blastm8,genetable,biom}] [--search-mode {uniref50,uniref90}]
[-v] [--metaphlan <metaphlan>] [--metaphlan-options <metaphlan_options>] [--prescreen-threshold <0.01>]
[--bowtie2 <bowtie2>] [--bowtie-options <bowtie_options>] [--nucleotide-database <nucleotide_database>]
[--nucleotide-identity-threshold <0.0>] [--nucleotide-query-coverage-threshold <90.0>]
[--nucleotide-subject-coverage-threshold <50.0>] [--diamond <diamond>] [--diamond-options <diamond_options>]
[--evalue <1.0>] [--protein-database <protein_database>] [--rapsearch <rapsearch>]
[--translated-alignment {usearch,rapsearch,diamond}]
[--translated-identity-threshold <Automatically: 50.0 or 80.0, Custom: 0.0-100.0>]
[--translated-query-coverage-threshold <90.0>] [--translated-subject-coverage-threshold <50.0>] [--usearch <usearch>]
[--gap-fill {on,off}] [--minpath {on,off}] [--pathways {metacyc,unipathway}]
[--pathways-database <pathways_database.tsv>] [--xipe {on,off}] [--annotation-gene-index <3>]
[--id-mapping <id_mapping.tsv>] [--remove-temp-output] [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
[--o-log <sample.log>] [--output-basename <sample_name>] [--output-format {tsv,biom}] [--output-max-decimals <10>]
[--remove-column-description-output] [--remove-stratified-output]
HUMAnN : HMP Unified Metabolic Analysis Network 3
optional arguments:
-h, --help show this help message and exit
[0] Common settings:
-i <input.fastq>, --input <input.fastq>
input file of type {fastq,fastq.gz,fasta,fasta.gz,sam,bam,blastm8,genetable,biom}
[REQUIRED]
-o <output>, --output <output>
directory to write output files
[REQUIRED]
--threads <1> number of threads/processes
[DEFAULT: 1]
--version show program's version number and exit
[1] Workflow refinement:
-r, --resume bypass commands if the output files exist
--bypass-nucleotide-index
bypass the nucleotide index step and run on the indexed ChocoPhlAn database
--bypass-nucleotide-search
bypass the nucleotide search steps
--bypass-prescreen bypass the prescreen step and run on the full ChocoPhlAn database
--bypass-translated-search
bypass the translated search step
--taxonomic-profile <taxonomic_profile.tsv>
a taxonomic profile (the output file created by metaphlan)
[DEFAULT: file will be created]
--memory-use {minimum,maximum}
the amount of memory to use
[DEFAULT: minimum]
--input-format {fastq,fastq.gz,fasta,fasta.gz,sam,bam,blastm8,genetable,biom}
the format of the input file
[DEFAULT: format identified by software]
--search-mode {uniref50,uniref90}
search for uniref50 or uniref90 gene families
[DEFAULT: based on translated database selected]
-v, --verbose additional output is printed
[2] Configure tier 1: prescreen:
--metaphlan <metaphlan>
directory containing the MetaPhlAn software
[DEFAULT: $PATH]
--metaphlan-options <metaphlan_options>
options to be provided to the MetaPhlAn software
[DEFAULT: "-t rel_ab"]
--prescreen-threshold <0.01>
minimum percentage of reads matching a species
[DEFAULT: 0.01]
[3] Configure tier 2: nucleotide search:
--bowtie2 <bowtie2> directory containing the bowtie2 executable
[DEFAULT: $PATH]
--bowtie-options <bowtie_options>
options to be provided to the bowtie software
[DEFAULT: "--very-sensitive"]
--nucleotide-database <nucleotide_database>
directory containing the nucleotide database
[DEFAULT: /home/ubuntu/cs_course/data/chocophlanhumann/chocophlan]
--nucleotide-identity-threshold <0.0>
identity threshold for nuclotide alignments
[DEFAULT: 0.0]
--nucleotide-query-coverage-threshold <90.0>
query coverage threshold for nucleotide alignments
[DEFAULT: 90.0]
--nucleotide-subject-coverage-threshold <50.0>
subject coverage threshold for nucleotide alignments
[DEFAULT: 50.0]
[3] Configure tier 2: translated search:
--diamond <diamond> directory containing the diamond executable
[DEFAULT: $PATH]
--diamond-options <diamond_options>
options to be provided to the diamond software
[DEFAULT: "--top 1 --outfmt 6"]
--evalue <1.0> the evalue threshold to use with the translated search
[DEFAULT: 1.0]
--protein-database <protein_database>
directory containing the protein database
[DEFAULT: /home/ubuntu/cs_course/data/chocophlanhumann/uniref/uniref]
--rapsearch <rapsearch>
directory containing the rapsearch executable
[DEFAULT: $PATH]
--translated-alignment {usearch,rapsearch,diamond}
software to use for translated alignment
[DEFAULT: diamond]
--translated-identity-threshold <Automatically: 50.0 or 80.0, Custom: 0.0-100.0>
identity threshold for translated alignments
[DEFAULT: Tuned automatically (based on uniref mode) unless a custom value is specified]
--translated-query-coverage-threshold <90.0>
query coverage threshold for translated alignments
[DEFAULT: 90.0]
--translated-subject-coverage-threshold <50.0>
subject coverage threshold for translated alignments
[DEFAULT: 50.0]
--usearch <usearch> directory containing the usearch executable
[DEFAULT: $PATH]
[5] Gene and pathway quantification:
--gap-fill {on,off} turn on/off the gap fill computation
[DEFAULT: on]
--minpath {on,off} turn on/off the minpath computation
[DEFAULT: on]
--pathways {metacyc,unipathway}
the database to use for pathway computations
[DEFAULT: metacyc]
--pathways-database <pathways_database.tsv>
mapping file (or files, at most two in a comma-delimited list) to use for pathway computations
[DEFAULT: metacyc database ]
--xipe {on,off} turn on/off the xipe computation
[DEFAULT: off]
--annotation-gene-index <3>
the index of the gene in the sequence annotation
[DEFAULT: 3]
--id-mapping <id_mapping.tsv>
id mapping file for alignments
[DEFAULT: alignment reference used]
[6] More output configuration:
--remove-temp-output remove temp output files
[DEFAULT: temp files are not removed]
--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
level of messages to display in log
[DEFAULT: DEBUG]
--o-log <sample.log> log file
[DEFAULT: temp/sample.log]
--output-basename <sample_name>
the basename for the output files
[DEFAULT: input file basename]
--output-format {tsv,biom}
the format of the output files
[DEFAULT: tsv]
--output-max-decimals <10>
the number of decimals to output
[DEFAULT: 10]
--remove-column-description-output
remove the description in the output column
[DEFAULT: output column includes description]
--remove-stratified-output
remove stratification from output
[DEFAULT: output is stratified]| flag/option | meaning |
|---|---|
--input analysis/sortmerna/SRR6820491_mrna.fq |
specifies the input file containing our filtered mRNA reads from SortMeRNA. We use the filtered reads here because rRNA sequences do not contain protein-coding information and would slow down functional annotation considerably. |
--output analysis/humann/ |
tells HUMAnN where to write all output files generated during the run. |
--taxonomic-profile analysis/metaphlan/metaphlan_output.txt |
provides the MetaPhlAn taxonomic profile so HUMAnN only searches database entries relevant to the taxa already identified. This greatly improves runtime efficiency. |
--threads 8 |
tells HUMAnN to use all 8 CPU cores available on our instance. |
&> analysis/humann/log.txt & |
redirects all terminal output to a log file and runs the process in the background. |
And so, the command we are going to run Humann with looks like this:
Code
humann --input analysis/sortmerna/SRR6820491_mrna.fq \
--output analysis/humann/ \
--taxonomic-profile analysis/metaphlan/metaphlan_output.txt \
--threads 8 \
&> analysis/humann/log.txt &As we are looking at community function we no longer need the rRNA sequences as they do not contain gene coding regions and so will slow the run down considerably. This means we shall be using the filtered output of SortMeRNA.
In order to speed up the run further we feed our Metaphlan output into the command with --taxonomic-profile, which tells Humann to only look at database entries relevant to the taxa we identified with Metaphlan.
For space-saving and efficiency we are also using the UniRef90 EC filtered database, a version which only contains proteins associated with a level-4 enzyme commision category (ones with precisely identified substrates). Whilst this may not identify a broader number of transcribed proteins, it will be more informative for pathway analysis.
This command will take up to a few hours to run, but hopefully you managed to run it at the end of last session!
The UniRef databases can be extremely large, with over 100GB being standard. Our instances are not quite large enough to accommodate the full size database we would usually need. This is because for costing purposes only we’re working on slightly smaller instances for this course. It is entirely possible to expand the disk space of an AWS instance, and during your own analyses we recommend about 1TB of storage (or more if you are processing more than ~20 samples).
Humann outputs
We can check that Humann has ran correctly by checking the log like so:
Code
less analysis/humann/log.txtOutput
Output files will be written to: /home/csuser/cs_course/analysis/humann
Removing spaces from identifiers in input file ...
Total species selected from prescreen: 0
Selected species explain 0.00% of predicted community composition
No species were selected from the prescreen.
Because of this the custom ChocoPhlAn database is empty.
This will result in zero species-specific gene families and pathways.
Running diamond ........
Aligning to reference database: uniref90_201901b_ec_filtered.dmnd
Total bugs after translated alignment: 1
unclassified: 10677087 hits
Total gene families after translated alignment: 3140
Unaligned reads after translated alignment: 35.3608922734 %
Computing gene families ...
Computing pathways abundance and coverage ...
Output files created:
/home/csuser/cs_course/analysis/humann/SRR6820491_mrna_genefamilies.tsv
/home/csuser/cs_course/analysis/humann/SRR6820491_mrna_pathabundance.tsv
/home/csuser/cs_course/analysis/humann/SRR6820491_mrna_pathcoverage.tsvIf your log looks like this then congratulations! Humann has ran and successfully outputted the results.
Humann will produce five outputs:
Code
ls -F analysis/humann/Output
log.txt
SRR6820491_mrna_pathabundance.tsv
SRR6820491_mrna_genefamilies.tsv
SRR6820491_mrna_pathcoverage.tsv
SRR6820491_mrna_humann_temp/A directory of temporary files
This will contain multiple files generated by humann during the analysis, including aligned reads, intermediate mapping files, and a log file. Some other pieces of software clear their temporary files afterwards, but Humann leaves them intact. This can be useful for troubleshooting, but takes up a large portion of you storage space. If you are happy the run is complete you can delete these files to save space
Tabular file with gene families and abundances
This is a functional profile consisting of two columns: the identified gene families, and their abundances. We can inspect it with:
Code
less analysis/humann/SRR6820491_mrna_genefamilies.tsvIt should look a bit like this:
Output
# Gene Family SRR6820491_mrna_Abundance-RPKs
UNMAPPED 2259677.0000000000
UniRef90_C8C186 94213.0855549303
UniRef90_C8C186|unclassified 94213.0855549303
UniRef90_A0A098YGS3 90112.0680509912
UniRef90_A0A098YGS3|unclassified 90112.0680509912
UniRef90_A6B9Y3 82876.7853519426
UniRef90_A6B9Y3|unclassified 82876.7853519426
UniRef90_A0A343J0E6 75794.0211237180
UniRef90_A0A343J0E6|unclassified 75794.0211237180
UniRef90_A0A141RML6 73640.8387018042
UniRef90_A0A141RML6|unclassified 73640.8387018042
UniRef90_A6BBS7 69930.2570132110
UniRef90_A6BBS7|unclassified 69930.2570132110
UniRef90_R0JKC3 57483.5504630520
UniRef90_R0JKC3|unclassified 57483.5504630520
UniRef90_A0A127SFT5 54307.5293219123
UniRef90_A0A127SFT5|unclassified 54307.5293219123
UniRef90_A0A127SHY8 48265.0559379932
UniRef90_A0A127SHY8|unclassified 48265.0559379932
UniRef90_Q79DR3 40958.2935396859
UniRef90_Q79DR3|unclassified 40958.2935396859
UniRef90_P62593 39834.9879483638
...Gene family abundance is reported in RPK (reads per kilobase), a measure which normalises for gene length to reflect the actual copy number in the community. Gene families are reported in order of abundance from most abundant to least. The “UNMAPPED” reads are reads which remain unidentified after both the nucleotide and the translated protein alignment steps. It treats all unmapped reads as a gene of size 1 kilobase for calculating the RPK.
Have a look at the output above and try and answer these questions:
- What is the most abundant gene family?
- What is the abundance of the UNMAPPED reads?
- UniRef90_C8C186
- 2,259,677 RPK
Tabular file with gene pathways and their abundance
This is a functional profile consisting of two columns: identified gene pathways and their abundances.
Code
less analysis/humann/SRR6820491_mrna_pathabundance.tsvIt should look a bit like this:
Output
# Pathway SRR6820491_mrna_Abundance
UNMAPPED 644218.0056648480
UNINTEGRATED 369564.0910171091
UNINTEGRATED|unclassified 369564.0910171091
PWY-1042: glycolysis IV 5122.9174199680
PWY-1042: glycolysis IV|unclassified 5122.9174199680
PWY-5695: inosine 5'-phosphate degradation 2522.4955682189
PWY-5695: inosine 5'-phosphate degradation|unclassified 2522.4955682189
PWY-7221: guanosine ribonucleotides de novo biosynthesis 2393.1160596558
PWY-7221: guanosine ribonucleotides de novo biosynthesis|unclassified 2393.1160596558
ANAGLYCOLYSIS-PWY: glycolysis III (from glucose) 2219.7642712425
ANAGLYCOLYSIS-PWY: glycolysis III (from glucose)|unclassified 2219.7642712425
PWY-7208: superpathway of pyrimidine nucleobases salvage 2183.2076526440
PWY-7208: superpathway of pyrimidine nucleobases salvage|unclassified 2183.2076526440
PWY-7228: superpathway of guanosine nucleotides de novo biosynthesis I 2161.1939689820
PWY-7228: superpathway of guanosine nucleotides de novo biosynthesis I|unclassified 2161.1939689820
PWY-5100: pyruvate fermentation to acetate and lactate II 2124.4245452763
PWY-5100: pyruvate fermentation to acetate and lactate II|unclassified 2124.4245452763
PWY-5484: glycolysis II (from fructose 6-phosphate) 1959.7318701297
PWY-5484: glycolysis II (from fructose 6-phosphate)|unclassified 1959.7318701297
PWY-6609: adenine and adenosine salvage III 1889.9623784489
PWY-6609: adenine and adenosine salvage III|unclassified 1889.9623784489
...Tabular file with gene pathways and their coverage
We shall not be using this file today, but this gives a qualitative measure that a specific metabolic pathway is present in the microbial community. The measure ranges from 0 (not present) to 1 (present)
Log file
Finally, our log file should have captured all the terminal messages Humann prints during its run. It’s worth checking to see if the run has ran smoothly. As we checked before, if it has you should see the following at the end:
Output
Output files created: /home/ubuntu/cs_course/analysis/humann/SRR6820491_mrna_genefamilies.tsv /home/ubuntu/cs_course/analysis/humann/SRR6820491_mrna_pathabundance.tsv /home/ubuntu/cs_course/analysis/humann/SRR6820491_mrna_pathcoverage.tsvNormalising abundances
It’s worth noting that gene family abundances are measured in RPKs (Reads Per Kilobase), and so are normalised for read length but not sampling depth. This means that our output may still be biased by sampling depth. Whilst some analyses such as strain profiling may use RPKs, most of the time it is better to work with normalised abundances.
To normalise our data we can use the humann_renorm_table command, with the following options:
| flag/option | meaning |
|---|---|
-i analysis/humann/SRR6820491_mrna_genefamilies.tsv |
specifies the HUMAnN gene family abundance table we want to normalise. |
-u relab |
converts abundances into relative abundance values rather than absolute RPKs. |
-m community |
tells HUMAnN to normalise abundances across the entire microbial community rather than within individual taxa. |
--update-snames |
updates the sample names in the output file to indicate the new normalisation method. |
-o analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsv |
specifies the name of the output file containing the normalised abundances. |
Making sure we’re still in our cs_course directory, let’s run the command on our gene families
Code
humann_renorm_table \
-i analysis/humann/SRR6820491_mrna_genefamilies.tsv \
-u relab \
-m community \
--update-snames \
-o analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsvAnd again but with our pathway abundances (all we changed here was in the input -i and output -o):
Code
humann_renorm_table \
-i analysis/humann/SRR6820491_mrna_pathabundance.tsv \
-u relab \
-m community \
--update-snames \
-o analysis/humann/SRR6820491_mrna_pathabundance_normalised.tsvTake a look at the output using:
Code
less analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsvOutput
# Gene Family SRR6820491_mrna_Abundance-RELAB
UNMAPPED 0.589358
UniRef90_C8C186 0.0245722
UniRef90_C8C186|unclassified 0.0245722
UniRef90_A0A098YGS3 0.0235026
UniRef90_A0A098YGS3|unclassified 0.0235026
UniRef90_A6B9Y3 0.0216155
UniRef90_A6B9Y3|unclassified 0.0216155
UniRef90_A0A343J0E6 0.0197682
UniRef90_A0A343J0E6|unclassified 0.0197682
UniRef90_A0A141RML6 0.0192066
UniRef90_A0A141RML6|unclassified 0.0192066
UniRef90_A6BBS7 0.0182389
UniRef90_A6BBS7|unclassified 0.0182389
UniRef90_R0JKC3 0.0149926
UniRef90_R0JKC3|unclassified 0.0149926
UniRef90_A0A127SFT5 0.0141642
UniRef90_A0A127SFT5|unclassified 0.0141642
UniRef90_A0A127SHY8 0.0125883
UniRef90_A0A127SHY8|unclassified 0.0125883
UniRef90_Q79DR3 0.0106825
UniRef90_Q79DR3|unclassified 0.0106825
UniRef90_P62593 0.0103896
...You’ll note the abundances are now relative (in %) rather than absolute RPKs. This will allow us to more reliably compare the gene expressions in our sample to other studies.
Have a look at the normalised gene families file.
- What percentage of sequences do not have an assigned gene family?
- What is the relative abundance of the most abundant gene family?
- 58.93%
- 2.5%
Identifying gene families in pathways
Now we have normalised abundance data on our gene families we can go one step further and see which genes are involved in the pathways that Humann has identified. HUMAnN can provide this information using the pathway and gene family tables.
We will “unpack” the pathways to reveal the gene families involved. This adds another level of detail to the pathway abundance table.
We can generate a table linking pathways to the gene families that contribute to them using the humann_unpack_pathways. There’s a few options we have to specify:
| flag/option | meaning |
|---|---|
--input-genes analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsv |
specifies the normalised gene family abundance table. |
--input-pathways analysis/humann/SRR6820491_mrna_pathabundance_normalised.tsv |
specifies the normalised pathway abundance table. |
-o analysis/humann/unpacked_pathways_normalised.tsv |
tells HUMAnN where to write the unpacked pathway output. |
Let’s run the command!
Code
humann_unpack_pathways \
--input-genes analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsv \
--input-pathways analysis/humann/SRR6820491_mrna_pathabundance_normalised.tsv \
-o analysis/humann/unpacked_pathways_normalised.tsvThis will produce a table that looks like this:
Code
less analysis/humann/unpacked_pathways_normalised.tsvOutput
# Pathway SRR6820491_mrna_Abundance-RELAB
ANAEROFRUCAT-PWY: homolactic fermentation 0.00115042
ANAEROFRUCAT-PWY|unclassified 0.00115042
ANAEROFRUCAT-PWY|unclassified|UniRef90_F4LQU7 3.06411e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_A0A376LF36 2.18255e-07
ANAEROFRUCAT-PWY|unclassified|UniRef90_D1L5J9 1.53932e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_F4LWJ5 3.82403e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_B5Y7Q7 0.000806969
ANAEROFRUCAT-PWY|unclassified|UniRef90_A0A0U9HE71 4.4773e-06
ANAEROFRUCAT-PWY|unclassified|UniRef90_A0A1B1YMK6 4.50368e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_A0A1B1YMK2 4.01626e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_A0A376RRT5 1.11938e-06
ANAEROFRUCAT-PWY|unclassified|UniRef90_M1YXF4 7.12165e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_A0A1H3A0T4 2.5612e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_A3DCA8 0.000505282
ANAEROFRUCAT-PWY|unclassified|UniRef90_B5Y768 0.000249563
ANAEROFRUCAT-PWY|unclassified|UniRef90_B5Y8V1 0.000376358
ANAEROFRUCAT-PWY|unclassified|UniRef90_K0FCI7 8.40225e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_A3DBX9 4.01119e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_A0A2X1LMF5 3.11421e-07
ANAEROFRUCAT-PWY|unclassified|UniRef90_R1AXJ4 4.86346e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_G8LY94 5.4377e-05
ANAEROFRUCAT-PWY|unclassified|UniRef90_B5Y840 0.00023678
...From this table:
The pathway row shows the total community abundance for the pathway.
The species-stratified row shows how much each species contributes.
The gene family row shows the individual UniRef90 genes and their abundances.
It’s important to note that this output varies because humann_unpack_pathways uses a probabilistic approach to distribute pathway abundance among the many potential contributing genes in the “unclassified” pool. This means each time the command runs it may pick one of many equally contributing genes to appear in the pathway.
- In the results above (as your results may be slightly different!), which gene family contributes to the most abundant pathway with an associated family in this sample?
- UniRef90_F4LQU7
Group gene families into GO terms
Our outputs contain many different gene IDs in long lists that can be extremely laborious and time-consuming to look up and analyse. Finding which ones are important or interesting can be a difficult task, but we can instead use GO terms to simplify this process.
Gene Ontology (GO) terms are high-level, structured, standardised, and species-agnostic functional descriptors. GO terms describe gene/protein function across three independent high-level categories:
Biological Process (pathways/events) e.g. cell division
Molecular Function (biochemical activity) e.g. kinase
Cellular Component (location) e.g. membrane
Using GO terms will reduce the complexity of functional annotation outputs and highlight the biological functions of interest in our analysis.
We can use HUMAnN’s regrouping utility humann_regroup_table to map gene families to GO terms. Let’s build the command.
| flag/option | meaning |
|---|---|
-i analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsv |
specifies the normalised gene family abundance table to regroup. |
-g uniref90_go |
tells HUMAnN to regroup UniRef90 gene families into Gene Ontology (GO) categories. |
-o analysis/humann/SRR6820491_mrna_go.tsv |
specifies the output file for the regrouped GO term abundances. |
Code
humann_regroup_table \
-i analysis/humann/SRR6820491_mrna_genefamilies_normalised.tsv \
-g uniref90_go \
-o analysis/humann/SRR6820491_mrna_go.tsvAfter this has ran we can then turn the quite cryptic GO term IDs into human-readable titles using the renaming utility humann_rename_table.
| flag/option | meaning |
|---|---|
-i analysis/humann/SRR6820491_mrna_go.tsv |
specifies the regrouped GO table to rename. |
-n go |
tells HUMAnN to use the GO naming database for annotation. |
-o analysis/humann/SRR6820491_mrna_go_renamed.tsv |
specifies the output file containing human-readable GO term names. |
Code
humann_rename_table \
-i analysis/humann/SRR6820491_mrna_go.tsv \
-n go \
-o analysis/humann/SRR6820491_mrna_go_renamed.tsvLet’s inspect the output using:
Code
less analysis/humann/SRR6820491_mrna_go_renamed.tsvOutput
# Gene Family SRR6820491_mrna_Abundance-RELAB
UNMAPPED 0.589358
UNGROUPED 0.006087598206
UNGROUPED|unclassified 0.006087598206
GO:0000015: [CC] phosphopyruvate hydratase complex 0.0007034955
GO:0000015: [CC] phosphopyruvate hydratase complex|unclassified 0.0007034955
GO:0000034: [MF] adenine deaminase activity 3.99883e-06
GO:0000034: [MF] adenine deaminase activity|unclassified 3.99883e-06
GO:0000049: [MF] tRNA binding 0.0031198066590000005
GO:0000049: [MF] tRNA binding|unclassified 0.0031198066590000005
GO:0000103: [BP] sulfate assimilation 2.56775e-05
GO:0000103: [BP] sulfate assimilation|unclassified 2.56775e-05
GO:0000105: [BP] histidine biosynthetic process 0.0014610406499999996
GO:0000105: [BP] histidine biosynthetic process|unclassified 0.0014610406499999996
GO:0000107: [MF] imidazoleglycerol-phosphate synthase activity 0.00021605209
GO:0000107: [MF] imidazoleglycerol-phosphate synthase activity|unclassified 0.00021605209
GO:0000155: [MF] phosphorelay sensor kinase activity 0.0005325315649999998
GO:0000155: [MF] phosphorelay sensor kinase activity|unclassified 0.0005325315649999998
GO:0000156: [MF] phosphorelay response regulator activity 3.5137059999999997e-05
GO:0000156: [MF] phosphorelay response regulator activity|unclassified 3.5137059999999997e-05
...From our GO Terms outputs:
- Which of the GO terms is least abundant?
- Which of the GO terms related to Molecular Functions is the most abundant?
- methanofuran biosynthetic process
- adenine deaminase activity
When analysing our RNA reads from a metatranscriptomics dataset by itself there is a crucial confounding factor we must consider.
Any given transcript could appear more abundant in one sample compared to another for one of two reasons:
The gene is more highly expressed in that sample
The copy number of that gene is higher in that sample
Without accompanying metagenomic data it is impossible to truly detect which one of these scenarios is true. This is a recurrent challenge in analysing any metatranscriptome in isolation.
The best case scenario would be to have a metagenomic DNA dataset from the same sample and run Metaphlan on that, then feed that taxonomic profile output into Humann (as metagenomic data is more reliable for quanitifying species). After that running Humann on both metagenomic and metatranscriptomic data will give us gene family copy numbers as well as expression levels, which together can then be used to normalise our dataset and provide a more reliable measure of expression abundance.
Because we do not have an attached metagenomic dataset here, we must be careful in our interpretation of our analysis!