Extracting a Community Profile
Taxonomic Profiling
Community structure is the data that tells us which species or taxa are present and what their abundances are. This taxonomic information is important as it tells us which microbes within the microbiome are the most likely to be driving the function. This kind of analysis is often called taxonomic profiling.
Taxonomic profiling can be achieved by a few different approaches. These include:
Using 16S/18S rRNA sub-unit sequencing to identify Operation Taxonomic Units (OTUs), a sort of proxy for species. This approach is also called “amplicon sequencing” or “amplicon analysis”. Whilst this approach usually utilises targeted PCR amplification of 16S/18S sequences, it is possible to use our aligned reads from SortMeRNA to yield our rRNA sequences in traditional amplicon analysis tools.
Using gene alignment software to assign taxonomy to reads by aligning them to marker genes from curated databases.
In this tutorial we shall be following the second of the approaches detailed above. In the previous lesson we filtered our samples to separate out rRNA and mRNA sequences, however for our taxonomic analysis we shall be using our unfiltered data.
We shall be using a tool called MetaPhlan, which aligns sequences with reference genomes in order to estimate which taxon those sequences originate from. Because rRNA is often used as a taxonomic marker, we will want to use our unfiltered data. The filtered data will be used instead for our functional analysis next lesson.
(For more on different methodologies see the introduction to lesson 3).
What is Metaphlan?
MetaPhlan (Metagenomic Phylogenetic Analysis) is one tool from the Biobakery suite of software, which all have bakery related names. Whilst this is a fun naming convention it can be somewhat confusing as to what each tool does! You may see many different tools mentioned in the literature but possibly be unclear on what they do. Here’s a rundown of the tools relevant to this section of the course.
MetaPhlAn is a tool for profiling microbial communities in metagenomic samples by identifying the species present and their relative abundances. Instead of analyzing full genomes, MetaPhlAn targets unique marker genes specific to each species, making it faster and computationally lighter than other approaches. We will be using MetaPhlan 4, the fourth iteration of the programme.
MetaPhlan uses another tool called Bowtie2, which is a tool for aligning reads against long reference sequences. Bowtie2 excels at aligning sequences of between 50-100 bp in length, perfect for our purposes.
You may have also seen “Chocophlan” listed among the Biobakery tools too. Unlike metaphlan, this is not actually a programme, but a database! Chocophlan serves as a pangenome reference database configured specifically to work with other biobakery tools. In effect this is a massive list of genes, their sequences, and their most likely organism of origin.
And so, MetaPhlan uses bowtie2 to map sequenced reads to the Chocophlan database of marker genes, determining species by the proportion of aligned reads. This approach allows researchers to quickly study microbiome composition without having to perform separate sequencing to determine taxonomy.
Humann is a tool used for functional annotation, which we shall explore next lesson.
Running Metaphlan
First of all, ensuring we are in our ~/cs_course directory, let’s make a directory to store all of our Metaphlan outputs:
Code
mkdir analysis/metaphlanHere’s what the Metaphlan command looks like:
Code
metaphlan data/trimmed_reads/SRR6820491_merged.fastq \
--input_type fastq \
--bowtie2db databases/chocophlandb/ \
--bowtie2out analysis/metaphlan/metaphlan_bowtie2.bz2 \
--nproc 8 \
--add_viruses \
--tax_lev 's' \
--min_cu_len 2000 \
--read_min_len 70 \
--stat_q 0.1 \
--output_file analysis/metaphlan/metaphlan_output.txt \
--min_ab 0.1 \
&> analysis/metaphlan/log.txt &Note: if you want to re-run MetaPhlan you’ll have to delete the original outputs
It may look like there’s a lot of complicated parts to this command, but we can break it down piece by piece to understand exactly what we are about to run. Remember, most programmes have a manual or help section that can be accessed by using the ‘-h’ flag. Let’s bring it up now.
Code
metaphlan -hThere are many possible input arguments for metaphlan, which can seem overwhelming at first. However, there’s no need to know them all! We only need to use those which are relevant to our analysis.
Referring to the manual can be useful to toggle different settings on and off, for example changing the flag of tax_lev from s to g will change the taxonomic level we’re searching at from species to genus.
Let’s build our command now.
data/trimmed_reads/SRR6820491_merged.fastq specifies our input file containing our trimmed reads. We use this rather than our sorted mRNA from sortmeRNA as MetaPhlan identifies species based on a wide range of marker genes, some of which may be filtered out by sortmeRNA.
--input_type fastq tells MetaPhlan the format of our input file.
--bowtie2db databases/chocophlandb/ specifies the location of the reference marker gene database that MetaPhlAn will use for alignment.
--output_file analysis/metaphlan/metaphlan_output.txt tells the program where to write the taxonomic profile.
We will also include a few additional arguments.
--bowtie2out analysis/metaphlan/metaphlan_bowtie2.bz2 saves the intermediate Bowtie2 alignment output.
--nproc 8 tells the program to use all 8 CPU cores of our instance.
--add_viruses allows viral marker genes to be included in the taxonomic profiling.
--tax_lev 's' tells MetaPhlAn to report results at the species level.
--min_cu_len 2000 sets the minimum total nucleotide length for clade-specific markers.
--read_min_len 70 ensures that only reads of at least 70 bp are analysed.
--stat_q 0.1 sets the statistical quantile used for robust average abundance estimation.
--min_ab 0.1 specifies that only clades with at least 0.1% relative abundance will be reported.
Finally,
&> analysis/metaphlan/log.txt & redirects all terminal output to a log file and runs the process in the background.
Now we have a better idea of what we’re asking MetaPhlAn to do, let’s run it. Be sure that you’re in your cs_course directory so that the commands point to the correct input and output directories! Go ahead and run the command in your terminal:
Code
metaphlan data/trimmed_reads/SRR6820491_merged.fastq \
--input_type fastq --bowtie2db databases/chocophlandb/ \
--bowtie2out analysis/metaphlan/metaphlan_bowtie2.bz2 \
--nproc 8 \
--add_viruses \
--tax_lev 's' \
--min_cu_len 2000 \
--read_min_len 70 \
--stat_q 0.1 \
--min_ab 0.1 \
--output_file analysis/metaphlan/metaphlan_output.txt \
&> analysis/metaphlan/log.txt &This should take about 30-45 minutes to run. It takes a while as each sequence is compared to the full database one by one! Our test data here only consists of one sample, but in your research you may be working on many many more. As such a “real” Metaphlan run may take much longer.
A note on analysing community data from Metatranscriptomes
This approach of analysing RNA reads is largely the same as the approach we would use for analysing DNA reads. However, this approach does have one main caveat, as explained here.
Because Metaphlan quantifies species abundance based on the number of reads that match species-specific marker genes it may seem sensible to assume a higher abundance of reads means a higher abundances of individuals or genomes of that species, but this is not necessarily true.
When analysing metagenomic DNA data it is safe to assume each genome corresponds to a single copy of a given marker gene (assuming that gene appears once within the genome). However, for metatranscriptomic RNA data that same assumption cannot be made as markers may be transcribed more or less than the average transcription rate for that species. This means the abundance Metaphlan calculates may be greater or smaller than that species’ contribution to the overall transcript pool.
In effect we are observing a measure of the marker gene transcript pool (not the overall transcript pool of all expressed genes), and this should be taken into account when analysing the outputs of Metaphlan. Remember it is always important to consider how our data is linked to biological reality!
In a real world Metaphlan run it is most likely you would set the taxonomic level for the search to be Kingdom and below, so why are we setting it to just species for this tutorial?
Metaphlan will output results for reads it can match with confidence to a set of marker genes. However some of those matches will only be resolved down to a higher taxonomic level as some genes are shared between all members of, for example, a certain Family. If only those shared genes are matched then it cannot be determined which Genus that read is from, and so it Metaphlan will report only to the Family level.
This can result in outputs that are much larger and much more complex to interpret. For simplicity’s sake we shall only deal with species level outputs for this tutorial, but we will include a pre-made output file for you to explore in the next episode.
Interpreting the Metaphlan output
From our Metaphlan run we have a few main outputs:
Tabular file containing the community profile (metaphlan_output.txt)
- Each line of this file will have four “columns”:
the closest matched taxon (to the species level)
the matched taxon NCBI ID
the relative abundance of the taxon
additional species that may share the same species group marker as the closest matched taxon
- This file is used for both interpretation of the data and as an input for the functional annotation later on.
Bowtie2 alignment file (SRR6820491.bowtie2.bz2)
This file is a compressed archive of intermediate files from the alignment process. It contains read IDs, marker gene matches, and alignment scores.
We don’t need it for this pipeline, but could be useful if you wanted to know how efficiently Metaphlan has mapped your reads to the database.
It can also be used as an input if you need to re-run Metaphlan, which saves time as it will skip the mapping step.
Execution log file (metaphlan.log)
- This log contains messages that would otherwise appear in the terminal that we have redirected using the &> command. We can use this to track if there have been any errors or warnings if things go wrong.
Let’s have a look at our community profile output using less.
Code
less analysis/metaphlan/metaphlan_output.txt#mpa_vJan25_CHOCOPhlAnSGB_202503 #/home/ubuntu/.local/bin/metaphlan data/trimmed_reads/SRR6820491_merged.fastq --input_type fastq --bowtie2out analysis/metaphlan/SRR6820491_s.bowtie2.bz2 --bowtie2db data/chocophlandb/ -t rel_ab --tax_lev s -o analysis/metaphlan/SRR6820491_profile_vJan25_s.txt --nproc 8 --add_viruses --stat_q 0.1 --min_ab 0.1
#16372080 reads processed
#SampleID Metaphlan_Analysis #clade_name NCBI_tax_id relative_abundance additional_species
s__Coprothermobacter_proteolyticus 35786 81.43426
s__Acetivibrio_thermocellus 1515 12.51698
s__Tepidanaerobacter_sp_GT38 2722793 5.52737
s__Methanothermobacter_thermautotrophicus 145262 0.50341
s__Escherichia_coli 562 0.01798
From the output here, try and answer these questions:
How many taxons have been identified?
What different taxonomic levels do we have access to with MetaPhlan?
What species are found in our sample?
Can you tell if only bacteria have been identified in our sample?
Our output has 5 lines, meaning we have 5 taxons identified.
We only have species-level access in our run. Changing the
--tax_levflag would allow us to identify a broader set of taxa (e.g., genus, family, kingdom).In our sample we identified:
Coprothermobacter proteolyticus, Acetivibrio thermocellus, Tepidanaerobacter sp. GT38, Methanothermobacter thermautotrophicus, and Escherichia coli.Trick question! Although you can manually verify each taxon, you would need to adjust the
--tax_levflag to show the kingdom (k__) level to formally confirm whether only bacteria are present.
Whilst it is possible to interpret information from this table alone, it is far easier to try to do so from visualised data. In the next episode we shall be using Krona and Graphlan to generate visualisations of Metaphlan’s output.