Glossary

Key Points

Taxonomic Assignment	Taxonomy can be assigned by comparing against a database with genome annotations. These are not exhaustive and so many things will remain unannotated depending on the samples you are analysing. Taxonomic assignment can be done using Kraken2. Pavian is a web based tool that can be used to visualize the assigned taxa.
Diversity Tackled With R	α diversity measures diversity in a metagenome β diversity measures the difference in diversity between metagenomes. A Biological Observation Matrix, BIOM table is a matrix of counts and is generated from the Kraken output using `kraken-biom` The `phyloseq` package can be used to analyse metagenome diversity using the BIOM table.
Taxonomic Analysis with R	The R package `phyloseq` has a function `psmelt()` to make dataframes from `phyloseq` objects. A venn diagram can be used to show the shared and unique compositions of samples. Plotting relative abundance allows you to compare samples with differing numbers of reads

accession: a unique identifier assigned to each sequence or set of sequences
categorical variable: Variables can be classified as categorical (aka, qualitative) or quantitative (aka, numerical). Categorical variables take on a fixed number of values that are names or labels.
cleaned data: data that has been manipulated post-collection to remove errors or inaccuracies, introduce desired formatting changes, or otherwise prepare the data for analysis
conditional formatting: formatting that is applied to a specific cell or range of cells depending on a set of criteria
CSV (comma separated values) format: a plain text file format in which values are separated by commas
factor: a variable that takes on a limited number of possible values (i.e. categorical data)
Gb: gigabyte of file storage or file size
Gbase: a gigabase represents one billion nucleic acid bases (Gbp may indicate one billion base pairs of nucleic acid)
headers: names at tops of columns that are descriptive about the column contents (sometimes optional)
metadata: data which describes other data
NGS: common acronym for “Next Generation Sequencing” currently being replaced by “High Throughput Sequencing”
null value: a value used to record observations missing from a dataset
observation: a single measurement or record of the object being recorded (e.g. the weight of a particular mouse)
plain text: unformatted text
quality assurance: any process which checks data for validity during entry
quality control: any process which removes problematic data from a dataset
raw data: data that has not been manipulated and represents actual recorded values
rich text: formatted text (e.g. text that appears bolded, colored or italicized)
string: a collection of characters (e.g. “thisisastring”)
TSV (tab separated values) format: a plain text file format in which values are separated by tabs
variable: a category of data being collected on the object being recorded (e.g. a mouse’s weight)

Taxonomic Annotations: Glossary

Key Points

Glossary