QC & Assembly: Glossary

Key Points

Introduction to Metagenomics
  • Genomics looks at the whole genome content of an organism

  • Metagenomes contain multiple organisms within one sample unlike genomic samples.

  • In metagenomes the organisms present are not usually present in the same abundance - except for mock communities.

  • We can identify the organisms present in a sample using either amplicon sequencing or whole metagenome sequencing. Amplicon sequencing is cheaper and quicker, however it also limits the amount of downstream analysis that can be done with the data.

  • Metagenomes can differ in their levels of complexity and this is determined by how many organisms are in the metagenome.

  • Difference platforms allow us to perform different analyses. The suitability depends on the question you are asking.

Logging onto the Cloud
  • You can use one set of log-in credentials for many instances

  • Logging off an instance is not the same as turning off an instance

Assessing Read Quality, Trimming and Filtering
  • Quality encodings vary across sequencing platforms.

  • It is important to know the quality of our data to be able to make decisions in the subsequent steps.

  • Data cleaning is essential at the beginning of metagenomics workflows.

  • Due to differences in the sequencing technology Nanopore data must be handled differently.

Metagenome Assembly
  • Assembly merges raw reads into contigs.

  • Flye can be used as a metagenomic assembler.

  • Certain statistics can be used to describe the quality of an assembly.

Glossary

accession
a unique identifier assigned to each sequence or set of sequences
categorical variable
Variables can be classified as categorical (aka, qualitative) or quantitative (aka, numerical). Categorical variables take on a fixed number of values that are names or labels.
cleaned data
data that has been manipulated post-collection to remove errors or inaccuracies, introduce desired formatting changes, or otherwise prepare the data for analysis
conditional formatting
formatting that is applied to a specific cell or range of cells depending on a set of criteria
CSV (comma separated values) format
a plain text file format in which values are separated by commas
factor
a variable that takes on a limited number of possible values (i.e. categorical data)
Gb
gigabyte of file storage or file size
Gbase
a gigabase represents one billion nucleic acid bases (Gbp may indicate one billion base pairs of nucleic acid)
headers
names at tops of columns that are descriptive about the column contents (sometimes optional)
metadata
data which describes other data
NGS
common acronym for “Next Generation Sequencing” currently being replaced by “High Throughput Sequencing”
null value
a value used to record observations missing from a dataset
observation
a single measurement or record of the object being recorded (e.g. the weight of a particular mouse)
plain text
unformatted text
quality assurance
any process which checks data for validity during entry
quality control
any process which removes problematic data from a dataset
raw data
data that has not been manipulated and represents actual recorded values
rich text
formatted text (e.g. text that appears bolded, colored or italicized)
string
a collection of characters (e.g. “thisisastring”)
TSV (tab separated values) format
a plain text file format in which values are separated by tabs
variable
a category of data being collected on the object being recorded (e.g. a mouse’s weight)