Principles of experimental design: Glossary

Key Points

Platforms available and what is best for my experiment?
  • Short read sequencing is high throughput and can generate in many millions of reads. These reads are usually between 150-300bp long and have a high base accuracy

  • Long read sequencing is lower throughput in terms of the number of reads we see compared to short read sequencing. However the reads are many kbs in length

  • If in doubt about design, ask your facility you are using

Understanding experimental design
  • Most experiments need positive and negative controls

  • Technical replicates ensure our measures are reproducible

  • Biological replicates ensure our results generalise

  • Coverage is the percentage of the reference genome sequenced

  • Depth is the average number of times a base is sequenced

  • There is a trade-off between sequencing depth and the need for controls and replicates

Statistical Analysis
  • Type I errors are those caused by False positives.

  • Type II errors are those caused by False negatives.

  • We can reduce Type I errors by using a more stringent alpha threshold.

  • We can reduce Type II errors by improving the power in an experiment by increasing the number of replicates used.

  • High throughput experiments require us to use multiple testing correction. There are two popular methods to do this, the most stringent is the Bonferroni correction, the least stringent is the Benjamini Hochberg FDR.

Glossary

accession
a unique identifier assigned to each sequence or set of sequences
categorical variable
Variables can be classified as categorical (aka, qualitative) or quantitative (aka, numerical). Categorical variables take on a fixed number of values that are names or labels.
cleaned data
data that has been manipulated post-collection to remove errors or inaccuracies, introduce desired formatting changes, or otherwise prepare the data for analysis
conditional formatting
formatting that is applied to a specific cell or range of cells depending on a set of criteria
CSV (comma separated values) format
a plain text file format in which values are separated by commas
factor
a variable that takes on a limited number of possible values (i.e. categorical data)
Gb
gigabyte of file storage or file size
Gbase
a gigabase represents one billion nucleic acid bases (Gbp may indicate one billion base pairs of nucleic acid)
headers
names at tops of columns that are descriptive about the column contents (sometimes optional)
metadata
data which describes other data
NGS
common acronym for “Next Generation Sequencing” currently being replaced by “High Throughput Sequencing”
null value
a value used to record observations missing from a dataset
observation
a single measurement or record of the object being recorded (e.g. the weight of a particular mouse)
plain text
unformatted text
quality assurance
any process which checks data for validity during entry
quality control
any process which removes problematic data from a dataset
raw data
data that has not been manipulated and represents actual recorded values
rich text
formatted text (e.g. text that appears bolded, colored or italicized)
string
a collection of characters (e.g. “thisisastring”)
TSV (tab separated values) format
a plain text file format in which values are separated by tabs
variable
a category of data being collected on the object being recorded (e.g. a mouse’s weight)