Data processing and analysis

Last lesson you performed quality control on your reads. This lesson we will align those reads to a reference genome, and end by identifying and visualizing variations among these samples.

As you progress through this lesson, keep in mind that, even if you aren’t going to be doing this same workflow in your research, you will be learning some very important lessons about using command-line bioinformatic tools. What you learn here will enable you to use a variety of bioinformatic tools with confidence and greatly enhance your research efficiency and productivity.

We will also be learning about the power of automation when repeating similar workflows multiple times, and apply this to our analysis.


This lesson assumes a working understanding of the bash shell. If you haven’t already completed the Using the Command Line lesson, and aren’t familiar with the bash shell, please review those materials before starting this lesson.

This lesson also assumes some familiarity with biological concepts, including the structure of DNA, nucleotide abbreviations, and the concept of genomic variation within a population.

This lesson uses data hosted on an Amazon Machine Instance (AMI). Course participants will be given information on how to log-in to the AMI during the course. Information on preparing for the course is provided on the Cloud-SPAN Genomics Course setup page.


00:00 1. Trimming and Filtering How can I get rid of sequence data that doesn’t meet my quality standards?
00:55 2. Variant Calling Workflow How do I find sequence variants between my sample and a reference genome?
01:55 3. Automating a Variant Calling Workflow How can I make my workflow more efficient and less error-prone?
02:40 Finish

The actual schedule may vary slightly depending on the topics and exercises chosen by the instructor.