Project management for cloud genomics

Good data organisation is the foundation of any research project. It sets you up well for an analysis and makes it easier to come back to the project later. It also allows you to share your work with collaborators, including your most important collaborator - future you.

Person working at a computer with an offstage person asking "How is the analysis going?" The person at the computer replies "Can't understand the date...and the data collector does not answer my emails or calls" Person offstage: "That's terrible! So cruel! Who did collect the data? I will sack them!" Person at the computer: "um...I did, 3 years ago"

Organising a project that includes sequencing involves many components. There’s the “metadata” about the experimental setup and conditions, the measurements of experimental parameters, the sequencing preparation and sample information, the sequences themselves and the files and workflow of any bioinformatics analysis.

Much of the information in a sequencing project is digital, and we need to keep track of our digital records in the same way we have a lab notebook and sample freezer.

In this lesson, we’ll go through the project organisation and documentation needed for an efficient bioinformatics workflow. This will make you a more effective bioinformatics researcher and prepare your data and project for publication. Grant agencies and publishers increasingly require this information.

In this lesson, we’ll be using data from a famous study of experimental evolution using E. coli. More information about this dataset is available here. In this study there are several types of files:

spreadsheet data from the experiment that tracks the strains and their phenotype over time
spreadsheet data with information on the samples that were sequenced - the names of the samples, how they were prepared and the sequencing conditions
the sequence data

Throughout the analysis, we will generate more files from the steps in the bioinformatics pipeline and documentation on the tools and parameters that we used.

Getting Started

This lesson assumes no prior experience with the tools covered in the course. However, learners are expected to have some familiarity with biological concepts, including the concept of genomic variation within a population, as well as some basic experience using a command line interface to navigate file systems.

For a beginner-level overview of the command line, see the Cloud-SPAN Prenomics pages. If you are unsure whether your skills/experience are sufficient, why not try our self-assessment quiz to test your knowledge?

This lesson is part of a course that uses data hosted on an Amazon Machine Instance (AMI). Workshop participants will be given information on how to log-in to the AMI during the course. Information on preparing for the course is provided on the Cloud-SPAN Genomics setup page.

00:00	1. Data Tidiness	What metadata should I collect? How should I structure my sequencing data and metadata?
00:30	2. Planning for NGS Projects	How do I plan and organise a genome sequencing project? What information does a sequencing facility need? What are the guidelines for data storage?
01:00	3. Examining Data on the NCBI SRA Database	How do I access public sequencing data?
01:30	4. Why of cloud computing	What is cloud computing? What are the tradeoffs of cloud computing?
01:35	5. Logging onto the Cloud	How do I connect to an AWS instance?
02:20	Finish

Project management for cloud genomics

Getting Started

Schedule