Platforms available and what is best for my experiment?

Overview

Teaching: 20 min
Exercises: 20 min
Questions
  • What platform should we chose?

  • What things influence platform choice or design if platform is fixed?

Objectives
  • Understand the difference between long and short read sequencing

  • Understand that different applications may require one or both types of sequencing

  • Know that you may not neccesarily have control over the experimental design, but you should still be able to identify good and bad parts of experimental design

Outline

When we talk about sequencing, we can generally group it into either long read or short read sequencing. However, these are actually third generation and second generation sequencing technologies. Sanger sequencing was the first generation of sequencing and could sequence longer fragments than short read sequencing of today, however it was low throughput. Second generation sequencing, or next generation sequencing (NGS) as it is more commonly called, is a much higher throughput method but with reads typically 150-300bp long. As of July 2022, the NextSeq 550 high-output system runs were capable of generating upto 800 million paired-end reads in one run. Currently illumina sequencing is the predominant short read technology. Due to the length of the reads, the DNA/cDNA fragments ned to be broken into smaller pieces during the library preparation process, either through chemical means or through sonication. This is a key difference between short and long read technologies.

Long read sequencing is third generation sequencing, the current leading technologies for this are nanopore and pacbio. Unlike short read sequencing, these samples do not need fragmenting and so depending on the quality of the sample, reads are often several kb in length and 10-30kb fragments are common in good quality sample cases. Currently the longest read length for a nanopore run is 2.3Mb, see here.

Which type of sequencing is the most appropriate for me?

There are benefits to both kinds of sequencing. Short read sequences have a higher based accuracy but are harder to align accurately, whereas long read sequences are easier to align accurately, but have a lower base accuracy. Depending on your application, you may need to use one or both in combination. In some cases, for example if you’re using publically available data, you will have no control over how the data is generated. Other factors to consider are what facilities you have access to at your institution. If your institution is a centre of excellence for nanopore sequencing for instance, you will have more expertise in handling these samples at your institution.

How will my research question impact on the appropriateness of the platform/method of sequencing used?

There are lots of ways in which your research question can impact on what sequencing method you will use. Some of the following are worth considering before choosing your method:

Sequencing Technologies at a glance

Platform Generation Read length Sequence Accuracy Other advantages Other disadvantages
Illumina second-generation NGS Short 150 - 300 bp high Relatively cheap, Can sequence fragmented DNA Not suitable for sequences with many repetitive elements, methylation signatures, Relatively slow
PacBio third-generation NGS Long 13,000 - 20,000 bp with a max of 300,000 low Easier library preparation, suitable for methylation signatures, Quick Relatively expensive
Oxford Nanopore third-generation NGS Long 10,000 - 30,000 bp with a max of 2.3 million low portability, Easier library preparation, Quick, suitable for methylation signatures  

Sequencing facilities

The current sequencing methods evolve rapidly, however there are established sequencing facilities that your data may be generated from if you are based in the UK. Some of the bigger centres are listed below. They should be able to inform you on the most up to date library preparation methods, how much data is required for your downstream analysis and any other requirements for sequencing:

University of York, Technology Facility

University of Glasgow, Polyomics facility

University of Sheffield, Genomics core facility

University of Liverpool, Centre of Genomic Research

How to deal with data from experiments you haven’t designed?

You may be dealing with data that you haven’t generated yourself, and have no control over experimental design. This could be data that is already within your lab, or this could be that you are using data that is in a data repository such as the european nucleotide archive ENA or SRA.

If this is the case you may want to think of a few of the following? We will cover some of these points, such as replication and controls in greater detail, so don’t worry if you don’t understand these now.

We should also be able to answer these additional questions later today

Key Points

  • Short read sequencing is high throughput and can generate in many millions of reads. These reads are usually between 150-300bp long and have a high base accuracy

  • Long read sequencing is lower throughput in terms of the number of reads we see compared to short read sequencing. However the reads are many kbs in length

  • If in doubt about design, ask your facility you are using