Sequence Data Quality Control


This lab is aimed at giving you a flavour of some of the steps that needs to be taken in most sequencing projects and provides examples from several sequencing technologies. The idea is to give you an feel for how you can handle and process the sequence data that you recieve from the sequencer. Although some parts of the lab and certain tools may be specific to a particular sequencing technology or application, the overall philosophy on handling your sequencing data is the same and has already proven to withstand the test of time.

Logging in to the virtual machine

Log in to the virtual machine as instructed previously. We will need graphics in this session to view the results files so if you are using ssh, you will need to use the -Y option.

Before we start there are some notations in the following text that may need a brief introduction. UNIX commands such as pwd will be in italics in the text while commands that you need to type will look like this:

 #This is a comment that is not executed

Note: In order to give you the chance to solve the exercise on your own the answers/tips are hidden by default. Click on the small arrow to reveal the answer. Try for yourself first and then look at the provided answers to maximise your learning experience!

Click to see the hidden answer This is the hidden answer!

OK, lets get to work! The lab will guide you through a number of important steps and considerations when working with sequence data.


Lets say that you just got your freshly baked sequences back from the sequencing centre. Everyone in the project is excited about the new insights that the data will bring.

So what do you do now?

This question is the essence of what this lab is all about!

For practical reasons we will use test datasets here, but most steps would apply to your favourite sequencing project.

All datasets are provided in the following directory:


Quality control

Why do we need quality control?

While the quality of the sequences we are getting is constantly improving there are a number of reasons why it is important to perform at least some basic quality controls. Modern sequencing technologies can generate a huge number of sequence reads in a single experiment. However, no sequencing technology is perfect, and each instrument will generate different types and amounts of error. Therefore, it is necessary to understand, identify and exclude error types that may impact the interpretation of downstream analysis.

Note: As you probably noticed in the above bullets, the study design and your research project is central to your interpretation of the QC results. It is important to emphasise that you need to treat your bioinformatics project with the same scientific approach as you would a wet lab experiment.


Exercise 1

First lets make a directory called qcLab to work in: Hint: use mkdir to do this (and only use the hidden answer if you do not manage on your own!

Make a directory called qcLab and change into the new directory
mkdir ~/qcLab
cd ~/qcLab

There are a mulitude of software options for quality control of your sequence reads. Some are R packages such as ShortRead (bioconductor package that can actually do QC, filtering as well as trimming) while others are software that you run in your terminal.

FastQC is a java based quality control tool that can be used using commands in the terminal or by using a graphical interface. You can run fastqc as a graphical software or directly in the command line.

Start fastqc

It may take a short while to get the graphics presented on your screen so be a bit patient here.


Once you have selected the file, fastqc will start running (it will take a few minutes so a short break is a good idea).

Exercise 2

Now we will look at Illumina sequencing data from the Earth Microbiome Project. This is an environmental sample sequenced for 16S amplicons. This data is in the file EMP_Misc_16v4EMP_NoIndex_L005_R1_001-sample.fastq.gz.

Exercise 3

Now we will look at two sets of RAD sequencing data. For this exercise we will use the command line version of FastQC.

Firstly 12 RAD sequencing samples saved in the compressed folder CAN.tgz. Uncompress the file using tar:

Hint for running tar
#Check that you are in the qcLab directory
cd ~/qcLab 
tar -xzvf ~/workshop_materials/quality_control/CAN.tgz
Note that using this command will put the extracted in the directory where you are currently standing in a directory called CAN

Analyse each of the samples using FastQC with the command line. This shows how to run it for the first file:

Hint for running fastqc command line
fastqc CAN/can152.fq
Note that by default output of the fastqc will be placed inside the CAN directory.

From the command line type for the first file:

firefox CAN/can152_fastqc.html 

HINT - Think about using the wildcard again.

How can you explain the pattern observed in the first six bases? Why might the beginning of the read be poorer quality?

Exercise 4

Now lets look at a paired-end Illumina RAD dataset. The forward read is in the file KE1_S49_R1_001.fastq.gz and the reverse read is in the KE1_S49_R2_001.fastq. Have a look at the fastq files and the naming of the ID:s in the forward and reverse reads using head and tail. Can you see which reads belong to the same DNA fragment?

Exercise 5
cutadapt --help 

You won’t need to unzip the file. Make sure that you specify an output file.

Hint for running cutadapt
cutadapt -a =ATCTCGTATGCCGTCTTCTGCTTG -o SRR026762_clean.fastq ~/workshop_materials/quality_control/SRR026762-sample.fastq.gz
Exercise 6

Let’s look at some Illumina MiSeq paired end sequencing data from an amplicon study - 1_TAAGGCGA-TAGATCGC_L001_R1_001.fastq and 1_TAAGGCGA-TAGATCGC_L001_R2_001.fastq

Why is read 2 poorer quality than read 1?

Run fastq-mcf to filter based on a q score of 35 and a minimum length of 80 bp. The adaptors can be found in the file adaptors.fasta. Make sure you specify an output file for both the forward and reverse reads with a different name than the raw data.

Hint for running fastq-mcf
fastq-mcf adaptors.fasta 1_TAAGGCGA-TAGATCGC_L001_R1_001.fastq -q 35 -l 80 -o 1_TAAGGCGA-TAGATCGC_L001_R1_001_clean.fastq

Run FastQC on the filtered reads - can you see any changes? Think about cases where such strict quality control might not be necessary.

How many reads were too short? How many reads were trimmed?

Exercise 7

Let look at what PacBio sequences look like. These long reads have a higher error rate than Illumina data and a different quality encoding.

However, not all PacBio sequences look like this. For example, lets run fastqc on another PacBio dataset.

Reproducibility 101

We have now run a few analyses with several datasets and programs. Have a look at the files you produced. Would you remember what all the different files are and recall how you produced them? Probably not, so to be kind to your future self (and colleagues!) finding a way to organise your work is highly recommended. Below there are a few suggestions that could be useful, but feel free to find your own path (no pun intended).

Lets do another exercise and this time we will run it with a recommended file structure to follow. It will be a bit harder to keep track of the structure and in which directory you are but for a real project it would be highly beneficial.

Another suggestion is to keep track of commands used by keeping a separate terminal with a text editor of your choice open so that you can document your work as you proceed. There are many ways of doing this (R markdown, jupyter notebook etc), but here I would suggest that you start documenting in a simple text format and save it under the name README in the directory you are working in.

Exercise 8

Before starting an analysis it would be a good idea to make your data read-only using: <pre> chmod 444 *fastq</pre>

OK, so lets make a directory structure worthy of your sequence reads! As already mentioned you can structure this any way you like, this is only a suggested file structure that I have found useful for keeping track of a large number of projects over the years.

mkdir RAD
cd RAD
mkdir Progs Analysis Data Docs Scripts
cd Data

In the course we will only need the Analysis and Data directories, but go ahead and make all of the directories anyway.

With the massive amount of sequencing, file sizes and hard disk space is constantly an issue. Therefore is is important to make sure that you do not unnecessarily duplicate large data files. One way to avoid this would be to make symbolic links instead of actually copying. For most practical purposes it does not make any difference in your analyses but you will keep things neat and at the same time avoid wasting disk space.

Here we will reuse the CAN RADSeq data.

Click to see how to make a symbolic link
 ln -s ~/qcLab/CAN/*.fq .
 # This will work because we extracted the CAN.tgz file previously. 
Run MultiQC
mkdir ~/qcLab/RAD/Analysis/multiqc
cd ~/qcLab/RAD/Analysis/multiqc
multiqc ../fastqc/

You have now completed this lab successfully!

Well done and good luck with working on your future projects!