Genomics Tutorial - Quality Control

Introduction

There are many sources of errors that can influence the quality of your sequencing run. (1) In this quality control section we will use our skill on the command-line interface to deal with the task of investigating the quality and cleaning sequencing data. (2)

Overview

The part of the work-flow we will work on in this section marked in red below:

qc_workflow

Learning Outcomes

After completing this section of the tutorial you should be able to:

  • Describe the steps involved in pre-processing/cleaning sequencing data.
  • Distinguish between a good and a bad sequencing run.
  • Compute, investigate and evaluate the quality of sequence data from a sequencing experiment.

Setup the environment

Follow these steps to set-up the conda environment for this section:

  1. Open a new terminal and load the workshops/workshops/genomics_workshop_qc module:
    $ module load workshops/workshops/genomics_workshop_qc
  2. Activate the conda environment:
    $ gen_qc_init

The Data

First, we are going to retrieve the data we are going to use in this section. Open a terminal window and issue the following commands:

#change into the main tutorial directory
$ cd ~/genomics_tutorial

# create a directory you work in
$ mkdir quality_control

#uncompress the data   
$ tar -xvzf data/qc/data.tar.gz -C quality_control/

# change into the directory
$ cd quality_control


The data is from a paired-end sequencing run data (The figure below shows the difference between Single-End (SE) and Paired-End (PE) sequencing) from an Illumina HiSeq (3). Thus, we have two files, one for each end of the read.

qc_pairedend

If you need to refresh how Illumina paired-end sequencing works have a look at the Illumina technology webpage and this video.

Attention

The data we are using is "almost" raw data as it came from the machine. This data has been post-processed in two ways already. All sequences that were identified as belonging to the PhiX genome have been removed. This process requires some skills we will learn in later sections. |illumina| adapters have been removed as well already! The process is explained below and we are going to run through the process anyway.

Investigate the Data

We can use basic terminal tools to investigate the data in ~/genomics_tutorial/quality_control/data directory:

#First change into the ~/genomics_tutorial/quality_control/data directory
cd data

#First extract one of the paired-end files for the ancestor
$ zcat anc_R1.fastq.gz > anc_R1.fastq

#Next look at the anc_R1.fastq
$ head anc_R1.fastq

The fastq File Format

The data we receive from the sequencing is in fastq format (4). See the format definition below:

FASTQ File Format

@title and optional description
sequence line(s)
+optional repeat of title line
quality line(s)

A useful tool to decode base qualities can be found here.

Calculating the Coverage

If we assume a genome size of ~4.6 Mb (Mb = 10^6 bp), we can calculate the coverage based on this formula:

C = LN / G

  • C: is the coverage
  • G: is the haploid genome length in bp
  • L: is the read length in bp (e.g. 2x150 paired-end = 300 bp)
  • N: is the number of reads sequenced

We have all the values except for the number of reads sequenced (N), which we can get from our sequence files.

To get the number of reads sequenced, we can exploit what we know about the fastq format above and utilize common Terminal tools:

expr $(cat anc_R1.fastq | wc -l) / 4

In the command above:

  • cat retrieves the content of the file
  • | redirects the output from cat to wc
  • wc counts the number of lines with the -l option
  • The process above produces the number of lines, thus replace $(...) with the number of lines
  • expr is used to evaluate and expression of dividing the number of lines by 4 (every read in a fastq file takes up 4 lines)

To do a sanity check on your data, perform the same command on the other pair-end file:

expr $(zcat anc_R2.fastq | wc -l) / 4

In both instances you should get 281748 reads.

Then to calculate the coverage:

C = (300 * 281748) / 4.8 * 10^6 = 17.6

Note

The amount of coverage you will need for your analysis will always depend on experiment type and the organism you are working with. To learn more about sequencing coverage and depth, refer to reference 5.

Note

Always prepare a list of potential analyses you would want to perform on your sequencing projects BEFORE performing the sequencing experiments.

The QC Process

There are a few steps one need to do when getting the raw sequencing data from the sequencing facility:

Remove PhiX Sequences

PhiX is a non-tailed bacteriophage with a single-stranded DNA and a genome with 5386 nucleotides. PhiX is used as a quality and calibration control for sequencing runs. PhiX is often added at a low known concentration, spiked in the same lane along with the sample or used as a separate lane. As the concentration of the genome is known, one can calibrate the instruments. Thus, PhiX genomic sequences need to be removed before processing your data further as this constitutes a deliberate contamination (6). The steps involve mapping all reads to the "known" PhiX genome, and removing all of those sequence reads from the data.

However, your sequencing provider might not have used PhiX, thus you need to read the protocol carefully, or just do this step in any case.

Attention

We are not going to do this step here, as the sequencing run we are using did not use PhiX. Please see the Mapping section on how to map reads against a reference genome.

Adapter Trimming

The process of sequencing DNA via Illumina technology requires the addition of some adapters to the sequences. These get sequenced as well and need to be removed as they are artificial and do not belong to the species we try to sequence. Generally speaking we have to deal with a trade-off between accuracy of adapter removal and speed of the process. Adapter trimming does take some time.

Also, we have generally two different approaches when trimming adapters:

  • We can use a tool that takes an adapter or list of adapters and removes these from each sequence read.
  • We can use a tool that predicts adapters and removes them from each sequence read.

For the first approach we need to know the adapter sequences that were used during the sequencing of our samples. Normally, you should ask your sequencing provider, who should be providing this information to you. Illumina itself provides a document that describes the adapters used for their different technologies. Also the FastQC tool, we will be using later on, provides a collection of contaminants and adapters.

However, often (sadly) this information is not readily available, e.g. when dealing with public data. Thus, the second approach can be employed, that is, using a tool that predicts adapters.

Quality Trimming

Here, we are going to use the second approach with a tool called |fastp| to trim adapters and do quality trimming. fastp has a few characteristics which make it a great tool, most importantly: it is pretty fast, provides good information after the run, and can do quality trimming as well, thus saving us to use another tool to do this.

Quality trimming of our sequencing reads will remove bad quality called bases from our reads, which is especially important when dealing with variant identification.

New Tool

fastp
An ultra-fast all-in-one FASTQ preprocessor (QC/adapters/trimming/filtering/splitting/merging...)

First we will trim the sequence reads of the ancestor:

#First make sure that you are currently in the ~/genomics_tutorial/quality_control/ directory

#Now create a directory to store the trimmed data   
$ mkdir trimmed

#Now execute fastp - remember to use \ at the end of each line to enter information on a new line
$ fastp --detect_adapter_for_pe          
        --overrepresentation_analysis      
        --correction                                     
        --cut_right                                      
        --thread 2                                       
        --html trimmed/anc.fastp.html     
        --json trimmed/anc.fastp.json      
         -i data/anc_R1.fastq.gz               
         -I data/anc_R2.fastq.gz               
         -o trimmed/anc_R1.fastq.gz       
         -O trimmed/anc_R2.fastq.gz

In the command above:

  • --detect_adapter_for_pe: Specifies that we are dealing with paired-end data.
  • --overrepresentation_analysis: Analyse the sequence collection for sequences that appear too often.
  • --correction: Will try to correct bases based on an overlap analysis of read1 and read2.
  • --cut_right: Will use quality trimming and scan the read from start to end in a window. If the quality in the window is below what is required, the window plus all sequence towards the end is discarded and the read is kept if its still long enough.
  • --thread: Specify how many concurrent threads the process can use.
  • --html and --json: We specify the location of some stat files.
  • -i data/anc_R1.fastq.gz : Specify the first input read file
  • -I data/anc_R2.fastq.gz: Specify the second read file (Remember that this is a paired-end sequencing experiment)
  • -o trimmed/anc_R1.fastq.gz : Specify the first output file (corresponding to the input file for -i)
  • -O trimmed/anc_R2.fastq.gz: Specify the second output file (corresponding to the input file for -I)

Now run fastp analysis on the two evolved samples:

#Remember to use \ at the end of each line to enter information on a new line

#Execute fastp for the first evolved sample
fastp --detect_adapter_for_pe 
        --overrepresentation_analysis 
        --correction 
        --cut_right 
        --html trimmed/evol1.fastp.html 
        --json trimmed/evol1.fastp.json 
        --thread 2 
         -i data/evol1_R1.fastq.gz 
         -I data/evol1_R2.fastq.gz 
         -o trimmed/evol1_R1.fastq.gz 
         -O trimmed/evol1_R2.fastq.gz
#Remember to use \ at the end of each line to enter information on a new line

#Execute fastp for the second evolved sample
fastp --detect_adapter_for_pe 
        --overrepresentation_analysis 
        --correction 
        --cut_right 
        --html trimmed/evol2.fastp.html 
        --json trimmed/evol2.fastp.json 
        --thread 2 
         -i data/evol2_R1.fastq.gz 
         -I data/evol2_R2.fastq.gz 
         -o trimmed/evol2_R1.fastq.gz 
         -O trimmed/evol2_R2.fastq.gz

Quality Assessment of Sequencing Reads

New Tool

FastQC
A high throughput sequence QC analysis tool.

With FastQC we can quickly (and visually) retrieve information from our sequence files about the quality of the reads within them. It produces and HTML report that can be viewed in a web-browser.

Before we run and investigate the quality of the reads in our sample data, look at the example reports provided by the authors of FastQC for: * Good Ilumina Data * Bad Ilumina Data * Adapter dimer contaminated run

Also see this video for a step-by-step explanation of all the elements/pages of a FastQC report.

Now that we know more about what FastQC, we can use it on our trimmed sequences.

FastQC is executed as follows:

$ fastqc -o RESULT-DIR INPUT-FILE.fq(.gz) ...

Where:

  • -o is the directory where the output files will be written to
  • INPUT-FILE is the input sequence file(s)

To perform a FastQC analysis on the trimmed sequences, use the following code:

#First make sure you are currently in the ~/genomics_tutorial/quality_control/ directory

#Create the output directory "trimmed-fastqc"
$ mkdir trimmed-fastqc

#Now execute FastQC on all the sequences in ~/genomics_tutorial/quality_control/trimmed
$ fastqc -o trimmed-fastqc trimmed/*.fastq.gz

There should now be reports for all the sequence files in the output directory.

Combining FastQC and fastp Information

So far we have accumulated allot of information on sequences from multiple samples, using multiple tools. Thus, the next step is to aggregate all this accumulated information and to decide if the data is of sufficient quality to perform any further downstream analyses on it.

Fortunately, there is an application that can assist with this task: MultiQC

New Tool

MultiQC
An excellent tool that can aggregate information from multiple bioinformatics analyses and multiple samples into one report.

To use MultiQC on our data, simply provide the output directories generated by FastQC and fastp to the application:

#First make sure that you are in the ~/genomics_tutorial/quality_control directory

#Execute MultiQC
$ multiqc trimmed-fastqc trimmed

There should now be a generated html file in the ~/genomics_tutorial/quality_control directory. Open the html file using a web-browser and inspect the contents. Do you think that we can continue with our downstream analyses?

References

  1. Robasky, K., Lewis, N. E., & Church, G. M. (2014). The role of replicates for error mitigation in next-generation sequencing. Nature Reviews Genetics, 15(1), 56-62.

  2. Kircher, M., Heyn, P., & Kelso, J. (2011). Addressing challenges in the production and analysis of illumina sequencing data. BMC genomics, 12(1), 1-14.

  3. Glenn, T. C. (2011). Field guide to next‐generation DNA sequencers. Molecular ecology resources, 11(5), 759-769.

  4. Cock, P. J., Fields, C. J., Goto, N., Heuer, M. L., & Rice, P. M. (2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic acids research, 38(6), 1767-1771.

  5. Sims, D., Sudbery, I., Ilott, N. E., Heger, A., & Ponting, C. P. (2014). Sequencing depth and coverage: key considerations in genomic analyses. Nature Reviews Genetics, 15(2), 121-132.

  6. Mukherjee, S., Huntemann, M., Ivanova, N., Kyrpides, N. C., & Pati, A. (2015). Large-scale contamination of microbial isolate genomes by Illumina PhiX control. Standards in genomic sciences, 10(1), 1-4.