Genomics Tutorial - Genome Assembly

Introduction

In the previous section we confirmed that our sequencing data was of sufficient quality to continue further downstream analyses.

The next step in the process is to assemble the sequencing reads into their respective genomes.

Overview

The part of the work-flow we will work on in this section marked in red below:

assembly_workflow

Learning outcomes

After studying this section of the tutorial you should be able to:

Setup the environment

Follow these steps to set-up the conda environment for this section:

  1. Open a new terminal and load the workshops/workshops/genomics_workshop_assembly module:
    $ module load workshops/workshops/genomics_workshop_assembly
  2. Activate the conda environment:
    $ gen_ass_init
  3. Compute and interpret a whole genome assembly.
  4. Judge the quality of a genome assembly.

The Data

In the previous section we created the quality_control directory and populated it. The structure is as follows:

quality_control/
├── data
│   ├── anc_R1.fastq
│   ├── anc_R1.fastq.gz
│   ├── anc_R2.fastq.gz
│   ├── evol1_R1.fastq.gz
│   ├── evol1_R2.fastq.gz
│   ├── evol2_R1.fastq.gz
│   └── evol2_R2.fastq.gz
├── multiqc_data
│   ├── multiqc_citations.txt
│   ├── multiqc_data.json
│   ├── multiqc_fastp.txt
│   ├── multiqc_fastqc.txt
│   ├── multiqc_general_stats.txt
│   ├── multiqc.log
│   └── multiqc_sources.txt
├── multiqc_report.html
├── trimmed
│   ├── anc.fastp.html
│   ├── anc.fastp.json
│   ├── anc_R1.fastq.gz
│   ├── anc_R2.fastq.gz
│   ├── evol1.fastp.html
│   ├── evol1.fastp.json
│   ├── evol1_R1.fastq.gz
│   ├── evol1_R2.fastq.gz
│   ├── evol2.fastp.html
│   ├── evol2.fastp.json
│   ├── evol2_R1.fastq.gz
│   └── evol2_R2.fastq.gz
└── trimmed-fastqc
    ├── anc_R1_fastqc.html
    ├── anc_R1_fastqc.zip
    ├── anc_R2_fastqc.html
    ├── anc_R2_fastqc.zip
    ├── evol1_R1_fastqc.html
    ├── evol1_R1_fastqc.zip
    ├── evol1_R2_fastqc.html
    ├── evol1_R2_fastqc.zip
    ├── evol2_R1_fastqc.html
    ├── evol2_R1_fastqc.zip
    ├── evol2_R2_fastqc.html
    └── evol2_R2_fastqc.zip

4 directories, 39 files

Creating a genome assembly

We want to create a genome assembly for our ancestor. To do this, we are going to use the quality trimmed forward and backward DNA sequences from our Quality Control process and use the program SPAdes to build a genome assembly.

New Tool

SPAdes
SPAdes -St. Petersburg genome assembler - is an assembly tool kit containing various assembly pipelines which works with Illumina or IonTorrent reads and is capable of providing hybrid assemblies using PacBio, Oxford Nanopore and Sanger reads.

An evaluation of assembly software found SPAdes to be a good choice for fungal genomes (1). Additionally it is also simple to install and use. It is also simple to install and use.

SPAdes usage

Before performing the assemblies we need to setup our environment and get some more information about the parameters that we need to pass to SPAdes:


#First change into the root folder for the tutorial
$ cd ~/genomics_tutorial

#Next, create a output directory for the assemblies
$ mkdir assembly

#Get some information on the parameters that spades uses 
$ spades.py -h

Generally, paired-end data is submitted in the following way:

$ spades.py -o result-directory -1 read1.fastq.gz -2 read2.fastq.gz 
Where: * -o - The output directory where the final assembly will be stored * -1 - The input file containing the forward paired-end reads * -2 - The input file containing the reverse paired-end sequencing

To explore the functionality of SPAdes and how important the quality control steps are to downstream analyses, we will create assemblies of the original and trimmed data of the ancestor.

First, create a genome assembly with the original ancestor data:

#First make sure that you are in the ~/genomics_tutorial directory

#Run SPAdes on the ancestor's original reads
$ spades.py -o assembly/spades-original --careful -1 quality_control/data/anc_R1.fastq.gz -2  quality_control/data/anc_R2.fastq.gz

The --careful option tries to reduce number of mismatches and short indels.

Attention

Always refer to the manual to ensure that you are using the correct/recommended parameters for the type of sequencing data available.

Next, create a genome assembly with the trimmed ancestor data:

#First make sure that you are in the ~/genomics_tutorial directory

#Run SPAdes on the trimmed ancestor reads
$ spades.py -o assembly/spades-150 --careful -1 quality_control/trimmed/anc_R1.fastq.gz -2 quality_control/trimmed/anc_R2.fastq.gz

Assembly Quality Assessment

After running SPAdes we now have two assemblies and we need to figure out which assembly we will use for further analyses. To do this we need to access the quality of each assembly and determine which assembly has the highest quality.

Assessing assembly quality can be performed by using various metrics calculated from each assembly. They are:

  • N50: length for which the collection of all contigs of that length or longer covers at least 50% of assembly length
  • NG50: where length of the reference genome is being covered
  • NA50 and NGA50: where aligned blocks instead of contigs are taken
  • miss-assemblies: miss-assembled and unaligned contigs or contigs bases
  • genes and operons covered

To compute these assembly statistics, we can use the program QUAST (2).

New Tool

QUAST
An easy-to-use set of tools to compare and evaluate genome assemblies.

To compare our two assemblies, we need to provide QUAST with the scaffolds.fasta files produced by SPAdes:

#First make sure that you are in the ~/genomics_tutorial directory

$ quast -o assembly/quast  assembly/spades-original/scaffolds.fasta assembly/spades-150/scaffolds.fasta
Open report.html in the quast directory in a web browser and inspect the report. How does the quality of the assemblies compare and which one would you pick to move forward with?

Further Reading

For further information of genome assemblies and how to evaluate their quality, see the following additional resources:

Background on Genome Assemblies

  • How to apply de Bruijn graphs to genome assembly (3).
  • Sequence assembly demystified (4).

Genome Assembly Software

  • GAGE: A critical evaluation of genome assemblies and assembly algorithms (5).
  • Assessment of de novo assemblers for draft genomes: a case study with fungal genomes (1).
  • Bandage: interactive visualization of de novo genome assemblies (6).

References

  1. Abbas, M. M., Malluhi, Q. M., & Balakrishnan, P. (2014). Assessment of de novoassemblers for draft genomes: a case study with fungal genomes. BMC genomics, 15(9), 1-12.

  2. Gurevich, A., Saveliev, V., Vyahhi, N., & Tesler, G. (2013). QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), 1072-1075.

  3. Compeau, P. E., Pevzner, P. A., & Tesler, G. (2011). How to apply de Bruijn graphs to genome assembly. Nature biotechnology, 29(11), 987-991.

  4. Nagarajan, N., & Pop, M. (2013). Sequence assembly demystified. Nature Reviews Genetics, 14(3), 157-167.

  5. Salzberg, S. L., Phillippy, A. M., Zimin, A., Puiu, D., Magoc, T., Koren, S., ... & Yorke, J. A. (2012). GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome research, 22(3), 557-567.

  6. Wick, R. R., Schultz, M. B., Zobel, J., & Holt, K. E. (2015). Bandage: interactive visualization of de novo genome assemblies. Bioinformatics, 31(20), 3350-3352.