Genomics Tutorial - Genome Annotation

Introduction

In this section you will predict genes and assess your assembly using Prokka.

Overview

The part of the work-flow we will work on in this section marked in red below:

anot_workflow

Learning outcomes

After completing this section of the tutorial you should be able to:

Use bioinformatics tools to perform gene prediction
Use genome-viewing software to graphically explore genome annotations and NGS data overlays

Setup the environment

Follow these steps to set-up the conda environment for this section:

Open a new terminal and load the workshops/workshops/genomics_workshop_annot module:
```
$ module load workshops/workshops/genomics_workshop_annot
```
Activate the conda environment:
```
$ gen_annot_init
```

The Data

Lets look at our directory structure in ~/genomics_tutorial so far:

genomics_tutorial
├── assembly
│   ├── quast
│   │   ├── basic_stats
│   │   └── icarus_viewers
│   ├── spades-150
│   │   ├── corrected
│   │   │   └── configs
│   │   ├── K21
│   │   │   ├── configs
│   │   │   └── simplified_contigs
│   │   ├── K33
│   │   │   ├── configs
│   │   │   └── simplified_contigs
│   │   ├── K55
│   │   │   ├── configs
│   │   │   └── simplified_contigs
│   │   ├── K77
│   │   │   ├── configs
│   │   │   └── path_extend
│   │   ├── misc
│   │   ├── mismatch_corrector
│   │   │   ├── contigs
│   │   │   │   └── configs
│   │   │   └── scaffolds
│   │   │       └── configs
│   │   ├── pipeline_state
│   │   └── tmp
│   └── spades-original
│       ├── corrected
│       │   └── configs
│       ├── K21
│       │   ├── configs
│       │   └── simplified_contigs
│       ├── K33
│       │   ├── configs
│       │   └── simplified_contigs
│       ├── K55
│       │   ├── configs
│       │   └── simplified_contigs
│       ├── K77
│       │   ├── configs
│       │   └── path_extend
│       ├── misc
│       ├── mismatch_corrector
│       │   ├── contigs
│       │   │   └── configs
│       │   └── scaffolds
│       │       └── configs
│       ├── pipeline_state
│       └── tmp
├── data
├── kraken
├── krona
│   └── taxonomy
├── mappings
│   ├── evol1.sorted.dedup_stats
│   │   ├── css
│   │   ├── images_qualimapReport
│   │   └── raw_data_qualimapReport
│   └── ref_genome
├── quality_control
│   ├── data
│   ├── multiqc_data
│   ├── trimmed
│   └── trimmed-fastqc
└── variants
    └── plots

67 directories

Annotation with Prokka

We will attempt to annotate our assembled genome using Prokka

New Tool

Prokka
A software tool that rapidly annotates prokaryotic genomes

To perform an annotation on our assembled genome, execute the following command:

#Execute Prokka
$ prokka --kingdom Bacteria --genus Escherichia --species coli --outdir annotation assembly/scaffolds.fasta

Your results will be in the annotation directory with the prefix PROKKA.

Interactive viewing

We will use the software Integrative Genomics Viewer (IGV) to view the assembly, the genome annotation, and the variants that you have called, all in one window.

New Tool

Integrative Genomics Viewer (IGV)
An easy-to-use interactive tool for the visual exploration of genomic data

Follow these steps to view the genomic data we have generated thus far:

Open IGV by running the igv command in the terminal. The will open up a new window.
Navigate to Genomes > Load Genome From File. Load the genome assembly by selecting assembly/spades-150/scaffolds.fasta

Next, to load our variant calling data we first need to extract the vcf files we compressed earlier. To do this, do the following:

#First change into the directory where the variant data is located
$ cd variants

#extract vcf file for evol1
$ gzip -dk evol1.freebayes.filtered.vcf.gz

#extract vcf file for evol2
$ gzip -dk evol2.freebayes.filtered.vcf.gz

Next, load each of the extracted vcf files in series by navigating to File > Load from Fileand selecting each file.
Finally, load the Prokka annotation by navigating to File > Load from Fileand selecting the gff file inside the annotation directory.
You can now select different contigs and zoom in and out on the sequence.