Genomics Tutorial - Variant Calling

Introduction

In this section we will use our genome assembly based on the ancestor and call genetic variants in the evolved line. For more information on variant calling, see Reference 1.

Overview

The part of the work-flow we will work on in this section marked in red below:

vc_workflow

Learning outcomes

After completing this section of the tutorial you should be able to:

  • Use tools to call variants based on a reference genome.
  • Be able to describe what influences the calling of variants.

The Data

Lets look at our directory structure in ~/genomics_tutorial so far:

genomics_tutorial
├── assembly
│   ├── quast
│   │   ├── basic_stats
│   │   └── icarus_viewers
│   ├── spades-150
│   │   ├── corrected
│   │   │   └── configs
│   │   ├── K21
│   │   │   ├── configs
│   │   │   └── simplified_contigs
│   │   ├── K33
│   │   │   ├── configs
│   │   │   └── simplified_contigs
│   │   ├── K55
│   │   │   ├── configs
│   │   │   └── simplified_contigs
│   │   ├── K77
│   │   │   ├── configs
│   │   │   └── path_extend
│   │   ├── misc
│   │   ├── mismatch_corrector
│   │   │   ├── contigs
│   │   │   │   └── configs
│   │   │   └── scaffolds
│   │   │       └── configs
│   │   ├── pipeline_state
│   │   └── tmp
│   └── spades-original
│       ├── corrected
│       │   └── configs
│       ├── K21
│       │   ├── configs
│       │   └── simplified_contigs
│       ├── K33
│       │   ├── configs
│       │   └── simplified_contigs
│       ├── K55
│       │   ├── configs
│       │   └── simplified_contigs
│       ├── K77
│       │   ├── configs
│       │   └── path_extend
│       ├── misc
│       ├── mismatch_corrector
│       │   ├── contigs
│       │   │   └── configs
│       │   └── scaffolds
│       │       └── configs
│       ├── pipeline_state
│       └── tmp
├── data
├── kraken
├── krona
│   └── taxonomy
├── mappings
│   ├── evol1.sorted.dedup_stats
│   │   ├── css
│   │   ├── images_qualimapReport
│   │   └── raw_data_qualimapReport
│   └── ref_genome
└── quality_control
    ├── data
    ├── multiqc_data
    ├── trimmed
    └── trimmed-fastqc

65 directories

Preprocessing

We first need to make an index of our reference genome as this is required by the SNP caller. Given a scaffold/contig file in fasta-format, e.g. scaffolds.fasta which is located in the directory assembly/, use Samtools to do this:

#First *cd* back into ~/genomics_tutorial          
$ samtools faidx assembly/spades-150/scaffolds.fasta

In the next step we will further pre-process our mapping files and create a bam-index file (.bai) for the bam-file we want to work with:

#Create the bam index with Bamtools
$ bamtools index -in mappings/evol1.sorted.dedup.q20.bam

In the last step of preprocessing create a new directory for the variants:

#Create a directory to store the variant calling results
$ mkdir variants

Calling variants

We will call variants with freebayes.

New Tool

freebayes
a Bayesian genetic variant detector for variable elements that are smaller than the length of a short-read sequencing alignment.

To call variants with freebayes we need:

  • A reference genome scaffold file in fasta-format, e.g. scaffolds.fasta
  • An index in .fai format - We created this index in the pre-processing step with Samtools.
  • A mapping file (.bam file)
  • A mapping index in .bai format - We created this index in the pre-processing step with Bamtools.

Thus, we have all the input we need to do variant calling:

#Make sure you are in the ~/genomics_tutorial directory
$ cd ~/genomics_tutorial

# Now we call variants and pipe the results into a new file
$ freebayes -p 1 -f assembly/scaffolds.fasta mappings/evol1.sorted.dedup.q20.bam > variants/evol1.freebayes.vcf

Where in the command above:

  • -p 1 specifies the ploidy level. E.Coli are haploid.

Post-processing

The vcf file format

Our freebayes run produced a vcf formatted file. As we explore our results we will also explore the vcf format simultaneously. To see how the file is structured, we can use Linux terminal tools to view its contents:

# first 10 lines, which are part of the header
$ cat variants/evol1.freebayes.vcf | head

This should give you the header of the file. For example:

##fileformat=VCFv4.2
##fileDate=20220607
##source=freeBayes v1.3.6
##reference=assembly/spades-150/scaffolds.fasta
##contig=<ID=NODE_1_length_115086_cov_5.759660,length=115086>
##contig=<ID=NODE_2_length_106179_cov_5.826921,length=106179>
##contig=<ID=NODE_3_length_102007_cov_6.351643,length=102007>
##contig=<ID=NODE_4_length_100915_cov_6.070430,length=100915>
##contig=<ID=NODE_5_length_99317_cov_6.509049,length=99317>
##contig=<ID=NODE_6_length_90900_cov_5.988142,length=90900>
Header lines are demarcated by ## so we can use this feature to strip out the header lines and only look at the top 4 entries:
# remove header lines and look at top 4 entries
$ cat variants/evol1.freebayes.vcf | grep -v '##' | head -4

This yields output similar to the example below:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  unknown
NODE_1_length_115086_cov_5.759660   81  .   A   T   0   .   AB=0;ABP=0;AC=0;AF=0;AN=1;AO=2;CIGAR=1X;DP=37;DPB=37;DPRA=0;EPP=7.35324;EPPR=13.4954;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=166.958;PAIRED=1;PAIREDR=0.971429;PAO=0;PQA=0;PQR=0;PRO=0;QA=49;QR=887;RO=35;RPL=0;RPP=7.35324;RPPR=10.5174;RPR=2;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=21;SRP=6.05036;SRR=14;TYPE=snp GT:DP:AD:RO:QR:AO:QA:GL 0:37:35,2:35:887:2:49:0,-75.4037
NODE_1_length_115086_cov_5.759660   83  .   A   T   6.17772e-15 .   AB=0;ABP=0;AC=0;AF=0;AN=1;AO=2;CIGAR=1X;DP=37;DPB=37;DPRA=0;EPP=7.35324;EPPR=6.05036;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=232.777;PAIRED=1;PAIREDR=0.971429;PAO=0;PQA=0;PQR=0;PRO=0;QA=40;QR=1195;RO=35;RPL=1;RPP=3.0103;RPPR=10.5174;RPR=1;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=22;SRP=8.03571;SRR=13;TYPE=snp  GT:DP:AD:RO:QR:AO:QA:GL 0:37:35,2:35:1195:2:40:0,-104.042
NODE_1_length_115086_cov_5.759660   412 .   G   T   0   .   AB=0;ABP=0;AC=0;AF=0;AN=1;AO=2;CIGAR=1X;DP=33;DPB=33;DPRA=0;EPP=3.0103;EPPR=4.76149;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=216.196;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=16;QR=1078;RO=31;RPL=2;RPP=7.35324;RPPR=6.44263;RPR=0;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=14;SRP=3.64073;SRR=17;TYPE=snp GT:DP:AD:RO:QR:AO:QA:GL 0:33:31,2:31:1078:2:16:0,-95.797          

The fields in a vcf-file are described as follows:

Column Field Name Description
1 CHROM Chromosome name
2 POS 1-based position. For an indel, this is the position preceding the indel.
3 ID Variant identifier. Usually the dbSNP rsID.
4 REF Reference sequence at POS involved in the variant. For a SNP, it is a single base.
5 ALT Comma delimited list of alternative seuqence(s).
6 QUAL Phred-scaled probability of all samples being homozygous reference.
7 FILTER Semicolon delimited list of filters that the variant fails to pass.
8 INFO Semicolon delimited list of variant information.
9 FORMAT Colon delimited list of the format of individual genotypes in the following fields.
10+ Sample(s) Individual genotype information defined by FORMAT.

Statistics

The next step in exploring our vcf data is to generated some statistics that can tell us more about the quality of the called variants

First, to prepare our vcf-file for querying we need to index it with tabix (which is part of Samtools):

#We first need to compress file 
$ bgzip variants/evol1.freebayes.vcf

#Now we can index the file
$ tabix -p vcf variants/evol1.freebayes.vcf.gz
Where in the command above:

  • -p vcfspecifies that the format of the input file is vcf

We can generate some statistics using rtg vcfstats

New Tool

RTG Tools
A software package that can be used to read, evaluate and manipulate vcf formatted files

Run rtg vcfstats as follows:

#Remember that we are using the compressed version of our vcf file
$ rtg vcfstats variants/evol1.freebayes.vcf.gz

The output should look similar to the example below:

Location                     : variants/evol1.freebayes.vcf.gz
Failed Filters               : 0
Passed Filters               : 35221
SNPs                         : 101
MNPs                         : 6
Insertions                   : 5
Deletions                    : 4
Indels                       : 1
Same as reference            : 35104
Total Haploid                : 117
Haploid SNPs                 : 101
Haploid MNPs                 : 6
Haploid Insertions           : 5
Haploid Deletions            : 4
Haploid Indels               : 1
SNP Transitions/Transversions: 0.60 (38/63)
Insertion/Deletion ratio     : 1.25 (5/4)
Indel/SNP+MNP ratio          : 0.09 (10/107)

The output above gives a nice summary of our variant calling data, but we can use another tool that is part of samtools, bcftools, to get more detailed statistics of our data.

Run bcftools as follow:

$ bcftools stats -F assembly/spades-150/scaffolds.fasta -s - variants/evol1.freebayes.vcf.gz > variants/evol1.freebayes.vcf.gz.stats

Where in the command above:

  • -s - is a list of samples for sample stats. - is used to include all samples
  • -F FILE is the faidx indexed reference sequence file (which is used to determine INDEL context)

To visualize the results from bcftools we do the following:

#First create a directory to store the plots in
$ mkdir variants/plots

#Next, generate the plots
$ plot-vcfstats -p variants/plots/variants/evol1.freebayes.vcf.gz.stats
Where in the command above: * -p is the output files prefix. Add a slash at the end to create a new directory.

Check the variants/plots directory for the generated output. A summary of all the data is generated and packaged in a pdf file.

An example of some of the plots generated can be seen below:

vcfstats

Variant filtration

Variant filtration is still a developing field and thus there is no consensus the best method to filter variants (2). For the purposes of this tutorial, we will only perform some simple filtration procedures here.

We can start by filtering based on quality by selecting only variants with a quality score > 30. To do this we can once again use RTG tools:

# use rtg vcfffilter
$ rtg vcffilter -q 30 -i variants/evol1.freebayes.vcf.gz -o variants/evol1.freebayes.q30.vcf.gz

Where in the command above

  • -i FILE specifies the input file
  • -o FILEspecifies the output file
  • -q FLOATspecifies the minimal allowed quality in output.

Next we can get some quick statistics on the filtered variants as done before:

# look at stats for filtered
$ rtg vcfstats variants/evol1.freebayes.q30.vcf.gz

An example of the output can be seen below:

Location                     : variants/evol1.freebayes.q30.vcf.gz
Failed Filters               : 0
Passed Filters               : 105
SNPs                         : 90
MNPs                         : 5
Insertions                   : 5
Deletions                    : 4
Indels                       : 1
Same as reference            : 0
Total Haploid                : 105
Haploid SNPs                 : 90
Haploid MNPs                 : 5
Haploid Insertions           : 5
Haploid Deletions            : 4
Haploid Indels               : 1
SNP Transitions/Transversions: 0.64 (35/55)
Insertion/Deletion ratio     : 1.25 (5/4)
Indel/SNP+MNP ratio          : 0.11 (10/95)
Comparing the example above with the previous output from vcfstats you can see that we have filtered out some SNPs.

freebayes adds some extra information to the vcf files it creates and this allows for some more detailed filtering.

Note

This strategy will NOT work on calls done with e.g. samtools / bcftools mpileup called variants.

Here we filter based on some recommendations from the developer of freebayes:

$ zcat variants/evol1.freebayes.vcf.gz | vcffilter -f "QUAL > 1 & QUAL / AO > 10 & SAF > 0 & SAR > 0 & RPR > 1 & RPL > 1" | bgzip > variants/evol1.freebayes.filtered.vcf.gz
Where in the commands above:

  • QUAL > 1: removes really bad sites
  • QUAL / AO > 10: additional contribution of each observation should be 10 log units (~ Q10 per read)
  • SAF > 0 & SAR > 0: reads on both strands
  • RPR > 1 & RPL > 1: at least two reads “balanced” to each side of the site

Now we can look at the statistics after filtering:

#Index the newly filtered vcf file
$ tabix -p vcf variants/evol1.freebayes.filtered.vcf.gz

#Obtain statistics for the filtered vcf file
$ rtg vcfstats variants/evol1.freebayes.filtered.vcf.gz
The output is now as follows:
Location                     : variants/evol1.freebayes.filtered.vcf.gz
Failed Filters               : 0
Passed Filters               : 59
SNPs                         : 48
MNPs                         : 1
Insertions                   : 5
Deletions                    : 4
Indels                       : 1
Same as reference            : 0
Total Haploid                : 59
Haploid SNPs                 : 48
Haploid MNPs                 : 1
Haploid Insertions           : 5
Haploid Deletions            : 4
Haploid Indels               : 1
SNP Transitions/Transversions: 0.50 (16/32)
Insertion/Deletion ratio     : 1.25 (5/4)
Indel/SNP+MNP ratio          : 0.20 (10/49)
Our SNPs are now significantly lower, as well as the SNP Transitions/Transversions ratio.

For the purposes of the tutorial the filtering strategy above was sufficient. However, several more elaborate strategies have been explored and should be investigated if you plan on doing variant calling in future. Some of these additional strategies can be found here

To summarize the process of variant calling in this tutorial we can apply the same process flow to the second evolved strain:

# Perform the index mapping
$ bamtools index -in mappings/evol2.sorted.dedup.q20.bam

#Calling the variants
$ freebayes -p 1 -f assembly/scaffolds.fasta mappings/evol2.sorted.dedup.q20.bam > variants/evol2.freebayes.vcf

#Compress the result files
$ bgzip variants/evol2.freebayes.vcf

# Index the compressed file
$ tabix -p vcf variants/evol2.freebayes.vcf.gz

#Get pre-filtered statistics
rtg vcfstats variants/evol2.freebayes.vcf.gz

# Perform filtering
$ zcat variants/evol2.freebayes.vcf.gz | vcffilter -f "QUAL > 1 & QUAL / AO > 10 & SAF > 0 & SAR > 0 & RPR > 1 & RPL > 1" | bgzip > variants/evol2.freebayes.filtered.vcf.gz

#Index the filtered dataset
$ tabix -p vcf variants/evol2.freebayes.filtered.vcf.gz

#Get post-filtered statistics
$rtg vcfstats variants/evol2.freebayes.filtered.vcf.gz

References

  1. Nielsen, R., Paul, J. S., Albrechtsen, A., & Song, Y. S. (2011). Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics, 12(6), 443-451.
  2. Olson, N. D., Lund, S. P., Colman, R. E., Foster, J. T., Sahl, J. W., Schupp, J. M., ... & Zook, J. M. (2015). Best practices for evaluating single nucleotide variant calling methods for microbial genomics. Frontiers in genetics, 6, 235.