Genomics Tutorial - Variant Calling
Introduction
In this section we will use our genome assembly based on the ancestor and call genetic variants in the evolved line. For more information on variant calling, see Reference 1.
Overview
The part of the work-flow we will work on in this section marked in red below:
Learning outcomes
After completing this section of the tutorial you should be able to:
- Use tools to call variants based on a reference genome.
- Be able to describe what influences the calling of variants.
The Data
Lets look at our directory structure in ~/genomics_tutorial so far:
genomics_tutorial
├── assembly
│ ├── quast
│ │ ├── basic_stats
│ │ └── icarus_viewers
│ ├── spades-150
│ │ ├── corrected
│ │ │ └── configs
│ │ ├── K21
│ │ │ ├── configs
│ │ │ └── simplified_contigs
│ │ ├── K33
│ │ │ ├── configs
│ │ │ └── simplified_contigs
│ │ ├── K55
│ │ │ ├── configs
│ │ │ └── simplified_contigs
│ │ ├── K77
│ │ │ ├── configs
│ │ │ └── path_extend
│ │ ├── misc
│ │ ├── mismatch_corrector
│ │ │ ├── contigs
│ │ │ │ └── configs
│ │ │ └── scaffolds
│ │ │ └── configs
│ │ ├── pipeline_state
│ │ └── tmp
│ └── spades-original
│ ├── corrected
│ │ └── configs
│ ├── K21
│ │ ├── configs
│ │ └── simplified_contigs
│ ├── K33
│ │ ├── configs
│ │ └── simplified_contigs
│ ├── K55
│ │ ├── configs
│ │ └── simplified_contigs
│ ├── K77
│ │ ├── configs
│ │ └── path_extend
│ ├── misc
│ ├── mismatch_corrector
│ │ ├── contigs
│ │ │ └── configs
│ │ └── scaffolds
│ │ └── configs
│ ├── pipeline_state
│ └── tmp
├── data
├── kraken
├── krona
│ └── taxonomy
├── mappings
│ ├── evol1.sorted.dedup_stats
│ │ ├── css
│ │ ├── images_qualimapReport
│ │ └── raw_data_qualimapReport
│ └── ref_genome
└── quality_control
├── data
├── multiqc_data
├── trimmed
└── trimmed-fastqc
65 directories
Preprocessing
We first need to make an index of our reference genome as this is required by the SNP caller.
Given a scaffold/contig file in fasta-format, e.g. scaffolds.fasta
which is located in the directory assembly/
, use Samtools to do this:
#First *cd* back into ~/genomics_tutorial
$ samtools faidx assembly/spades-150/scaffolds.fasta
In the next step we will further pre-process our mapping files and create a bam-index file (.bai
) for the bam-file we want to work with:
#Create the bam index with Bamtools
$ bamtools index -in mappings/evol1.sorted.dedup.q20.bam
In the last step of preprocessing create a new directory for the variants:
#Create a directory to store the variant calling results
$ mkdir variants
Calling variants
We will call variants with freebayes.
New Tool
freebayes
a Bayesian genetic variant detector for variable elements that are smaller than the length of a short-read sequencing alignment.
To call variants with freebayes we need:
- A reference genome scaffold file in fasta-format, e.g.
scaffolds.fasta
- An index in
.fai
format - We created this index in the pre-processing step with Samtools. - A mapping file (
.bam
file) - A mapping index in
.bai
format - We created this index in the pre-processing step with Bamtools.
Thus, we have all the input we need to do variant calling:
#Make sure you are in the ~/genomics_tutorial directory
$ cd ~/genomics_tutorial
# Now we call variants and pipe the results into a new file
$ freebayes -p 1 -f assembly/scaffolds.fasta mappings/evol1.sorted.dedup.q20.bam > variants/evol1.freebayes.vcf
Where in the command above:
-p 1
specifies the ploidy level. E.Coli are haploid.
Post-processing
The vcf file format
Our freebayes run produced a vcf
formatted file. As we explore our results we will also explore the vcf
format simultaneously. To see how the file is structured, we can use Linux terminal tools to view its contents:
# first 10 lines, which are part of the header
$ cat variants/evol1.freebayes.vcf | head
This should give you the header of the file. For example:
##fileformat=VCFv4.2
##fileDate=20220607
##source=freeBayes v1.3.6
##reference=assembly/spades-150/scaffolds.fasta
##contig=<ID=NODE_1_length_115086_cov_5.759660,length=115086>
##contig=<ID=NODE_2_length_106179_cov_5.826921,length=106179>
##contig=<ID=NODE_3_length_102007_cov_6.351643,length=102007>
##contig=<ID=NODE_4_length_100915_cov_6.070430,length=100915>
##contig=<ID=NODE_5_length_99317_cov_6.509049,length=99317>
##contig=<ID=NODE_6_length_90900_cov_5.988142,length=90900>
Header lines are demarcated by ##
so we can use this feature to strip out the header lines and only look at the top 4 entries:
# remove header lines and look at top 4 entries
$ cat variants/evol1.freebayes.vcf | grep -v '##' | head -4
This yields output similar to the example below:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT unknown
NODE_1_length_115086_cov_5.759660 81 . A T 0 . AB=0;ABP=0;AC=0;AF=0;AN=1;AO=2;CIGAR=1X;DP=37;DPB=37;DPRA=0;EPP=7.35324;EPPR=13.4954;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=166.958;PAIRED=1;PAIREDR=0.971429;PAO=0;PQA=0;PQR=0;PRO=0;QA=49;QR=887;RO=35;RPL=0;RPP=7.35324;RPPR=10.5174;RPR=2;RUN=1;SAF=2;SAP=7.35324;SAR=0;SRF=21;SRP=6.05036;SRR=14;TYPE=snp GT:DP:AD:RO:QR:AO:QA:GL 0:37:35,2:35:887:2:49:0,-75.4037
NODE_1_length_115086_cov_5.759660 83 . A T 6.17772e-15 . AB=0;ABP=0;AC=0;AF=0;AN=1;AO=2;CIGAR=1X;DP=37;DPB=37;DPRA=0;EPP=7.35324;EPPR=6.05036;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=232.777;PAIRED=1;PAIREDR=0.971429;PAO=0;PQA=0;PQR=0;PRO=0;QA=40;QR=1195;RO=35;RPL=1;RPP=3.0103;RPPR=10.5174;RPR=1;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=22;SRP=8.03571;SRR=13;TYPE=snp GT:DP:AD:RO:QR:AO:QA:GL 0:37:35,2:35:1195:2:40:0,-104.042
NODE_1_length_115086_cov_5.759660 412 . G T 0 . AB=0;ABP=0;AC=0;AF=0;AN=1;AO=2;CIGAR=1X;DP=33;DPB=33;DPRA=0;EPP=3.0103;EPPR=4.76149;GTI=0;LEN=1;MEANALT=1;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=216.196;PAIRED=1;PAIREDR=1;PAO=0;PQA=0;PQR=0;PRO=0;QA=16;QR=1078;RO=31;RPL=2;RPP=7.35324;RPPR=6.44263;RPR=0;RUN=1;SAF=1;SAP=3.0103;SAR=1;SRF=14;SRP=3.64073;SRR=17;TYPE=snp GT:DP:AD:RO:QR:AO:QA:GL 0:33:31,2:31:1078:2:16:0,-95.797
The fields in a vcf-file are described as follows:
Column | Field Name | Description |
---|---|---|
1 | CHROM | Chromosome name |
2 | POS | 1-based position. For an indel, this is the position preceding the indel. |
3 | ID | Variant identifier. Usually the dbSNP rsID. |
4 | REF | Reference sequence at POS involved in the variant. For a SNP, it is a single base. |
5 | ALT | Comma delimited list of alternative seuqence(s). |
6 | QUAL | Phred-scaled probability of all samples being homozygous reference. |
7 | FILTER | Semicolon delimited list of filters that the variant fails to pass. |
8 | INFO | Semicolon delimited list of variant information. |
9 | FORMAT | Colon delimited list of the format of individual genotypes in the following fields. |
10+ | Sample(s) | Individual genotype information defined by FORMAT. |
Statistics
The next step in exploring our vcf
data is to generated some statistics that can tell us more about the quality of the called variants
First, to prepare our vcf-file for querying we need to index it with tabix
(which is part of Samtools):
#We first need to compress file
$ bgzip variants/evol1.freebayes.vcf
#Now we can index the file
$ tabix -p vcf variants/evol1.freebayes.vcf.gz
Where in the command above:
-p vcf
specifies that the format of the input file isvcf
We can generate some statistics using rtg vcfstats
New Tool
RTG Tools
A software package that can be used to read, evaluate and manipulate vcf
formatted files
Run rtg vcfstats
as follows:
#Remember that we are using the compressed version of our vcf file
$ rtg vcfstats variants/evol1.freebayes.vcf.gz
The output should look similar to the example below:
Location : variants/evol1.freebayes.vcf.gz
Failed Filters : 0
Passed Filters : 35221
SNPs : 101
MNPs : 6
Insertions : 5
Deletions : 4
Indels : 1
Same as reference : 35104
Total Haploid : 117
Haploid SNPs : 101
Haploid MNPs : 6
Haploid Insertions : 5
Haploid Deletions : 4
Haploid Indels : 1
SNP Transitions/Transversions: 0.60 (38/63)
Insertion/Deletion ratio : 1.25 (5/4)
Indel/SNP+MNP ratio : 0.09 (10/107)
The output above gives a nice summary of our variant calling data, but we can use another tool that is part of samtools, bcftools
, to
get more detailed statistics of our data.
Run bcftools
as follow:
$ bcftools stats -F assembly/spades-150/scaffolds.fasta -s - variants/evol1.freebayes.vcf.gz > variants/evol1.freebayes.vcf.gz.stats
Where in the command above:
-s -
is a list of samples for sample stats.-
is used to include all samples-F FILE
is thefaidx
indexed reference sequence file (which is used to determine INDEL context)
To visualize the results from bcftools
we do the following:
#First create a directory to store the plots in
$ mkdir variants/plots
#Next, generate the plots
$ plot-vcfstats -p variants/plots/variants/evol1.freebayes.vcf.gz.stats
Where in the command above:
* -p
is the output files prefix. Add a slash at the end to create a new directory.
Check the variants/plots
directory for the generated output. A summary of all the data is generated and packaged in a pdf file.
An example of some of the plots generated can be seen below:
Variant filtration
Variant filtration is still a developing field and thus there is no consensus the best method to filter variants (2). For the purposes of this tutorial, we will only perform some simple filtration procedures here.
We can start by filtering based on quality by selecting only variants with a quality score > 30. To do this we can once again use RTG tools
:
# use rtg vcfffilter
$ rtg vcffilter -q 30 -i variants/evol1.freebayes.vcf.gz -o variants/evol1.freebayes.q30.vcf.gz
Where in the command above
-i FILE
specifies the input file-o FILE
specifies the output file-q FLOAT
specifies the minimal allowed quality in output.
Next we can get some quick statistics on the filtered variants as done before:
# look at stats for filtered
$ rtg vcfstats variants/evol1.freebayes.q30.vcf.gz
An example of the output can be seen below:
Location : variants/evol1.freebayes.q30.vcf.gz
Failed Filters : 0
Passed Filters : 105
SNPs : 90
MNPs : 5
Insertions : 5
Deletions : 4
Indels : 1
Same as reference : 0
Total Haploid : 105
Haploid SNPs : 90
Haploid MNPs : 5
Haploid Insertions : 5
Haploid Deletions : 4
Haploid Indels : 1
SNP Transitions/Transversions: 0.64 (35/55)
Insertion/Deletion ratio : 1.25 (5/4)
Indel/SNP+MNP ratio : 0.11 (10/95)
Comparing the example above with the previous output from vcfstats
you can see that we have filtered out some SNPs.
freebayes
adds some extra information to the vcf
files it creates and this allows for some more detailed filtering.
Note
This strategy will NOT work on calls done with e.g. samtools
/ bcftools
mpileup called variants.
Here we filter based on some recommendations from the developer of freebayes
:
$ zcat variants/evol1.freebayes.vcf.gz | vcffilter -f "QUAL > 1 & QUAL / AO > 10 & SAF > 0 & SAR > 0 & RPR > 1 & RPL > 1" | bgzip > variants/evol1.freebayes.filtered.vcf.gz
Where in the commands above:
QUAL > 1
: removes really bad sitesQUAL / AO > 10
: additional contribution of each observation should be 10 log units (~ Q10 per read)SAF > 0 & SAR > 0
: reads on both strandsRPR > 1 & RPL > 1
: at least two reads “balanced” to each side of the site
Now we can look at the statistics after filtering:
#Index the newly filtered vcf file
$ tabix -p vcf variants/evol1.freebayes.filtered.vcf.gz
#Obtain statistics for the filtered vcf file
$ rtg vcfstats variants/evol1.freebayes.filtered.vcf.gz
The output is now as follows:
Location : variants/evol1.freebayes.filtered.vcf.gz
Failed Filters : 0
Passed Filters : 59
SNPs : 48
MNPs : 1
Insertions : 5
Deletions : 4
Indels : 1
Same as reference : 0
Total Haploid : 59
Haploid SNPs : 48
Haploid MNPs : 1
Haploid Insertions : 5
Haploid Deletions : 4
Haploid Indels : 1
SNP Transitions/Transversions: 0.50 (16/32)
Insertion/Deletion ratio : 1.25 (5/4)
Indel/SNP+MNP ratio : 0.20 (10/49)
Our SNPs are now significantly lower, as well as the SNP Transitions/Transversions ratio.
For the purposes of the tutorial the filtering strategy above was sufficient. However, several more elaborate strategies have been explored and should be investigated if you plan on doing variant calling in future. Some of these additional strategies can be found here
To summarize the process of variant calling in this tutorial we can apply the same process flow to the second evolved strain:
# Perform the index mapping
$ bamtools index -in mappings/evol2.sorted.dedup.q20.bam
#Calling the variants
$ freebayes -p 1 -f assembly/scaffolds.fasta mappings/evol2.sorted.dedup.q20.bam > variants/evol2.freebayes.vcf
#Compress the result files
$ bgzip variants/evol2.freebayes.vcf
# Index the compressed file
$ tabix -p vcf variants/evol2.freebayes.vcf.gz
#Get pre-filtered statistics
rtg vcfstats variants/evol2.freebayes.vcf.gz
# Perform filtering
$ zcat variants/evol2.freebayes.vcf.gz | vcffilter -f "QUAL > 1 & QUAL / AO > 10 & SAF > 0 & SAR > 0 & RPR > 1 & RPL > 1" | bgzip > variants/evol2.freebayes.filtered.vcf.gz
#Index the filtered dataset
$ tabix -p vcf variants/evol2.freebayes.filtered.vcf.gz
#Get post-filtered statistics
$rtg vcfstats variants/evol2.freebayes.filtered.vcf.gz
References
- Nielsen, R., Paul, J. S., Albrechtsen, A., & Song, Y. S. (2011). Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics, 12(6), 443-451.
- Olson, N. D., Lund, S. P., Colman, R. E., Foster, J. T., Sahl, J. W., Schupp, J. M., ... & Zook, J. M. (2015). Best practices for evaluating single nucleotide variant calling methods for microbial genomics. Frontiers in genetics, 6, 235.