Genomics Tutorial - Read Mapping

Introduction

In the previous section we created a genome assembly of the ancestral sequences.

The next step in the process is to map the evolved sequences to this ancestral reference assembly.

Overview

The part of the work-flow we will work on in this section marked in red below:

mapping_workflow

Learning outcomes

After studying this section of the tutorial you should be able to:

  • Explain the process of sequence read mapping.
  • Use bioinformatics tools to map sequencing reads to a reference genome.
  • Filter mapped reads based on quality.

Setup the environment

Follow these steps to set-up the conda environment for this section:

  1. Open a new terminal and load the workshops/workshops/genomics_workshop_mapping module:
    $ module load workshops/workshops/genomics_workshop_mapping
  2. Activate the conda environment:
    $ gen_map_init

The Data

Before you begin with this section of the tutorial, your genomics_tutorial directory should have the following structure:

genomics_tutorial
├── assembly
│   ├── quast
│   │   ├── basic_stats
│   │   └── icarus_viewers
│   ├── spades-150
│   │   ├── corrected
│   │   │   └── configs
│   │   ├── K21
│   │   │   ├── configs
│   │   │   └── simplified_contigs
│   │   ├── K33
│   │   │   ├── configs
│   │   │   └── simplified_contigs
│   │   ├── K55
│   │   │   ├── configs
│   │   │   └── simplified_contigs
│   │   ├── K77
│   │   │   ├── configs
│   │   │   └── path_extend
│   │   ├── misc
│   │   ├── mismatch_corrector
│   │   │   ├── contigs
│   │   │   │   └── configs
│   │   │   └── scaffolds
│   │   │       └── configs
│   │   ├── pipeline_state
│   │   └── tmp
│   └── spades-original
│       ├── corrected
│       │   └── configs
│       ├── K21
│       │   ├── configs
│       │   └── simplified_contigs
│       ├── K33
│       │   ├── configs
│       │   └── simplified_contigs
│       ├── K55
│       │   ├── configs
│       │   └── simplified_contigs
│       ├── K77
│       │   ├── configs
│       │   └── path_extend
│       ├── misc
│       ├── mismatch_corrector
│       │   ├── contigs
│       │   │   └── configs
│       │   └── scaffolds
│       │       └── configs
│       ├── pipeline_state
│       └── tmp
├── data
└── quality_control
    ├── data
    ├── multiqc_data
    ├── trimmed
    └── trimmed-fastqc

56 directories

Mapping sequence reads to a reference genome

We want to map the sequencing reads to the ancestral reference genome. We are going to use the quality trimmed forward and backward DNA sequences of the evolved line and use a program called BWA to map the reads (1).

New Tool

BWA
A short read aligner, that can take a reference genome and map single- or paired-end sequence data to it

To obtain information on how to run BWA tools, do the following:

#general help
$ bwa

#Help with indexing
$ bwa index

# Help with mem
$ bwa mem

Creating a reference index for mapping

BWA requires an index of the reference genome and can either create one or reuse an index that was created previously.

Note

The indexing step may be time-consuming. Remember that once created, the index can be reused.

To create a BWA index for our ancestral genome assembly, we will pass the scaffolds.fasta file from out assembly to bwa index:

#First change into the root folder for the tutorial
$ cd ~/genomics_tutorial

#Make a directory for the mappings together with a sub-directory for our reference genome assembly
$ mkdir -p mappings/ref_genome

#Now copy the genome assembly done in the previous section to our newly created ref_genome directory
$ cp assembly/spades-150/scaffolds.fasta mappings/ref_genome

#Create the bwa index
$ bwa index assembly/spades-150/scaffolds.fasta

#Do an ls on the mappings/ref_genome directory
$ ls mappings/ref_genome
After performing the ls in the last step, you will see that BWA have created many files for our reference genome assembly. Together, these files are the BWA index that can now be reused in the mapping process.

Mapping reads in a paired-end manner

Now that we have created our index, it is time to map the trimmed sequencing reads of our two evolved lines to the ancestral genome.

$ bwa mem mappings/ref_genome/scaffolds.fasta quality_control/trimmed/evol1_R1.fastq.gz quality_control/trimmed/evol1_R2.fastq.gz > mappings/evol1.sam

$ bwa mem mappings/ref_genome/scaffolds.fasta quality_control/trimmed/evol2_R1.fastq.gz quality_control/trimmed/evol2_R2.fastq.gz > mappings/evol2.sam

The commands above created two .sam files. This file format is a common format for read mapping tools, including BWA. Lets dive deeper into the format and see what it looks like.

The SAM File Format

A quick overview of the sam-format can be found here.

However, briefly:

SAM File Format

Any number of header lines starting with @ symbol
An alignment section/line with 11 mandatory fields in a tab delimited format

The columns of an alignment line in the mapping file are described in the table below:

Column Field Description
1 QNAME Query (pair) NAME
2 FLAG bitwise FLAG
3 RNAME Reference sequence NAME
4 POS 1-based leftmost POSition/coordinate of clipped sequence
5 MAPQ MAPping Quality (Phred-scaled)
6 CIAGR extended CIGAR string
7 MRNM Mate Reference sequence NaMe (‘=’ if same as RNAME)
8 MPOS 1-based Mate POSition
9 ISIZE Inferred insert SIZE
10 SEQ query SEQuence on the same strand as the reference
11 QUAL query QUALity (ASCII-33 gives the Phred base quality)
12 OPT variable OPTional fields in the format TAG\:VTYPE\:VALUE

One line of a mapped read can be seen here (you will need to scroll to the right):


    M02810:197:000000000-AV55U:1:1101:10000:11540   83      NODE_1_length_1419525_cov_15.3898       607378  60      151M    =       607100  -429    TATGGTATCACTTATGGTATCACTTATGGCTATCACTAATGGCTATCACTTATGGTATCACTTATGACTATCAGACGTTATTACTATCAGACGATAACTATCAGACTTTATTACTATCACTTTCATATTACCCACTATCATCCCTTCTTTA FHGHHHHHGGGHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHGHHHHHGHHHHHHHHGDHHHHHHHHGHHHHGHHHGHHHHHHFHHHHGHHHHIHHHHHHHHHHHHHHHHHHHGHHHHHGHGHHHHHHHHEGGGGGGGGGFBCFFFFCCCCC NM:i:0  MD:Z:151        AS:i:151        XS:i:0

The line above essentially defines the read and the position within the reference genome, where the read mapped and a quality of the mapping.

Mapping post-processing

After performing sequence mapping, some post processing is required, which is what we will demonstrate in this section.

Fix mates and compress

Because aligners can sometimes leave unusual SAM flag information on SAM records, it is helpful when working with many tools to first clean up read pairing information and flags with Samtools.

New Tool

Samtools
A tool for reading/writing/editing/indexing/viewing files in the SAM/BAM/CRAM formats

We are also going to produce compressed bam output for efficient storing of and access to the mapped reads.

Note

samtools fixmate expects name-sorted input files, which we can achieve with samtools sort -n.

Thus, we will pipe (|) the output from samtools sort -n to samtools fixmate:

    $ samtools sort -n -O sam mappings/evol1.sam | samtools fixmate -m -O bam - mappings/evol1.fixmate.bam

    $ samtools sort -n -O sam mappings/evol2.sam | samtools fixmate -m -O bam - mappings/evol2.fixmate.bam
Where in the commands above:

  • -m: Adds the ms (mate score) tags. These are used by markdup (below) to select the best reads to keep.
  • -O bam: Specifies that we want compressed bam output from fixmate

Attention

The step of sam to bam-file conversion might take a few minutes to finish, depending on how big your mapping file is.

Note

Once we have the bam-file, we can also delete the original sam-file as it requires too much space and we can always recreate it from the bam-file. For this tutorial we will not do this.

Sorting

Next, we need to use samtools again to sort the bam-file into coordinate order:

# convert to bam file and sort
$ samtools sort -O bam -o mappings/evol1.sorted.bam mappings/evol1.fixmate.bam

$ samtools sort -O bam -o mappings/evol2.sorted.bam mappings/evol2.fixmate.bam

Where in the commands above: * -o: specifies the name of the output file. * -O bam: specifies that the output will be bam-format

Remove duplicates

In the final step of post-processing, we remove duplicate reads. The main purpose of removing duplicates is to mitigate the effects of PCR amplification bias introduced during library construction. It should be noted that this step is not always recommended. It depends on the research question. In SNP calling it is a good idea to remove duplicates, as the statistics used in the tools that call SNPs subsequently expect this (most tools anyway). However, for other research questions that use mapping, you might not want to remove duplicates, e.g. RNA-seq.

To remove the duplicates, do the following:

#Remove duplicates
$ samtools markdup -r -S mappings/evol1.sorted.bam mappings/evol1.sorted.dedup.bam

$ samtools markdup -r -S mappings/evol2.sorted.bam mappings/evol2.sorted.dedup.bam  

Mapping statistics

Stats with QualiMap

For an in depth analysis of the mapping results, one can use qualimap ( 2).

New Tool

Qualimap
This tool examines sequencing alignment data in SAM/BAM files according to the features of the mapped reads and provides an overall view of the data that helps to the detect biases in the sequencing and/or mapping of the data and eases decision-making for further analysis.

To run Qualimap, do the following:

#Run Qualimap
$ qualimap bamqc -bam mappings/evol1.sorted.dedup.bam
$ qualimap bamqc -bam mappings/evol2.sorted.dedup.bam

# Once finished open the result pages with a web browser
$ firefox mappings/evol1.sorted.dedup_stats/qualimapReport.html
$ firefox mappings/evol2.sorted.dedup_stats/qualimapReport.html

This will create a report in the mapping folder. See this webpage to get help on the sections in the report.

Sub-selecting reads

It is important to remember that the mapping commands we used above, without additional parameters to sub-select specific alignments (e.g. for Bowtie there are options like --no-mixed, which suppresses unpaired alignments for paired reads or --no-discordant, which suppresses discordant alignments for paired reads, etc.), are going to output all reads, including unmapped reads, multi-mapping reads, unpaired reads, discordant read pairs, etc. in one file. We can sub-select from the output reads we want to analyse further using samtools.

Concordant reads

We can select read-pair that have been mapped in a correct manner (same chromosome/contig, correct orientation to each other, distance between reads is reasonable).

Attention

We show the command here, but we are not going to use it.

$ samtools view -h -b -f 3 mappings/evol1.sorted.dedup.bam > mappings/evol1.sorted.dedup.concordant.bam
Where in the command above:

  • -b: Output will be bam-format
  • -f 3: Only extract correctly paired reads. -f extracts alignments with the specified SAM flag

Quality-based sub-selection

In this section we want to sub-select reads based on the quality of the mapping. It seems a reasonable idea to only keep good mapping reads. As the SAM-format contains at column 5 the MAPQ value, which we established earlier is the "MAPping Quality" in Phred-scaled, this seems easily achieved. The formula to calculate the MAPQ value is: MAPQ=-10*log10(p), where p is the probability that the read is mapped wrongly. However, there is a problem! While the MAPQ information would be very helpful indeed, the way that various tools implement this value differs. A good overview can be found here. Bottom-line is that we need to be aware that different tools use this value in different ways and the it is good to know the information that is encoded in the value. Once you dig deeper into the mechanics of the MAPQ implementation it becomes clear that this is not an easy topic. If you want to know more about the MAPQ topic, please follow the link above.

For the sake of going forward, we will sub-select reads with at least medium quality as defined by bowtie:

$ samtools view -h -b -q 20 mappings/evol1.sorted.dedup.bam > mappings/evol1.sorted.dedup.q20.bam
$ samtools view -h -b -q 20 mappings/evol2.sorted.dedup.bam > mappings/evol2.sorted.dedup.q20.bam

Where in the commands above:

  • -h: Include the sam header
  • -q 20: Only extract reads with mapping quality >= 20

Hint

I will repeat here a recommendation given at the source link above, as it is a good one: If you unsure what MAPQ scoring scheme is being used in your own data then you can plot out the MAPQ distribution in a BAM file using programs like the mentioned Qualimap or similar programs. This will at least show you the range and frequency with which different MAPQ values appear and may help identify a suitable threshold you may want to use.

Unmapped reads

We could decide to use a program like Kraken to classify all unmapped sequence reads and identify the species they are coming from and test for contamination (See the Taxonomic Investigation section).

Lets see how we can get the unmapped portion of the reads from the bam-file:

# Extract the unmapped reads
$ samtools view -b -f 4 mappings/evol1.sorted.dedup.bam > mappings/evol1.sorted.unmapped.bam
$ samtools view -b -f 4 mappings/evol2.sorted.dedup.bam > mappings/evol2.sorted.unmapped.bam

# Count the unmapped reads
$ samtools view -c mappings/evol1.sorted.unmapped.bam
$ samtools view -c mappings/evol2.sorted.unmapped.bam

Where in the commands above:

  • -b: indicates that the output is BAM.
  • -f INT: only include reads with this SAM flag set. You can also use the command samtools flags to get an overview of the flags.
  • -c: count the reads

Next, we can extract the fastq sequences of the unmapped reads for read1 and read2.

Extract the unmapped reads as fastq sequences
$ samtools fastq -1 mappings/evol1.sorted.unmapped.R1.fastq.gz -2 mappings/evol1.sorted.unmapped.R2.fastq.gz mappings/evol1.sorted.unmapped.bam
$ samtools fastq -1 mappings/evol2.sorted.unmapped.R1.fastq.gz -2 mappings/evol2.sorted.unmapped.R2.fastq.gz mappings/evol2.sorted.unmapped.bam     

These unmapped sequences can now be further investigated in our pipeline at a later stage.

References

  1. Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. bioinformatics, 25(14), 1754-1760.

  2. Okonechnikov, K., Conesa, A., & García-Alcalde, F. (2016). Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics, 32(2), 292-294.