Genomics Tutorial - Taxonomic Investigation

Introduction

At this stage in the analysis, we hope that all the sequences obtained are from the organisms that we are trying to study, i.e. that we performed our extraction and amplification correctly and didn't have contamination. To make sure, we can do a taxonomic investigation of our sequences.

We will use the tool Kraken to assign taxonomic classifications to our sequence reads. Let us see if we can id some sequences from other species.

Overview

The part of the work-flow we will work on in this section marked in red below:

taxo_workflow

Learning outcomes

After completing this section of the tutorial you should be able to:

  • To taxonomically define sequences in an assembly
  • To visualize the taxonomic layout of a sequencing experiment

The Data


Lets see how our directory structure looks so far:

genomics_tutorial
├── assembly
│   ├── quast
│   │   ├── basic_stats
│   │   └── icarus_viewers
│   ├── spades-150
│   │   ├── corrected
│   │   │   └── configs
│   │   ├── K21
│   │   │   ├── configs
│   │   │   └── simplified_contigs
│   │   ├── K33
│   │   │   ├── configs
│   │   │   └── simplified_contigs
│   │   ├── K55
│   │   │   ├── configs
│   │   │   └── simplified_contigs
│   │   ├── K77
│   │   │   ├── configs
│   │   │   └── path_extend
│   │   ├── misc
│   │   ├── mismatch_corrector
│   │   │   ├── contigs
│   │   │   │   └── configs
│   │   │   └── scaffolds
│   │   │       └── configs
│   │   ├── pipeline_state
│   │   └── tmp
│   └── spades-original
│       ├── corrected
│       │   └── configs
│       ├── K21
│       │   ├── configs
│       │   └── simplified_contigs
│       ├── K33
│       │   ├── configs
│       │   └── simplified_contigs
│       ├── K55
│       │   ├── configs
│       │   └── simplified_contigs
│       ├── K77
│       │   ├── configs
│       │   └── path_extend
│       ├── misc
│       ├── mismatch_corrector
│       │   ├── contigs
│       │   │   └── configs
│       │   └── scaffolds
│       │       └── configs
│       ├── pipeline_state
│       └── tmp
├── data
├── mappings
│   ├── evol1.sorted.dedup_stats
│   │   ├── css
│   │   ├── images_qualimapReport
│   │   └── raw_data_qualimapReport
│   └── ref_genome
└── quality_control
    ├── data
    ├── multiqc_data
    ├── trimmed
    └── trimmed-fastqc

Kraken2

New Tool

Kraken2
A tool that uses k-mers to assign a taxonomic labels in form of NCBI Taxonomy to the sequence (if possible).

The taxonomic label is assigned based on similar k-mer content of the sequence in question to the k-mer content of reference genome sequence. The result is a classification of the sequence in question to the most likely taxonomic label. If the k-mer content is not similar to any genomic sequence in the database used, it will not assign any taxonomic label.

Now we create a directory where we are going to do the analysis and we will change into that directory too.

# First make sure that you are in the root genomics directory
$ cd ~/genomics_tutorial/

#Create an analysis directory for kraken
$ mkdir kraken

#Change into the kraken directory
$ cd kraken

Kraken2 needs a database that can be used to assign the taxonomic labels to sequences. In this tutorial we will use the minikraken database.

In this tutorial, the location of the database will be referred to in commands as the environmental variable $KRAKEN_DB

Note

The "minikraken2" database was created from bacteria, viral and archaea sequences. What are the implications for us when we are trying to classify our sequences?

Now we can attempt to investigate the sequences we got back from the sequencing provider for other species as the one it should contain.

We call the Kraken2 tool and specify the database and fasta-file with the sequences it should use. The general command structure looks like this:

$ kraken2 --use-names --threads 4 --db PATH_TO_DB_DIR --report example.report.txt example.fa > example.kraken

However, we may have fastq-files, so we need to use --fastq-input which tells Kraken2 that it is dealing with fastq-formated files. The --gzip-compressed flag specifies that the input-files are compressed.

In addition, we are dealing with paired-end data, which we can tell Kraken2 with the switch --paired. Here, we are investigating one of the unmapped paired-end read files of the evolved line.

#Create Kraken2 classifications
$ kraken2 --use-names --threads 4 --db $KRAKEN_DB --fastq-input --report evol1 --gzip-compressed --paired ../mappings/evol1.sorted.unmapped.R1.fastq.gz ../mappings/evol1.sorted.unmapped.R2.fastq.gz > evol1.kraken

$ kraken2 --use-names --threads 4 --db $KRAKEN_DB --fastq-input --report evol1 --gzip-compressed --paired ../mappings/evol1.sorted.unmapped.R1.fastq.gz ../mappings/evol1.sorted.unmapped.R2.fastq.gz > evol1.kraken

This classification may take a while, depending on how many sequences we are going to classify. The resulting content of the file evol1.kraken looks similar to the following example:

C       7001326F:121:CBVVLANXX:1:1105:2240:12640        816     251     816:9 171549:5 816:5 171549:3 2:2 816:5 171549:4 816:34 171549:8 816:4 171549:2 816:10 A:35 816:10 171549:2 816:4 171549:8 816:34 171549:4 816:5 2:2 171549:3 816:5 171549:5 816:9
C       7001326F:121:CBVVLANXX:1:1105:3487:12536        1339337 202     1339337:67 A:35 1339337:66
U       7001326F:121:CBVVLANXX:1:1105:5188:12504        0       251     0:91 A:35 0:91
U       7001326F:121:CBVVLANXX:1:1105:11030:12689       0       251     0:91 A:35 0:91
U       7001326F:121:CBVVLANXX:1:1105:7157:12806        0       206     0:69 A:35 0:68
C       7001326F:121:CBVVLANXX:1:1105:10112:12880       1339337 192     1339337:62 A:35 1339337:61

Each sequence classified by |kraken| results in a single line of output. Output lines contain five tab-delimited fields; from left to right, they are:

  1. C/U: one letter code indicating that the sequence was either classified or unclassified.
  2. The sequence ID, obtained from the FASTA/FASTQ header.
  3. The taxonomy ID |kraken| used to label the sequence; this is 0 if the sequence is unclassified and otherwise should be the |ncbitax| identifier.
  4. The length of the sequence in bp.
  5. A space-delimited list indicating the lowest common ancestor (in the taxonomic tree) mapping of each k-mer in the sequence.
    For example, 562:13 561:4 A:31 0:1 562:3 would indicate that:
    • the first 13 k-mers mapped to taxonomy ID #562
    • the next 4 k-mers mapped to taxonomy ID #561
    • the next 31 k-mers contained an ambiguous nucleotide
    • the next k-mer was not in the database
    • the last 3 k-mers mapped to taxonomy ID #562

Note

For more information on how to use Kraken2, see the manual and Reference 1

Investigate Taxa

We can use the webpage NCBI TaxIdentifier to quickly get the names to the taxonomy identifier. However, this is impractical as we are dealing potentially with many sequences. Fortunately Kraken2 has some scripts that help us understand our results better.

Because we used the switch --report FILE, we also have a sample-wide report of all taxa found. This is much better to get an overview what was found.

The first few lines of an example report are shown below.

83.56  514312  514312  U       0       unclassified
16.44  101180  0       R       1       root
16.44  101180  0       R1      131567    cellular organisms
16.44  101180  2775    D       2           Bacteria
13.99  86114   1       D1      1783270       FCB group
13.99  86112   0       D2      68336           Bacteroidetes/Chlorobi group
13.99  86103   8       P       976               Bacteroidetes
13.94  85798   2       C       200643              Bacteroidia
13.94  85789   19      O       171549                Bacteroidales
13.87  85392   0       F       815                     Bacteroidaceae
13.87  85392   170     G       816                       Bacteroides
13.71  84366   32233   S       817                         Bacteroides fragilis

The output of kraken-report is tab-delimited, with one line per taxon. The fields of the output, from left-to-right, are as follows:

  1. Percentage of reads covered by the clade rooted at this taxon
  2. Number of reads covered by the clade rooted at this taxon
  3. Number of reads assigned directly to this taxon
  4. A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply "-".
  5. NCBI Taxonomy ID
  6. The indented scientific name

Note

If you want to compare the taxa content of different samples to another, one can create a report whose structure is always the same for all samples, disregarding which taxa are found (obviously the percentages and numbers will be different). We can create such a report using the option --report-zero-counts which will print out all taxa (instead of only those found). We then sort the taxa according to taxa-ids (column 5), e.g. sort -n -k5.

The report is not ordered according to taxa ids and contains all taxa in the database, even if they have not been found in our sample and are thus zero. The columns are the same as in the former report, however, we have more rows and they are now differently sorted, according to the NCBI Taxonomy id.

Bracken

New Tool

Bracken
Bracken stands for Bayesian Re-estimation of Abundance with Kraken, and is a statistical method that computes the abundance of species in DNA sequences from a metagenomics sample

Braken uses the taxonomy labels assigned by Kraken2 (see above) to estimate the number of reads originating from each species present in a sample. Bracken classifies reads to the best matching location in the taxonomic tree, but does not estimate abundances of species. Combined with the Kraken classifier, Bracken will produce more accurate species- and genus-level abundance estimates than Kraken2 alone.

The use of Bracken subsequent to Kraken2 is optional but might improve on the Kraken2 results alone.

The general structure of the Bracken command looks like this:

$ bracken -d PATH_TO_DB_DIR -i kraken2.report -o bracken.species.txt -l S
  • -l S: denotes the level we want to look at. S stands for species but other levels are available.
  • -d PATH_TO_DB_DIR: specifies the path to the |kraken| database that should be used.

To apply Bracken to our previous Kraken2 results, do the following:

#Run Bracken
$ bracken -d $KRAKEN_DB -i evol1 -l S -o evol1.bracken

$ bracken -d $KRAKEN_DB -i evol2 -l S -o evol2.bracken

The species-focused result-table (evol1.bracken) looks similar to this:

name    taxonomy_id     taxonomy_lvl    kraken_assigned_reads   added_reads     new_est_reads   fraction_total_reads
Streptococcus sp. oral taxon 431        712633  S       2       0       2       0.00001
Neorhizobium sp. NCHU2750       1825976 S       0       0       0       0.00000
Pseudomonas sp. MT-1    150396  S       0       0       0       0.00000
Ahniella affigens       2021234 S       1       0       1       0.00000
Sinorhizobium sp. CCBAU 05631   794846  S       0       0       0       0.00000
Cohnella sp. 18JY8-7    2480923 S       1       0       1       0.00000
Bacillus velezensis     492670  S       4       4       8       0.00002
Actinoplanes missouriensis      1866    S       2       8       10      0.00002

The important column is the sixth column, new_est_reads, which gives the newly estimated reads.

For more information on Bracken, see Reference 2

Visualisation

We will use the Krona tools to create a nice interactive visualisation of the taxa content of our sample.

New Tool

Krona
A tool that creates multi-layered pie charts of hierarchical data that allows for the interactive exploration of the data within modern web browsers.

The figure below shows an example (albeit an artificial one) snapshot of the visualisation Krona provides (3):

krona_example

To create a Krona page for our data, we need to follow these steps:

Preparation

We must first prepare our data directory for the Krona pre-requisite data:

# we create a directory in our home where the krona database will live
$ mkdir -p ~/genomics_tutorial/krona/taxonomy

# now we make a symbolic link to that directory
$ ln -s ~/genomics_tutorial/krona/taxonomy ~/miniconda3/envs/ngs/opt/krona/taxonomy

Building the taxonomy database

Next we need to build a taxonomy database that Krona will use for visualization.

To build the database, execute the following Krona script:

$ ktUpdateTaxonomy.sh ~/genomics_tutorial/krona/taxonomy

Visualising the taxonomy

Now, we can use the the command ktImportTaxonomy to create the html web-pages for our samples:


# Go to kraken2 results
$ cd kraken

#Make an interactive plot for the first evolved sample
ktImportTaxonomy -m 2 -t 5 evol1 -o evol1.kraken.html

#Make an interactive plot for the second evolved sample
ktImportTaxonomy -m 2 -t 5 evol2 -o evol2.kraken.html
Where in the commands above:

  • -t is the column number in the input file which contains the taxonomic ID
  • -m is the column number in the input file which contains the score (or percentage of the total number of hits)

Attention

In general it is always good practice to know the format of output from the tools you are using. For example, knowing which columns in our Kraken2 report file contained the score and taxonomic classifiers allowed us to feed this information to Krona to produce a meaningful plot. Remember, some tools are designed to take generic input (which makes them more flexible). Krona is a good example of this type of tool.

References

  1. Wood, D. E., & Salzberg, S. L. (2014). Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology, 15(3), 1-12.

  2. Lu, J., Breitwieser, F. P., Thielen, P., & Salzberg, S. L. (2017). Bracken: estimating species abundance in metagenomics data. PeerJ Computer Science, 3, e104.

  3. Ondov, B. D., Bergman, N. H., & Phillippy, A. M. (2011). Interactive metagenomic visualization in a Web browser. BMC bioinformatics, 12(1), 1-10.