Genomics Tutorial - Taxonomic Investigation
Introduction
At this stage in the analysis, we hope that all the sequences obtained are from the organisms that we are trying to study, i.e. that we performed our extraction and amplification correctly and didn't have contamination. To make sure, we can do a taxonomic investigation of our sequences.
We will use the tool Kraken to assign taxonomic classifications to our sequence reads. Let us see if we can id some sequences from other species.
Overview
The part of the work-flow we will work on in this section marked in red below:
Learning outcomes
After completing this section of the tutorial you should be able to:
- To taxonomically define sequences in an assembly
- To visualize the taxonomic layout of a sequencing experiment
The Data
Lets see how our directory structure looks so far:
genomics_tutorial
├── assembly
│ ├── quast
│ │ ├── basic_stats
│ │ └── icarus_viewers
│ ├── spades-150
│ │ ├── corrected
│ │ │ └── configs
│ │ ├── K21
│ │ │ ├── configs
│ │ │ └── simplified_contigs
│ │ ├── K33
│ │ │ ├── configs
│ │ │ └── simplified_contigs
│ │ ├── K55
│ │ │ ├── configs
│ │ │ └── simplified_contigs
│ │ ├── K77
│ │ │ ├── configs
│ │ │ └── path_extend
│ │ ├── misc
│ │ ├── mismatch_corrector
│ │ │ ├── contigs
│ │ │ │ └── configs
│ │ │ └── scaffolds
│ │ │ └── configs
│ │ ├── pipeline_state
│ │ └── tmp
│ └── spades-original
│ ├── corrected
│ │ └── configs
│ ├── K21
│ │ ├── configs
│ │ └── simplified_contigs
│ ├── K33
│ │ ├── configs
│ │ └── simplified_contigs
│ ├── K55
│ │ ├── configs
│ │ └── simplified_contigs
│ ├── K77
│ │ ├── configs
│ │ └── path_extend
│ ├── misc
│ ├── mismatch_corrector
│ │ ├── contigs
│ │ │ └── configs
│ │ └── scaffolds
│ │ └── configs
│ ├── pipeline_state
│ └── tmp
├── data
├── mappings
│ ├── evol1.sorted.dedup_stats
│ │ ├── css
│ │ ├── images_qualimapReport
│ │ └── raw_data_qualimapReport
│ └── ref_genome
└── quality_control
├── data
├── multiqc_data
├── trimmed
└── trimmed-fastqc
Kraken2
New Tool
Kraken2
A tool that uses k-mers to assign a taxonomic labels in form of NCBI Taxonomy to the sequence (if possible).
The taxonomic label is assigned based on similar k-mer content of the sequence in question to the k-mer content of reference genome sequence. The result is a classification of the sequence in question to the most likely taxonomic label. If the k-mer content is not similar to any genomic sequence in the database used, it will not assign any taxonomic label.
Now we create a directory where we are going to do the analysis and we will change into that directory too.
# First make sure that you are in the root genomics directory
$ cd ~/genomics_tutorial/
#Create an analysis directory for kraken
$ mkdir kraken
#Change into the kraken directory
$ cd kraken
Kraken2 needs a database that can be used to assign the taxonomic labels to sequences. In this tutorial we will use the minikraken database.
In this tutorial, the location of the database will be referred to in commands as the environmental variable $KRAKEN_DB
Note
The "minikraken2" database was created from bacteria, viral and archaea sequences. What are the implications for us when we are trying to classify our sequences?
Now we can attempt to investigate the sequences we got back from the sequencing provider for other species as the one it should contain.
We call the Kraken2 tool and specify the database and fasta-file with the sequences it should use. The general command structure looks like this:
$ kraken2 --use-names --threads 4 --db PATH_TO_DB_DIR --report example.report.txt example.fa > example.kraken
However, we may have fastq-files, so we need to use --fastq-input
which tells Kraken2 that it is dealing with fastq-formated files.
The --gzip-compressed
flag specifies that the input-files are compressed.
In addition, we are dealing with paired-end data, which we can tell Kraken2 with the switch --paired
.
Here, we are investigating one of the unmapped paired-end read files of the evolved line.
#Create Kraken2 classifications
$ kraken2 --use-names --threads 4 --db $KRAKEN_DB --fastq-input --report evol1 --gzip-compressed --paired ../mappings/evol1.sorted.unmapped.R1.fastq.gz ../mappings/evol1.sorted.unmapped.R2.fastq.gz > evol1.kraken
$ kraken2 --use-names --threads 4 --db $KRAKEN_DB --fastq-input --report evol1 --gzip-compressed --paired ../mappings/evol1.sorted.unmapped.R1.fastq.gz ../mappings/evol1.sorted.unmapped.R2.fastq.gz > evol1.kraken
This classification may take a while, depending on how many sequences we are going to classify.
The resulting content of the file evol1.kraken
looks similar to the following example:
C 7001326F:121:CBVVLANXX:1:1105:2240:12640 816 251 816:9 171549:5 816:5 171549:3 2:2 816:5 171549:4 816:34 171549:8 816:4 171549:2 816:10 A:35 816:10 171549:2 816:4 171549:8 816:34 171549:4 816:5 2:2 171549:3 816:5 171549:5 816:9
C 7001326F:121:CBVVLANXX:1:1105:3487:12536 1339337 202 1339337:67 A:35 1339337:66
U 7001326F:121:CBVVLANXX:1:1105:5188:12504 0 251 0:91 A:35 0:91
U 7001326F:121:CBVVLANXX:1:1105:11030:12689 0 251 0:91 A:35 0:91
U 7001326F:121:CBVVLANXX:1:1105:7157:12806 0 206 0:69 A:35 0:68
C 7001326F:121:CBVVLANXX:1:1105:10112:12880 1339337 192 1339337:62 A:35 1339337:61
Each sequence classified by |kraken| results in a single line of output. Output lines contain five tab-delimited fields; from left to right, they are:
C/U
: one letter code indicating that the sequence was either classified or unclassified.- The sequence ID, obtained from the FASTA/FASTQ header.
- The taxonomy ID |kraken| used to label the sequence; this is 0 if the sequence is unclassified and otherwise should be the |ncbitax| identifier.
- The length of the sequence in bp.
- A space-delimited list indicating the lowest common ancestor (in the
taxonomic tree) mapping of each k-mer in the sequence.
For example,562:13 561:4 A:31 0:1 562:3
would indicate that:- the first 13 k-mers mapped to taxonomy ID #562
- the next 4 k-mers mapped to taxonomy ID #561
- the next 31 k-mers contained an ambiguous nucleotide
- the next k-mer was not in the database
- the last 3 k-mers mapped to taxonomy ID #562
Investigate Taxa
We can use the webpage NCBI TaxIdentifier to quickly get the names to the taxonomy identifier. However, this is impractical as we are dealing potentially with many sequences. Fortunately Kraken2 has some scripts that help us understand our results better.
Because we used the switch --report FILE
, we also have a sample-wide report of all taxa found.
This is much better to get an overview what was found.
The first few lines of an example report are shown below.
83.56 514312 514312 U 0 unclassified
16.44 101180 0 R 1 root
16.44 101180 0 R1 131567 cellular organisms
16.44 101180 2775 D 2 Bacteria
13.99 86114 1 D1 1783270 FCB group
13.99 86112 0 D2 68336 Bacteroidetes/Chlorobi group
13.99 86103 8 P 976 Bacteroidetes
13.94 85798 2 C 200643 Bacteroidia
13.94 85789 19 O 171549 Bacteroidales
13.87 85392 0 F 815 Bacteroidaceae
13.87 85392 170 G 816 Bacteroides
13.71 84366 32233 S 817 Bacteroides fragilis
The output of kraken-report is tab-delimited, with one line per taxon. The fields of the output, from left-to-right, are as follows:
- Percentage of reads covered by the clade rooted at this taxon
- Number of reads covered by the clade rooted at this taxon
- Number of reads assigned directly to this taxon
- A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply "-".
- NCBI Taxonomy ID
- The indented scientific name
Note
If you want to compare the taxa content of different samples to another, one can create a report whose structure is always the same for all samples, disregarding which taxa are found (obviously the percentages and numbers will be different).
We can create such a report using the option --report-zero-counts
which will print out all taxa (instead of only those found).
We then sort the taxa according to taxa-ids (column 5), e.g. sort -n -k5
.
The report is not ordered according to taxa ids and contains all taxa in the database, even if they have not been found in our sample and are thus zero. The columns are the same as in the former report, however, we have more rows and they are now differently sorted, according to the NCBI Taxonomy id.
Bracken
New Tool
Bracken
Bracken stands for Bayesian Re-estimation of Abundance with Kraken, and is a statistical method that computes the abundance of species in DNA sequences from a metagenomics sample
Braken uses the taxonomy labels assigned by Kraken2 (see above) to estimate the number of reads originating from each species present in a sample. Bracken classifies reads to the best matching location in the taxonomic tree, but does not estimate abundances of species. Combined with the Kraken classifier, Bracken will produce more accurate species- and genus-level abundance estimates than Kraken2 alone.
The use of Bracken subsequent to Kraken2 is optional but might improve on the Kraken2 results alone.
The general structure of the Bracken command looks like this:
$ bracken -d PATH_TO_DB_DIR -i kraken2.report -o bracken.species.txt -l S
-l S
: denotes the level we want to look at.S
stands for species but other levels are available.-d PATH_TO_DB_DIR
: specifies the path to the |kraken| database that should be used.
To apply Bracken to our previous Kraken2 results, do the following:
#Run Bracken
$ bracken -d $KRAKEN_DB -i evol1 -l S -o evol1.bracken
$ bracken -d $KRAKEN_DB -i evol2 -l S -o evol2.bracken
The species-focused result-table (evol1.bracken
) looks similar to this:
name taxonomy_id taxonomy_lvl kraken_assigned_reads added_reads new_est_reads fraction_total_reads
Streptococcus sp. oral taxon 431 712633 S 2 0 2 0.00001
Neorhizobium sp. NCHU2750 1825976 S 0 0 0 0.00000
Pseudomonas sp. MT-1 150396 S 0 0 0 0.00000
Ahniella affigens 2021234 S 1 0 1 0.00000
Sinorhizobium sp. CCBAU 05631 794846 S 0 0 0 0.00000
Cohnella sp. 18JY8-7 2480923 S 1 0 1 0.00000
Bacillus velezensis 492670 S 4 4 8 0.00002
Actinoplanes missouriensis 1866 S 2 8 10 0.00002
The important column is the sixth column, new_est_reads
, which gives the newly estimated reads.
For more information on Bracken, see Reference 2
Visualisation
We will use the Krona tools to create a nice interactive visualisation of the taxa content of our sample.
New Tool
Krona
A tool that creates multi-layered pie charts of hierarchical data that allows for the interactive exploration of the data within modern web browsers.
The figure below shows an example (albeit an artificial one) snapshot of the visualisation Krona provides (3):
To create a Krona page for our data, we need to follow these steps:
Preparation
We must first prepare our data directory for the Krona pre-requisite data:
# we create a directory in our home where the krona database will live
$ mkdir -p ~/genomics_tutorial/krona/taxonomy
# now we make a symbolic link to that directory
$ ln -s ~/genomics_tutorial/krona/taxonomy ~/miniconda3/envs/ngs/opt/krona/taxonomy
Building the taxonomy database
Next we need to build a taxonomy database that Krona will use for visualization.
To build the database, execute the following Krona script:
$ ktUpdateTaxonomy.sh ~/genomics_tutorial/krona/taxonomy
Visualising the taxonomy
Now, we can use the the command ktImportTaxonomy
to create the html web-pages for our samples:
# Go to kraken2 results
$ cd kraken
#Make an interactive plot for the first evolved sample
ktImportTaxonomy -m 2 -t 5 evol1 -o evol1.kraken.html
#Make an interactive plot for the second evolved sample
ktImportTaxonomy -m 2 -t 5 evol2 -o evol2.kraken.html
Where in the commands above:
-t
is the column number in the input file which contains the taxonomic ID-m
is the column number in the input file which contains the score (or percentage of the total number of hits)
Attention
In general it is always good practice to know the format of output from the tools you are using. For example, knowing which columns in our Kraken2 report file contained the score and taxonomic classifiers allowed us to feed this information to Krona to produce a meaningful plot. Remember, some tools are designed to take generic input (which makes them more flexible). Krona is a good example of this type of tool.