BLAST+ Tutorial - Creating Custom BLAST Databases

Introduction

As discussed earlier, the BLAST search tools are separate from the databases searched. The included database tools allow anyone to convert a collection of FASTA sequences into a searchable BLAST database. However, do take care to read up on the influence of database size on the E-value and other search metrics.

In this tutorial we will create a BLAST database from the Virulence Factors of Pathogenic Bacteria (VFPB) database which is distributed as a collection of FASTA sequences.

The Virulence Factors of Pathogenic Bacteria Database

A good practical example of a custom database is the Virulence Factors of Pathogenic Bacteria database. This database contains sequences of gene products in bacteria which causes (or is likely to cause) disease. Virulence factors include bacterial toxins and proteins that mediate bacterial attachment or generally assist a bacterium to establish disease within a host. The database consists of two datasets. The core dataset only includes genes that have been experimentally verified to be virulence factors. The full dataset covers all genes that have been verified AND predicted to be virulence factors. Thus the full dataset is also larger than the core dataset. We will use the full dataset in this tutorial.

Creating a BLAST database from the VFPB

Perform the following steps to create a VFPB BLAST database:

  1. cd into ~/blast_tutorial

  2. The VFPB sequences is distributed in a compressed gz file - we need to extract the contents:

    $ gunzip VFDB_setB_nt.fas.gz
    

    This will create a VFDB_setB_nt.fas file inside the blast_tutorial directory and delete the gz file. Use more to look at the contents of the file to confirm that the file contains multiple FASTA sequences.

  3. Create a new directory inside blast_tutorial called vfdb_setb_nt and move the fas file into it. (Note that the name of the directory is important)

  4. Next we will use the makeblastdb tool to create a BLAST database from the multiple FASTA sequence file. However, first look at all the options provided by the tool:

    $ makeblastdb -h
    
  5. Create the BLAST database inside the vfdb_setb_nt directory:

    $ makeblastdb -in vfdb_setb_nt/VFDB_setB_nt.fas \
                   -parse_seqids \
                   -title “VFDB Full nucleotide dataset” \
                   -dbtype nucl \
                   -out vfdb_setb_nt/vfdb_setb_nt
    

    Where the options used are defined as follows:

    Option Value Description/Comments
    -in Path to the input sequence file The default format for input is a FASTA formatted file
    -parse_seqids No value needed With this option enabled, the header ids of FASTA records will be parsed and used in the database
    -title Any string enclosed in quotation marks Title for the database. Make sure that the title for the database is informative
    -dbtype nucl or prot Type of the sequences in the database. Use nucl for nucleic acid sequences and prot for amino acid sequences
    -out output path for the new database The final element in the path will be the database name
  6. Use ls to inspect the ~/blast_tutorial/vfdb_setb_nt/ directory. It should now contain more files with the prefix vfdb_setb_nt. The prefix is the database name (as given to the -out option).

    Practically, a BLAST database is a collection of specific files in a directory that share a common prefix, and can thus be referred to by that prefix alone (usually given to the -db option in BLAST+ tools. See Step 7 as an example).

  7. To retrieve information about the database (or any BLAST Database) enter the following:

    $ blastdbcmd -db vfdb_setb_nt/vfdb_setb_nt -info
    

    Note that the path must be specified to the database, together with its name (i.e. the prefix)

Congratulations! You now have a database that can be queried using the BLAST+ search tools.