The bioinformatics databases mentioned in the introduction have lists to bioinformatics tools and web sites. In addition, the bioinformatics portal biodirectory.com 24 (was bioinformatics.org) has a huge listing of bioinformatics tools and application development interfaces (API's). In this section I will outline several of those tools and API's.
The Basic Local Alignment Search Tool (BLAST) searches for similarities in nucleotide and amino acid sequences. Altschul, Gish, et al 8 and Altschul, Madden, et al 33 discuss the statistical basis and algorithms for BLAST. The NCBI BLAST page The Statistics of Sequence Similarity Scores 34 gives also gives an overview. There are a number of different flavors of BLAST, including
In this article I will only discuss BLASTN and BLASTP. Web user interfaces of these are hosted on the NCBI site 34. They accept queries that are compared against the NCBI BLAST database.
Although using BLAST at NCBI can be fairly intimidating at first use with its large blank text areas there is a lot of documentation on the use of BLAST, including context sensitive help. The screens themselves do not explain their use but the documentation contains examples. The NCBI site seems fantastic for bioinformatics experts doing public research. However, there may be a number of reasons for creating your own BLAST web interface:
Executables of BLAST software can be downloaded from the NCBI site 34 for many common operating systems, including Windows and Linux. One easy way to start is with netblast, which allows you to run queries against public sequence databases rather than setting up and maintaining your own. The NCBI makes available download of it's own web interface so that you may run it from your own system.
For many of the reasons above you may want to run BLAST against your own database. To do this first download the standalone BLAST executable from the NCBI BLAST download page. Decompress this and add an ncbi.ini (Windows) or or .ncbirc (UNIX and Linux) to your system path. Then download one of the BLAST databases from the database page. It is recommended to start off with a small genome like E. Coli. For a FASTA format genome, after decompressing the file format the database with the command
>bin/formatdb -i ecoli.nt -p F
where ecoli.nt is the name of the database file. Here I have put the database file in the top level where BLAST was installed. To test the installation with a nucleotide search run this command
>bin/blastall -p blastn -d ecoli.nt -i test_blast.txt -o test_blast.out
where the input file test_blast.txt is a part of the E. Coli genome and test_blast.out is the output file. The human genome BLAST database is 1.7 GB compressed so you will need to have plentiful disk space on your server to run against this.
As an example lets try running BLASTN on the NCBI site by
searching for the mRNA sequence
for the gene SCN3A sodium channel, voltage-gated, type III, alpha
protein [Homo sapiens] discussed in the background section. This input
string is ATGGCACAGGCACTGTTGGTACCCCCAGGACCTGAAAGCTTCCGCCTTTTTACTAGA.
Cut and paste the input string into the Search text area on the BLAST web page and click BLAST!. When the next page comes up click the FORMAT! button. The output includes a number of human genes with the best scores. The next best scoring genes are from Pan troglodytes (chimpanzee). Since this is a gene for a protein used in the human brain the result matching the chimpanzee DNA is very interesting.
The same thing can be done with the netbast program. After download and installation run the command line
> blastcl3 -p blastn -i partial_scn3a_input.txt -o partial_scn3a_output.txt
In this command line blastcl3 is the executable program. The -p option specifies BLAST Nucleotide (blastn). The -i option specifies the input file partial_scn3a_input.txt. The -o option specifies the file that the output should be written to partial_scn3a_output.txt. The results are exact the same as from the NCBI BLAST page.
Creating a web interface for BLAST is most easily done by using a CGI program as discussed in the Perl and CGI section of this tutorial. The CGI input parameters become input parameters to the BLAST program. Instead of sending the output to a file the output may be redirected from standard out to the browser.
BLAST Protein or BLASTP can be used to locate similar amino acid sequence across different proteins. The NCBI Conserved Domain Database (CDD) is a protein annotation database which consists of a collection of well-annotated multiple sequence alignment models for domains that are the same across multiple proteins. CDD has been constructed by using BLASTP and related tools to locate these conserved amino acid sequences.
Let's try some BLASTing with the protein translation of the SCN3A gene discussed above. The text
>gi|9801814|emb|CAC03583.1| SCN3A [Homo sapiens]
TKNVEYTFTGIYTFESLIKILARGFCLEDFTFLRDPWNWLDFSVIVMAYVTEFVSLGNVSALRTFRVLRA
LKTISVIPGLKTIVGALIQSVKKLSDVMILTVFCLSVFALIGLQLFMGNLR
is a FASTA version of a subunit of the protein for SCN3A. Try cutting and pasting the text into the search box at the NCBI BLAST web page. As before click the BLAST! button and then the FORMAT! button. This time the results come back with predicted sodium channel for chimpanzees on top.
To do the equivalent thing with netblast enter the command line
>blastcl3 -p blastp -i scn3a_subunit_protein.txt -o scn3a_subunit_protein_output.txt
In this case the -p option specifies BLASTP for protein. The input file scn3a_subunit_protein.txt contains the text above and the output file is scn3a_subunit_protein_output.txt. The results are the same as the NCBI web page.
There are no user comments.
Please send ideas and opinions by email at alexamies@gmail.com.