Approaches to Web Development for Bioinformatics

Previous  Contents  Next
References

Working with Data from Public Databases

In this section:

Data may be downloaded piece at a time by browsing the web user interfaces provided by the various data repositories, such as NCBI, or downloaded by FTP.

FASTA Format

DNA and RNA data can be downloaded from the web user interfaces in GenBank, including Map Viewer and by Entrez query.  One of the supported formats is FASTA. The format is used by GenBank and popularly used throughout the industry.  FASTA files have the format


FASTA

>gi|19923380|ref|NM_006922.2| Homo sapiens sodium channel, voltage-gated, type III, alpha (SCN3A), mRNA
AGCGAAGCGGAGGCATAAGCAGAGAGGATTCTGGAAAGGTCTCTTTGTTTTCTTATCCACAGAGAAAGAA
AGAAAAAAAATTGTAACTAATTTGTAAACCTCTGTGGTCAAAAAAAAAAAAAAAAAAAAAGCTGAACAGC
...

The FASTA format is described in detail in the NCBI Handbook2.  The first line is the definition line, including a comment, and the rest is a nucleotide sequence.  The letters 'gi' stand for GenBank identifier, which is 19923380.  The field NM_006922.2 is the GenBank accession number.  The rest of the line is the comment, including a reference to the name of the gene, SCN3A.  An example of a FASTA file for the nucleotide sequence discussed above is SCN3A.txt, which was downloaded from the GenBank database by the author using MapViewer.

Here is a Perl program that reads a FASTA file and prints the definition line.


Perl

#!/usr/bin/perl -w
# An example Perl program to demonstrate reading a gene file in FASTA format and printing the comment
# The first and only argument on the command line is the name of the file
while (<>) {
    if (/^>/) {
        $_ =~ s/^>//g;
        print($_);
    }
}

This program uses a number of Perl shortcuts.  The diamond operator <> operates on the command line arguments assuming that they are file names and iterates over lines within them placing each line in the implicit $_ variable.  The regular expression /^>/ matches any line beginning with the character '>'.  The regular expression s/^>//g strips the '>' at the start of the line.  Running the script on the file above gives this output:


Program output

>read_fasta.pl SCN3A_fasta.txt
gi|19923380|ref|NM_006922.2| Homo sapiens sodium channel, voltage-gated, type II I, alpha (SCN3A), mRNA

where I have given the name of the script read_fasta.pl.  Here is a complementary script that ignores the comment and converts the remainder of the file to lower case:


Perl
#!/usr/bin/perl -w
# An example Perl program to demonstrate reading a gene file in FASTA format,
# ignoring the comment, and converting the entire file to lower case.
# The first and only argument on the command line is the name of the file.
while (<>) {
if (not /^>/) {
(lc($_));
}
}

The function lc() converts a string to lower case and uc() (not used here) converts to upper case.


Previous  Contents  Next
References

Contributed Comments and NotesAdd a comment.

There are no user comments.

Google

Please send ideas and opinions by email at alexamies@gmail.com.

© 2006-2007 Alex Amies