In this section:
Data may be downloaded piece at a time by browsing the web user interfaces provided by the various data repositories, such as NCBI, or downloaded by FTP.
DNA and RNA data can be downloaded from the web user interfaces in GenBank, including Map Viewer and by Entrez query. One of the supported formats is FASTA. The format is used by GenBank and popularly used throughout the industry. FASTA files have the format
The FASTA format is described in detail in the NCBI Handbook2. The first line is the definition line, including a comment, and the rest is a nucleotide sequence. The letters 'gi' stand for GenBank identifier, which is 19923380. The field NM_006922.2 is the GenBank accession number. The rest of the line is the comment, including a reference to the name of the gene, SCN3A. An example of a FASTA file for the nucleotide sequence discussed above is SCN3A.txt, which was downloaded from the GenBank database by the author using MapViewer.
Here is a Perl program that reads a FASTA file and prints the definition line.
This program uses a number of Perl shortcuts. The
diamond operator <> operates on the command line
arguments assuming that they are file names and iterates over lines within them
placing each line in the implicit $_ variable. The
regular expression /^>/ matches any line beginning
with the character '>'. The regular expression s/^>//g
strips the '>' at the start of the line. Running the script
on the file above gives this output:
where I have given the name of the script read_fasta.pl.
Here is a complementary script that ignores the comment and converts
the remainder of the file to lower case:
The function lc() converts a string to lower case and uc()
(not used here) converts to upper case.
There are no user comments.
Please send ideas and opinions by email at alexamies@gmail.com.