Approaches to Web Development for Bioinformatics
NCBI Databases
Data from the Entrez Gene database can be downloaded in bulk using
FTP. It includes sequences from the international sequence
collaboration, Swiss-Prot, and RefSeq.
Gene Information
The file gene_info contains
- NCBI Taxonomy unique identifier
- the unique identifier for the gene
- the default symbol for the gene
- the LocusTag value
- a bar-delimited set of unofficial symbols for the gene
- a bar-delimited set of identifiers in other databases
- the chromosome on which this gene is placed
- the map location for this gene
- a descriptive name for this gene
- the type assigned to the gene
- other nomenclature information
The file is large. The file from 5/11/2006 was 171 MB
uncompressed. It contains gene data for all organisms where
genetic data has been submitted to the NCBI. A small section of the
file is here.
To extract the human genetic data from this file this Perl script may be used.
Perl
my $i = 0;
while (<>) {
if
($_ =~ /^9606.*/) {
print($_);
$i++;
}
}
print
"Lines written $i.\n";
The program works by copying any line starting with the taxonomic id
for Homo sapiens (9606). The output should be redirected to a
file, as in
Program output
>extract_human.pl gene_info > output.txt
The file downloaded on 5/11/2006 there were 39,256 entries and
the output file size was over 5 MB, easily small enough to open in a
text editor such as emacs.
The genes in the file are named by registration with Human Genome
Organization (HUGO) Gene Nomenclature
Committee26, which has a
similar download file.
The Perl program below looks for a regular expression on the command
line and parses the different fields for matching lines in the
gene_info file.
Perl
$file = $ARGV[0];
print
"Searching for $ARGV[1] in file $file.\n";
my $exp = $ARGV[1];
open GENE_FILE,
"<", $file;
while (<GENE_FILE>) {
if (/$exp/) {
print;
my @fields = split
"\t";
print
"Fields: \n",
"\tTaxonmoic ID: $fields[0]\n",
"\tGene ID: $fields[1]\n",
"\tGene Symbol: $fields[2]\n",
"\tLocus Tag: $fields[3]\n",
"\tSynonyms: $fields[4]\n",
"\tdbXrefs: $fields[5]\n",
"\tchromosome: $fields[6]\n",
"\tmapLocation: $fields[7]\n",
"\tdescription: $fields[8]\n";
}
When run on the file extracted from gene_info with the target
expression SCN3A the result is
Program output
>query_gene_info.pl human_gene_info.txt SCN3A
Searching for SCN3A in file human_gene_info.txt.
9606 6328 SCN3A - NAC3 HGNC:10590|MIM:182391|HPRD:01671 2 2q24 sodium
channel, voltage-gated, type III, alpha protein-coding
SCN3A sodium channel, voltage-gated, type III, alpha O
Fields:
Taxonmoic ID: 9606
Gene ID: 6328
Gene Symbol: SCN3A
Locus Tag: -
Synonyms: NAC3
dbXrefs: HGNC:10590|MIM:182391|HPRD:01671
chromosome: 2
mapLocation: 2q24
description: sodium channel, voltage-gated, type
III, alpha protein-coding SCN3A sodium channel, voltage-gated, type
III, alpha
where query_gene_info.pl is the name of the script.
The web page for searching the gene information database on this
site - Search Human Gene
Information Database - was constructed based on these
principles.
There are no user comments.
Please send ideas and opinions by email at alexamies@gmail.com.
© 2006-2007 Alex Amies