Approaches to Web Development for Bioinformatics

Previous  Contents  Next
References

NCBI Databases

Data from the Entrez Gene database can be downloaded in bulk using FTP. It includes sequences from the international sequence collaboration, Swiss-Prot, and RefSeq.

Gene Information

The file gene_info contains

The file is large. The file from 5/11/2006 was 171 MB uncompressed.  It contains gene data for all organisms where genetic data has been submitted to the NCBI. A small section of the file is here. To extract the human genetic data from this file this Perl script may be used.


Perl

#!/usr/bin/perl -w
# A Perl script to extract human gene data from the gene_info file
# param 1 file to extract data from
my $i = 0;
while (<>) {
if ($_ =~ /^9606.*/) {
print($_);
$i++;
}
}
print "Lines written $i.\n";

The program works by copying any line starting with the taxonomic id for Homo sapiens (9606).  The output should be redirected to a file, as in


Program output

>extract_human.pl gene_info > output.txt

The file downloaded on 5/11/2006 there were 39,256 entries and the output file size was over 5 MB, easily small enough to open in a text editor such as emacs.

The genes in the file are named by registration with Human Genome Organization (HUGO) Gene Nomenclature Committee26, which has a similar download file.

The Perl program below looks for a regular expression on the command line and parses the different fields for matching lines in the gene_info file.


Perl

#!/usr/bin/perl -w
# A Perl script to query human gene_info data file
# param 0 is the name of the file to search
# param 1 is the expression to search for

$file = $ARGV[0];
print "Searching for $ARGV[1] in file $file.\n";
my $exp = $ARGV[1];

# open file for reading and iterate over lines
open GENE_FILE, "<", $file;
while (<GENE_FILE>) {
if (/$exp/) {
print;

# Parse individual fields
my @fields = split "\t";
print
"Fields: \n",
"\tTaxonmoic ID: $fields[0]\n",
"\tGene ID: $fields[1]\n",
"\tGene Symbol: $fields[2]\n",
"\tLocus Tag: $fields[3]\n",
"\tSynonyms: $fields[4]\n",
"\tdbXrefs: $fields[5]\n",
"\tchromosome: $fields[6]\n",
"\tmapLocation: $fields[7]\n",
"\tdescription: $fields[8]\n";

}

When run on the file extracted from gene_info with the target expression SCN3A the result is


Program output

>query_gene_info.pl human_gene_info.txt SCN3A
Searching for SCN3A in file human_gene_info.txt.
9606 6328 SCN3A - NAC3 HGNC:10590|MIM:182391|HPRD:01671 2 2q24 sodium channel, voltage-gated, type III, alpha protein-coding SCN3A sodium channel, voltage-gated, type III, alpha O
Fields:
Taxonmoic ID: 9606
Gene ID: 6328
Gene Symbol: SCN3A
Locus Tag: -
Synonyms: NAC3
dbXrefs: HGNC:10590|MIM:182391|HPRD:01671
chromosome: 2
mapLocation: 2q24
description: sodium channel, voltage-gated, type III, alpha protein-coding SCN3A sodium channel, voltage-gated, type III, alpha

where query_gene_info.pl is the name of the script.

The web page for searching the gene information database on this site - Search Human Gene Information Database - was constructed based on these principles.


Previous  Contents  Next
References

Contributed Comments and NotesAdd a comment.

There are no user comments.

Google

Please send ideas and opinions by email at alexamies@gmail.com.

© 2006-2007 Alex Amies