Approaches to Web Development for Bioinformatics

Previous  Contents  Next
References

Reference Sequence

The NCBI file ref2seq contains references from the gene ID to other databases at NCBI. The fields are

The human gene information can be extracted using the same script as above for gene_info.  This information can be parsed from the ref2seq file with the following Perl script:, which is similar to the script above:


Perl

#!/usr/bin/perl -w
# A Perl script to query human ref2seq data file for a specific gene ID.
# param 0 is the name of the file to search
# param 1 is the gene ID to search for

$file = $ARGV[0];
print "Searching for $ARGV[1] in file $file.\n";
my $exp = "9606\t$ARGV[1]";

# open file for reading and iterate over lines
open GENE2REFSEQ, "<", $file;
while (<GENE2REFSEQ>) {
if (/$exp/) {

print;

# Parse individual fields
my @fields = split "\t";
print
"Fields: \n",
"\tTaxonomic ID: $fields[0]\n",
"\tGene ID: $fields[1]\n",
"\tStatus: $fields[2]\n",
"\tRNA nucleotide accession.version: $fields[3]\n",
"\tRNA nucleotide gi: $fields[4]\n",
"\tProtein  accession.version: $fields[5]\n",
"\tThe gi for a protein accession: $fields[6]\n",
"\tGenomic nucleotide accession.version: $fields[7]\n",
"\tGenomic nucleotide gi: $fields[8]\n",
"\tStart position on the genomic accession: $fields[9]\n",
"\tEnd position on the genomic accession: $fields[10]\n",
"\tOrientation: $fields[11]\n",
"\tName of the assembly: $fields[12]\n"
}
}

Some sample output from this script is:


Program output

>query_ref2seq.pl human_gene2refseq.txt 3064
Searching for 3064 in file human_gene2refseq.txt.
9606    3064    REVIEWED        NM_002111.6     90903230        NP_002102.4 90903231        AC_000047.1     89161206        2987657 3157210 +       Alterna
e assembly (based on Celera assembly)
Fields:
Taxonomic ID: 9606
Gene ID: 3064
Status: REVIEWED
RNA nucleotide accession.version: NM_002111.6
RNA nucleotide gi: 90903230
Protein  accession.version: NP_002102.4
The gi for a protein accession: 90903231
Genomic nucleotide accession.version: AC_000047.1
Genomic nucleotide gi: 89161206
Start position on the genomic accession: 2987657
End position on the genomic accession: 3157210
Orientation: +
Name of the assembly: Alternate assembly (based on Celera assembly)


Previous  Contents  Next
References

Contributed Comments and NotesAdd a comment.

There are no user comments.

Google

Please send ideas and opinions by email at alexamies@gmail.com.

© 2006-2007 Alex Amies