Approaches to Web Development for Bioinformatics
Reference Sequence
The NCBI file ref2seq contains references from the gene ID to other
databases at NCBI. The fields are
- NCBI Taxonomy unique identifier
- The unique identifier for the gene
- The status of the reference sequence
- RNA nucleotide accession and version (separated by a period)
- RNA nucleotide gi
- Protein accession and version (separated by a period)
- The gi for a protein accession
- Genomic nucleotide accession and version (separated by a period)
- Genomic nucleotide gi
- Start position on the genomic accession
- End position on the genomic accession
- Orientation of the gene feature on the genomic accession
- Name of the assembly
The human gene information can be extracted using the same script as
above for gene_info. This information can be parsed from the
ref2seq file with the following Perl script:, which is similar to the
script above:
Perl
$file = $ARGV[0];
print
"Searching for $ARGV[1] in file $file.\n";
my $exp =
"9606\t$ARGV[1]";
open GENE2REFSEQ,
"<", $file;
while (<GENE2REFSEQ>) {
if (/$exp/) {
print;
my @fields = split
"\t";
print
"Fields: \n",
"\tTaxonomic ID: $fields[0]\n",
"\tGene ID: $fields[1]\n",
"\tStatus: $fields[2]\n",
"\tRNA nucleotide accession.version: $fields[3]\n",
"\tRNA nucleotide gi: $fields[4]\n",
"\tProtein accession.version: $fields[5]\n",
"\tThe gi for a protein accession: $fields[6]\n",
"\tGenomic nucleotide accession.version: $fields[7]\n",
"\tGenomic nucleotide gi: $fields[8]\n",
"\tStart position on the genomic accession: $fields[9]\n",
"\tEnd position on the genomic accession: $fields[10]\n",
"\tOrientation: $fields[11]\n",
"\tName of the assembly: $fields[12]\n"
}
}
Some sample output from this script is:
Program output
>query_ref2seq.pl human_gene2refseq.txt 3064
Searching for 3064 in file human_gene2refseq.txt.
9606 3064
REVIEWED
NM_002111.6
90903230 NP_002102.4
90903231
AC_000047.1
89161206 2987657 3157210
+ Alterna
e assembly (based on Celera assembly)
Fields:
Taxonomic ID: 9606
Gene ID: 3064
Status: REVIEWED
RNA nucleotide accession.version: NM_002111.6
RNA nucleotide gi: 90903230
Protein accession.version: NP_002102.4
The gi for a protein accession: 90903231
Genomic nucleotide accession.version: AC_000047.1
Genomic nucleotide gi: 89161206
Start position on the genomic accession: 2987657
End position on the genomic accession: 3157210
Orientation: +
Name of the assembly: Alternate assembly (based on Celera assembly)
There are no user comments.
Please send ideas and opinions by email at alexamies@gmail.com.
© 2006-2007 Alex Amies