Approaches to Web Development for Bioinformatics

Previous  Contents  Next
References

Human Genome

The human genome is given in a number of different files and formats. The messenger RNA (mRNA) is given in a file called rna.gbk.gz. This includes each locus with reference identifiers, PubMed references, mRNA, a comment, and the gene product (usually a protein).  Here are a few sample lines


GenBank mRNA file fragment

LOCUS       NM_004239               6452 bp    mRNA    linear   PRI 16-OCT-2005
DEFINITION  Homo sapiens thyroid hormone receptor interactor 11 (TRIP11), mRNA.
...
COMMENT     PROVISIONAL REFSEQ: This record has not yet been subject to final
            NCBI review. The reference sequence was derived from Y12490.1.
           
            Summary: TRIP11 was first identified through its ability to
            interact functionally with thyroid hormone receptor-beta (THRB; MIM
            190160). It has also been found in association with the Golgi
            apparatus and microtubules.[supplied by OMIM].
FEATURES             Location/Qualifiers
...
                    /db_xref="GeneID:9321"

A program that scans this file and extracts the gene ID, locus accession reference, and comment to another file is given below.


Perl

#!/usr/bin/perl -w
# A Perl script to extract comments on gene locii from the file rna.gbk
# The number of locii read
my $i = 0;

# The locus accession reference
my $accession;

# The text for the comment
my $comment;

# The RNA file must be in GenBank form and be called rna.gbk
open RNA_GBK, "<", "rna.gbk";

# The output file
open COMMENTS_RNA, ">", "comments_rna.gbk";

# Iterate over each line in the input file
while (<RNA_GBK>) {

# Look for the start of a new comment in a line beginning with 'LOCUS'
if ($_ =~ /^(LOCUS)(\s+)([A-Z0-9_]+)/) {
$accession = $3;
$i++;

# Look for a comment
} elsif ($_ =~ /^(COMMENT)(\s+)(.*)/) {
$comment = $3;
# Concatenate subsequent comment lines
while (<RNA_GBK>) {
if ($_ =~ /^(\s+)(.*)/) {
$comment .= " $2";
} else {
last;
}
}
}

# Look for the NCBI gene ID.  This is found within the features block
if ($_ =~ /^FEATURES/) {
my $geneID;
while (<RNA_GBK>) {
if ($_ =~ /(GeneID:)([0-9]+)/) {
$geneID .= $2;
print(COMMENTS_RNA "$geneID\t$accession\t$comment\n");
last;
}
}
}
}
print "$i Locii written.\n";

close RNA_GBK;
close COMMENTS_RNA;

The comment can then be more easily searched and extracted with a simple script, such as


Perl

#!/usr/bin/perl -w
# A Perl script to greg comments from the file comments_rna.gbk, extracted from the file rna.gbk
open COMMENTS_RNA, "<", "comments_rna.gbk";

# gene ID 9321 is an example
my @line = grep {/^9321/} <COMMENTS_RNA>;
my @fields = split /\t/, $line[0];
print $fields[2];
close COMMENTS_RNA;

This is used in conjuntion with the scripts above to produce the detailed comments for the human gene search page on this site.


Previous  Contents  Next
References

Contributed Comments and NotesAdd a comment.

There are no user comments.

Google

Please send ideas and opinions by email at alexamies@gmail.com.

© 2006-2007 Alex Amies