Approaches to Web Development for Bioinformatics

Previous  Contents  Next
References

Codon Bias Example

In the Biology Background section I discussed translation of codons to amino acids.  The translation is many-to-one in the sense that different codons translate to the same amino acid.  Therefore, if given a nucleotide sequence (without introns) the protein primary structure (amino acid sequence) is uniquely determined.  On the other hand, for a given protein there are many possible nucleotide sequences.  However, not all nucleotide sequences are equally likely.

Codon bias is a term that describes which codon to amino acid translations are favored.  This varies by organism.  Here is a program that computes the codon bias for a nucleotide sequences in a FASTA file.


#!/bin/perl -w
# Program to compute codon bias

use MedicalComputing::CodonBias;
use Bio::SeqIO;

# Check command line arguments
if ($#ARGV != 1) {
die "usage: codon_bias.pl file_name starting_position\n";
}

# File to read (FASTA format)
my $inputFile = $ARGV[0];

# Starting position of coding region
my $startingPos = $ARGV[1];

# Check that file exists
if (! -e $inputFile) {
die "Cannot find file $inputFile.\n";
}

# Check that file can be read by user
if (! -r $inputFile) {
die "You are not authorized to read file $inputFile.\n";
}

# Create a Bio::SeqIO object from the FASTA file
my $inseq = Bio::SeqIO->new(-file => "<$inputFile", -format => "FASTA");
my $seqObj = $inseq->next_seq(); if (! $seqObj) {
die "No sequence object.\n";
}

# Check that the starting position is reasonable or exit
if ($startingPos > $seqObj->length) {
die "Starting position $startingPos is greater than the length of DNA ", $seqObj->length, ".\n";
}

# Get a substring starting with the coding region
my $dna = $seqObj->subseq($startingPos, $seqObj->length-1);

# Create a new CodonBias object
my $codonBias = MedicalComputing::CodonBias->new($dna);

# Scan through the sequence and tally the translations
$codonBias->tally();
print "The codon counts are:\n\tCodon\tCount\t\Percent\n";

# An array that holds the names of the amino acids translated.
my @aminoAcids = $codonBias->aminoAcids();

# Iterate over the amino acids to display them on the command line with totals
my $totalCount = 0;
my $numCodons = 0;
for my $aminoAcid (@aminoAcids) {
my $totalPercent = 0.0;
print $aminoAcid, "\n";

# Iterate over the codons translating to each amino acid
for my $codon ($codonBias->codonsForAminoAcid($aminoAcid)) {
if ($codonBias->codonCount($codon)) {
print "\t", $codon, "\t", $codonBias->codonCount($codon), "\t";
my $codonPercent = $codonBias->codonProportion($aminoAcid, $codon) * 100;
$totalPercent += $codonPercent;
$totalCount += $codonBias->codonCount($codon);

# Format the percentage bias with 2 digits before and after the decimal point
printf "%2.2f\n", $codonPercent;
} else {
print "\t$codon\tNot found\n";
}
$numCodons++;
}

# Print totals for each amino acid
$totalCount += $codonBias->totalCondonCount($aminoAcid);
print "\t", "Total:", "\t", $codonBias->totalCondonCount($aminoAcid),
"\t", $totalPercent, "\n";
}

# Check totals
print "Total codon count for all amino acids: $totalCount, ",
($#aminoAcids + 1), " amino acids, $numCodons unique codons.";

The program gets the sequence from the FASTA file and computes the frequency of each codon translation.  Most of the work is done by the MedicalComputing::CodonBias and Bio::SeqIO modules.   The MedicalComputing::CodonBias module should be placed in a directory called MedicalComputing in a file called CodonBias.pm in your Perl library path.

The program uses a few Perl language features not discussed before. The die function terminates the program, giving the operating system an error in the process.  The -e $inputFile if condition tests for the existence of the file specified on the command line.  The -r $inputFile if condition checks to see whether the user is allowed to read the file.

The program can be run from the command line shown here. The input file is HD.txt (the gene for Huntington disease) and the coding region starting position is 146.  The first nucleotide is position 1.


>codon_bias.pl HD.txt 146
The codon counts are:
    Codon  Count  Percent
Asparagine
    AAT    35     38.89
    AAC    55     61.11
    Total: 90     100
Glycine
    GGT    17     12.23
    GGC    39     28.06
    GGA    44     31.65
    GGG    39     28.06
    Total: 139    100
Proline
...

The full output listing is in codon_bias_hd.txt.

Since most of the analytic code is externalized in the modules and most of the user interface code is in the main script the modules can be reused in a web context.  Here is a CGI script that uploads a FASTA format file and reports the total codon biases.


#!/usr/bin/perl
# A CGI script to compute codon bias from an uploaded a FASTA file

# Omit this line or replace myhome with the location of your Perl libraries
use lib qw(/myhome/perllib);

use IO::String;
use CGI;
use Bio::SeqIO;
use MedicalComputing::CodonBias;

my $q = new CGI;

print
    $q->header,
    $q->start_html('Codon Bias Computation'),
    $q->h1('Codon Bias Computation');

# Display the form
if (! $q->param('uploaded_file')) {

    print $q->start_multipart_form,
          "Upload a FASTA file to compute codon bias", $q->br,
          $q->filefield(-name=>'uploaded_file',
                        -default=>'starting value ignored',
                        -size=>50,
                        -maxlength=>120),$q->p,
          "Starting position", $q->br,
          $q->textfield(-name=>'startingPos',
                        -default=>'1',
                        -size=>4,
                        -maxlength=>4),
          $q->p,
          $q->submit('Submit'),
          $q->end_form;

# Process the form
} else {

    # upload the file
    my $filename = $q->upload('uploaded_file');

    # Get starting position
    my $startingPos = 1;
    if ($q->param('startingPos')) {
        $startingPos = $q->param('startingPos');
    }

    # Get file upload information
    my $type = $q->uploadInfo($filename)->{'Content-Type'};
    print "File $filename, starting position $startingPos", $q->p, "\n";

    # Only process text files
    if ($type eq 'text/plain') {

        # Read the text file into the string object $data (max 10 000 lines)
        my $data;
        my $i = 0;
        while (defined($line = <$filename>) && ($i < 10000)) {
            $data .= $line;
            $i++;
        }

        # With an IO::String the data can be read into Bio::SeqIO
        my $stringfh = new IO::String($data);

        # Use Bio::SeqIO to parse the string
        my $seqIO = Bio::SeqIO->new(-fh => $stringfh, -format => "FASTA");
        my $seqObj = $seqIO->next_seq();
        if (! $seqObj) {
            print "No sequence information.", $q->p;
        } else {

            # Get a substring starting with the coding region
            my $dna = $seqObj->subseq($startingPos, $seqObj->length-1);

            # Create a new CodonBias object
            my $codonBias = MedicalComputing::CodonBias->new($dna);

            # Scan through the sequence and tally the translations
            $codonBias->tally();
            print "<table border='1'>\n",
                  "<caption>Codon Percentages</caption>\n",
                  "<tr><th>Amino Acid</th><th>Codon</th><th>Percent</th></tr>\n";

            # An array that holds the names of the amino acids translated.
            my @aminoAcids = $codonBias->aminoAcids();

            # Iterate over the amino acids to display the proportions
            for my $aminoAcid (@aminoAcids) {
                my @codonsForAminoAcid = $codonBias->codonsForAminoAcid($aminoAcid);
                my $rowSpan = $#codonsForAminoAcid + 1;
                print "<tr><td rowspan='$rowSpan'>$aminoAcid</td>\n";

                # Iterate over the codons translating to each amino acid
                my $codonIndex = 0;
                for my $codon (@codonsForAminoAcid) {
                    if ($codonIndex) {
                        print "<tr>";
                    }
                    $codonIndex++;
                    print "<td>$codon</td><td>";
                    if ($codonBias->codonCount($codon)) {
                        my $codonPercent = $codonBias->codonProportion($aminoAcid, $codon) * 100;

                        # Format the percentage bias with 2 digits before and after the decimal point
                        printf "%2.2f", $codonPercent;
                    } else {
                        print "Not found";
                    }
                    print "</td></tr>\n";
                }
            }
            print "</table>", $q->p, "\n";
        }
  
    } else {
        # Give a warning message for other types of files than text files
        print "Only text files permitted", $q->p;
    }

    # Go back to the start page
    print $q->a({-href=>'codon_bias_web.pl'}, "Start again"), $q->p;

}
print $q->end_html;

A working version of the page is at codon_bias_web.  The main difference in usage of the Bio::SeqIO module is that instead of loading the sequence directly from the FASTA file it is uploaded, read into a string, and then used to construct an IO::String object.  The IO::String object can then be used to construct a Bio::SeqIO object.  Another version of the web page with additional frills is at codon_bias_mc.


Previous  Contents  Next
References


Contributed Comments and NotesAdd a comment.

There are no user comments.

Google

Please send ideas and opinions by email at alexamies@gmail.com.

© 2006-2007 Alex Amies