Approaches to Web Development for Bioinformatics

Previous  Contents  Next
References

Perl Basics

Let's look at a couple of examples. Here is the classic 'Hello World' in Perl:


#!/usr/bin/perl -w
# An example Perl program
print "Hello World!\n";

On UNIX save the file as hello.pl and make it executable with the command chmod +x hello.pl. The program can be run with the command


> hello.pl

If you have Linux it is probably as easy as that because Perl is pre-installed with most Linux distributions and the first line of the Perl program tells the Linux shell how to invoke the Perl interpreter. On Windows, you need to download a Perl interpreter. The O'Reilly Perl site14 has links to downloads for Windows and UNIX.  On Windows I used the ActiveState ActivePerl interpreter, which is freely available but not open source.  On Windows, you can invoke the Hello World program in exactly the same way if you use a .pl extension and that is registered to invoke the Perl interpreter.  Otherwise, use


> perl hello.pl

Here is a program to validate that all the characters in a given string are valid DNA symbols.  The program makes use of Perl regular expression to ensure that the DNA characters input from the command line all belong to the set {'a', 't', 'c', 'g', 'A', 'T', 'C', 'G'}.


#/usr/bin/perl -w
# Validates a DNA string to ensure that there are no invalid characters
# The DNA string is the first command line parameter
my $dna = $ARGV[0];

# Validate using a regular expression
if ($dna =~ m/([^atcg])/i) {
print "Invalid DNA symbol found: $1 \n";
}

The my keyword defines a variable.  Its use is optional but improves readability.  The built-in variable $ARGV[0] access the first command line argument.  The program can be tested from the command line using the example input


> perl validate_dna.pl atcgNSTOP

This gives the output


Invalid DNA symbol found: N

Validation of input data is a very important consideration when creating web user interfaces for security reasons.

Here is an example that demonstrates the foreach language construct to iterate over a list of codons:


#/usr/bin/perl -w
# Demonstrates a foreach loop

# Uses the quoted words qw shortcut
my @codonList = qw(uuu uuc uua uug);

foreach $codon (@codonList) {
print "Codon: $codon\n";
}

The program uses the quoted words (qw) shortcut to create the array @codonList.  Notice the use of the @ symbol when referring to the array.  The foreach keyword enables iteration over the array @codonList placing the value in the variable $codon with each iteration.  Executing the script leads to the output


Codon: uuu
Codon: uuc
Codon: uua
Codon: uug

Here is an example that transcribes a DNA sequence to RNA using a regular expression.


#/usr/bin/perl -w
# Transcribes a DNA string into an RNA string

# String to translate
my $dna = "atggcacaggcactgttggtacccccaggacctgaaagcttccgcctttttactaga";

# Substitute all t's with u's
$dna =~ s/t/u/g;

print "RNA: $dna \n";

This demonstrates the use of the regular expression assignment operator =~, which evaluates the regular expression substitution operation and assigns the output to the left hand side.  The 's' in the expression stands for 'substitute', as in the UNIX command.  Running this script will give the following output:


>dna_to_rna.pl
RNA: auggcacaggcacuguugguacccccaggaccugaaagcuuccgccuuuuuacuaga

Here is an example program that performs the RNA to amino acid translation as discussed in the biology background section.  It demonstrates use of a hash table (%rna_to_amino_acid).


#/usr/bin/perl -w
# Translates a codon into an amino acid

# Hash table for codon to amino acid translation
my %rna_to_amino_acid = (
uuu => F, uuc => F, uua => L, uug => L,
ucu => S, ucc => S, uca => S, ucg => S,
uau => Y, uac => Y, uaa => "--STOP--", uag => "--STOP--",
ugu => C, ugc => C, uga => "--STOP--", ugg => W,
cuu => L, cuc => L, cua => L, cug => L,
ccu => P, ccc => P, cca => P, ccg => P,
cau => H, cac => H, caa => Q, cag => Q,
cgu => R, cgc => R, cga => R, cgg => R,
auu => I, auc => I, aua => I, aug => M,
acu => T, acc => T, aca => T, acg => T,
aau => N, aac => N, aaa => K, aag => K,
agu => S, agc => S, aga => R, agg => R,
guu => V, guc => V, gua => V, gug => V,
gcu => A, gcc => A, gca => A, gcg => A,
gau => D, gac => D, gaa => E, gag => E,
ggu => G, ggc => G, gga => G, ggg => G
);

# RNA string to translate
my $rna = "auggcacaggcacuguugguacccccaggaccugaaagcuuccgccuuuuuacuaga";

# Amino acid string
my $amino_acid;

# Iterate over RNA string translating codon to amino acid
my $i = 0;
while ($i < length $rna) {
$amino_acid .= $rna_to_amino_acid{substr($rna, $i, 3)};
$i += 3;
}

print "Amino acid string $amino_acid \n";

The curly brackets {} indicate that the inside is to be used as a key to look up the value in the given hash table.  The output is the same as the example output from the Biology Background section above:


>perl rna_translation.pl
Amino acid string MAQALLVPPGPESFRLFTR

Perl makes the concept of references explicit.  References are indicated with a '\' symbol.  For example, in the following script \@symbols is a reference, or pointer, to the array symbols.  Braces (or curly brackets {}) deference a variable so that @{$inputSymbols} is the array again.  The array has been passed into the subroutine by reference, as opposed to by value, which involves making a copy.


#!/usr/bin/perl -w
# An example Perl program to demonstrate references

# Create an array of input symbols
my @symbols = qw(a b c);

# Make call to validate the input symbols are valid DNA symbols
validateSymbols(\@symbols);

# Subroutine to validate that all the input symbols are one of a c g t
# Only parameter is a reference to an array of input symbols to be checked
sub validateSymbols {
my $inputSymbols = shift;
my @dna = qw(a c g t);

# Braces not strictly necessary (can use @$inputSymbols)
for my $item (@{$inputSymbols}) {
unless (grep $item eq $_, @dna) {
print "Input symbol '$item' is not a valid DNA symbol.\n";
}
}
}

This example also introduced the use of Perl subroutines, which are the equivalent of functions in other languages, such as C.  The keyword sub is used to define a subroutine.  The last statement in a subroutine defines the optional return value, although that was not used in the example above.  The subroutine parameters are passed in the variable @_, which is accessed implicitly with the shift function above.  The shift function returns the first member of the array argument and deletes it from (shifts it off) the array. Running the program gives the following output:


> perl validate_sub.pl Input symbol 'b' is not a valid DNA symbol.


Previous  Contents  Next
References

Contributed Comments and NotesAdd a comment.

There are no user comments.

Google

Please send ideas and opinions by email at alexamies@gmail.com.

© 2006-2007 Alex Amies