Approaches to Web Development for Bioinformatics

Previous  Contents  Next
References

On this page:

Biology Background

This section reviews basic the basic biology background for some of the software tools discussed in this paper.  It will be more useful to readers coming from a software background than those from a biology or medical background.  A more full biology background is given in Purves, et al, Life: The Science of Biology9, in Griffiths, et al Introduction to Genetic Analysis28, and in Gibas and Jambeck, Developing Bioinformatics Skills10.

Deoxyribonucleic acid (DNA) is composed of nucleotides including the four nucleic acid bases adenine, guanine, cytosine, and thymine.  In the production of proteins in all living organisms DNA is transcribed to ribonucleic acid (RNA), which is then translated to a protein.  The sequence of nucleotides and the translation forms the basis for much theory and ongoing research in biology.

DNA Transcription to RNA

In all living organisms DNA is transcribed to RNA.  There are several different kinds of RNA.  Most genes code for messenger RNA (mRNA) which is then translated to protein.  Some genes code for ribosomal RNA (rRNA), which form a large part of the ribosomes in eukaryotes (including humans).  A small number of genes code for transfer RNA (tRNA), which assist in the translation process itself.  In prokaryotes (i.e. bacteria), since they do not have a nucleus, mRNA is translated to protein nearby the transcription site.  However, in eukaryotes transcription takes place in the nucleus and the product, mRNA, is transported to the cytoplasm for protein synthesis.

In eukaryotes DNA within genes contains non-coding sequences called introns. The continuous coding sequences within genes are called exons.  During transcription these introns are spliced out and the exons concatenated to form the mRNA exported from the nucleus.  Alternative splicings can take place making it possible to form different proteins from the same gene.

mRNA Translation and Protein Synthesis

Proteins are synthesized from mRNA in molecular factories called ribosomes.

Since there are four nucleic acid bases but 20 amino acids it takes more than one base to specify an amino acid.  In fact, nucleic acid bases are grouped into sets of three a codon, which translates into a single amino acid.  There are 64 (43) possible combinations for codons and the translations to amino acids are given in the table below. 

Translation of Nucleic Acid Bases from Messenger RNA (mRNA) into Amino Acids to Build Proteins
(A = adenine, G = guanine, C = cytosine, U = uracil or T, thymine in DNA) 
     DNA Base in Second Position of Codon  
U (T) C A G
First Position U (T) UUU Phenylalanine (Phe, F) UCU Serine (Ser, S) UAU Tyrosine (Tyr, Y) UGU Cysteine (Cys, C) U Third Position
UUC UCC UAC UGC C
UUA Leucine (Leu, L) UCA UAA Stop UGA Stop A
UUG UCG UAG Stop UGG Tryptophan (Trp, W) G
C CUU Leucine (Leu, L) CCU Proline (Pro, P) CAU Histidine (His, H) CGU Arginine (Arg, R) U
CUC CCC CAC CGC C
CUA CCA CAA Glutamine (Gln, Q) CGA A
CUG CCG CAG CGG G
A AUU Isoleucine (Ile, I) ACU Threonine (Thr, T) AAU Asparagine (Asn, N) AGU Serine (Ser, S) U
AUC ACC AAC AGC C
AUA ACA AAA Lysine (Lys, K) AGA Arginine (Arg, R) A
AUG Methionine (Met, M);
Start
ACG AAG AGG G
G GUU Valine (Val, V) GCU Alanine (Ala, A) GAU Aspartic Acid (Asp, D) GGU Glycine (Gly, G) U
GUC GCC GAC GGC C
GUA GCA GAA Glutamic Acid (Glu, E) GGA A
GUG GCG GAG GGG G

Uracil in RNA replaces thymine (T) in DNA.

This genetic code is common to nearly all organisms. Mitochondrial genomes are different, however.

In most prokaryotes, and all eukaryotes the first amino acid synthesized is methionine.  Hence, in the table above methionine is listed in the same cell as the start codon.  Not all occurrences of AUG represent the start of a coding sequence, however.  In the ribosome an initiation complex assembles and scans the mRNA for an AUG codon that is in the proper sequence context.

Transfer RNA (tRNA) molecules recognize codons in mRNA and translate them to amino acids.  Special proteins called release factors recognize stop codons and, when they do recognize one, terminate translation.

Basic translation takes the nucleotide sequence and translates it to an amino acid sequence using the table above.  As an example, consider the gene SCN3A sodium channel, voltage-gated, type III, alpha protein [Homo sapiens] in the GenBank genome database2 MapViewer.  This is located on chromosome 2. The first few codons starting at position 471 in the messenger RNA (mRNA) nucleotide sequence NM_006922 and its amino acid translation are

Codon atg gca cag gca ctg ttg gta ccc cca gga cct gaa agc ttc cgc ctt ttt act aga
Amino Acid    M A Q A L L V P P G P E S F R L F T R

The sequence given by GenBank is the same as that derived from the table above.  There are a couple of points to note:

  1. Even though GenBank is giving the mRNA sequence, it lists thymine (T) instead of uracil (U) as if it were DNA.
  2. I haven't explained why we started at position 471 yet.  Methionine (Met, M) is always the start.  In raw DNA there are nucleotides that are not coded into amino acids.  Especially in humans proteins are spliced from DNA sequences that contain many non-coding regions.  One of the fundamental problems is where to start translating.

This can be checked with the Swiss Institute of Bioinformatics (SIB) ExPASy Translate Tool4 by pasting the nucliotide sequence atggcacaggcactgttggtacccccaggacctgaaagcttccgcctttttactaga into the text area of the web form. 

To do this kind of translation you need to decide what to look for and where to start.  What to look for is an amino acid sequence (protein) from any of the large number of proteins in the organism being studied.  The particulars of deciding what protein to look for a translation for are beyond the scope of the present document.  Let's assume that someone has told us what amino acid sequence to look for.

To start a sequence we need to begin with a start codon.  However, a quick look over the SCN3A gene using the browser's find function for the codon atg yields many matches.  Also, we need to consider the reverse direction (3′-5′ versus 5′-3′).  We also don't know which nucleotide triples form the codons.  Is it atg gca ..., tgg ca..., or ggc a...? This gives the three possible frames: Frame 1, Frame 2, and Frame 3.  There are another three possible frames in the opposite direction, giving a total of six.  Translation tools either try all frames or let users specify which frame to use in advance.

See the page Genome Resources on this site for a list of popular and user suggested genome links for additional information on this topic.

Previous  Contents  Next
References

Contributed Comments and NotesAdd a comment.

There are no user comments.

Google

Please send ideas and opinions by email at alexamies@gmail.com.

© 2006-2007 Alex Amies