|
| Article details |
| Title: String Comparison in Terms of Statistical Evaluation Applied on Biological Sequences |
|
Author(s): Alina Bogan-Marta; Nicolae Robu; Mirela Pater;
|
|
Keywords: n-grams, entropy, cross-entropy, protein similarity, exploratory data analysis
|
Abstract:
Protein sequences from all different organisms can be treated as texts written in auniversal language where the alphabet consists of 20 distinct symbols, the amino-acids. Themapping of a protein sequence to its structure, functional dynamics and biological role thenbecomes analogous to the mapping of words to their semantic meaning in natural languages.This analogy can be exploited by applying statistical language modeling and text classificationtechniques for the advancement of biological sequences understanding. Here a newgeneral strategy for measuring similarity between proteins is introduced. Our approach hasits roots in computational linguistics and the related techniques for quantifying and comparingcontent in strings of characters. We experimented with different implementations havingas ultimate goal the development of practical, computational efficient algorithms. The experimentalanalysis provides evidence for the usefulness and the potential of the new approachand motivates the further development of linguistics-related tools as a means to decipher thebiological sequences.
|
Introduction:
|
Conclusions:
|
References:
|
|
File link : unavailable
|
|
|