Complexity of DNA and Protein Sequences: Observation of Real Data
From Santa Fe Institute Events Wiki
Global IP Fellows Meeting |
HAO Bailin
Head, T-Life Research Center, Fudan University, Shanghai, China
Complexity of DNA and Protein Sequences: Observation of Real Data
According to a theorem in C. Shannon's seminal 1948 paper
the set of all symbolic sequences of length N over a finite alphabet
may be roughly divided into two subsets: a huge typical set and a tiny
set of atypical sequences. Biological sequences as result of billion years
of evolution must belong to this tiny set. An effective way of studying
the atypical subset is to look at real data. I will report some
observations on real DNA and protein data, including "avoidance signature"
of bacterial genomes, taxon-specific repeats in these genomes, fine
structure in the number distribution of K-strings in randomized genomes,
almost-uniqueness of reconstruction of protein sequences from their
constituent K-peptides, etc. These observations may sometimes lead to
interesting pieces of biology-inspired mathematics including
combinatorics, graph theory, and formal language theory.