Complexity of DNA and Protein Sequences: Observation of Real Data

Global IP Fellows Meeting

HAO Bailin
Head, T-Life Research Center, Fudan University, Shanghai, China
Complexity of DNA and Protein Sequences: Observation of Real Data
According to a theorem in C. Shannon's seminal 1948 paper the set of all symbolic sequences of length N over a finite alphabet may be roughly divided into two subsets: a huge typical set and a tiny set of atypical sequences. Biological sequences as result of billion years of evolution must belong to this tiny set. An effective way of studying the atypical subset is to look at real data. I will report some observations on real DNA and protein data, including "avoidance signature" of bacterial genomes, taxon-specific repeats in these genomes, fine structure in the number distribution of K-strings in randomized genomes, almost-uniqueness of reconstruction of protein sequences from their constituent K-peptides, etc. These observations may sometimes lead to interesting pieces of biology-inspired mathematics including combinatorics, graph theory, and formal language theory.

Complexity of DNA and Protein Sequences: Observation of Real Data

From Santa Fe Institute Events Wiki