Catherine A. Macken and Kevin P. Murphy
Theoretical Division, Los Alamos National Laboratory, Los Alamos, NM 87545.
Hidden Markov Models (HMMs) have received considerable attention in the recent literature as models underlying recognition of patterns in DNA and RNA, gene finding, speech recognition and other applications. With few exceptions, the approach adopted has required prespecification of the topology of the HMM. Since HMMs have typically been used in structured contexts, such as on a family of protein sequences to discover motifs, or to predict the two-dimensional structure of RNA, there is often a natural topology that can be specified. The analysis then requires inference of the independent probabilities of transition among the hidden states and of symbol emission, that is, of Pr{Xt+l|Xt} and Pr{st|Xt}, where t is the current time and Xt and st are the state and symbol (respectively) at time t.
Our interest is in generalizing the HMM approach so that we can investigate the usefulness of this model-class in discovering de novo underlying patterns in a long sequence of uncharacterized DNA. Our approach has two major facets:
(i) we generalize the HMM to a non-deterministic probabilistic finite automaton (NPFA) in which transitions among the hidden states are defined by the probabilities: Pr{Xt+l = i,st|Xt} > 0 . (Non-determinism allows for more than one possible transition out of state Xt upon reading the symbol st.) (ii) we infer the topology as well as the parameters of the NPFA from the data.
We are testing our methods using a simple stochastic finite automaton to generate a data stream. We have preliminary results for exon and intron data. Our hope is that we will discover structure in sequences that leads to insights into their fundamental biological properties.
Work supported under contract W-7405-ENG-36.