Gary Churchill and Betty Lazareva
Biometrics Unit, Cornell University, Ithaca, NY 14853
We present a model for random errors that occur in DNA sequence data. The model is defined in terms of three parameters, one for each of the possible error types, substitution, insertion or deletion. A Gibbs sampling algorithm is described that can be used to simultaneously estimate the error rate parameters and to restore the DNA sequence. Parameter estimates are summarized as a posterior density. The restored DNA sequence can be summarized as a modal sequence or as a posterior credible region which takes the form of a cylinder set in sequence space. The methods are applied to a set of DNA sequence fragments from a human gene. Possible generalizations of the model and the algorithm are discussed in light of these results.