David C. Torney, Clive C. Whittaker, and Guochun Xie
Center for Human Genome Studies, MSK710, Los Alamos National Laboratory, Los Alamos, NM 87545
Improved detection of human coding sequences is highly desirable for many reasons, including SAmple SEquencing (see Chi et al. and Ricke et al. this meeting). We developed a novel approach which, in principle, makes full use of all differences between coding and noncoding sequence data for classification. Likelihoods of sequence data lie at the root of this approach. We first converted the DNA sequences to binary sequences, encoding as follows: A=00, C=01, G=10, T=11. Let the parity of a binary sequence be the number of ones modulo 2. For clarity, begin with two datasets of example sequences, coding and noncoding, with n (binary) letters in each sequence.[1] Count the number of times the parity is even for subsequences- not necessarily consecutive subsequences. Although there are 2n subsequences, it is natural to focus on those subsequences with the smallest number of letters. In fact, the distribution of the differences of the average parity of a subsequence between the two datasets narrowed as the number of letters increased. Our motivation was completeness, and it should be noted that our approach relies upon none of the mainstays of other techniques, such as subsequence frequencies or periodicities.[1,2]
To establish feasibility, we considered only those subsequences of up to six (binary) letters, and we required the first and last letters to be within 60 letters of one another. Neither the phase nor the strand of the coding sequences were known in the "training" dataset;[1] to partially mitigate the latter we appended the reverse-complementary sequences. Lacking the phase, we averaged the parities of subsequences which were translates (modulo 2) of one another. To classify "test" sequences, we retained only a small number of subsequences: essentially those with the largest magnitude of the difference in the average parities for the "training" coding and noncoding data. Retained subsequences frequently had some pairs of letters corresponding to individual bases, but a three-base periodicity was uncommon. The subsequence with two adjacent letters, corresponding to an individual base, had the largest magnitude of the difference in average parities between coding and noncoding sequences, reflecting the larger C + G content of coding sequences.
Finally, for "test" sequences, we added the retained subsequences' parities, each multiplied by the difference of the average parities in the training datasets. A threshold was used to classify: for sums above the threshold the classification was noncoding. The threshold was chosen to make the false-prediction rates in the two test datasets equal. After analyzing approximately 72,000 54-base training and test sequences, the false-prediction rates we obtained were 27.5%, whereas 29.5% was the smallest previously found for the same dataset using an individual feature: hexamer frequencies 1. Because of the restrictions in our preliminary feasibility studies, substantial improvements are likely. Our approach might also contribute to modular prediction software, such as GRAIL.
* This work was funded by the U.S. D.O.E. under contract W-7405-ENG-36.
1 J. W. Fickett and C.-S. Tung, Nucleic Acids Res.,20, 6441-6450, (1992).
2 A. Thomas and M. H. Skolnick, IMA Journal of Mathematics Applied in Medicine and Biology, 11, 149-160 (1994).