Introduction to the Workshop
URLs Provided by Attendees
- Ethical, Legal, and Social Issues
The electronic form of this document may be cited in the following style:
Human Genome Program, U.S. Department of Energy, DOE Human Genome Program Contractor-Grantee Workshop IV, 1994.
Abstracts scanned from text submitted for November 1994 DOE Human Genome Program Contractor-Grantee Workshop. Inaccuracies have not been corrected.
Combinatorial Methods for Gene Recognition
Mikhail Gelfand [1,2], Pavel Pevzner , Michael Roytberg 
Institute of Protein Research, Russian Academy of Sciences
Department of Computer Science and Engineering, The Pennsylvania State University
Institute of Mathematical Biology, Russian Academy of Sciences
One of the main problems in prediction of complete genes in large-scale DNA sequencing projects is the combinatorial explosion of the number of potential exon-intron structures. The standard dynamic programming cannot be applied here, because there are no a priori reasons to assume the linearity of the structure-scoring functional [Gelfand, 1994]. We have suggested an algorithm that constructs the Pareto-optimal set of exon structures which is guaranteed to contain the best structure for any scoring functional and contains no unnecessary structures [Gelfand and Roytberg, 1993].
The pilot version of the GREAT (Gene Recognition and Exon Assembly Tool) software, which uses the simplest site recognition and codon usage objective functions, has been tested on an independent set of genes and benchmarked against GRAIL-2. The results were comparable: GREAT demonstrated higher sensitivity but smaller specificity than GRAIL-2 [Gelfand et al., 1994]. From this perspective it might be useful to combine the strengths of both approaches in a hybrid system.
Another issue we address is related to on-line database search for gene recognition. Existence of introns makes it necessary to use gapped matches, leading to new combinatorial problems [Pevzner and Waterman, 1994]. On the other hand, combination of exon assembly and similarity search approaches leads to the problem of finding the best local similarity between a set of candidate exon-intron structures and a sequence from the database. An efficient algorithm for this problem, related to approximate matching of regular expressions, has been constructed [Rubinov et al., 1994].
Further development will be directed towards combining of on-line database search and exon assembly, development of new coding potentials accounting to genome inhomogeneity, improvement of the structure scoring schemes, backtrack assignment of prediction certainty to individual exons, and creation of programs for analysis of large sequence fragments containing multiple genes.
The work is supported by DOE under the grant DE-FG02-94ER61919.
Gelfand M.S., Roytberg M.A. (1993) A dynamic programming algorithm for prediction of the exon-intron structure. BioSystems 30, 173-182.
Gelfand M.S., Podolsky L.I., Astakhova T.V., Roytberg M.A. (1994) Recognition of genes in human DNA sequences. Technical Report CSE-94-059, The Pennsylvania State University.
Gelfand M.S. (1994) Prediction of function in DNA sequence analysis. Technical Report CSE-94-060, The Pennsylvania State University.
Pevzner P.A., Waterman M.S. (1994) Multiple filtration and approximate pattern matching. Algorithmica (to appear).
Rubinov A.R., Gelfand M.S., Pevzner P.A. (1994) Exon assembly and fast database search (in preparation).