Spliced Alignment: a New Approach to Gene Recognition

Mikhail S. Gelfand,[1] Andrey A. Mironov,[2] Pavel A. Pevzner[3]

Gene recognition is one of the most important problems in computational molecular biology. Previous attempts to solve this problem were based on statistics and artificial intelligence and, surprisingly enough, applications of theoretical computer science methods for gene recognition were almost unexplored. Recent advances in large-scale cDNA sequencing open a way towards a new combinatorial approach to gene recognition. This paper describes a spliced alignment algorithm and a software tool which explores all possible exon assemblies in polynomial time and finds the multi-exon structure with the best fit to a related protein. Unlike other existing methods, the algorithm successfully performs exons assemblies even in the case of short exons or exons with unusual codon usage; we also report correct assemblies for genes with more than 10 exons provided a homologous protein is already known. On a test sample of human genes with known mammalian relatives the average overlap between the predicted and the actual genes was 98%, which is remarkably well as compared to other existing methods. At that, the algorithm absolutely correctly reconstructed 87% of genes. The rare discrepancies between the predicted and real exon-intron structures were restricted either to extremely short initial or terminal exons (less than 5 amino acids) or proved to be results of alternative splicing and errors in database feature tables. Moreover, the algorithm performs reasonably well with non-vertebrate and even prokaryote targets. The dependence of the performance on the evolutionary distance between the analyzed gene and the target protein was estimated using simulated data. The main result of the paper is that a relatively simple algorithm based on combinatorial common sense and some biological intuition can outperform many approaches developed for gene recognition in the last fifteen years.

[1] Institute of Protein Research, Russian Academy of Sciences, Puschino, Moscow region, 142292, Russia. misha@imb.imb.free.net

[2] Laboratory of Mathematical Methods, National Center for Biotechnology, NIIGENETIKA, Moscow, 113545, Russia. mir@vnigen.msk.su

[3] Departments of Mathematics and Computer Science, University of Southern California, Los Angeles, CA 90089-1113.

pevzner@cse.psu.edu This work is supported by DOE grant DE-FG02-94ER61919.


Abstracts scanned from text submitted for January 1996 DOE Human Genome Program Contractor-Grantee Workshop.

Return to Table of Contents