DOE Genomes
Human Genome Project Information  Genomics:GTL  DOE Microbial Genomics  home
-
HGP Home
Archive Edition

logo

DOE Human Genome Program Contractor-Grantee Workshop IV

Santa Fe, New Mexico, November 13-17, 1994

Introduction to the Workshop
URLs Provided by Attendees

Abstracts
Mapping
Informatics
Sequencing
Instrumentation
Ethical, Legal, and Social Issues
Infrastructure

The electronic form of this document may be cited in the following style:
Human Genome Program, U.S. Department of Energy, DOE Human Genome Program Contractor-Grantee Workshop IV, 1994.

Abstracts scanned from text submitted for November 1994 DOE Human Genome Program Contractor-Grantee Workshop. Inaccuracies have not been corrected.

A Syntactic Pattern Recognition System for DNA Sequences

David Searls
Department of Genetics, Univeristy of Pennsylvania School of Medicine, Room 475 CRB, 422 Curie Blvd., Philadelphia PA 19104-6145.

Formal language theory views languages as sets of strings over some alphabet, and specifies potentially infinite languages with concise sets of rules called grammars. Grammars are an exceptionally well-studied methodology, familiar to all computer scientists, for the description of complex, higher-order structures embodied in strings of symbols. Moreover, they can be given as input to general-purpose programs called parsers capable of determining whether a given string satisfies the rules of the grammar. Parser technology is also extensively developed, and has been applied as well to the problem of searching for complex patterns specified by grammars in large amounts of data, in a technique known as syntactic pattern recognition.

We have studied DNA sequences from the perspectives of both formal language theory and practical pattern recognition tasks using linguistic tools. On the formal side we have presented a number of results concerning the mathematical linguistic "complexity" of the language of DNA, e.g. it's position on the Chomsky hierarchy of languages, and the relationship between syntactic structure and secondary structure. We have also defined and characterized a novel grammar formalism, called String Variable Grammar, that is particularly well-suited to the representational needs of DNA. The practical side entails the development and use of a syntactic pattern recognition system for DNA sequences, called GenLang, that takes advantage of structural and/or hierarchical aspects of a domain by using rule-based methods to describe and discriminate such structures. The GenLang system has been used successfully to specify and search for tRNA genes, group I introns, and most recently, protein-encoding genes, achieving results comparable to other, procedural systems.

This work was funded by the DOE Genome Program (DE-FG02-92ER61371).

[1] Searls, D.B. (1988) "Representing Genetic Information with Formal Grammars" Proceedings of the Seventh National Conference of the American Association for Artificial Intelligence, AAAI/Morgan Kaufman, pp. 386-391.
[2] Searls, D.B. (1989) "Investigating the Linguistics of DNA with Definite Clause Grammars" In Logic Programming: Proceedings of the North American Conference (E. Lusk and R. Overbeek, eds.), MIT Press, pp. 189-208.
[3] Searls, D.B. and Noordewier, M.O. (1991) "Pattern-Matching Search of DNA Sequences Using Logic Grammars" Proceedings of the Seventh Annual Conference on Artificial Intelligence Applications, IEEE Computer Society, pp. 3-10.
[4] Searls, D.B. (1992) "The Linguistics of DNA" American Scientist 80: 579-591.
[5] Searls, D.B. (1993) "The Computational Linguistics of Biological Sequences" In Artificial Intelligence and Molecular Biology (L. Hunter, ed.), AAAI Press, chapter 2, pp. 47-120.
[6] Searls, D.B. and Dong, S. (1993) "A Syntactic Pattern Recognition System for DNA Sequences" In Proceedings of the Second International Conference on Bioinformatics, Supercomputing, and Complex Genome Analysis (H.A. Lim, J. Fickett, C.R. Cantor, and R.J. Robbins, eds.). World Scientific Publishing Co., pp. 89-101.
[7] Dong, S. and Searls, D.B. "Gene Structure Prediction by Linguistic Methods" Genomics, in press.

Send the url of this page to a friend


To read pdf files, download the free Acrobat Reader software.

Last modified: Wednesday, October 29, 2003

Home * Contacts * Disclaimer

Base URL: www.ornl.gov/hgmis

Office of Science Site sponsored by the U.S. Department of Energy Office of Science, Office of Biological and Environmental Research, Human Genome Program