DOE Genomes
-

Human Genome Project Information


Archive

logo

DOE Human Genome Program Contractor-Grantee Workshop IV

Santa Fe, New Mexico, November 13-17, 1994

PDF

Introduction to the Workshop
URLs Provided by Attendees

Abstracts
Mapping
Informatics
Sequencing
Instrumentation
Ethical, Legal, and Social Issues
Infrastructure
 

The electronic form of this document may be cited in the following style:
Human Genome Program, U.S. Department of Energy, DOE Human Genome Program Contractor-Grantee Workshop IV, 1994.

Abstracts scanned from text submitted for November 1994 DOE Human Genome Program Contractor-Grantee Workshop. Inaccuracies have not been corrected.

A Syntactic Pattern Recognition System for DNA Sequences

David Searls
Department of Genetics, Univeristy of Pennsylvania School of Medicine, Room 475 CRB, 422 Curie Blvd., Philadelphia PA 19104-6145.

Formal language theory views languages as sets of strings over some alphabet, and specifies potentially infinite languages with concise sets of rules called grammars. Grammars are an exceptionally well-studied methodology, familiar to all computer scientists, for the description of complex, higher-order structures embodied in strings of symbols. Moreover, they can be given as input to general-purpose programs called parsers capable of determining whether a given string satisfies the rules of the grammar. Parser technology is also extensively developed, and has been applied as well to the problem of searching for complex patterns specified by grammars in large amounts of data, in a technique known as syntactic pattern recognition.

We have studied DNA sequences from the perspectives of both formal language theory and practical pattern recognition tasks using linguistic tools. On the formal side we have presented a number of results concerning the mathematical linguistic "complexity" of the language of DNA, e.g. it's position on the Chomsky hierarchy of languages, and the relationship between syntactic structure and secondary structure. We have also defined and characterized a novel grammar formalism, called String Variable Grammar, that is particularly well-suited to the representational needs of DNA. The practical side entails the development and use of a syntactic pattern recognition system for DNA sequences, called GenLang, that takes advantage of structural and/or hierarchical aspects of a domain by using rule-based methods to describe and discriminate such structures. The GenLang system has been used successfully to specify and search for tRNA genes, group I introns, and most recently, protein-encoding genes, achieving results comparable to other, procedural systems.

This work was funded by the DOE Genome Program (DE-FG02-92ER61371).

[1] Searls, D.B. (1988) "Representing Genetic Information with Formal Grammars" Proceedings of the Seventh National Conference of the American Association for Artificial Intelligence, AAAI/Morgan Kaufman, pp. 386-391.
[2] Searls, D.B. (1989) "Investigating the Linguistics of DNA with Definite Clause Grammars" In Logic Programming: Proceedings of the North American Conference (E. Lusk and R. Overbeek, eds.), MIT Press, pp. 189-208.
[3] Searls, D.B. and Noordewier, M.O. (1991) "Pattern-Matching Search of DNA Sequences Using Logic Grammars" Proceedings of the Seventh Annual Conference on Artificial Intelligence Applications, IEEE Computer Society, pp. 3-10.
[4] Searls, D.B. (1992) "The Linguistics of DNA" American Scientist 80: 579-591.
[5] Searls, D.B. (1993) "The Computational Linguistics of Biological Sequences" In Artificial Intelligence and Molecular Biology (L. Hunter, ed.), AAAI Press, chapter 2, pp. 47-120.
[6] Searls, D.B. and Dong, S. (1993) "A Syntactic Pattern Recognition System for DNA Sequences" In Proceedings of the Second International Conference on Bioinformatics, Supercomputing, and Complex Genome Analysis (H.A. Lim, J. Fickett, C.R. Cantor, and R.J. Robbins, eds.). World Scientific Publishing Co., pp. 89-101.
[7] Dong, S. and Searls, D.B. "Gene Structure Prediction by Linguistic Methods" Genomics, in press.


Last modified: Wednesday, October 29, 2003

Home * Contacts * Disclaimer

Document Use and Credits
Publications and webpages on this site were created by the U.S. Department of Energy Genome Program's Biological and Environmental Research Information System (BERIS). Permission to use these documents is not needed, but please credit the U.S. Department of Energy Genome Programs and provide the website http://genomics.energy.gov. All other materials were provided by third parties and not created by the U.S. Department of Energy. You must contact the person listed in the citation before using those documents.

Base URL: www.ornl.gov/hgmis

Site sponsored by the U.S. Department of Energy Office of Science, Office of Biological and Environmental Research, Human Genome Program