Languages, Automata, Interfaces, and Macromolecules*

David B. Searls

SmithKline Beecham Pharmaceuticals, 709 Swedeland Road, PO Box 1539, King of Prussia, PA 19406

Viewed as strings of symbols, biological macromolecules can be modelled as elements of formal languages. Generative grammars have been useful in molecular biology for purposes of syntactic pattern recognition, for example in the author's work on the GenLang pattern matching system, which is able to describe and detect patterns that are probably beyond the capability of a regular expression specification. More recently, grammars have been used to capture intramolecular interactions or long-distance dependencies between residues, such as those arising in folded structures. In the work of Haussler and colleagues, for example, stochastic context-free grammars have been used as a framework for "learning" folded RNA structures such as tRNAs, capturing both primary sequence information and secondary structural covariation. Such advances make the study of the formal status of the language of biological macromolecules highly relevant, and in particular the finding that DNA is beyond context-free has already created challenges in algorithm design.

Moreover, to date such methods have not been able to capture relationships between strings in a collection, such as those that arise via intermolecular interactions, or evolutionary relationships implicit in alignments. Recently we have attempted to remedy this by showing (1) how formal grammars can be extended to describe interacting collections of molecules, such as hybridization products and, potentially, multimeric or physiological protein interactions, and (2) how simple automata can be used to model evolutionary relationships in such a way that complex model-based alignment algorithms can be automatically generated by means of visual programming. These results allow for a useful generalization of the language-theoretic methods now applied to single molecules.

In addition, we describe a new software package for genome application development, called bioTk. This is a domain-specific widget set, implemented in Tcl/Tk, for the rapid prototyping of sophisticated graphical user interfaces involving objects such as chromosomes, maps, and sequences.

*Supported by a grant from the Department of Energy, 92ER61371.

D.B. Searls, "String Variable Grammar: A Logic Grammar Formalism for DNA Sequences", Journal of Logic Programming 24 (1,2):73-102 (1995).

D.B. Searls, "Formal Grammars for Intermolecular Structure", First International Symposium on Intelligence in Neural and Biological Systems, 30-37 (1995).

D.B. Searls and K.P. Murphy, "Automata-Theoretic Models of Mutationa and Alignment", Third International Conference on Intelligent Systems for Molecular Biology, 341-349 ( 1995).

D.B. Searls, "bioTk: Componentry for Genome Informatics Graphical User Interfaces", Gene 163 (2):GC1-16 (1995).


Abstracts scanned from text submitted for January 1996 DOE Human Genome Program Contractor-Grantee Workshop.

Return to Table of Contents