W. Fan, P. Buneman, S.B. Davidson and C. Overton
Dept. of Computer and Information Science & Dept. of Genetics, University of Pennsylvania, Philadelphia, PA 19104 Email: {peter, susan, coverton, wenfei} @cis.upenn.edu
Much data of interest to biologists exists as text in flat files rather than in database management systems (DBMSs). The reasons for this are numerous, ranging from economic considerations (the expense of a general purpose DBMS), to the numerous special purpose and often quite sophisticated applications that have been built around data stored in these formats (e.g. ACE), to the ubiquity of the format as an exchange format (e.g. ASN.1). Owners of such flat file data sources are therefore frequently reluctant to migrate their data to DBMSs and give up the flat file format. On the other hand, DBMSs offer many desirable features that are not found in general text systems, such as indexing, integrity checking concurrency control and general purpose query languages against which optimizations can be performed. To take advantage of DBMS features while maintaining data in its original format, one solution is to map the file format into a database format and provide a database view for the data in files. In this way, the data can be queried and updated using database languages, and integrity constraints on the data can be maintained using database facilities.
We have therefore developed a framework that, using an extension of Definite Clause Grammars (DCG), translates data stored in text files structured by a syntactic grammar into a database, and converts data from the database back to the files. Our framework is more general than existing mapping mechanisms between files and databases[2,3] in the sense that it allows the grammars specifying files to be beyond context free grammars (CFGs) - in fact, DCG grammars have Turing machine power. Our approach also facilitates general integrity constraints checking on the elements recognized while parsing files. Such features are important since files often cannot be described by CFGs and they encode many integrity constraints. As an example, GenBank cannot be described by a CFG. Furthermore, many constraints are encoded in GenBank accession numbers: they function as keys for entries, and must therefore be unique; they also function as references to other entries when used in non-key fields within an entries (foreign keys). The family of constraints that can be encoded using DCGs is quite general, including but not limited to those commonly found in database systems.
The use of DCGs within our framework also promise new optimization techniques. When mapping files to databases, the size of databases created is a major efficiency concern; ideally, it should be possible to generate the database image for only the data in which the users are interested. Using DCGs, this can be done by describing conditions under which the parser can ignore portions of the input parse, only generating a database image for the data satisfying the conditions. The data which does not satisfy the conditions can simply be skipped in a do-not- care manner. For instance, a GenBank file can consist of several hundred thousand entries. The user can choose to parse N entries only, or to create database image for only the entries satisfying certain conditions. This conditional (partial) parsing technique improves parsing performance and enables users to check constraints without generating database objects.
DCGs can be directly implemented in Prolog, do not require building up special parsers, and thus can be rapidly prototyped. We have also been able to develop a simple technique for "reverse" transformations. That is, given a DCG mapping from files to databases which satisfies certain constraints, another DCG which encodes data from the databases back to the files can be produced which is guaranteed to respect the original grammar.
[1] This research was supported by a grant from the Director, Health Effects and Life Science Research Division, Office of Health and Environmental Research of the U.S. Department of Energy under contract DOE DE-FG02-94-ER-61923 Sub 1.
[2] S. Abiteboul et al., "Querying and updating the file", Proceedings of VLDB'93, 73-84 (1993).
[3] G.H. Gonnet and F. W. Tompa, "Mind your grammar: a new approach to modeling text", Proceedings of VLDB '87, 339-346, Brighton (1987).