David S. Greenberg, Cynthia A. Phillips, and David Wilson[1]
Sandia National Laboratories, Mail Stop 1110, P.O. Box 5800, Albuquerque, NM 87185-1110 and Massachusetts Institute of Technology[1]
There have been many studies which investigate algorithms for reconstructing physical maps from hybridization data between clone libraries and sets of genomic probes [6, 8, 4, 3, 1]. All of the various formulations of the problem of finding a complete physical map have been shown to be NP-complete and thus a variety of heuristic and/or approximate solutions have been proposed. Interestingly, the analyses of these approaches (whether by theoretical bounds, evaluation against known maps, or evaluation against simulated data) have given mixed results. Some studies imply that excellent maps are possible while others show that good maps are unlikely.
In [5] an attempt was made to formalize the model of the problem so that everyone could agree on the conditions involved. In this work we use the formalization to examine the expected quality of the maps. We start by looking at the simple case in which the hybridization data is error free - that is, each clone represents a single contiguous section of genome and all hybridizations are faithfully recorded.
Surprisingly to us, there is still are large amount of inherent ambiguity in maps created from such data. As a measure of ambiguity we asked what percentage of unique probes (STSs for example) which are actually adjacent on the genome could have been non-adjacent on a genome for which the hybridization data remained unchanged. We dubbed these adjacencies, weak, since no algorithm can be sure they are adjacent based only on the hybridization data.
We found that three factors effect the amount of weakness in the result. the coverage of the clones, the length of the clones, and the statistical manner in which the clones and probes were chosen. The effects of coverage have been previously reported in studies such as [7, 2]. Our new results are that clone length has a non-linear effect on the results. Once a critical length (between 3 and 5 depending on the manner in which clones are chosen) is reached the amount of weakness reaches an asymptote. On the other hand, very small expected clone lengths result in very high weakness. We evaluated several models of clone and probe generation. The method of clone generation varied between constant size. size bounded in a range, and Poisson distributed sizes. While expected length was the most important factor, allowing the sizes to be Poisson distributed increased the expected weakness somewhat. On the other hand. the method of probe generation was varied from uniform gaps to Poisson distributed gaps. The use of Poisson distributed gaps led to significantly greater weakness.
We conclude that the reason for the varied results of analyses of algorithms for physical mapping is that the method in which clones and probes are generated varies greatly. We suggest that more careful analysis in which the ambiguity of the result is compared against our theoretical result will yield more reliable judgements about algorithms.
*This work was supported by the U.S. Department of Energy and was performed at Sandia National Laboratories for the U.S. Department of Energy under contract DE-AC04-94AL85000.
References
[1] F. Alizadeh, R. Karp, L. Newberg, and D. K. Weiser. Physical mapping of chromosomes: a combinatorial problem in molecular biology. In Proceedings of the 4th Annual ACM-SIAM SODA, pages 371-381, 1993.
[2] R. Arratia, E. S. Lander, S. Tavare, and M. S. Waterman. Genomic mapping by anchoring random clones: A mathematical analysis. Genomics, 11:806-827, 1991.
[3] A. Cuticchia, J. Arnold, and W. Timberlake. The use of simulated annealing in chromosome reconstruction experiments based on binary scoring. Genetics, 132:591-601,1992
[4] A. Cuticchia. J. Arnold, and W. Timberlake. ODS: ordering DNA sequences algorithm based on simulated annealing. CABIOS, 9(2):215-219, 1993.
[5] D. Greenberg and S. Istrail. Physical mapping by STS hybridization: Algorithmic strategies and the challenge of software evaluation. JCB, 2(2):219-274, 1995.
[6] A. Grigoriev, R. Mott, and H. Lehrach. An algorithm to detect chimeric clones and random noise in genomic mapping. Genomics, 22:282-486, 1994.
[7] E. S. Lander and M. S. Waterman. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics. 2(3):231-329, 1988.
[8] R. Mott, et al. Algorithms and software tools for ordering clone libraries: application to the mapping of the genome schizosaccharomyces pombe. Nucleic Acid Research, 21(8):1965-1974, 1993.