DOE Human Genome Program Contractor-Grantee
144. JASON Study on Data Mining and the Human Genome
G. Joyce, H. Abarbanel, C. Callan, W. Dally, F. Dyson, T. Hwa, S. Koonin, H. Levine, O. Rothaus, R. Schwitters, C. Stubbs, and P. Weinberger
JASON Program Office, McLean, VA
The JASON organization conducted a DOE-sponsored study on bioinformatics and the human genome project. The study sought to explore the problems that must be faced in bioinformatics and to identify information technologies that could help to overcome these problems. While the current influx of data greatly exceeds what biologists have experienced in the past, other scientific disciplines and the commercial sector have been handling much larger datasets for many years. Powerful datamining techniques have been developed in other fields that, with appropriate modification, could be applied to the biological sciences.
Clearly there is a need for more bioinformaticists, as well as computer scientists and engineers who are willing to become involved in bioinformatics research. An ample talent pool already exists from which to recruit these individuals. The DOE can facilitate cross-fertilization between biologists and the non-biological datamining community by sponsoring joint workshops, offering research fellowships to computer scientists who are interested in biological applications, providing access to the unclassified resources of the Advanced Strategic Computing Initiative, and taking advantage of the commercial sector's willingness to make datamining tools freely available to the academic community.
Greater emphasis must be placed on closing the loop between algorithmic analysis and experimental validation. This will require close cooperation between computer scientists and biologists. The DOE should support the development of experimental methods for validating bioinformatics algorithms and the establishment of statistical tests that can be used to assess the robustness of these algorithms. The DOE should take responsibility for ensuring the provenance of the primary data from the major sequencing centers and making that data freely available in a generic database format with minimal annotation.
|The online presentation of this publication is a special feature of the Human Genome Project Information Web site.|