Hopper: A Prototype for Data Flow in Large-Scale DNA Sequencing

Todd M. Smith[1], Chris Abajian, Leroy Hood

Department of Molecular Biotechnology, University of Washington, Seattle WA, 98195

With the advent of fluorescent-based DNA sequencing it has become possible to consider the analysis of entire genomes as a first step in the biological study of an organism. A primary challenge in these projects is the scale at which the data handling must be done. For example, at least 10[7] sequencing tracts will have to be obtained to completely determine the nucleotide sequence of the human genome. Hence, large-scale sequencing facilities will benefit from tracking template DNA information (purification methods, reaction, and electrophoresis conditions), in a systematic fashion. A lack of software tools that support automated sample entry however is a major hindrance to recording these parameters. For example, experimental information can be added to the comment field in the ABI sample sheet, and subsequently the chromatogram file, but this information must be added by hand. We have overcome this problem, with a sample sheet generator that uses the ABI file format, written in a graphical programming language, Tcl/Tk. It is used to facilitate data flow in our production operation.

The UNIX file system has been used to prototype automating the flow of data from the ABI sequencer to a data repository. Data transfers between an Apple Macintosh (the collection device for the ABI sequencer) and a UNIX workstation are accomplished by FTP (file transfer protocol) using a Macintosh program, Fetch. Once transferred, the data are automatically processed by a central Perl program, Hopper. Hopper automatically runs a series of programs that provide a number of first level analysis about data quality (read length estimate, fraction of indeterminate bases, and number of contaminating and repetitive sequences) and generates simple reports describing the results. This program also automates DNA sequence data assembly using the PHRED[2] basecalling and PHRAP[2] assembly programs. Using the combination of PHRED and PHRAP cosmids, from shotgun sequencing projects (containing up to 40% alu repetitive DNA), have been successfully assembled without manual intervention, as well as BAC derived contigs over 100 kb in length.

Supported by a grant from the Director, Office of Energy Research, Office of Health and Environmental Research of the U.S. Department of Energy under contract

[1] DOE Human Genome Distinguished Postdoctoral Fellow.

[2] Phil Green unpublished work.


Abstracts scanned from text submitted for January 1996 DOE Human Genome Program Contractor-Grantee Workshop.

Return to Table of Contents