abstracts from the
Stephanie L. Chissoe and the Genome Sequencing Center
Large scale genomic sequencing is ongoing at the Washington University Genome Sequencing Center, focusing on human and C. elegans DNA. The C. elegans project is in its final year with a total of 71 Mb finished sequence (in collaboration with The Sanger Centre, Hinxton, UK). The human project has been maturing since funding began last year, and we now have 15 Mb finished human genomic sequence. The goals for each project are similar; to achieve contiguous, base-perfect sequence. For C. elegans, a clone-based sequence-ready map has provided the majority of sequencing substrates, including cosmids and YACs. A minimal tiling path of cosmids were chosen, although recently we have selected YACs for direct sequencing. Additionally, direct probing of a C. elegans fosmid library has provided bacterial clones for regions previously spanned only by YACs, facilitating sequence contiguity. We have incorporated an up-front, sequence-ready mapping paradigm for human genomic DNA. Chromosome 7 STS information is being translated into sequence-ready bacterial contigs. As the sequencing progresses, we are actively working to close gaps in the bacterial clone map by re-screening the large insert libraries with probes based on end-sequence data, additional markers, or overlapping YACs. Candidate clones are chosen for sequencing after evaluating the fingerprint data to verify clone fidelity and overlap through a region.
Sequencing is performed using a mixed shotgun strategy, which includes initial sequencing of random subclones followed by directed approaches for gap closure and ambiguity resolution. The quality of the assembly and sequence editing are monitored by reassembly and comparison to the finished sequence. Additionally, the restriction digest sizes are compared to that based on the finished sequence. During analysis and annotation of the finished sequence, potential exons are identified by similarity to EST data and known protein sequences, and by gene prediction programs.
An overview of the projects will be presented, addressing our strategies, progress, and methods for quality control at each stage of the process. Additionally, recent software tools and technology developments (see abstract by E. Mardis) which have increased our throughput of high quality data will be included.
Glen A. Evans, Maria Athanasiou, Lisa Hahner, Sherri Osborne-Lawrence, Terry
Franklin, Nina Federova, Cynthia English, Shelly Hinson-Cooper, Joel Dunn, James McFarland,
Juan Davie, Travis Ward, Paul Card, Parul Patel, Margaret Gordon, Jackie Newton, Danny
Valenzuela, Jeff Schageman, Jeff Harris, Garrett Gotway, Mathenna Syed, Ken Kufer, Peter
Schilling, Vanetta Gee, Mujeeb Basit, Stafford Brignac, Odell Grant, Ron Burmeister, Kevin
O'Brien, Skip Garner, and Roger Schultz
In order to complete the DNA sequence of the human genome, collaboration of a group of high throughput sequencing centers will be necessary. The UTSW Genome Sequencing Center has as its goals: 1) development of 100 mb/year sequencing capacity, 2) complete sequencing of human chromosomes 11 and 15 and 3) developing exportable technology and robotics to support high throughput genomic sequencing. Aspects of this project include high throughput mapping to generate sequence ready PAC and BAC template sets, pilot projects to sequence megabase-sized contigs on chromosome 11p15.5, 11p14, 11q23, 15q26 and 15q11, and technology development projects in robotics and informatics. The strategy for integrated map construction, sequencing and data annotation involves assembly of a precise sequence-ready map using STSs, ESTs, regionally mapped genes and other probes screened against a 10X total human BAC library which meets ethical and legal standards of informed consent and anonymity. BAC clones are isolated by array hybridization with radiolabeled pooled oligonucleotides, identity confirmed by PCR with specific markers and FISH, and clones fingerprinted. BAC/PAC end-sequencing is widely utilized for gap filling, contig establishment and sequence annotation. BAC clones are sequenced by M13 shotgun sequencing and fragments assembled into initial contigs using PHRED and PHRAP. Gap filling and quality improvement to an estimated sequence accuracy of 99.99% are accomplished by oligonucleotide-directed sequencing from BAC and PAC templates. Sequence processing, assembly, quality control and annotation and implemented through informatics tools developed in the center. The effort is augmented by the robotics and instrumentation developed in the center including MERMADE high throughput oligonucleotide synthesizers, the Sequencing Support System, a 3 m Sagian rail robot developed for automated sequencing, and a high capacity DNA sequencer under development. The center has produced 4.2 mb of DNA sequence and analysis and annotation of 750 kb of contiguous sequence for 11p15.5 region have been completed with an anticipated 2 mb of continuous sequence by the end of 1997.
Jane Lamerdin, Aaron Adamson, Karolyn
Burkhart-Schultz, Linda Danganan, Jeff Garnes,
Ami Kyle, Melissa Ramirez, Stephanie Stilwagen,
Glenda Quan, Pat Poundstone, Robert Bruce, Evan
Skowronski, Arthur Kobayashi, David Ow,
Anthony V. Carrano, and Paula McCready
Chromosome 19 is the most GC-rich of the human chromosomes as determined by flow cytometry. It is predicted to be very gene rich, with an estimated 2000 genes contained within ~60 Mb of euchromatin. A high resolution physical map of human chromosome 19, constructed largely in bacterial-based clones, serves as a resource for targeted genomic sequencing in regions of high biological interest. One interesting feature of chromosome 19 is the high density of clustered gene families such as zinc finger genes (ZNFs), olfactory receptors (OLFRs) and cytochrome P-450s (CYPs). In order to understand their evolution and subsequent functional diversification, several of these clusters are current sequencing targets. We are also interested in genes involved in DNA repair, and have performed genomic analyses of 6 such loci, many in both human and mouse. To date we have completed over 2 Mb of genomic sequence from chromosome 19 and other human and rodent targets, using a shotgun strategy. Our largest completed sequence contig is ~1 Mb and is located in 19q13.1, flanked by the genetic markers D19S208 and COX7A1. Preliminary analysis of this contig indicates a relative gene density of ~1.7 per 40 kb, and an average Alu density of 1.1 Alu/kb (~31%), which is comparable to other previously sequenced regions of chromosome 19. Of the 43 putative genes identified, two appear to be pseudogenes, 7 encode putative cell surface proteins (e.g. glycoproteins), and 14 are completely novel. Progress towards completion of an >800 kb contig in 19p12 and regions containing clustered OLFR and ZNF gene family members will also be presented.
This work was performed by Lawrence Livermore National Laboratory under the auspices of the U.S. Department of Energy, Contract No. W-7405-Eng-48.
Mark Shannon, Linda Ashworth, Laurie Gordon,
Anne Olsen, and Lisa Stubbs
Genetic and physical mapping studies indicate that hundreds, if not thousands, of zinc finger (ZNF)-containing genes populate the human genome, and that many of these genes, including those with Kruppel-associated box (KRAB) motifs, are arranged in familial clusters. In a previous study, we identified a KRAB-containing ZNF gene family located near the XRCC1 gene in human chromosome 19q13.2 (H19q13.2). Preliminary characterization of this family indicated that the genes are arranged in a 'head-to-tail' tandem array with an average intergenic spacing of 20-30kb. Current estimates based upon physical mapping data suggest that this family is comprised of at least 15 members, which span a minimum of 600kb in 19q13.2. Restriction mapping studies, in combination with Southern blot analyses using ZNF consensus and KRAB sequence probes, have allowed a preliminary determination of this family's content and organization. In depth studies of the largest cosmid contig from the region have demonstrated that nine genes are present within 339kb, encompassing the proximal one-half to two-thirds of the gene family. A number of methods, including Southern blot analysis, PCR analysis and DNA sequencing, have been used to localize expressed sequences within an EcoRI restriction map (RMAP) of this region. Several sequences present in the Genbank database (ZNF45, ZNF155, and HZF4 as well as ESTs 429289, 20113, and 28165) have been localized to specific sites within the RMAP. The region also contains three previously unknown gene sequences. Sequence analysis of cDNA clones for eight of the genes indicates that while the KRAB A domains of the sibling genes are highly similar in structure, other portions, including the KRAB B and ZNF domains, are highly divergent in sequence. These observations suggest that this gene family may have evolved to encode a collection of transcription factors that bind to different target DNA sequences as a result of divergent ZNF arrays, while participating in common transcription control complexes due to their highly similar KRAB A domains. Interestingly, the genes are expressed widely in adult tissues and are co-expressed in many sites. However, tissue-specific variations in levels of transcript between the genes are also evident. Such overlapping, but not identical, expression patterns are consistent with the idea that, like coding sequences, the duplicated cis-acting regulatory regions of the sibling genes have diverged over evolutionary time as they acquired new biological functions. Comparative mapping studies have demonstrated that a homologous Xrcc1-linked gene family is present in the mouse genome. Preliminary studies indicate that three members of the murine gene cluster are orthologous to genes located in H19q13.2. Future studies will address whether the historically homologous mouse and human genes are subject to the same upstream regulation and whether they regulate the same downstream genes.
This work was supported by the U.S. Department of Energy under contract Nos. DE-AC05-96ORO224674 and W-7405-ENG-48.
D. C. Bruce, D. O. Ricke, J. L. Longmire, P. S.
White, J. M. Buckingham, L. A. Chasteen, D. L.
Robinson, M. D. Jones, A. C. Munk, J. D. Cohn,
A. L. Williams, M. O. Mundt, L. L. Deaven, and
N. A. Doggett
We have recently begun full genomic sequencing of a 3.0 Mb cosmid/P1 contig of the human chromosome region in 16p13.3 extending from the polycystic kidney disease 1 (PKD1) locus to the CREB binding protein (CREBBP) locus [responsible for Rubinstein-Taybi Syndrome and implicated in acute myeloid leukemias associated with translocations t(8;16)(p11;p13.3) and t(11;16)(q23;p13.3)]. This contig encompasses the recently cloned Familial Mediterranean Fever gene, and the syntenic breakpoint between mouse chromosomes 16 and 17. The average overlap between clones in the contig is 25%. Sample sequencing of this region has revealed that it is gene rich and G+C rich (>50% G+C), with the gene density approaching one gene/10 kb in some stretches. These observations are consistent with the cytogenetic designation of 16p13.3 as a G+C rich 'T' band (Dutrillaux, 1973; Holmquist, 1992). Our strategy for sequencing involves nebulization to randomly break DNA, size selection of 3 kb fragments, double adapter cloning into bluescript KS+ plasmid, and sequencing of both ends to 5 X random sequencing coverage. Assembly of sequence contigs is constrained by the inherent relationship of the end sequences being approximately 3 kb apart. Closure is achieved by a combination of primer walking, longer reads, and alternate chemistry reactions. To date we have sequenced to approximately 2 X coverage for the first Mb of this contig and approximately 1 X coverage for the remaining 2 Mb. Supported by the US DOE, OBER under contract W-7405-ENG-36.
Robert K. Moyzis, Han-Chang Chi, and Deborah
The Human Genome Project is undergoing a rapid transition from an emphasis on generating physical maps to the large-scale finished sequencing of human DNA. Current technology will allow a large fraction of human DNA to be sequenced in the next 5-10 years by highly automated, high-throughput sequencing "factories". A significant fraction of the human genome, however, will be difficult to sequence to completion by such "factory" approaches. These are regions that: 1) contain a high percentage of repetitive DNA sequences, 2) contain internal tandem duplications, including multigene families, and/or 3) are unstable in all current sequencing vectors. This would be irrelevant if such regions were rare, or contained little of intrinsic informational value. Such is not the case. The first five years of the Human Genome Project mapping efforts have indicated that such regions represent approximately 20% of human DNA. This includes such critical regions as centromeres and telomeres, as well as a greater abundance of low-copy repeats and multigene families than previously anticipated. Producing quality DNA sequence of these regions, which faithfully represent genomic DNA, will be a continuing challenge.
We propose that a large-scale, yet distributed "boutique" approach to mapping and sequencing such regions is warranted, where individual laboratories specialize in genomic regions they have special expertise in investigating. Such efforts would complement and integrate with the few truly large-scale sequencing centers that are likely to evolve during the next few years, such as the proposed DOE Joint Genome Institute Sequencing Center. Our initial iboutiquei target is telomeric regions, which exhibit high levels of repetitive DNA composition, cloning instability, and population heterogeneity. Numerous investigations have implicated genes near telomeres as likely targets for alterations during aging and cancer progression. Through the efforts of a number of laboratories, most notably Dr. Harold Riethmanis at the Wistar Institute, nearly all human telomeres have now been cloned by functional complementation in yeast. My laboratory has finished the 0.23Mb 7q telomere sequence (Chi et.al., this meeting), the first RARE cleavage confirmed human telomere region to be sequenced directly up to the terminal (TTAGGG)n repeat. An important QC/QA aspect of this project was the extensive confirmation of the sequence against genomic DNA by PCR-sequencing (White et.al., this meeting). Greater than 3Mb of confirmed telomeres are now available for sequencing, with another 6-12Mb currently being confirmed. These represent 43 of the 46 unique human chromosome ends. The finished and annotated sequence of these clones will "cap" the world-wide genome sequencing effort, and identify numerous important genes and polymorphic markers.
Han-Chang Chi1,2, Elizabeth H. Saunders1, Judy
M. Buckingham1, Darrell O. Ricke1, A. Christine
Munk1, Rebecca Lobb1, Samantha Y.-J. Ueng1,
Mark O. Mundt1, P. Scott White1, Owatha L.
Tatum1, and Robert K. Moyzis1,2
225,432 bp of DNA immediately internal to the (TTAGGG)n telomere repeat of human chromosome 7q was sequenced. The telomeric end of chromosome 7q, unlike most human chromosomes, contains only a few (<1.5 kb nucleotides in total) subtelomeric repetitive DNA. The lack of subtelomeric repeats, therefore, suggested that this region would likely contain subtelomeric genes whose expression would be affected by telomere alterations accompanying aging and cancer progression (Moyzis et. al., this meeting). Nine overlapping cosmids and two PCR products obtained from the 7q telomere YAC clone HTY146 (yRM2000) were sequenced using a Sample Sequencing (SASE)-parallel primer walking strategy. Sequence validation was performed on PCR amplified human genomic DNA from at least 6 individuals including the cell line used to construct HTY146. Primer pairs (spaced 300-900 bp) were randomly picked from 25 sites that are almost evenly distributed along the entire 7q terminal region. The QC/QA results confirmed that the cloned YAC/E. coli sequences were a faithful representation of genomic DNA, containing less than one error in 10,000 bases (White et. al., this meeting). Computer analysis uncovered numerous open reading frames, expressed sequence tags (ESTs), and potential exons dispersed along the entire 226 kb region, as well as 19 variable number of tandem repeats (VNTRs) and 20 microsatellite repeats. Approximately 192 kb internal to the (TTAGGG)n terminal repeat the first and second exons for the human vasoactive intestinal peptide receptor 2 (VIPR2) gene were located. This gene is involved in a diverse set of physiological functions including smooth muscle relaxation, electrolyte secretion, and vasodilation. SASE-parallel primer walking is efficient for finished sequencing and gene lacalization of a long range genomic target, especially a idifficulti region like telomeres.
P. Scott White, Han Chi, Larry L. Deaven,
Darrell O. Ricke, Elizabeth Saunders, Owatha L.
Tatum, and Robert K. Moyzis
The 7q telomere region was sequenced to examine if this region contained expressed genes close to the telomere or served as a buffer area between the telomere and expressed genes in the advent of telomere shortening. A CpG island next to the few exons of the vasoactive intestinal polypeptide receptor 2 precursor gene were discovered (remain exons presumed to be centromeric of the sequenced region). This 225,432 bp sequence contains multiple EST clusters and evidence for additional candidate genes as well as a large array of repetitive sequences and one type I phosphatidylinositol-4-phosphate 5-kinase pseudogene. Interesting, a second candidate gene exists in the form of 7 close regions of homology with chromosome 4 cosmid L191F1. Evidence for a third candidate gene can be pieced together from EST clones with homologies to both the 5' and 3' ends. These paired EST end sequences link together multiple candidate exons in what may be an alternatively spliced gene. An update of the sequence analysis of this region will be presented.
As part of the sequence verification for the 7q region, we have PCR amplified and sequenced ~300 bp from each of 19 sites along this region, from at least 6 individuals representing 4 diverse ethnic groups. In addition, we sequenced the genomic DNA used to build the 7q YAC library, and the supporting YAC clone, HTY-146. For each site, PCR primers were selected in regions free of known repeats. All PCR products were sequenced and compared to detect errors and polymorphisms. Sequence comparisons among individuals give us an idea of the amount of natural variation, as well as validate the clone sequence to the genomic DNA from which the clone originated. Out of 5722 nucleotides resequenced (2.5 % of total) there was one heterozygous site found in the individual from which the library was constructed, and 9 single nucleotide polymorphisms (SNPs) among the other individuals. All other sequences were identical between the clone sequence and that of the individual from which the library was made. These results are in agreement with previous estimates that natural variation at the primary sequence level is at least 1 in 1000, which needs to be considered when performing sequence validation aimed at detecting a one in 10,000 error frequency.
Shawn Iadonato, Jun Yu, Gane K.-S. Wong,
Charles Magness, Phil Green, and Maynard Olson
We are implementing an approach to large-scale sequencing of human genomic DNA that emphasizes high-quality at a reasonable cost. Quality targets are: 1) a single-base-pair error rate of better than 0.01%, 2) no gaps, in either the shotgun sequence or the physical map, across mega-base sized regions, and 3) validation of the sequenced large-insert clones to an average resolution of 200-bp. This is accomplished by the systematic introduction of objective quality measures throughout the data production process, from the physical mapping through to the shotgun sequencing.
Detailed sequence-ready restriction-maps are produced by the multiple-complete-digest (MCD) method. These maps have proven to be very accurate, with average sizing errors better than 1%. They allow us to both validate the large-insert clones to a 200-bp average resolution, and to confirm the correctness of the subsequent sequence assemblies. The sequence production is distinguished by an emphasis on long read-lengths instead of maximum machine utilization. Data and quality analysis are performed with the Phred/Phrap/Consed system and average Phrap-alignable read lengths are 735-bp. Overlapping large-insert clones are finished independently and the sequence overlaps are used to estimate the single-base-pair error rate, which has been better than 1 in 100,000-bp. We will present data from a CONTIGUOUS 2-Mbp region on human chromosome-7 (near 7q31.3). Evidence will be presented to support all of the above quality assertions.
Chris Martin and Mohan Narla
The Genome Centers at DOE's Los Alamos, Berkeley, and Livermore National Laboratories have merged into the Joint Genome Institute (JGI). This reorganization is a significant undertaking that is aimed at achieving economies of scale in our genome efforts while also leveraging off of the expertise available at the three sites. In the area of genomic sequencing, the large majority of the JGI's effort will be moving the a new facility, located in Walnut Creek, California, in early to mid 1998. This will provide a custom designed space sufficient for the scale up of the JGI's production sequencing effort in a factory like setting, termed the PSF (Production Sequencing Facility). Until this time, the production sequencing efforts at the three laboratories will be scaled up in place, while also working towards the goal of finalizing a uniform production process for use within the PSF.
The sequencing approach at Berkeley has recently been altered by the increase in the emphasis on the up-front shotgun phase of the process, which utilizes double end plasmid subclone sequencing. Additionally, we have adopted the use of additional bacterial strains for all of our subclone libraries that seem to alleviate under-representation of genomic regions due to cloning biases. These changes are helping us to significantly reduce the time required for the complete sequence of a given physical mapping clone to be determined. Data will be presented on the current status of a set of 16 human and mouse P1's, pac's and bac's that are now in progress using this new sequencing process.
Kelly A. Frazer, Gabriela G. Cretz, Christopher
H. Martin, Jan-Fang Cheng, and Edward M.
Human chromosome 5q31 was chosen by the JGI for large-scale sequencing because it harbors a family of interleukin genes which are important regulators of the immune response. Interleukin-3 (IL-3), IL-4, IL-5, IL-13 and granulocyte-macrophage colony-stimulating factor (GM-CSF) are clustered within 600 kilobases of each other on human 5q31. Previously, we have computationally and biologically analyzed the available genomic sequence data in the 1 Mb region of human 5q31 containing the interleukin family cluster and identified 16 new genes, as well as 7 previously known genes. We also performed comparative mapping studies and demonstrated that 13 of the human genes in the 5q31 region are located in the syntenic region of mouse chromosome 11.
Numerous experimental studies have indicated that noncoding regulatory sequences controlling gene expression are evolutionarily conserved in mice and humans. Comparative analysis of human and mouse orthologous sequences is a powerful tool for identifying these conserved noncoding regulatory elements. To facilitate the annotation of coding regions and permit noncoding conserved regulatory elements to be readily identified in the human 5q31 sequence data, the JGI has chosen to sequence the 1 Mb region containing the interleukin growth factor cluster on mouse chromosome 11.
To date, we have isolated and begun sequencing a 150 kb mouse BAC which contains 5 complete genes: IL-4 and IL-13 - genes with coordinated expression in T-Cells; septin - a ubiquitously expressed gene with complex alternative splicing; and cyclin-like and KIF3 - genes which are both predominantly expressed in the brain and are physically near each other. Importantly, the expression patterns of these five genes suggest that human-mouse conserved noncoding regulatory elements are likely to be in this 150 kb region and identified by comparative sequence analysis.
Ana V. Perez-Castro, Julie Wilson, Cleo Naranjo,
and Michael R. Altherr
We have sequenced the human and mouse genomic segments encoding the fibroblast growth factor receptor 3 gene (FGFR3). FGFR3 is a member of the receptor tyrosine kinase superfamily and a developmentally regulated transmembrane protein. Mutations in FGFR3 contribute to a number of significant human maladies. The human and mouse genes exhibit significant similarity in terms of their structural domains and genomic organization. In both species, the gene consists of 19 exons and 18 introns spanning greater than 15 kbp of sequence. The coding regions across the entire gene are 84% and 92% identical at the nucleic acid and amino acid levels respectively. The alternatively spliced exon 8 is 95% similar at the nucleic acid level and 93% similar at the amino acid level. While the sequence similarity of the introns is (on average) less than 50%, the size of the individual introns is very similar. The 5' flanking regions of the gene in both species shows a similar high degree of conservation. In general, the analysis of a 400 bp segment preceding the initiator ATG shows only a 60% similarity. However, several common transcriptional regulatory sequences are present in both. Consensus binding sites for Sp1, AP2, Krox24, and IgHC.4 are located in this region. It is also worth noting that the position and spacing between these sites is conserved in these species. These data suggest that mouse comparative genomic sequencing can be used to identify and annotate significant functional domains in the human genome.
Support: DOE Contract W-7405-ENG-36 and
LANL LDRD funds.
Darrell O. Ricke, Norman A. Doggett and Michael R. Altherr
The terminal short arm band of human chromosome 16 (16p13.3) is syntenic with three different mouse chromosomes (11, 16, and 17). At least 8 loci define the syntenic overlap between human 16 and mouse 17. Less than 1 Mbp from the proximal boundary of this segment is another cluster of eight loci that comprise 21 centimorgans of mouse chromosome 16. Our human genomic sequencing targets include this 1 Mbp segment of chromosome 16 where the syntenic breakpoint between mouse chromosomes 16 and 17 occurs. In addition, this segment overlaps at least one unidentified locus of medical significance, CATM, and contains the recently identified gene for familial Mediterranean fever. The CATM locus is a gene that, when defective, causes congenital cataract and microphthalimia. It is clear from previous comparative sequencing studies that both structural and regulatory regions of genes can be identified by similarity comparisons of human and mouse genes. In fact, our preliminary sequencing efforts of this region of 16p13.3 have already identified 33 mouse EST clusters with greater than 80% similarity to the human sequence. These ESTs are being used to develop primer pairs to identify mouse BAC clones. These mouse BAC clones will be subcloned and sequenced at low redundancy to identify and annotate potentially interested regions of the human genome. This effort will be closely coordinated with higher redundancy efforts at LLNL and LBNL in an effort to optimize the depth of sequence coverage required to identify conserved sequence segments and presumably functionally important domains.
Support: DOE Contract W7405-ENG-36 and LANL LDRD funds.
Owen White, Mark D. Adams, Rebecca Clayton,
Hans-Peter Klenk, Anthony R. Kerlavage, Karen E. Nelson, J. Craig Venter
The Institute for Genomic Research (TIGR) has completed the genomic sequences for the DOE-funded projects of the eubacteria Mycoplasma genitalium, and two archeal genomes Methanococcus jannaschii and Archaeoglobus fulgidus. The eubacterial genomes Deinococcus radiodurans and Thermotoga maritima are now in closure, while a fifth genome project, Shewanella putrefaciens is currently in the random sequencing phase at TIGR. These projects result in over 11.5 Mb of finished sequence that contain approximately 10,500 genes. Data relating to each of these bacterial projects are located in our local database, disk-based files used for searching, an external web server, the public sequence archives, and other web servers (e.g. Argonne's biochemical pathways for Haeomophilus influenzae). One challenge of the distributed archetecture of bacterial information is to effectively propagate changes from one data location to another, preferably in an automated manner. We will report on a system that is under development that encodes system-wide data dependencies necessary for update information in a large scale sequencing facility. We have also begun implementation of enhanced high-throughput data analysis that adds intergenomic and intragenomic gene families, hydrophobicity plots, block and motif searching, medline and web-derived information to our previous database searching methodology. We will report on our enhanced annotation techniques and will discuss them in context of the recently completed genomes.
Robert B.Weiss1, Mark Stump1, Joshua
Cindy Hamil1, Frank Robb2 and Diane Dunn1
The ability to sequence moderate-size plasmid inserts (10-15 kb) using mapped transposons is being tested on both microbial and human sub-clone libraries. The transposon serves as the priming site for initiating a set of bi-directional di-deoxy ladders within the plasmid insert. The priming site locations are mapped by automated Southern analysis using a restriction digest that releases the insert from the vector, while cutting in the center of the transposon. The inserts propagate on a vector designed to maintain the plasmid at a few copies per cell, while enabling efficient DNA purification after temperature-induced runaway plasmid replication. Sub-clone libraries are built in a multiplex family of this vector backbone, where each vector attaches unique sequence tags to the insert for use in mapping. Automated Hybridization and Imaging Instruments (AHII) are used to sequentially probe for the mapping tags. These instruments are fully automated devices for detecting enzyme-linked fluorescence from DNA hybrids on nylon membranes, and are used in the mapping and sequencing phase of the project
We are in the process of using the transposon mapping and sequencing technique to complete the 2.1 Mb genomic sequence of Pyrococcus furiosus (DSM 3638), a hyperthermophilic heterotrophic member of the Archaea. This organism, isolated from hot marine sediments, grows vigorously at temperatures near 100oC. Its lifestyle is anaerobic and it derives energy by fermenting peptides and carbohydrates to organic acids and CO2. P. furiosus plasmid and cosmid libraries were built in a family of 21 multiplex vectors. These vectors provide multiplex tags for both the end sequencing and transposon mapping phases of the process. End sequences were determined on 2875 plasmid inserts and 400 cosmid inserts. The cosmid inserts provide a global scaffold of the genome, while the plasmid inserts are used in the primary process of transposon mapping and sequencing. Overlaps between plasmid inserts derive from the matching of end sequence onto transposon sequence contigs, and contigs are grown from re-feeding the mapping process with inserts that extend or bridge the sequence contigs.
Common growth and DNA prep formats feed both the mapping and sequencing process. Minimal spanning sets of clones are predicted from the mapping phase and fed to the sequencing process, where cycle sequencing is performed on double-stranded templates. A single hybridization instrument is capable of imaging 1728 lanes of mapping data in each eight hour cycle; with a 20-fold deep multiplex set, these instruments acquire mapping data on 34,560 clones over the course of a single, unattended run. The use of this technique on the Pyrococcus genome is demonstrating the robustness and cost- effectiveness of using this technique. The use of moderate-size plasmid insert libraries providing a reduced clone feed and finishing information results in a complementary approach to small-size M13 insert strategies. This project is an alpha test of methods, instrumentation and software under development at the Utah Center for Human Genome Research.
This work is funded by DOE grant DE-FG03-94ER-61950 (R.B. Weiss, P.I.).
Sorel Fitz-Gibbon1, Ung-Jin Kim2, Heidi
Ladner1, Yongweo Cao2, Gony H. Kim2, Barbara
Perry2, Enrique Colayco2, Ronald V. Swanson3, Terry
Gaasterland4, Jeffrey H. Miller1, and Melvin I. Simon2
Pyrobaculum aerophilum is a hyperthermophilic archaeon discovered from a boiling marine water hole at Maronti Beach, Italy that is capable of growth at 104C. This microorganism can grow aerobically, unlike most of it's thermophilic relatives, making it amenable to a variety of experimental manipulation and a potential candidate for a model organism for studying archaeal microbiology and thermophilism. Sequencing the entire genome of this organism provides a wealth of information on the evolutionary and phylogenetic relationship between archaea and other organisms as well as the basis of thermophilic nature of this organism. We have constructed a physical map that covers estimated 2.3 Megabase pair genome using a 10X fosmid library. The map currently consists of 96 overlapping fosmid clones. We have completed sequencing the entire genome using a random shotgun approach with the supplement of oligonucleotide primer directed sequencing. A total of 16,098 random sequences corresponding to approximately 3.5X genomic coverage were obtained by sequencing from both ends of 2-3 kbp genomic DNA fragments cloned into pUC18/19 vectors using vector-specific primers. These fragments were assembled into several hundred contigs using the Phrap program developed by Dr. Phil Green at University of Washington, Seattle. Gaps and regions of low quality base calls have been resolved using primers specifically designed to extend the problem regions. We have successfully closed all remaining gaps after a total of 2,300 directed sequencing reactions and reassembly. Our current full length genomic sequence still suffers from low data quality: only approximately 99% of the nucleotide sequences are accurate. This is mainly due to the low redundancy (3.5 fold) in random sequencing. We plan to perform 2-3,000 more directed sequencing reactions to polish the sequence to 99.99% accuracy. We are currently running MAGPIE, a WEB based system for sequence annotation, in our UltraSparc server (date.tree.caltech.edu) and plan to analyze and completely annotate the genome.
D.R. Smith, T. Aldredge, R. Bashirzadeh, H.
Bochner, M. Boivin, S. Bross, D. Bush, A. Caron,
A. Caruso, G. Church*, R. Cook, C.J. Daniels#,
C. Deloughery, L. Doucette-Stamm, J. Dubois, J.
Egan, D. Ellston, J. Ezedi, T. Ho, K. Holtham, P.
Joseph, M. LaPlante, H-M. Lee, D. Blakely, R.
Cook, R. Gibson, K. Gilbert, A. Goyal, J. Guerin,
D. Harrison, J. Hitti, L. Hoang, N. Jiwani, P.
Keagle, J. Kozlovsky, W. Lumm, J. Mao, P.
Mank, A. Majeski, S. McDougall, J. Nölling, D.
Patwell, J. Phillips, S. Pietrokovski@, B. Pothier,
S. Prabhakar, D. Qiu, J.N. Reeve#, P. Rice, P.
Richterich, M. Rossetti, M. Rubenfield, M.
Sachdeva, H. Safer, G. Shimer, P. Snell, R.
Spadafora, L. Spitzer, H-U. Thomann, R. Vicaire,
Y.Wang, L.Wong, K. Weinstock, J. Wierzbowski,
Q. Xu, L. Zhang
This project is applying automated sequencing technology and bioinformatics tools to the analysis of microbial genomes with potential applications in energy production and bioremediation. Efforts have focused on two genomes in particular, those of Methanobacterium thermoautotrophicum strain delta H and Clostridum acetobutylicum ATCC 824.
Methanobacterium thermoautotrophicum is a thermophilic archaeon that grows at temperatures from 40-70C, and was isolated in 1971 from sewage sludge. The complete 1,751,377 bp sequence of the genome of M. thermoautotrophicum was determined by a whole genome shotgun sequencing approach. Analysis of the sequence predicted 1,855 polypeptide-encoding ORFs, 807 (44%) of which could be classified according to function. The putative gene products were compared with sequences from Methanococcus jannaschii, as well as eucaryal, bacterial and archaeal specific databases. These analyses indicated that most ORFs are most similar to sequences described previously in other Archaea, but that there has been extensive divergence between the two sequenced methanogen genomes. Most gene products predicted to be involved in cofactor and small molecule biosyntheses, intermediary metabolism, transport, nitrogen fixation, regulatory functions and interactions with the environment are more similar to bacterial than eucaryal sequences, whereas the converse was true for most proteins predicted to be involved in DNA metabolism, transcription, and translation. There are 24 polypetides that could form two-component sensor kinase-response regulator systems, homologs of bacterial DnaK and DnaJ, homologs of eucaryal DNA replication initiation Cdc6 proteins, an X-family repair-type DNA polymerase and an unusual archaeal B-type DNA polymerase. There are 39 tRNA genes, two rRNA gene clusters, one intein containing gene, several repeated regions, two large clusters of short repetitive elements, and numerous other interesting features.
The Clostridia are a diverse group of gram-positive, rod-shaped, spore forming anaerobes that include several toxin-producing pathogens and a large number of terrestrial species. The latter have been used extensively for industrial solvent production (acetone, butanol and ethanol) by fermentation of starches and sugars. C. acetobutylicum strain ATCC 824 has a 4.1 Mb, AT-rich genome and is one of the best-studied solventogenic clostridia. The shotgun sequencing phase has been completed, with 4.9 Mb of multiplex and 21.3 Mb of ABI raw sequence reads (6.3 fold total redundancy) that produced 551 contigs spanning 4,030,725 bases when assembled using PHRAP with quality scores. A total of 4018 putative polypeptide encoding ORFs were identified and searched against public databases to provide preliminary annotation. The finishing phase of the project is currently underway utilizing a quality-based finishing paradigm and a set of integrated bioinformatics tools. The data are available at http://www.cric.com.
John J. Dunn, Laura-Li Butler-Loffredo, Ting
Chen, William C. Crockett, Jan Kieleczawa,
Jeremy Medalle, Sean McCorkle, Keith H.
Thompson, Jeanne R. Wysocki, Shiping Zhang and F. William Studier
The ~900 kbp linear chromosome of Borrelia burgdorferi, the bacterium that causes Lyme disease, is being sequenced to spur the development of an integrated system for accurate, high-throughput, low-cost genome sequencing with minimal human involvement. We are testing a modified whole-genome shotgun approach, using random 1st-end and directed 2nd-end sequencing on a clone set with good physical coverage. A network of linked clones is generated at relatively low sequence redundancy, and primer walking on linking clones is used to close the gaps and complete the sequence of both strands. This strategy can achieve highly accurate sequence at an overall redundancy of 4-fold or even less.
Two commercial fluorescent sequencers have been used in this development phase, with the intention of scaling up capacity with a capillary sequencing system currently under development. Data management and analysis capabilities tailored to this sequencing strategy have been developed and implemented, including software for managing the sequence process, selecting 2nd-ends for sequencing, assembling the sequence, and selecting walking primers. The routine biochemical protocols needed for Mbp-scale sequencing have also been implemented. Primers for walking are generated from a library of all 4096 hexamers, by ligating hexamers on hexamer templates to form 12-mers for cycle sequencing.
The Borrelia sequence is almost finished. As of August 21, approximately 5400 end sequences and 2700 primer-walking sequences had been obtained on approximately 2700 plasmid clones (average insert length of 2.1 kbp) and 106 fesmid clones (average insert length of 35 kbp), representing about 4x sequence coverage and 10x physical coverage of the linear chromosome. The sequence redundancy was higher than necessary because of extra sequences obtained during protocol development. A single contig of approximately 900 kbp aligns with the published restriction map and contains all but an estimated few kbp at each end. This sequence is posted on our web site at www.bio.bnl.gov and will be updated periodically as the sequence of both strands is completed. As of August 21, about 92% had been sequenced on both strands and approximately 200 single-strand gaps remained to be filled. Our sequence agrees well with the sequence of the chromosome of the same Borrelia strain determined independently at The Institute for Genomic Research (TIGR).
Software and database support for this sequencing strategy continues to be refined, with the aim of removing as much human decision making as possible. Automation of sample selection and data entry will greatly reduce sources of human error, which cause more problems in primer walking than in shotgun sequencing. A newly developed single-copy amplifiable vector in which nested deletions are easily produced should allow the use of clone libraries of longer average fragment length, improve sequencing efficiency, and help in resolving repeated sequences.
R. T. Okinaka, K. G. Cloud, O. A. Hampton, K.
K. Hill, P. Keim, S. Kumano, D. Manter, J. J.
Renouard, D. O. Ricke, and P. J. Jackson
Virulent strains of Bacillus anthracis contain two large plasmids, pX01 and pX02. These plasmids are known to carry the toxin genes (lethal factor, edema factor, and protective antigen) and the three capsule genes that enhance the virulence of the organism (Cap A, B and C). The vast majority of the genes on pX01 and pX02, however, have not yet been characterized. The two plasmids combined contain sufficient sequence information to code for an additional 200-250 microbial sized open reading frames (ORF), i.e., 185 kb and 85 kb of DNA in pX01 and pX02, respectively. As an initial step to identify and characterize potential ORFs we have begun to sequence, assemble and analyze the entire DNA sequence of each plasmid. The pX01 and pX02 genomes are being analyzed by random cloning of sheared or restriction digested 2-4 kb fragments followed by high throughput DNA sequence analysis on automated Applied Biosystems sequencing machines. We will report on our progress and sequence analysis of the assembled shotgun sequence contigs for both pX01 and pX02.
This work is sponsored by the U.S. Department of Energy.
Michael J. Daly and Kenneth W. Minton
The genome of Deinococcus radiodurans R1 is currently being sequenced under DOE/OBER sponsorship at The Institute of Genomic Research (TIGR). The $44,406 provided us by DOE was used to support the early stages of this successful sequencing effort (http://www.tigr.org/tigr-scripts/CMR2/GenomePage3.spl?database=gdr). A series of total genomic DNA preparations from strain R1 was prepared for TIGR that met the high standards needed to generate a random D. radiodurans gene library. We also provided TIGR with the results of our high resolution studies on determining the size of the D. radiodurans chromosome (3.1 Mbp) and its plasmid pS16 (46 kbp). Determining the size of the chromosome was achieved by generating a NotI restriction map by pulsed field gel electrophoresis (PFGE); the chromosome contains 12 NotI sites and pS16 contains 1 NotI site. As a result of genetic disruption studies we were able to resolve all PFGE bands clearly by eliminating doublets, thereby correcting a previously published erroneous chromosome size estimate based on 11 PFGE chromosomal fragments - there is a 420 kbp chromosomal NotI doublet. Finally, by cloning out the NotI sites plus flanking sequencing, and using them as probes, we were able to align the NotI fragments into contigs using PFGE Southern blot analysis. These contigs are providing TIGR a rough framework on which to align the DNA sequences.
This work was supported by DOE grant #DE-FG02-96ER62231.
P. Hu, J. Elliott, P. Mc Cready, E. Skowronski, A. Adamson and E. Garcia
In Y. pestis the majority of virulence factors required for the expression of Bubonic Plague are encoded by the 9.6-kb pesticin (pPCP1), 70-kb calcium dependence (pCD1) and a 100-kb murine toxin (pMT1) plasmids. We have obtained the entire nucleotide sequence of the 9.6 and 100-kb plasmids by employing a combination of in vitro transposon-assisted, and random shotgun approaches. The 70-kb plasmid is nearly completed. Analysis of the sequences obtained confirms, and has enabled the precise localization of several known virulence genes in these plasmids. Blast searches against the databases indicates the presence in these plasmids of a large number of newly-identified virulence-related genes with high homology to both near and distantly related bacterial species such as: Shigella, Salmonella, Mycobacterium, etc. The presence of a number of insertion sequence elements of the IS100, IS125 and IS200 homology groups have been identified and localized in these plasmids. Other findings of interest include the identification of pilus assembly protein genes, origin of replication, partition and immunity regions. Results of the work completed and analyses of the sequence data obtained will be presented.
Work performed under the auspices of the US DOE by Lawrence Livermore Natl. Laboratory under contract No. W-705-ENG-480.