|
|
![]() |
Genome Sequencing Abstracts
DOE Human Genome Program
|
| Home | Sequencing Index | Functional Genomics |
| Author Index | Sequencing Technologies | Microbial Genome Program |
| Search | Mapping | Ethical, Legal, & Social Issues |
| Order a copy | Informatics | Infrastructure |
| 1. Uncovering the Riches of Human
Chromosome 19 Through Genomic Sequencing
Jane E. Lamerdin, Karolyn Burkhart-Schultz,
Linda Danganan, Laurie Gordon, Stephanie Stilwagen, Glenda Quan, Hoan Phan,
Nelson Velasco, Andre Arellano, Brent Kronmiller, Long Do, Astrid Terry,
Warren Regala, Vijay Viswanathan, Jennifer Dias, Amy Brower, Tim Andriese,
Pat Poundstone, Julie Avila, Jackie Coefield, Susan Lucas, Tina Attix,
Stephenie Liu, Robert Bruce, Evan Skowronski, Rick Colyaco, Arthur Kobayashi,
David Ow, Matt Nolan, Anthony V. Carrano, Anne. S. Olsen, and Paula McCready
Genomic sequencing of human chromosome 19 is well underway. Roughly 20% of the euchromatin of chromosome 19 is now available as finished genomic sequence in GenBank, with completion of most of the chromosome anticipated by 2001. Utilizing a high resolution physical map constructed largely in bacterial-based clones, we have seeded our current sequencing queue with many large (*1Mb) contigs from well-mapped regions, with representative contigs from almost every cytogenetic band on chromosome 19. As of Oct 30, we have finished over 11 Mb of genomic sequence, with roughly 10.5 Mb submitted to GenBank. Preliminary analyses of our data lend credence to the expectation that this GC-rich chromosome will be an excellent target for gene discovery through genomic sequencing. In this regard, several GC-rich regions (average GC content in excess of 58%) that have been sequenced on chromosome 19 exhibit a high gene density (on average, 1 gene per 20-25 kb) relative to the rest of the genome, and encode genes with compact genomic structure. Other regions with a slightly lower GC content (average GC= 50%) possess fewer genes which span larger genomic distances, e.g. the ryanodine receptor (RYR) region in 19q13.1. One interesting feature of chromosome 19 is the large number of clustered gene families distributed throughout the length of the euchromatin. These include the pregnancy-specific glycoprotein family (PSG), multiple zinc finger families (ZNF), olfactory receptors (OLFR) and cytochrome P-450s (CYP). In order to understand their evolution and subsequent functional diversification, several of these clusters are current sequencing targets. Not surprisingly, the ages of these clusters differ significantly, with the PSG family having duplicated fairly recently in evolutionary time, while the OLFR and ZNF clusters appear much older, with many of their members possessing orthologs in mice and rats. One common feature of the genomic structure of these disparate families is the prevalence of specific repeat families, which may have contributed to the evolution and expansion of these regions. We are undertaking a more detailed comparison of the genomic content of these gene family regions on chromosome 19, as well as their orthologous counterparts in mouse. These comparisons will no doubt expand our recognition of the fluidity of the mammalian genome. This work was performed by Lawrence Livermore National Laboratory under the auspices of the U.S. Department of Energy, Contract No. W-7405-Eng-48. 2. Genomic Sequencing of 3 Mb of Human Chromosome 16p13.3 Containing 4 Disease Genes M.O. Mundt, D.O. Ricke, D.C. Bruce, A.C.
Munk, D.L. Robinson, M.D. Jones, J.M. Buckingham, L.A. Chasteen, E.H. Saunders,
L.S. Thompson, L.A. Goodwin, A.L. Williams, J.L. Longmire, P.S. White,
L.L. Deaven, and N.A. Doggett
We have nearly completed genomic sequencing of a 3.0 Mb cosmid/P1 contig of the human chromosome region in 16p13.3 extending from the tuberous sclerosis disease (TSC2) locus to the CREB binding protein (CREBBP) locus [responsible for Rubinstein-Taybi Syndrome and implicated in acute myeloid leukemias associated with translocations t(8;16)(p11;p13.3) and t(11;16)(q23;p13.3)]. This contig also encompasses the polycystic kidney disease 1 (PKD1), the familial Mediterranean fever gene (MEFV) and the syntenic breakpoint between mouse chromosomes 16 and 17. The average overlap between clones in the contig is about 25%. Our earlier sample sequencing (SASE) of this region had revealed that it is gene rich and G+C rich (>50% G+C), with the gene density approaching one gene/10 kb in some stretches. These observations are consistent with the cytogenetic designation of 16p13.3 as a G+C rich "T" band (Dutrillaux, 1973; Holmquist, 1992). Our strategy for sequencing involved nebulization to randomly break DNA, size selection of 3 kb fragments, double adapter cloning into bluescript KS+ plasmid, and sequencing of both ends to 6X random sequencing coverage. Sequencing reactions were predominately Big dye terminators (ABI). Assembly of sequence contigs was assisted by the inherent relationship of the end sequences being approximately 3 kb apart. Closure and finishing was achieved by a combination of primer walking, longer reads, and alternate chemistry reactions. Sequence analysis and annotation is semi-automated with use of the SCAN program (developed by Ricke). We have achieved 100% closure of all 58 clones which we have attempted to sequence from this region. Three gaps remain but clones have now been found which span these. One of these "gaps" in the cosmid contig map is in the same region of a breakpoint cluster in the CREB binding protein gene, which occurs in leukemias. This region was stably maintained in BACs however. The maximum G+C content found in a finished clone is 57%. Alu content has also been high, with up to 30 Alu's in a finished cosmid. Supported by the US DOE, OBER under contract W-7405-ENG-36. 3. Sequencing Human Chromosome 14 and the Mouse Major Histocompatibility Locus: A Progress Report Lee Rowen, Anup Madan, Shizhen Qin,
Lee Hood, and the Multimegabase Sequencing Group Department of Molecular
Biotechnology, University of Washington, Seattle, Washington
To date, we have sequenced over 1.1 megabases of the mouse major histocompatibility locus and over 600 kb of chromosome 14. Our target region on chromosome 14 is 14q24.3-ter. Based on a preliminary analysis of the mouse MHC sequences, and a comparison with the human sequence counterpart, we have drawn the following conclusions: 1) Evolutionarily conserved genes are interspersed with genes with no identifiable homologues in other species, suggesting that genes with both specialized and generalized (housekeeping) functions co-exist in the MHC. 2) The MHC class III region is the most gene-dense. In the human sequence, 17% of 263 kb contains the coding region of 20 genes (average of 1 gene per 13.2 kb). The average intergenic distance is 2.7 kb. 3) Expansion of gene family membership has occurred through the duplication of long repeats. 4) Gene content and order in human and mouse MHC is similar, although variation in the extent of gene duplication occurs both within and between species. 5) Conserved blocks between human and mouse correspond to the most gene-dense regions in each specie. 6) Isochore boundaries, based on GC content and genome-wide interspersed repeats, can be identified in the class II-III regions in both species. 4. Physical Mapping and Sequencing of Human Chromosome 16p12.1-11.2 Hyung Lyun Kang, Yicheng Cao, So
Hee Dho1, Diana Bocskai, Mei Wang, Xuequn Xu, Jun-Ryul Huh1,
Byeong-Jae Lee1, Francis Kalush2, Judith G. Tesmer3,
Eunpyo Moon4, Norman A. Doggett3, Mark D. Adams2,
Melvin I. Simon, and Ung-Jin Kim
The first goal of the Human Genome Project is to determine the nucleotide sequences of the entire human genome. We have been mapping and sequencing the 6 Mbp region near the 16pCEN on the short arm of human chromosome 16 (16p12.1-11.2) jointly with The Institute for Genomic Research (TIGR) and Los Alamos National Laboratory (LANL). As shown by the complete sequences from the BACs derived from this region, the target region has many small and large peri-centromeric repeats. It has been theorized that due to these repeats, many of which consist of large numbers of short tandem repeats, near-centromeric regions are difficult to clone and map. In fact, most genomic libraries tend to have fewer clones covering the centromeric and telomeric regions. Our target region is relatively sparsely covered by STS markers. In fact, most genomic libraries tend to have fewer clones covering the centromeric and telomeric regions. Our target region is relatively sparsely covered by STS markers. To provide large, contiguous stretches of BACs from the target region for high throughput shotgun sequencing at TIGR and JGI, Caltech has been developing BAC contigs using the STS and other ordered markers obtained from the YAC-STS map that was previously constructed by LANL. The 12X coverage human BAC libraries constructed at Caltech (A, B, and C) were screened by the combination of the STS-PCR screening on pooled libraries and the hybridization-based screening using probes that include cDNA inserts, BAC end clones, genomic DNA fragments and BAC inserts. Initially, a total of 46 STSs were screened against the libraries. More recently, Caltech has constructed a 7X coverage library D from approved human DNA samples, which has been screened by hybridization using the probes derived from STS-PCR products, BAC clone inserts (for BAC-to-BAC hybridization), and gel-purified YAC DNA (YAC-to-BAC hybridization). Thus far over 1,000 putative BACs from the target region have been identified. The clones are being built into overlapping contigs based on the analyses that include STS contents, BAC-to-BAC hybridization data, insert size, restriction fingerprint analysis, BAC end sequencing and BAC end sequence match with completely sequenced BACs, and FISH mapping on some selected BACs. Over 30 BACs from this region corresponding to approximately 4 Mbp in length have been sequenced at TIGR. To close the remaining gaps, we are currently designing new STS markers and OVERGO probes based on the BAC end sequence data along with the Alu-PCR products from the YAC clones covering the gaps. We also plan on screening new 4X coverage EcoRI BAC library. For the description, protocols, and data related to our projects, please visit our WEB site http://www.tree.caltech.edu. 5. Human Telomere Mapping and Sequencing Han-Chang Chi1, Deborah
L. Grady1, Harold C. Riethman2, and Robert K. Moyzis1
The Human Genome Project has undergone a dramatic shift this year to the goal of obtaining a 'framework' sequence of human DNA in just a few years. Such a framework sequence will catalyze gene discovery and functional analysis, and allow finished sequencing to be focused on regions of the highest biomedical priority. A significant fraction (20%) of human DNA contains a high percentage of repetitive sequences, is unstable in most cloning vectors, and exhibits extensive polymorphisms both between individuals and populations. Producing quality maps and sequence in such regions, which faithfully represent human genomic DNA, will be a continuing challenge. One such region is represented by human telomeres. Following the discovery and cloning of the human telomere repeat (TTAGGG)n by our laboratory ten years ago, numerous investigations have implicated this sequence or genes near telomeres as likely targets for alterations during cellular aging and cancer progression. Nearly all human telomeres have now been cloned as yeast artificial chromosomes by functional complementation. During the last year, our laboratory finished the 0.23Mb 7q telomere sequence (GenBank accession AF027390), the first RARE (RecA-Assisted Restriction Endonulease) cleavage confirmed telomere region to be sequenced directly up to the terminal (TTAGGG)n repeat. Nine overlapping cosmids and two PCR products obtained from the 7q telomere YAC clone HTY146 (yRM2000) were sequenced using a Sample Sequencing (SASE)-parallel primer walking strategy. In total, 18% of this telomeric sequence required extensive PCR and non-standard sequencing methods to finish. Confirmation of the sequence against human genomic DNA was conducted by PCR-sequencing, using primer sets picked every 20kb. The submitted sequence is a faithful representation of human DNA, containing less than one error in 10,000 bases. Computer and experimental analysis uncovered numerous open reading frames, expressed sequence tags (ESTs), and potential exons dispersed along the entire 226 kb region, as well as 6 single nucleotide polymorphisms (SNPs), 19 variable number of tandem repeats (VNTRs) and 20 microsatellite repeats. The first and second exons for the human vasoactive intestinal peptide receptor 2 (VIPR2) gene were localized approximately 191 kb internal to the (TTAGGG)n terminal repeat. This neuropeptide system is involved in a diverse set of physiological functions including smooth muscle relaxation, electrolyte secretion, and vasodilation. Primer pairs picked to amplify the regions of 7q containing VNTRs uncovered extensive polymorphisms in the limited numbers of individuals examined to date. We are nearing completion of mapping and sequencing two additional telomeres, 9q and 11q, chosen because these regions contain a limited amount of subtelomeric repeats. In addition, SASE analysis is being initiated on 14 additional telomeres that have been confirmed by RARE cleavage (1q, 2p, 2q, 6q, 7p, 8p, 8q, 12q, 13q, 14q, 17p, 18p, 18q, and 21q) in order to prioritize our next targets for finished genomic sequencing. 6. A Comparison of Sequence Gap Closure Strategies Glenda G. Quan, Karolyn Burkhart-Shultz,
Timothy Andriese, Andre Arellano, Long Do, Arthur Kobayashi, Brent Kronmiller,
Madison Macht, Matt Nolan, David Ow, Hoan Phan, Melissa Ramirez, Warren
Regala, Christina Sanders, Stephanie Stilwagen, Astrid Terry, Nelson Velasco,
Vijay Viswanathan, Anthony V. Carrano, and Jane E. Lamerdin
The goal of finish sequencing is to obtain high-quality, contiguous sequence of cosmid and BAC clone inserts. A major component of finish sequencing is gap closure. In order for the sequence to be contiguous, gaps in the initial sequence data, obtained from random shotgun sequencing, must be closed. At the Joint Genome Center at Lawrence Livermore National Laboratory, we employ three main strategies for sequence gap closure: transposon "bombing", shatter library production, and custom primer walking. We currently use an in vitro transposon insertion strategy involving the random insertion of a yeast transposable element into a gap-spanning, circular plasmid. Using primers designed off both ends of the transposable element, new sequence can be obtained directing away from the insertion point. Transposon "bombing" allows us to identify new sequencing start points within the gap itself, and gives us the advantage of sequencing with two primers. In the shatter method, a double-stranded, linear fragment containing the gap sequence (e.g. a PCR product or restriction fragment) is sonicated into fragments of 300-500 bp in length. These short fragments are then sub-cloned into an M13 phage vector and sequenced using conventional ET-forward primers. These shatter libraries are particularly well-suited to regions of significant secondary structure which are recalcitrant to conventional sequencing chemistries, where the smaller inserts may contain only a portion of the hairpin in the original gap-spanning clone. Additionally, the data generated by these clones are very high in quality and can be assembled as 'mini' shotgun projects in those instances of very difficult assembly problems, such as long tandem repeats. Our third strategy utilizes the automated primer picking program in the sequence editor Consed for primer walking on existing clones that span a gap. The main advantage of primer walking is that it allows closure of small gaps with a minimum number of sequencing reads. We have used various combinations of these three strategies to increase our output of finished sequence by over 500% in the last fiscal year. Analyses are underway to evaluate the efficiency and cost of these three strategies in order to better tailor automated finishing protocols needed to achieve the ambitious sequencing ramps required to complete the JGI's portion of the human genome. This work was performed by Lawrence Livermore
National Laboratory under the auspices of the U.S. Department of Energy,
Contract No. W-7405-Eng-48.
Matt P. Nolan, Jane E. Lamerdin,
Glenda G. Quan, and Anthony V. Carrano
Our modified shotgun sequencing effort has three phases. In the random phase we sequence a fixed number of plates resulting in 80%-95% of the cosmid bases meeting our quality-based, double-stranded, finish criteria (QbDsFc). During pre-finishing we resequence clones attempting in one round of forwards and reverses to meet the QbDsFc for 95% of the bases and close most gaps. During directed closure we close any remaining gaps and complete double-stranding. To reduce finishing costs and speed time to completion for our cosmid and BAC clone projects we created software to automate selection of finishing reads. We describe our SaF (Swedish and Finnish) software tools developed to 1) facilitate the specification of clones for resequencing and to 2) quantify the state of project contigs with respect to our QbDsFc. We describe improvements to the SaF tools that helped us meet our ten-fold increase in sequence produced in the past year. In our production sequencing we use the SaF tools to fully automate clone selection in the pre-finishing phase and we require finishers to address each region identified during directed closure. For a project assemblage our SaF tools identify bases not meeting the QbDsFc, then conglomerate these problem bases into problem regions using parameterized filtering and clustering algorithms. They produce reports listing each problem region and a contig summary. In prefinishing we are attempting to identify candidate clones for the creation of shatter libraries. Some simple improvements to our algorithm have helped target potential false joins resulting in fewer contigs coming out of the prefinishing stage. We are targeting more reverses at internal problem areas with higher error rate. Also, with a greater emphasis on sequencing BAC clones, we are hoping to more strongly target regions of adjacent ALUs as they are often the cause of gaps and false joins. Additionally, for the BACs we are trying to incorporate restriction enzyme map data to verify sections of properly aligned sequence order for the purposes of orienting contigs and identifying potential false joins. New SaF tool features increase their usability in the directed closure phase. We have incorporated a feedback loop which identifies resequenced clones so that they don't get ordered redundantly so that we may use the automated clone selection in multiple passes and so we know when certain strategies have been played out. We describe our attempts to more tightly the SaF tools with consed. A greater emphasis is being placed in increasing the cost effectiveness of clone selection. For instance, we identify short clones so that we do not suggest sequencing their opposite ends. Work performed under the auspices of the US DOE by Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48 8. Process Description of a 5 Mb / Year Finish Sequencing Operation Using 100% Plasmid Double End Sequencing David C. Bruce, Leslie A. Chasteen,
Donna L. Robinson, Myrona D. Jones, Jennifer Bryant, Nancy C. Brown, Beverly
Parson-Quintana, Darrell O. Ricke, Mark O. Mundt, P. Scott White, Norman
Doggett, and Larry L. Deaven
Beginning in Oct. 1997, the Center for Human Genome Studies (CHGS) at Los Alamos National Laboratory (LANL), as part of the Joint Genome Institute (JGI, http://www.jgi.doe.gov/), committed to first year finished sequencing of 2.84 MB of human sequence. A steady state finish rate of 5 MB / year, by the end of the first year is projected. An exhaustive description of the JGI Sequence quality standards and sequencing targets is available at http://www.jgi.doe.gov/Docs/JGI_Seq_Quality.html. Prior to Oct 1997, the CHGS had submitted 224 KB of finish sequence to GenBank using only Applied Biosystems software, minimal sample tracking, and little automation. As of Oct 4, 1998, the CHGS had finished 3.2 MB total and 2.8 MB unique sequence and reached a finish sequence output of 0.6 MB / month using phred/phrap/consed finishing tools, select automation, sample tracking system in conjunction with a fully redesigned process. The sequencing strategy is 6X plasmid end sequencing using dye terminator chemistry in production, quality gap closure using alternate chemistry / gel conditions in pre-finish and final gap closure using a combination of primer walking, transposon bombing and small insert libraries in finish. Implementation of this aggressive ramp will be presented including; sequencing strategy design, process analysis, personnel reorganization, automation, informatics, and quality / cost control with emphasis on the production phase of sequencing. 9. Automation of Finishing at JGI-LLNL Stephanie Stilwagen, Matt Nolan,
Andre Arellano, Karolyn Burkhart-Schultz, Long Do, Arthur Kobyashi, Brent
Kronmiller, Madison Macht, David Ow, Hoan Phan, Glenda Quan, Melissa Ramirez,
Warren Regala, Christina Sanders, Astrid Terry, Stephan Trong, Nelson Velasco,
Vijay Viswanathan, Anthony Carrano, and Jane Lamerdin
Lawrence Livermore National Laboratory has generated 10.5 Mb of highly accurate, finished genomic sequence of selected large insert clones (e.g. cosmid, BAC, P1) from chromosome 19 of which 8.6 Mb has been completed within the last fiscal year. We were able to achieve this eight-fold increase with the introduction of a suite of finishing tools and web-based computer interfaces which are directly linked to automated robotic workstations. The LLNL sequencing strategy utilizes a 'shotgun' approach to generate our initial sequence data. The next stage of the process is pre-finishing which involves the selection of clones for re-sequencing for initial gap closure and ambiguity resolution. We have automated pre-finishing by utilizing the LLNL developed software tool, Swedish to select clones for re-sequencing with either the dye primer or dye terminator chemistry. After one round of pre-finishing, a project moves to the finishing phase and is assigned to a finisher. A finisher makes use of multiple software tools in an iterative manner to obtain contiguous sequence that meets our standards for double-strand coverage and sequence quality. Finishing involves closing the remaining gaps, resolving ambiguities, and validating the assembly. Automation of the finishing process makes use of Consed, Swedish, Finnish, web interfaces, and robotic workstations to increase efficiency and throughput. While these tools have had a significant impact on our productivity, additional tools and automation are still necessary to decrease the amount of human intervention required for finishing to meet the challenge of completing the Human Genome by 2003. This work was performed by Lawrence Livermore National Laboratory under the auspices of the U.S. Department of Energy, Contract No. W-7405-Eng-48. 10. Sequence Validation and Quality Assessment at the Joint Genome Institute M. Bussod, N. Doggett, J. Fawcett,
D. Ricke, K. Watson, O. Tatum, P.S. White, and M. Mundt
The Joint Genome Institute (JGI) is committed to producing high quality finished sequence data with fewer that 1 error in 10,000 bases. To ensure that we meet these strict criteria the JGI prescribes to a quality control process which requires that greater than 95% of all finished bases have Phrap scores greater than 40 and at least 95% of all bases are covered in reads from both strands (or 2 chemistries). In addition to these quality control criteria, the JGI has implemented post-sequencing Validation and Quality Assessment processes, which occur in 2 phases within the Joint Genome Institute. Sequence Validation occurs at each sequencing site prior to the submission of a sequence and involves comparing the final assembled sequence to 3 independent high-resolution restriction fingerprints. This pre-submission Sequence Validation process ensures that the finished sequence has been assembled correctly. The Quality Assessment process is a post-submission assessment of the sequence produced by the JGI. LANL has the responsibility for performing this Quality Assessment process for all of the sequence produced by the Joint Genome Institute. During the summer of 1998 our group successfully completed one round of sequence quality assessment sponsored by the NIH of 600 kb of finished sequence from 3 NIH centers, and we have recently begun a second round of this assessment involving greater than 1.2 Mb of finished sequence from three centers including the Sanger Institute. Our strategy for the Quality Assessment process is to identify the poorest quality regions within each finished clone and target these for verification. Software tools are being developed to evaluate the quality of clone sequencing projects based on Phred and Phrap scores. In addition, the techniques used and software modules written can be applied to the task of choosing optimal targets for resequencing. Base calling and structural assembly errors can be identified by using PCR, for example, and sequencing if necessary. Determination of sequence error probability is based on the Phrap values of the consensus bases where each base is given a P-value, the probability of the base being incorrect, depending on its quality. If the data is given in the form of a histogram, the calculation of the probability values for each clone project is dependent on the proportion of bases within each quality range. We used this technique to find good candidates for our JGI validation effort without requiring the full set of quality values. However, if the Phrap value of every base is available, a more accurate prediction of error rate is possible. In this case, sliding windows of consecutive bases can also be evaluated to detect regions with higher error rates and design targets for resequencing. In either case, correction factors can also be applied to the error calculations to account for the supposed conservative nature of the Phrap scoring system. The approaches described above are among those being compiled into a set of Java tools whose uses extend beyond just validation. Finishing requirements often mirror the needs of a quality assessment project. Right now, we use a similar version of a Java filter around a Primer3-based program to select oligonucleotide sequences for both finishing primer walks and validation PCR primers. In the NIH QA exercise, our success rate for getting PCR products was about 85%, even though we targeted more difficult regions to sequence. We recently received over 130,000 BAC end sequences from two centers to evaluate the DOE-funded BAC end sequencing effort. Studying these should be a new, exciting challenge with great potential benefit to the sequencing community. Supported by USDOE under contract W-7405-ENG-36. 11. LANL Finishing Team Accomplishments in FY98 J. Buckingham, L. Goodwin, C. Munk,
L. Saunders, S. Thompson, S. Ueng, D. Ricke, and M. Mundt
In response to the Joint Genome Institute's goal of finishing 20 MB of high quality sequence, the Finishing Team was formed at Los Alamos National Laboratory's Center for Human Genome Studies. The task for this diverse group of biologists, computer scientists and mathematicians was to design an efficient process to quickly close clone projects and bring the sequence quality up to a high standard. The tools to do this job were largely untested and disorganized, and many new protocols and strategies had to be formulated to address problems, even as the "conditions of contest" changed over the past year. Timely feedback to reduce unnecessary work was also an important factor to our success. Initially, LANL's sequencing capacity was directed at the double-ended plasmid SASE approach, using TAQ and later TAQ-fs. Two major improvements were switches to BigDye terminator reactions for production and ET dye primer chemistry on ABI 373's for finishing. The boost in quality from these two process changes was quite evident using both our own base caller and Phred. Phred, Phrap, and Consed had not previously been used at LANL, so several technicians went through UNIX training to become specialists at interacting with these programs. Auxiliary Java programs were also designed to analyze Phrap assembly structure and to suggest finishing reactions consisting of dye primer redos and primer walks. A paper trail system was converted to an automatic submission system for our oligo synthesizers. Following the lead of the production crew, we now use halfTERM (Genpak,Ltd.) with BigDye for our primer walks. In addition, we are adding DMSO to the formulation to improve the reactions. We have also used shatter libraries successfully and transposons not so successfully to close final difficult gaps that exist due to, for example, high GC content. Part of our automation schemes included the use of Hydras and multichannel pipettors for setting up finishing reactions. We are now streamlining our approach to efficiently address the issues involved with "draft" sequencing. We have defined a prefinishing step based on our "Strand Gap" report that will also help evaluate cost functions to feed back to our production team to determine level of shotgun required. Research plans include trying halfTERM with dye primer reactions and working with the Mermade oligo synthesizer that should be delivered in the next few months. We are also investigating the potential benefits of programming robots to select templates for reaction set ups and weighing these against potential disadvantages such as reduced quality. We will present relevant statistics to demonstrate the quality of our finishing reactions and their utility in alignments to our final consensus. This work contributed to the completion of 2.8Mb of sequence in FY98. Supported by US DOE under contract W-7405-ENG-36. 12. JGI-LANL Sequencing Cost Reduction and Quality Improvement: R&D Results Owatha L. "Tootie" Tatum and P.
Scott White
With recent dramatic increases in JGI's sequencing effort, the need to improve efficiency and reduce costs while maintaining high quality standards is of utmost importance. To this end, JGI and other large-scale genome sequencing facilities have recognized that an active research and development team is vital to their success. Aspects of LANL sequencing R&D goals include improvements in sequencing reactions - in the form of modifications of existing systems and investigation into and development of new sequencing technologies and automation systems. As part of the sequencing effort for the JGI, LANL has placed sequencing research and development as an important priority. In efforts to reduce costs, several modifications of existing chemistries have been examined, resulting in striking reductions in cost with actual improvements in read length and sequence quality. The protocols resulting from these R&D efforts have been implemented in the LANL production sequencing and finishing efforts with great success. Sequence obtained from difficult templates (i.e. BAC DNA) has been improved dramatically as a result of chemistry R&D as well. While improvements in chemistry have had the most immediate impact on cost, LANL has also focused on quality control and automation issues to further streamline the sequencing process. Commercially available automation equipment has been implemented into the production process line with a considerable saving of technician hands-on time. In addition to time/cost savings, high throughput automated systems have also been implemented to improve quality control early in the sequencing process. All aspects of sequencing R&D conducted by Los Alamos to date will contribute to the work at Production Sequencing Facility and may be of interest to other large-scale sequencing facilities as well. 13. Concatenation cDNA Sequencing and Analysis of 500 Human Brain cDNA Clones Wei Yu, John Bouck, James H. Gorrell, Donna
M. Muzny, and Richard A. Gibbs
Using a shotgun based strategy entitled Concatenation cDNA Sequencing (CCS), we have completed sequencing of 503 random selected cDNA clones with a total length of 807 kb from Homo sapiens brain cDNA library (1NIB). All sequence data have been annotated and submitted to GenBank. The statistics from completed projects have shown that CCS is as efficient as sequencing of single large DNA fragment, and the reads/kb range from 13-21 with an average of 16.8 and the number of primers/kb ranges from 0.62-1.8 with an average of 1.02. Computer analysis was performed to search for the similarity against the public database. Of the 471 clone sequences used for DNA similarity searches, 255 (54%) were not matched to any sequences in the non-redundant database. The remaining 216 were matched to previously defined sequences or known genes from human to other organisms. Of the 471 clone sequences, 230 clones (48.9%) possess putative complete and incomplete open reading frames with a minimal length of 100 amino acids. When all 471 cDNA sequences were compared to the protein sequences in the database, 255 were not assigned definitely to any known protein. For the remaining 216 clones, 145 displayed similarities to previously deposited protein sequences, providing a consistent search result between nucleic and amino acid data from each clone. There were 71 clones that failed to reveal any protein match despite their corresponding DNA similarity matches with database entries. To determine the amount of unique information that our cDNA clone sequences were adding to the database, we examined the distribution of 243 clones which have been incorporated into the unigene database maintained by the NCBI. When the 243 cDNA sequences were compared to the representative sequences from the unigene database, we found 10 cDNA sequences contained weak matches to representative clone, but were not included in unigene clusters. Of the 233 clusters that were matched, nearly all of them contained multiple sequences in each cluster. But when the same 233 clone sequences were used to compare to mRNA/gene sequences in each cluster, 143 (61%) clusters contained only one single mRNA/gene sequence, which is our cDNA sequences. The majority of the cDNA clones were found in small clusters with only a few other mRNA or EST. 14. Cosmid Finishing and Full Insert cDNA Sequencing Using Differential Extension with Nucleotide Subsets (DENS) D. Zevin-Sonkin1,2, H. Hovhanissyan1,
A. Ghochikyan1, L. Lvovsky1, A. Liberzon1,
M.C. Raja1,3, E. Ben-Asher2, G. Glusman2,
D. Lancet2, and L.E. Ulanovsky1,3
Differential Extension with Nucleotide Subsets (DENS) is essentially primer walking without primer synthesis (Raja et al., 1997, NAR 25, pp. 800-805). DENS works by converting a short primer (selected from a presynthesized library of 8-mers with 2 degenerate bases each) into a long one on the template at the intended site only. DENS starts with a limited initial extension of the primer (at 20 C) in the presence of only 2 out of the 4 possible dNTPs. The primer is extended by 5 bases or longer at the intended priming site, which is deliberately selected, as is the two-dNTP set, to maximize the extension length. The subsequent termination (sequencing) reaction at 60 C then accepts the primer extended at the intended site, but not at alternative sites where the initial extension (if any) is generally much shorter. We use DENS for cosmid finishing and have tested it for full insert cDNA sequencing. The templates for cosmid finishing by DENS (7-8 overlapping fragments, ~ 5 kb each) were PCR amplified from the cosmid. The PCR products were made single-stranded using Lambda Exonuclease (Exo-PCR). If one of the two primers in the PCR is phosphorylated, the Exo digestion leaves the opposite strand single-stranded. The 8-mer primers for DENS sequencing were selected using our dedicated software. The DENS approach resulted in approximately a three-fold reduction in cost and time of finishing compared to the strategy used before: additional shotguns combined with custom synthesized primer walking on the whole cosmid and/or PCR fragments. DENS primer walking seems to be tailor-made for full length cDNA sequencing, as the absence of the primer synthesis step facilitates closed-loop automation of primer walking with the benefit of unattended operation. In a pilot experiment we used DENS and "Exo-PCR" for sequencing both strands of four cDNA clones containing inserts of 1.9, 2.3, 3.8 and 4.9 kb. The success rate of the DENS sequencing reactions was 72% yielding 27,864 base-calls. The median PHRED quality value was 40, corresponding to the error probability of approximately one per 10,000. The plotted distribution showed that base-calls with PHRED values less than 20 occurred only 1% of the time. 14a. Complete sequence analysis of 918 human cDNA clones harboring long and nearly full-length inserts Nobuo Nomura, Takahiro Nagase, Ken-ichi Ishikawa, Reiko Kikuno,
Mikita Suyama, Nobuyuki Miyajima, Ayako Tanaka, Hirokazu Kotani, and Osamu
Ohara.
One of the goals of the Kazusa human cDNA project is to accumulate and exploit information on coding sequences of unidentified human cDNA clones harboring long and nearly full-length inserts. We have so far determined the entire sequences of 918 clones (KIAA0001-KIAA0918) with average size of 5.0kb. Among the clones, 268 were obtained from human immature myeloid cell like KG-1 and 650 from human brain. All the KG-1 and 25 brain cDNA clones were selected under the criteria that the clones carry inserts with at least 90% of the length of the corresponding transcripts. As another novel approach, 588 brain clones were selected based upon their capabilities to produce proteins in vitro with molecular weight larger than 50kDa. Since approximately 50% of the cDNA clones isolated by either method was found to retain an in-frame termination codon upstream of the first ATG codon, it was speculated that more than half of the clones analyzed harbored complete ORF. And we concluded that clones with complete ORF can be selected efficiently by either of the procedures. It turned out that 750 out of 918 clones encoded proteins larger than 50kDa. By computer analysis, we successfully assigned two thirds of 750 clones to the functional categories such as genes for cell signalling/communication (198 clones, 26.3%), cell structure/motility (106 clones, 14.2%), nucleic acid managing (112 clones, 15%), protein managing (30 clones, 4%), metabolism (17 clones, 2.3%), and cell division (9 clones, 1.2 %) Database search also revealed that most of ESTs currently registered fell in the region within 2kb from the 3'-end of our cDNA sequences. When the expression profiles of the cDNA clones were examined among over a dozen human tissues, approximately 80% of the clones from KG-1 and 20% of the clones from brain were expressed ubiquitously. The chromosomal locations of these clones were also determined. |