Shotgun sequencing

related topics
{math, number, function}
{specie, animal, plant}
{system, computer, user}
{ship, engine, design}
{acid, form, water}
{rate, high, increase}
{work, book, publish}
{theory, work, human}
{area, part, region}

In genetics, shotgun sequencing, also known as shotgun cloning, is a method used for sequencing long DNA strands. It is named by analogy with the rapidly-expanding, quasi-random firing pattern of a shotgun. The technique was developed in the 1970s by double nobel laureate Frederick Sanger.

Since the chain termination method of DNA sequencing can only be used for fairly short strands (100 to 1000 basepairs), longer sequences must be subdivided into smaller fragments, and subsequently re-assembled to give the overall sequence. Two principal methods are used for this: chromosome walking, which progresses through the entire strand, piece by piece, and shotgun sequencing, which is a faster but more complex process, and uses random fragments.

In shotgun sequencing [1] [2], DNA is broken up randomly into numerous small segments, which are sequenced using the chain termination method to obtain reads. Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. Computer programs then use the overlapping ends of different reads to assemble them into a continuous sequence [1].

Shotgun sequencing was one of the precursor technologies that was responsible for enabling full genome sequencing.



For example, consider the following two rounds of shotgun reads:

In this extremely simplified example, none of the reads cover the full length of the original sequence, but the four reads can be assembled into the original sequence using the overlap of their ends to align and order them. In reality, this process uses enormous amounts of information that are rife with ambiguities and sequencing errors. Assembly of complex genomes is additionally complicated by the great abundance of repetitive sequence, meaning similar short reads could come from completely different parts of the sequence.

Many overlapping reads for each segment of the original DNA are necessary to overcome these difficulties and accurately assemble the sequence. For example, to complete the Human Genome Project, most of the human genome was sequenced at 12X or greater coverage; that is, each base in the final sequence was present, on average, in 12 reads. Even so, current methods have failed to isolate or assemble reliable sequence for approximately 1% of the (euchromatic) human genome.

Whole genome shotgun sequencing

Whole genome shotgun sequencing for small (4000 to 7000 basepair) genomes was already in use in 1979.[1] Broader application benefited from pairwise end sequencing, known colloquially as double-barrel shotgun sequencing. As sequencing projects began to take on longer and more complicated DNAs, multiple groups began to realize that useful information could be obtained by sequencing both ends of a fragment of DNA. Although sequencing both ends of the same fragment and keeping track of the paired data was more cumbersome than sequencing a single end of two distinct fragments, the knowledge that the two sequences were oriented in opposite directions and were about the length of a fragment apart from each other was valuable in reconstructing the sequence of the original target fragment. The first published description of the use of paired ends was in 1990 [3] as part of the sequencing of the human HGPRT locus, although the use of paired ends was limited to closing gaps after the application of a traditional shotgun sequencing approach. The first theoretical description of a pure pairwise end sequencing strategy, assuming fragments of constant length, was in 1991[4]. At the time, there was community consensus that the optimal fragment length for pairwise end sequencing would be three times the sequence read length. In 1995 Roach et al.[5] introduced the innovation of using fragments of varying sizes, and demonstrated that a pure pairwise end-sequencing strategy would be possible on large targets. The strategy was subsequently adopted by The Institute for Genomic Research (TIGR) to sequence the genome of the bacterium Haemophilus influenzae in 1995 [6] , and then by Celera Genomics to sequence the Drosophila melanogaster (fruit fly) genome in 2000 [7], and subsequently the human genome.

Full article ▸

related documents
GNU Octave
Lex programming tool
Malleability (cryptography)
Byte-order mark
Abstract factory pattern
Static code analysis
Calculus with polynomials
Residue (complex analysis)
Hash collision
Weierstrass–Casorati theorem
Borel-Cantelli lemma
Zeta distribution
Fibonacci coding
Nowhere dense set
Interchange File Format
Degenerate distribution
Bernoulli process
Pseudometric space
Magma computer algebra system
Category (mathematics)
Waring's problem
Residue theorem
Alexandroff extension
Domain (mathematics)
Alternating group
Dyadic rational