Mouse Genetics: Concepts & Applications (Full Table of Contents)

Copyright ©1995 Lee M. Silver

8. Genetic Markers

8.1 Genotypic and phenotypic variation

8.2.1 Molecular basis for RFLPs

8.2.2 Choice of restriction enzymes to use for RFLP detection

8.2.3 Minisatellites: Variable Number Tandem Repeat (VNTR) loci

8.2.4 Dispersed multi-locus analysis with cross-hybridizing probes

8.2.5 Restriction landmark genomic scanning

8.3 Polymorphisms detected by PCR

8.3.1 Restriction site polymorphisms

8.3.2 Detection of allelic changes defined by single basepairs

8.3.3 Single Strand Conformation Polymorphism (SSCP)

8.3.4 Random Amplification of Polymorphic DNA (RAPD)

8.3.5 Interspersed repetitive sequence (IRS) PCR

8.3.6 Microsatellites: Simple Sequence Length Polymorphisms

8.4 Region-specific panels of DNA markers

8.4.1 Chromosome microdissection

8.4.2 Chromosome sorting by FACS

8.4.3 Somatic cell hybrid lines as a source of fractionated material

8.4.4 Miscellaneous approaches

 

8.1 Genotypic and phenotypic variation

Linkage analysis can only be performed on loci that are polymorphic with two or more distinguishable alleles. Naturally-occurring polymorphic loci with clear single-gene effects are rarely observed in wild animal populations. In the laboratory, however, it is possible to identify and breed animals with mutations at many different loci (see chapter 6). Over the last 90 years, thousands of independent mouse mutations have been characterized in various laboratories. In fact, as discussed previously, a primary reason for the initial choice of the mouse as an experimental genetic system was the collection of rare genetic variants present in the hands of the fancy mouse breeders (see chapter 1). But, even this variation is restricted in its scope and usefulness for geneticists. This is because of the severe limitation in the number of phenotypic markers that can be incorporated into any one cross. Although over 50 independent loci have been characterized with effects on coat color (Silvers, 1979), it is impossible to follow more than a handful at any one time since mutant alleles at any one locus will act to obscure the expression of mutant alleles at other loci. With mutant alleles that affect viability in some way, the problem of sorting out overlapping phenotypes becomes even more severe.

Pre-recombinant-DNA geneticists were able to circumvent these problems by performing large numbers of different crosses, each of which tested overlapping subsets of phenotypic markers. Thus, as illustrated in figure 8.1, the loci A, B, and C were mapped in one cross; C, D, and E were mapped in a second cross, and A, D, and F were mapped in a third cross.

If linkage was observed in each of these individual crosses among the three loci mapped therein as shown, it was then possible to develop a linkage map that encompassed all six loci even though, for example, the B locus was never mapped directly relative to D, E or F; and A was never mapped relative to E. By extension, it is possible to combine data obtained in hundreds of crosses to map hundreds of phenotypically-defined loci to form linkage maps that extend across all nineteen mouse autosomes and the X chromosome.

Mapping in the pre-recombinant DNA era was tedious and was generally performed by investigators dedicated to this task alone. However, with the results of the first generation of cloning and sequencing studies, the scientific community became aware of the existence of a hidden level of enormous genetic variation that occurs naturally in all mammalian populations (Botstein et al., 1980). The frequency of DNA variation that exists between chromosome homologs from two unrelated individuals of the same species (including mice and humans) appears to be on the order of one nucleotide substitution or small length change in every 200 to 500 bp. Since the mammalian genome has a size of 3x109 bp, this frequency implies a total number of genetic differences between any two unrelated individuals on the order of six million per haploid genome set. In a comparison of individuals from separate species, such as M. musculus and M. spretus, the level of variation will be even higher.

In the pre-recombinant DNA era, alleles could only be distinguished in terms of an altered phenotype; thus only genes could have alleles, and the demonstration of a genetic locus was dependent on the expression of alternative phenotypes. Today, every variant nucleotide in the genome is a potential locus. To say that DNA variation provides a larger reservoir for use in genetic studies than phenotypic variation is a vast understatement. Furthermore, and of most importance, there is essentially no limit to the number of these loci that can be mapped simultaneously within a single cross.

All simple forms of DNA variation fall into three classes: (1) base-pair substitutions, (2) short regions of deletion or tandem duplication, and (3) insertions or translocations. Examples from each of these classes can be detected as restriction fragment length polymorphisms (RFLPs) and/or by PCR-based protocols. These major tools for DNA allele detection will be discussed separately in the following two sections of this chapter.

8.2 Restriction Fragment Length Polymorphisms (RFLPs)

8.2.1 The molecular basis for RFLPs

A Restriction Fragment Length Polymorphism (RFLP) is defined by the existence of alternative alleles associated with restriction fragments that differ in size from each other. RFLPs are visualized by digesting DNA from different individuals with a restriction enzyme, followed by gel electrophoresis to separate fragments according to size, then blotting and hybridization to a labeled probe that identifies the locus under investigation. A RFLP is demonstrated whenever the Southern blot pattern obtained with one individual is different from the one obtained with another individual. An illustration of a result of this type is shown in figure 8.2. In this example, DNA samples from five individual mice were digested with the same enzyme, electrophoresed side-by-side on a gel, and probed with the same clone of a single-copy DNA sequence. The five patterns detected are all different from each other and are representative of five different genotypes. The simplest interpretation of data of this type is that the first and last samples shown in the figure are homozygous for a different restriction fragment while the middle samples are all heterozygous with different combinations of alleles. This simple interpretation can be tested and confirmed (or rejected) by simply breeding each of the animals to mates with different genotypes at this "RFLP locus" so that segregation of the two restriction fragment alleles can be demonstrated from putative heterozygotes and uniform transmission of the same restriction fragment allele can be demonstrated from putative homozygotes.

RFLPs were the predominant form of DNA variation used for linkage analysis until the advent of PCR. Even now, in the PCR age, RFLPs provide a convenient means for turning an uncharacterized DNA clone into a reagent for the detection of a genetic marker. The main advantage of RFLP analysis over PCR-based protocols is that no prior sequence information, nor oligonucleotide synthesis, is required. Furthermore, in some cases, it may not be feasible to develop a PCR protocol to detect a particular form of allelic variation. Nevertheless, if and when a PCR assay for typing a particular locus is developed, it will almost certainly be preferable over RFLP analysis for the reasons to be described in section 8.3.

The detection of a RFLP, in-and-of-itself, does not provide information as to the mechanism by which it was created. Although the different-sized restriction fragments shown in figure 8.2 can be followed readily in a genetic cross, one cannot tell, from these data alone, how they differ from each other at the molecular level. In fact, RFLPs can be generated by all of the mechanisms through which DNA variation can occur. The simplest RFLPs are those caused by single base-pair substitutions. However, RFLPs can also be generated by the insertion of genetic material, such as transposable elements, or by tandem duplications, deletions, translocations, or other rearrangements.

Several different mechanisms of RFLP generation are illustrated in Fig. 8.3. In this set of hypothetical examples, the first chromosome represents the ancestral state, and chromosomes 2, 3, and 4 represent different mutations from this state. In this example, DNA has been digested with the enzyme Taq I (with a recognition site of TCGA), fractionated and probed with a clone that recognizes the region shown in the figure. The Southern blot results that would be obtained with animals that carry representative pairs of these chromosomes are shown in figure 8.2. The length of the restriction fragment that will be observed with each chromosome type is indicated by the boxed-in region. Chromosome 1 has TaqI sites that flank the probed region at a distance of 4 kb from each other. In chromosome 2, the right-flank TaqI site has been mutated (the base substitution is marked with a *); a previously more distal TaqI site now becomes the new flanking site, leading to the production of a 5 kb restriction fragment. In chromosome 3, a mutation has occurred with an opposite effect, causing the creation of a TaqI site where none existed before; this new TaqI site becomes the left-flank site, leading to the production of a smaller restriction fragment. Finally, in chromosome 4, an insertion has occurred within the region between the two flanking TaqI sites, leading to an actual increase in the length of the region between these same two sites. More complicated scenarios can be built upon these simple examples with restriction sites created or removed from within the probed region itself, or with new restriction sites brought-in with inserted DNA elements. A final class of RFLPs is commonly generated through the expansion and contraction of families of tandemly repeated DNA elements as illustrated in figure 8.4. Loci having this type of organization are referred to as minisatellites, or VNTRs, and will be discussed separately in section 8.2.3.

Attempts to identify RFLPs between different inbred strains of mice often meet with limited success even after testing with large numbers of enzymes. In one study, RFLPs were identified at only 30% of the single copy loci tested with 22 different restriction enzymes (Knight and Dyson, 1990). Furthermore, when RFLPs are identified, they are almost always di-allelic binary systems — the insertion, deletion, or restriction site change is either present or absent. Unfortunately, di-allelic loci can only be mapped in crosses where the two parental chromosomes carry the two alternative alleles. Thus, even if a RFLP is identified between two inbred strains of mice, there is no guarantee that another pair of strains will also happen to carry alternative alleles. As a consequence, only a subset of the RFLP markers developed for analysis of one cross between traditional mouse strains will be of use for mapping in a cross between any other pair of inbred strains.

A major leap in mouse genetics came with the observation of an extremely high rate of RFLP detection between the common M. musculus-based inbred strains and the independent species M. spretus. As described in section 2.3, these species can breed under laboratory conditions to produce interspecific F1 hybrids. Although F1 males are sterile, the F1 females are fertile and they can be backcrossed to either parent to obtain offspring which can be analyzed for linkage (Bonhomme et al., 1978; Avner et al., 1988). More recently, the observation of an increased rate of RFLPs has been extended to comparisons between the inbred strains and wild-derived samples of M. m. castaneus which are more closely related to each other than either is to M. spretus (Figure 2.2). The pros and cons of performing genetic studies with interspecific or intersubspecific crosses are discussed in section 9.4.2.1.

8.2.2 Choice of restriction enzymes to use for RFLP detection

With so many restriction enzymes available, how does one decide which ones are the best to use in the search for RFLPs? Obviously, cost is an important consideration. Another consideration is whether the enzyme is optimally active with genomic DNA obtained from animal tissues. However, a critical consideration is the rate at which RFLPs can be detected based on the enzyme that is chosen.

A systematic study of RFLP detection between B6 and M. spretus DNA subsequent to digestion with one of ten different enzymes has been reported by LeRoy et al. (1992). One hundred and ten anonymous DNA sequences of less than 4 kb in length were used as probes. The highest rate of RFLP detection — 63% —was observed with DNA digested with Taq I. The second highest rate — 56% — was observed with Msp I. In decreasing order of effectiveness were the enzymes Bam HI (50%), Xba I (47%), Pst I (44%), Bgl II (41%), Hind III (39%), Pvu II (38%) Rsa I (38%), and Eco RI (33%). It is ironic that of the ten enzymes tested, the one most commonly used in molecular biological research — Eco RI — was the worst one, by a long shot, at detecting polymorphisms.

A theoretical explanation for the observation that TaqI and MspI are more likely than other enzymes to detect RFLPs can be found in the dinucleotide CpG which is at the center of both recognition sites. This dinucleotide is unusual in two respects. First, it is present in mammalian genomes at a frequency one-fifth of that expected from base composition alone. Second, when it is present, the cytosine within the dinucleotide is usually methylated. As it turns out, the latter fact explains the former because methylated cytosine has a propensity to undergo spontaneous deamination to form thymidine. This complete transition is not recognized as abnormal by the repair machinery present in mammalian cells, and thus methylated-CpG dinucleotides serve as one-way hotspots for mutation (Barker et al., 1984). As a consequence, the CpG dinucleotide is relatively rare, and when it does occur in a methylated form, it is more likely to mutate than any other dinucleotide. Even an unmethylated CpG can undergo a spontaneous mutation from cytosine to uracil; however, this abnormal nucleotide is more likely to be recognized and repaired. Nevertheless, in those few cases where repair does not occur, the uracil will basepair with an adenosine in the following round of DNA replication, leading to the same substitution as found with methylated CpGs.

Thus, TaqI and MspI are the most useful enzymes for the identification of RFLPs. Both enzymes recognize four bp sites, TaqI recognizes TCGA and MspI recognizes CCGG. If nucleotides were randomly distributed across the genome, TaqI and MspI sites would be distributed at average distances of 270 bp and 514 bp respectively. However, as a consequence of the paucity of CpG dinucleotides, these two restriction enzyme sites are actually found much less frequently in mammalian DNA. Empirical data indicate restriction fragment size distributions that average 2.9 and 3.5 kb for TaqI and MspI respectively (Barker et al., 1984).

In practice, the enzyme TaqI is the better choice of the two for use in RFLP analysis. It is relatively cheap and it works well with animal DNA samples that other enzymes refuse to cut (presumably aided by the high temperature at which the digestion is carried out). MspI is somewhat more sensitive to contaminants within animal tissue DNA samples, but is a good second choice. When the results obtained with Taq I and Msp I are combined, the Guénet group detected RFLPs at 74% of the loci tested for variation between spretus and musculus (LeRoy et al., 1992). When the results obtained with Xba I were added-in, 79% of the loci were polymorphic. When the results obtained with the remaining 7 enzymes were included, RFLPs were detected at 83% of the loci. The take-home lesson from this study is that it is most cost-effective to search for RFLPs on standard 1% agarose gels with just three enzymes — Taq I, Msp I, and Xba I. If the search is unsuccessful at this point, it would appear that the locus under analysis is not highly polymorphic at the DNA level, and in those cases where the locus is just "one more marker," it is probably not worth pursuing further. On the other hand, if the locus is of importance in-and-of-itself, it makes sense to pursue more sensitive, PCR-based avenues of polymorphism detection such as SSCP (section 8.3.3) or linked microsatellites (section 8.3.6)

8.2.3 Minisatellites: Variable Number Tandem Repeat (VNTR) loci

In contrast to traditional RFLPs caused by basepair changes in restriction sites, a special class of RFLP loci present in all mammalian genomes is highly polymorphic with very large numbers of alleles. These ‘hypervariable’ loci were first exploited in a general way by Jeffreys and his colleagues for genetic mapping in humans (Jeffreys et al., 1985). Hypervariable RFLP loci of this special class are known by a number of different names including Variable Number Tandem Repeat (VNTR) loci and minisatellites, which is the more commonly used term today. Minisatellites are composed of unit sequences that range from 10 to 40 bp in length and are tandemly repeated from tens to thousands of times. Although various functions have been suggested for minisatellite loci as a class, none of these has withstood the test of further analysis (Jarman and Wells, 1989; Harding et al., 1992). Rather, it appears most likely that minisatellite loci (like microsatellite loci described in a section 8.3.6) evolve in a neutral manner through expansion and contraction caused by unequal crossing over between out-of-register repeat units as diagrammed in figure 8.4 (Harding et al., 1992). As shown in the figure, recombination events of this type will yield reciprocal products that both represent new alleles with a change in the number of repeat units.

The frequency with which new alleles are created at minisatellite loci — on the order of 10-3 per locus per gamete — is much greater than the classical mutation rate of 10-5 to 10-6 (Jeffreys et al., 1988). This leads to a much higher level of polymorphism between unrelated individuals within a population. At the same time, one change in a thousand gametes is low enough so as to not interfere with the ability to follow minisatellite alleles in classical breeding studies.

Length polymorphisms at minisatellite loci are most simply detected by digestion of genomic DNA samples with a restriction enzyme that does not cut within the minisatellite itself but does cut within closely flanking sequences. As with all other RFLP analyses, the restriction digests are fractionated by gel electrophoresis, blotted and hybridized to probes derived from the polymorphic locus. However, unlike traditional point mutation RFLPs, minisatellites are caused by, and reflect, changes in the actual size of the locus itself.

The best restriction enzymes to use for minisatellite analysis are those with 4-bp recognition sites such as Hae III, HinfI or Sau3A; it is likely that one of these enzymes will not cut within the relatively short minisatellite unit sequence, but will cut within several hundred basepairs of flanking sequence on both sides. Standard 1% agarose gels with maximal separation in the 1 to 4 kb range are usually best for the resolution of minisatellite bands; however, conditions can be optimized for each minisatellite system under analysis.

There is nothing special about the unit sequence present within minisatellites, which are defined only by their repeated nature and their repeat unit size. Thus, it is not possible to develop a general protocol for identifying all minisatellite sequences within the genome, and there is no way of knowing how many loci of this type are actually present. However, significant homology (indicative of evolutionary relatedness) often exists among unlinked minisatellite loci that are scattered throughout the genome. Homologies that allow cross-hybridization define minisatellite families that can have as few as two and as many as 50 members (Nakamura et al., 1987). It is often possible to take advantage of these cross-homologies to map ten or more minisatellite loci as independent RFLPs within single Southern blot hybridization patterns.

The simultaneous detection of 10 to 40 unlinked and highly polymorphic loci provides a whole genome "fingerprint" pattern which is very likely to show differences between any two unrelated individuals (Jeffreys et al., 1985). These DNA fingerprints provide a powerful tool in human forensic analysis in the absence of any knowledge as to the map location of any of the individual loci that are being detected (Armour and Jeffreys, 1992). DNA fingerprinting per se is of much less use in the analysis of laboratory animals, who do not bring paternity suits or stand trial for rape or murder. However, fingerprinting can allow field biologists to follow individual animals in wild populations subjected to repeated capture and release sampling. It can also be used to monitor the integrity of inbred strains of mice and for the characterization and comparison of different breeds of domesticated animals that have commercial importance.

New minisatellite families are uncovered by chance, by cross-hybridization with probes defined in other species (Jeffreys et al., 1987), or by the use of "synthetic tandem repeats" of arbitrary 14-20-mer oligonucleotides (Mariat and Vergnaud, 1992). The first analysis of minisatellites in the mouse was performed with the use of several human minisatellite sequences as probes (Jeffreys et al., 1987). The results obtained in the analysis of a set of recombinant inbred strains (described at length in chapter 9) demonstrated the expected high level of polymorphism as well as a high level of stability over time, both of which are critical properties for a useful mapping tool. Julier and his colleagues have performed more detailed mapping studies with a larger panel of human minisatellite probes (Julier et al., 1990) and, in collaboration with Mariat and colleagues, they have also performed minisatellite mapping with the use of arbitrary oligonucleotides of 14 to 16 bases in length (Mariat et al., 1993) . With the 29 human-derived minisatellite probes tested, these authors found that 48% gave well-resolved complex fingerprint patterns upon hybridization to the mouse genome. With a set of 24 arbitrary oligonucleotides that were pre-selected for detection of minisatellites in humans, 23 were found to detect polymorphic loci in the mouse as well.

In an initial analysis with just 11 of the human minisatellite probes, a total of 115 to 234 restriction fragment differences were detected in pairwise comparisons among a series of seven M. musculus-derived inbred strains. The least number of polymorphic loci was observed in a comparison of C3H/He and DBA/2J; the highest number were observed between SJL/J and 129/Sv. Approximately twice as many polymorphisms were observed in pairwise comparisons between M. musculus-derived strains and a M. spretus inbred line.

The 11 characterized probes were used to follow the segregation of minisatellite alleles in a higher resolution analysis of the BXD set of RI strains as described in chapter 9 (Julier et al., 1990). The 346 polymorphic bands followed in this study sorted into 166 independent loci, approximately half of which were represented by a single restriction fragment, with the remaining represented by two or more fragments. As expected, in several cases, new fragments were detected in particular RI strains that were not present in either of the parental inbred strains from which they were generated, attesting to the rapid rate at which minisatellite loci mutate to new alleles.

Mapping with multi-locus minisatellite probes is most effective for whole genome studies rather than for single chromosomes analyses. Thus, like the two-dimensional RFLP and RAPD technologies described below, minisatellite mapping is actually of greatest use for the initial development of whole genome "framework maps" of relatively uncharacterized species, of which the mouse is not one.

8.2.4 Dispersed multi-locus analysis with cross-hybridizing probes

Minisatellite families are just one example of dispersed, cross-hybridizing loci that can be mapped simultaneously by Southern blot analysis. Another class of this type includes those gene families that have multiple members dispersed to unlinked chromosomal locations. In general, protein-encoding genes will be much less polymorphic than minisatellite loci; thus, simultaneous mapping of multiple members of gene families through RFLP analysis is best accomplished with interspecific backcrosses of the spretusdomesticus type. In one such study, probes for just two gene families — ornithine decarboxylase and triose phosphate isomerase — were combined with a probe for the highly polymorphic mouse mammary tumor virus (MMTV) elements (described below) in traditional Southern blot studies to detect and map a total of 28 loci to 16 of the 19 mouse autosomes (Siracusa et al., 1991).

A third broad class of cross-hybridizing loci is represented by the endogenous retroviral and retroviral-like elements which have been dispersed to random positions throughout the genome. A number of different families and subfamilies of this class have been identified. The best characterized of these (with average copy number per haploid genome in parentheses) are MMTVs (4—12), ecotropic MuLVs (0—10), non-ecotropic MuLVs (40—60), VL30s (~200), and IAPs (~2000). In all of these cases, polymorphisms are a consequence of the recent integration of proviral elements so that particular elements are present in the genomes of some strains but not others; thus, each polymorphism is represented by a binary plus/minus system.

Both the MMTVs and ecotropic MuLVs are present at copy numbers that are suitable for mapping by standard agarose gel electrophoresis. By combining data from various crosses, Jenkins and colleagues (1982) mapped a total of 18 ecotropic MuLV-integrations sites that were named Emv-1 through Emv-18. In similar studies, 26 MMTV integration sites have been mapped among various inbred strains (Kozak et al., 1987); these have been named Mtv-1 through Mtv-26.

The non-ecotropic MuLV elements are present at a copy number which is somewhat too high for complete resolution of all elements on standard agarose gels. To overcome this problem, and to obtain maximal mapping information, it is possible to take advantage of the subfamily structure of this class of elements. Oligonucleotides that recognize different subsets of 10 to 30 loci per genome have been used as Southern blot probes with excellent resolving power (Frankel et al., 1990; Frankel et al., 1992). In general, 30 to 50% of the non-ecotropic viral elements are shared in any one pairwise comparison of inbred strains. By combining data from different sets of recombinant inbred lines, Frankel and colleagues were able to map over 100 non-ecotropic integration sites; these have been named with the prefixes Polytropic murine virus (Pmv-), Modified polytropic murine virus (Mpmv-), or Xenotropic murine virus (Xmv-) according to the particular oligonucleotide that cross-hybridized to each element. With the MMTV and various MuLV families, it is still possible to use the same probes to map even more integration sites through the examination of strains that were not previously studied.

The retroviral-like families IAP (Lueders and Kuff, 1977) and VL30 (Courtney et al., 1982; Keshet and Itin, 1982) are present in 200 and 2000 copies respectively per haploid genome. These families and others of the same class contain a large potential reservoir of useful genetic markers. However, their copy number is much too high to allow the resolution of individual family members in traditional Southern blot studies with restriction digested DNA samples. It is typically difficult to resolve more than 20 bands in a traditional one-dimensional hybridization pattern. Furthermore, as the copy number of cross-hybridizing bands increases, the resolution of individual bands actually decreases as more and more merge into each other to eventually form a continuous smear.

In theory, this problem could be alleviated in two different ways. The first approach would be the same as that used for the non-ecotropic loci, which is to reduce the complexity of the Southern blot pattern with the use of oligonucleotide probes that detect small subsets of the whole family. The validity of this approach has been demonstrated for the IAP family of elements (Meitz and Kuff, 1992).

A second, very different approach is based on increasing resolving power, rather than decreasing complexity, by fractionating genomic DNA in two sequential dimensions. This can be accomplished as illustrated in figure 8.5. DNA samples are first subjected to digestion with a restriction enzyme that cuts relatively infrequently (Step 1 in the figure) followed by fractionation on an agarose gel (Step 2). At the completion of electrophoresis, each sample-containing gel lane is excised and incubated directly with a second restriction enzyme that cuts more frequently (Step 3). Finally, the gel slice itself is used as the sample for a second round of electrophoresis in a direction perpendicular to the first round (Step 4). At the completion of this second dimension run, the gel is blotted, hybridized to the high copy-number probe, and autoradiographed.

Separation of DNA fragments in two dimensions, rather than one, should theoretically provide a ‘power of two’ increase in resolution, from approximately 20 bands to 400. In fact, over 130 restriction fragment "spots" have been observed in individual two-dimensional patterns obtained with probes for the IAP and VL30 families (Sheppard et al., 1991; Sheppard and Silver, 1993). In general, each spot represents a single retroviral-like locus of the type defined by the probe used for hybridization. The X coordinate of the spot measures the distance between flanking restriction sites produced in the first digestion. The Y coordinate provides a measure of the distance between the two closest restriction sites (of either type) that flank the locus after the double digestion.

The main advantage of a two-dimensional mapping approach is that large numbers of loci from each animal can be mapped simultaneously. There are two main disadvantages to the general use of this approach for analyzing a large cross. First, only one animal can be analyzed within each gel. Second, from start-to-finish, each gel run can take five days to complete, and there is very little tolerance for mistakes of any kind throughout the protocol.

8.2.5 Restriction Landmark Genomic Scanning (RLGS)

A significant variation on two-dimensional RFLP analysis has been developed by Hayashizaki and his colleagues (Hatada et al., 1991). With this novel protocol, restriction sites are scanned directly without the intervention of probes for specific loci. This can be accomplished through the direct end-labeling of a class of restriction sites that are generated by a rare-cutting enzyme followed by additional rounds of restriction digestion and gel separation. Briefly, the first restriction digestion is carried out with a rare-cutting enzyme like Not I which has an 8-bp recognition site that is present, on average, only once per megabase in the mouse genome (a first component of Step 1 in figure 8.2). Digestion with Not I will produce a total of only ~3,000 fragments from each haploid genome. Labeling of Not I sites is accomplished by filling-in the single strand restriction site overhangs with radioactive nucleotides. Subsequently, the Not I fragments are reduced in size by digestion with a second enzyme having a 6-bp recognition site that produces fragments with an average size of 4-6 kb (a second component of Step 1). Although the total number of restriction fragments per genome is increased 200-fold by this second digestion, only those fragments that have an original Not I site at one end will be labeled. This total mixture is fractionated by agarose gel electrophoresis (Step 2) and then digested in situ with a third enzyme that has a 4-bp recognition site and thus cuts very frequently in the genome (Step 3). The average size of restriction fragments has now been reduced to several hundred basepairs. The gel strip containing each sample is now placed on top of a polyacrylamide gel and a second orthogonal dimension of electrophoresis is carried out (Step 4).

This RFLP protocol differs from all those described previously in that the rare restriction sites are visualized directly without the use of probes that light up particular loci or locus families. Thus, the complete set of fragments that flank both sides of every Not I site in the genome of an individual will be displayed in the pattern that is obtained. The X coordinate of each labeled spot will be a measure of the distance between the first labeled Not I restriction site and the nearest neighbor second restriction site. The Y coordinate of each spot will be a measure of the distance between the first labeled restriction site and the nearest neighbor third restriction site. Polymorphisms can arise from changes that affect any of the three restriction sites that define each spot.

Since the rare restriction sites themselves are labeled, blotting and hybridization steps are eliminated and autoradiographs are obtained by direct exposure of gels to film. The elimination of two lengthy steps significantly reduces the overall time required to process each sample. In addition, without blotting and hybridization, spots are much more sharp and well-delineated from each other. Resolution is also improved with the use of a polyacrylamide, rather than agarose, medium in the second dimension of separation. Hayashizaki, Hatada and colleagues have reported the detection of several thousand spots on two-dimensional gels derived from individual mice (Hatada et al., 1991). Analysis of the BXD set of RI strains with this protocol has allowed the mapping of 473 polymorphic loci.

The advantages in resolution notwithstanding, the Restriction Landmark Genomic Scanning protocol is still technically demanding and it still allows the processing of only one sample per gel. Like other multiplex whole genome scanning methods, it is actually of greatest utility for the initial development of whole genome maps of relatively uncharacterized species.

8.3 Polymorphisms detected by PCR

Without a doubt, the polymerase chain reaction (PCR) represents the single most important technique in the field of molecular biology today. What PCR accomplishes in technical terms can be described very simply — it allows the rapid and unlimited amplification of specific nucleic acid sequences that may be present at very low concentrations in very complex mixtures. Within less than a decade after its initial development, it has become a critical tool for all practicing molecular biologists, and it has served to bring molecular biology into the practice of many other fields in the biomedical sciences and beyond. The reasons are several-fold. First, PCR provides the ultimate in sensitivity — single DNA molecules can be detected and analyzed for sequence content (Li et al., 1988; Arnheim et al., 1991). Second, it provides the ultimate in resolution — all polymorphisms, from single base changes to large rearrangements, can be distinguished by an appropriate PCR-based assay. Third, it is extremely rapid — for many applications, it is possible to go from crude tissue samples to results within the confines of a single workday. Finally, the technique is an agent of democracy — once the sequences of the pair of oligonucleotides that define a particular PCR reaction are published, anyone anywhere with the funds to buy the oligonucleotides can reproduce the same reaction on samples of his or her choosing; this stands in contrast to RFLP analyses in which investigators are often dependent upon the generosity of others to provide clones to be used as probes. Numerous books and thousands of journal articles have been published on the principles and applications of the technique (Erlich, 1989; Innis et al., 1990 are two early examples).

Although the applications of PCR are as varied as the laboratories in which the technique is practiced, this section will focus entirely on six general applications that are relevant to the detection and typing of genetic variation in the mouse. Four of these applications are based on the PCR amplification of particular loci that have been previously characterized at the sequence level. In these cases, primer pairs must be chosen to be as specific as possible for the locus in question in order to avoid artifactual PCR products. Computer programs are available to assist in primer design (Lowe et al., 1990; Dietrich et al., 1992), but manual inspection is usually adequate. One must be careful to avoid self-complementarity within any one primer and the presence of complementary sequences between the two primers. Also, potential primers should be screened with use of a sequence comparison program to avoid homology with the highly repeated elements B1, B2, and L1 (see section 5.4). The primer length should be at least 20 bases, the G:C content should be at least 50%, and the melting temperature should be at least 60°C.

Even when all of these conditions are adhered to, it is still possible to find that a particular pair of primers will not work properly to amplify a specific locus into a reproducible product that can be clearly distinguished from artifactual background bands. There are a variety of approaches that one can take to eliminate such problems (Erlich, 1989; Innis et al., 1990), but if all else fails, one should replace one or both primers with alternatives derived from other nearby flanking sequences that also fit the rules listed above.

8.3.1 Restriction site polymorphisms

8.3.1.1 Overview

Rapid, highly efficient PCR-based assays can be designed to detect all RFLPs — previously-defined by Southern blot analysis — as long as the nature of the RFLP is understood and sequence information flanking the actual polymorphic site is available. A pair of PCR primers that flank this site can then be synthesized according to the rules just described and tested for their ability to amplify a specific product that can be readily identified as an ethidium bromide-stained band by gel electrophoresis.

With the simplest and most common type of RFLP illustrated in figure 8.6, the polymorphism results from a single nucleotide difference that provides a recognition site for a restriction enzyme in one allelic form and not the other. A polymorphism of this type can be rapidly detected by (1) amplifying the region around the polymorphic site from each sample, (2) subjecting the amplified material to the appropriate restriction enzyme for a brief period of digestion, and (3) distinguishing the undigested PCR product from the smaller digested fragments by gel electrophoresis. By choosing primers that are relatively equidistant to and sufficiently far from the polymorphic site, one can easily resolve allelic forms on agarose or polyacrylamide gels as illustrated in figure 8.6.

This PCR-based protocol provides results much more rapidly and is much easier to carry out than the Southern blot alternative which requires blotting, probe labeling, hybridization, and autoradiography. Since the major expense involved with PCR is in the initial sequencing of the locus and the synthesis of PCR primers, it is also less costly in all cases where one expects to type large numbers of samples for the particular locus in question.

Even RFLPs caused by more complex mutational events can be analyzed by PCR. Figure 8.7 illustrates the logic behind devising PCR strategies for detecting deletions, insertions, inversions, and translocations. The only requirement is a knowledge of the sequences that surround the breakpoints associated with each particular genetic event.

8.3.1.2 3’-untranslated regions as a mapping resource

An important resource for the identification of new restriction site polymorphisms that can be typed by PCR is the 3’-untranslated (3’-UT) regions of transcripts. These regions are not under the same selective constraints as coding sequences and are frequently just as polymorphic as random non-transcribed genomic regions. However, 3’-UT regions are direct markers for the 3’ ends of genes. They are usually not interrupted by introns and are often sufficiently divergent between different members of a gene family to allow locus-specific analysis.

Nearly all cDNA libraries are constructed from cDNA molecules that have been initiated by priming from the poly(A) tail present at the 3’ end of the mRNA. For all clones obtained from these libraries, it is straightforward to obtain sequence information for a few hundred basepairs of 3’-UT region directly adjacent to the poly(A) tail. This sequence information can be used to design a pair of PCR primers that can be used, in turn, to amplify and sequence the same region from a different strain or species of mice such as M. spretus or M. m. castaneus. In a comparison of 2,312 bp present in 3’-UT regions derived from 22 random mouse cDNA clones, Takahashi and Ko (1993) found an overall polymorphism rate of one change in every 92 bp. These single base changes translated into restriction site polymorphisms within 9 of the 22 clones analyzed. With primers already in-hand, these newly identified polymorphisms provide PCR markers for the direct mapping of corresponding genes that are indistinguishable in their coding regions.

8.3.2 Detection of allelic changes defined by single basepairs

8.3.2.1 Hybridization and single basepair changes

Although PCR detection of RFLPs is an improvement over Southern blot detection, the real advantage of PCR lies within its nearly-universal ability to discriminate alleles differing by single base changes even when they do not create or destroy any known restriction site. In fact, most random basepair changes will be of the non-RFLP type, and before the advent of PCR, there was no efficient means by which these alleles could be easily followed in large numbers of samples. It was this limitation that led originally to the development of the PCR protocol (Saiki et al., 1986).

The inability to detect single base changes on Southern blots was a consequence of both theoretical limitations inherent in the process of hybridization as well as practical limitations in the sensitivity of nucleic acid probes and the elimination of background noise. With Southern blot analysis, the sensitivity at which target sequences can be detected within a defined sample is directly proportional to the length of the probe. For example, a 1 kb probe will hybridize to ten times the amount of target sequence as a 100 bp probe (having the same specific activity), and this will lead to a signal which is ten times stronger. It is for this reason that it is always best to use the longest probes possible for traditional Southern blot studies as well as for other protocols such as in situ hybridization. Signal strength is important not simply to reduce the amount of time required for autoradiographic exposure, but also to allow detection over the background "noise" inherent in any hybridization experiment. If conditions are at all less than optimal, the signal-to-noise ratio will drop below 1.0 as the probe size is reduced below 100 to 200 bp.

The only forces holding the two strands of a DNA double helix together are the double or triple hydrogen bonds that exist within each basepair. Individual hydrogen bonds are very weak, and it is only when they are added together in large numbers that the double helix has sufficient stability to avoid being split apart by normal thermal fluctuations. Thus, for DNA molecules having a size in the range from a few basepairs up until a critical value of ~50 bp, the length itself plays a critical role in the determination of whether the helix will remain intact or fall apart. However, once this upper boundary is crossed, length is no longer a factor in thermal stability. In effect, there is only a small window — ~10 to 40 bp — over which it is possible to obtain differential hybridization of probe to target based on differences in hybrid length.

But how could length make a difference in allele detection when both the target and probe lengths are held constant? The answer is that the effective length of a hybrid is determined by the longest stretch of DNA that does not contain any mismatches. Thus, when a probe of 21 bases in length hybridizes to a target that differs at a single base directly in the middle of the sequence, the effective length of the hybrids that are formed is only 10 bp. Since a 10 bp hybrid is significantly less stable than a 21 bp hybrid, it is becomes easy to devise hybridization conditions (essentially by choosing the right temperature) such that the perfect hybrid will remain intact while the imperfect hybrid will not. In contrast, the thermal stability of a 50 bp hybrid is not sufficiently different from the thermal stability of a 100 bp hybrid (of equivalent sequence composition) to allow detection by differential hybridization.

Thus, in 1985, the detection of single base differences through differential Southern blot hybridization was not possible because of two counteracting problems. First, it was only with very short probes — oligonucleotides of less than 50 bases — that single base changes provided a large enough difference to be readily detected. However, it was only with much longer probes — of several hundred bases or greater — that signal strength and signal-to-noise ratio were sufficient to allow specific detection of the target sequence in any allelic form within the high complexity mouse genome. How could one break this impasse?

The answer, of course, was to focus on the target sequences rather than the probe or hybridization conditions. PCR provided a means to increase the absolute amount of target sequence, as well as the target-to-non-target ratio, by virtually-unlimited orders of magnitude. This, in turn, results in a proportional increase in potential signal strength which, in turn, allows one to use short oligonucleotides for hybridization and which, in turn, allows for the detection of single base differences in a simple plus/minus assay.

8.3.2.2 Allele-specific oligonucleotides (ASOs)

Once alternative alleles have been sequenced and a single basepair change between the two has been identified, it becomes possible to design a PCR protocol that allows one to follow their segregation (Farr, 1991). First, PCR primers are identified that allow specific amplification of a region that encompasses the variant nucleotide site (for this application, the length of the product is not critical and can be anywhere from 150 to 400 bp in length). Next, two Allele-Specific Oligonucleotides (ASOs) are produced in which the variant nucleotide is as close to the center as possible considering other factors described at the front of section 8.3. The ideal ASO length is 19 to 21 bases — short enough to allow differential hybridization based on a single base change and long enough to provide a high probability of locus specificity. The two ASOs are used with defined samples to determine a temperature at which positive hybridization is obtained with target DNA containing the correct allele but not with target DNA containing only the alternative allele.

Typically, a sample of genomic DNA is subjected to PCR amplification with the locus-specific primers, aliquots of the amplified material are spotted into two "dots," and each is probed with labeled forms of each of the two ASOs. Hybridization at one dot but not the other is indicative of a homozygote for that allele, while hybridization at both dots is indicative of a heterozygote that carries both alleles.

The power of this protocol for allele detection is its simplicity. The elimination of gel running saves both time and allows for easy automation. With large amounts of target sequence, it becomes possible to use non-radioactive labeling protocols that are safer and allow for long-term storage of labeled probes (Helmuth, 1990; Levenson and Chang, 1990). However, there are pitfalls that are important to keep in mind. First, some variants may be refractory to reproducible PCR analysis because of problems inherent in the sequence that surrounds the site of the base change. Second, plus/minus assays of any kind are subject to the problem of false negatives. (One can always insert a gel running step prior to hybridization to be certain that amplified material is actually present in the aliquot under analysis.) Finally, in the analysis of mice derived from anything other than a defined cross, there is always the risk that a third novel allele will exist that cannot be detected by either of the two ASOs developed for the analysis. An animal heterozygous for such a novel allele (along with one of the two known alleles) could be falsely characterized as homozygote for the one known allele present since the protocol is not quantitative. Nevertheless, even with these pitfalls, the PCR/ASO protocol remains a useful tool for genetic analysis.

8.3.2.3 The Oligonucleotide Ligation Assay (OLA)

An alternative protocol for the detection of well-defined alleles that differ by single base changes has been developed by Hood and his colleagues (Landegren et al., 1988; Landegren et al., 1990). This method, called the Oligonucleotide Ligation Assay (OLA) or Ligase-mediated gene detection, is predicated on the requirement for proper base-pairing at the 3’-end of one oligonucleotide as well as the 5’-end of an adjacent oligonucleotide before ligase can work to form a covalent phosphodiester bond. The conceptual framework behind the protocol is illustrated in figure 8.8. First, the potential target sequence is amplified by PCR. Next hybridization is carried out simultaneously with two oligonucleotides complementary to sequences that are adjacent to each other and directly flank the variant nucleotide. The variant base itself will be either complementary or non-complementary to the most 3’-base of the first allele-specific oligonucleotide. This ASO is modified ahead of time with an attached biotin moiety but is not otherwise labeled. The second oligonucleotide, which is labeled radioactively or non-isotopically, extends across an adjacent sequence that is common to both alleles under analysis. Ligase is also present in the reaction, and if both oligonucleotides are perfectly matched with the target sequence, the ligase will create a covalent bond between them. If a mismatch occurs at the junction site, the two oligonucleotides will not become ligated. Biotinylated material can be easily and absolutely separated from non-biotinylated material with the use of a streptavidin matrix and the resulting sample can then be tested for the presence of the label associated with the second oligonucleotide.

There are two main advantages to using the ligase-mediated detection protocol as a substitute for the PCR/hybridization protocol described in the previous section. First, the chance of a false positive arising from the OLA protocol is essentially zero. Second, the OLA protocol is highly amenable to automation. However, as in all plus/minus assays, proper controls are critical to rule out the possibility of false negatives.

8.3.2.4 The Ligase Chain Reaction (LCR)

By combining OLA together with the exponential amplification strategy of PCR, a new technique has been developed that is referred to as the Ligase Chain Reaction or LCR (Barany, 1991; Weiss, 1991). Like OLA, detection of nucleotide differences with the LCR protocol is based upon a requirement for perfect pairing at the two sites that flank the break between two oligonucleotides in order for ligase to form a phosphodiester bond between them as illustrated in figure 8.8. The difference is that in the LCR protocol, four oligonucleotides are used corresponding to the regions that flank the polymorphic site on both strands of the target DNA molecule. Therein lies the mechanism of amplification. If the target sequence provides a match, both sets of flanking oligonucleotides will become ligated. After denaturation, each of the newly-created double-length oligonucleotides can now act as a template for a new set of oligonucleotides to basepair and ligate. Thus, LCR proceeds by rounds of annealing in the presence of a heat-stable ligase followed by denaturation and then annealing again. The only difference in the thermocycler pattern from that used for PCR is the elimination of the elongation step. At the end of the process, the products of LCR can be detected easily in the same manner used for OLA as shown in figure 8.8. In contrast, detection of polymorphic sites with PCR requires a follow-up protocol — either hybridization to an allele-specific oligonucleotide or restriction digestion followed by gel electrophoresis.

The essential difference between LCR and the original OLA protocol is sensitivity: LCR requires far less starting material since the product is amplified during the protocol. The advantages of the LCR protocol over PCR are several. First, like OLA, LCR will not produce false positives. Second, because the product of LCR is directly assayable without further detection schemes, the process is much more amenable to automation and is much more likely to be quantitative. The disadvantage of LCR is that it can only be used to detect single base substitutions that have been previously characterized by sequence analysis.

8.3.3 Single Strand Conformation Polymorphism (SSCP)

8.3.3.1 Historical background

There are many circumstances where it is most useful to be able to follow genes — in contrast to anonymous sequences — directly within experimental crosses. A number of different approaches have already been described, but they all have limitations. Gene-associated RFLPs are detected between different species — M. spretus and traditional inbred M. musculus strains, for example — at a reasonable frequency, but are much more difficult to find among the inbred M. musculus strains themselves. To use any of the approaches dependent on Allele-Specific Oligonucleotides, it is first necessary to sequence the locus in question from different strains of mice, identify basepair variants, synthesize the ASOs as well as other locus-specific primer(s), test their specificity and optimize the reaction conditions for each locus. And at this point, one still only has a protocol for distinguishing two allelic states.

A geneticist’s perfect protocol for detection and analysis of polymorphisms at any locus would satisfy the following criteria. First, it would allow the detection of any and all basepair variants in a DNA region as multiple alleles. Second, it would not require prior sequence information from each allele. Third, it would not require the synthesis of ASOs. Fourth, the assay protocol itself would require no special equipment or special skills above and beyond that found in a standard molecular biology laboratory. Finally, the assay would be rapid and the results would be readily reproducible.

All of these criteria have been satisfied, to a good degree, with a simple protocol that takes advantage of the fact that even single nucleotide changes can alter the three dimensional equilibrium conformation that single strands will assume at low temperatures (Orita et al., 1989a). If a sample of DNA is denatured at high temperature and then quickly placed onto ice, re-formation of DNA hybrids will be inhibited. Instead, each single strand will collapse onto itself in what is often called a random coil. In fact, it is now clear that each single strand will assume a most-favored conformation based on the lowest energy state. Presumably, the most favored state is one in which a large number of bases can form hydrogen bonds with each other. Even a single nucleotide change could conceivably disrupt the previous most favored state and promote a different one, which, if different enough, would run with an altered mobility on a gel. Different allelic states of a locus that are detected with this protocol are called Single-Strand Conformation Polymorphisms or SSCPs (Beier et al., 1992; Beier, 1993).

8.3.3.2 Denaturing Gradient Gel Electrophoresis (DGGE)

The development of the SSCP protocol was an outgrowth of an earlier technique that allowed the detection of single base changes in genomic DNA upon electrophoresis through an increasing gradient of denaturant (Fischer and Lerman, 1983). This technique is called denaturing gradient gel electrophoresis or DGGE. Small changes in sequence have dramatic effects on the point in the denaturing gradient at which particular double-stranded genomic restriction fragments would split into single strands. With the attachment of a "GC-clamp" — composed of a stretch of tightly bonding G:C basepairs — the DNA fragment could be held together with a double helix in the clamped region attached to the open single strands present in the melted region. This two-phase molecule would be very resistant to further migration in the gel and would essentially stop in its track. Two allelic forms of a genomic fragment that differed by even a single basepair would undergo this transition at different denaturant points and this would be observed as different migration distances in the denaturing gradient gel. In the original protocol, different allelic forms were detected directly within total genomic DNA upon Southern blotting and hybridization to a locus-specific probe. At the time DGGE was developed, there was no other means available for detecting basepair changes that did not alter restriction sites, and thus DGGE expanded the polymorphism horizon. Unfortunately, DGGE requires the use of custom-made equipment and is tedious to perform on a routine basis.

In recent years, the DGGE protocol has been modified for use in conjunction with PCR as a means for the initial detection of allelic variants among samples recovered from different individuals within a population (Sheffield et al., 1989). The differential migration of PCR products can be detected directly in gels with ethidium bromide staining and variant alleles can be excised from the gel for sequence analysis. The main advantage to the use of DGGE is that nearly all single base substitutions can be detected (Myers et al., 1985) . Thus, it is ideal for the situation where one wants to search for rare variant alleles among individuals within a population without the need for sequencing through all of the wild-type alleles that will be present in most samples. Nevertheless, DGGE does not scale up easily and, thus, it is not the method of choice when another less-labor-intensive protocol can also be used for the detection of allelic variants. In many cases, the better protocol will be SSCP.

8.3.3.2 The SSCP protocol and its sensitivity

One of the main virtues of the new SSCP detection protocol developed by Orita et al. (1989a; 1989b) is its simplicity. In its basic form, PCR products are simply denatured at 94°, cooled to ice temperature to prevent hybrid formation, and then electrophoresed on standard non-denaturing polyacrylamide gels. The two strands of the PCR product will usually run with very different mobilities, and base changes can act to further alter the mobility of each strand. Thus, what appears to be a single PCR product by standard analysis, can split into four different bands on an SSCP gel if the original DNA sample was heterozygous for base changes that altered the mobility of both strands.

Various studies have analyzed pairs of PCR products known to differ by single base substitutions to obtain an estimate of the fraction that can be distinguished by the SSCP protocol. In one such study, 80% of 228 variant PCR products were distinguishable (Sheffield et al., 1993). In another study, the rate of detection was 80 to 90% (Michaud et al., 1992). However, when this last set of samples was analyzed under three different electrophoretic conditions, the detection rate was an astonishing 100%.

8.3.3.3 A powerful tool for the detection of polymorphisms among classical inbred strains

When SSCP analysis was performed on a set of PCR products amplified from either the 3’-untranslated or intronic regions of 30 random mouse genes, 43% showed polymorphism in a comparison of the classical inbred strains B6 and DBA, and 86% showed polymorphism between B6 and M. spretus (Beier et al., 1992). The rate at which SSCP polymorphisms are detected between B6 and DBA is much greater than that observed with RFLP analysis, and in the same ballpark as the polymorphism frequencies observed for microsatellites described in section 8.3.6.

There are a number of advantages to the SSCP approach over other systems for detecting polymorphisms: (1) SSCP has potential applicability to all unique sequences, both within genes and non-genic regions; (2) PCR primers can be designed directly from cDNA sequences; and (3) If one amplified region does not show polymorphism, one can always move up- or downstream to another. However, it is still likely to be the case that microsatellites — described in section 8.3.6 — will have larger numbers of different alleles and this makes them more useful in straightforward fingerprinting approaches. Together, SSCP and microsatellite analysis provide a powerful pair of PCR-based tools for classical linkage analysis with recombinant panels derived from both intra- and interspecific crosses.

8.3.4 Random Amplification of Polymorphic DNA (RAPD)

With all of the PCR protocols described so far, there is an absolute requirement for pre-existing sequence information to design the primers upon which specific amplification depends. In 1990, two groups demonstrated that single short random oligonucleotides of arbitrary sequence could be used to prime the amplification of genomic sequences in a reproducible and polymorphic fashion (Welsh and McClelland, 1990; Welsh et al., 1991; Williams et al., 1990) . This protocol is called Random Amplification of Polymorphic DNA or RAPD. The principle behind the protocol is as follows. Short oligonucleotides of random sequence will, just by chance, be complementary to numerous sequences within the genome. If two complementary sequences are present on opposite strands of a genomic region in the correct orientation and within a close enough distance from each other, the DNA between them can become amplified by PCR. Each amplified fragment will be independent of all others and, by chance, will likely be of different length as well; if few enough bands are amplified, all will be resolvable from each other by gel electrophoresis. Different oligonucleotides will amplify completely different sets of loci.

RAPD polymorphisms result from the fact that a primer hybridization site in one genome that is altered at a single nucleotide in a second genome can lead to the elimination of a specific amplification product from that second genome as illustrated in figure 8.9. If, for example, the random primer being used has a length of 10 bases, then each PCR product will be defined by 20 bases (10 in the primer target at each end) that are all susceptible to polymorphic changes. The resulting polymorphism will be detected as a di-allelic +/— system.

If one starts with the assumption that complete complementarity between primer and target is required for efficient amplification, it becomes possible to derive a general equation to predict the approximate number, A, of amplified bands expected as PCR products from a genome of complexity C that is primed with a single oligomer of length N. For amplified fragments of 2 kb or smaller in size the equation is:

? ?(8.1)

Let us use 2x109 as an estimate for the complexity of the single copy portion of the mouse genome (see section 5.1.2) and solve equation 8.1 for primer lengths that vary from eight to eleven. With N = 8, the predicted number of PCR products is 18,626 — far too many to resolve by any type of gel analysis. With N = 9, the equation predicts 116 PCR products, which is still too high a number. With N = 10, the prediction is 7.2 products, and with an 11-mer, the prediction is 0.45 products. Thus, the use of random 10-mers would be most appropriate for obtaining a maximal number of easily resolvable bands from the mouse genome.

Optimizations of the complete RAPD protocol, from the parameters upon which primer sequences are chosen to the conditions used for PCR, have been published (Williams et al., 1990; Nadeau et al., 1992). It is actually possible to obtain multiple PCR products with primers longer than 10-mers when relaxed reaction conditions are used to allow amplification from mismatched target sequences. In fact, one group has suggested that 12 to 14-mer primers are optimal (Nadeau et al., 1992). It is also possible to increase the predicted number of products by a simple factor of three by including two unrelated random primers of the same length in each PCR reaction. However, in any case where the number of visible PCR products goes above 12 to 20, it would become necessary to use polyacrylamide gels, rather than agarose gels, in order to clearly resolve each band; thus, the trade-off for the detection of more loci is a more time-consuming analysis. In the end, the protocol that requires the least amount of time for typing per locus is the one which should be chosen. Since different laboratories often excel at different techniques, the optimal conditions for RAPD analysis should be determined independently in each laboratory.

A comprehensive RAPD analysis of the two most well-characterized inbred strains — B6 and DBA — has been performed with 481 independent 10-mers used singly in PCR reactions (Woodward et al., 1992). An average of 5.8 PCR products per reaction were observed, which is not very different from that predicted from equation 8.1. In a direct strain comparison, 95 reproducible differences were observed between B6 and DBA among the complete set of 2,900 discrete bands detected. Assuming that each polymorphism results from a single nucleotide change in one of the two primer targets and all such changes are detectable, it becomes possible to calculate the average sequence difference between these two strains at 1.6 changes per 1,000 nucleotides. This low level of polymorphism is not unexpected given the high degree of relatedness known to exist among all of the classical inbred strains (see section 2.3.4).

Using RAPD as a method for detecting polymorphisms between B6 and DBA would appear to be rather inefficient — on average, only one polymorphism was detected among every five primer reactions that were run. A second negative factor is that all RAPD polymorphisms are binary +/— systems. Thus, as discussed above for RFLPs, a polymorphism detected between one pair of strains may not translate into use for another pair of strains. Furthermore, it is not possible to distinguish animals that are heterozygous at any locus from those that are homozygous for the ‘+’ allele. Thus, on average, only half of the RAPD polymorphisms detected between two strains would be mappable among offspring from a backcross to one parent, and with an intercross mapping system, the RAPD approach is even more limited (see section 9.4.3).

Nevertheless, there are many features that speak to the utility of the RAPD approach. Foremost among these is the relative speed and ease with which results can be obtained — there is no need for blotting or radioactive hybridization, and a complete analysis from start to finish can be performed within a single working day, unlike RFLP or minisatellite studies. Unlike all other PCR-based protocols, RAPD primers are not dependent on the results of costly cloning and sequencing studies, and once they are obtained, the main cost per sample is the DNA polymerase used for PCR. Thus, even in comparisons of inbred strains, the RAPD protocol may be more efficient in the long run relative to other techniques for generating random DNA markers. Additionally, cloning of RAPD fragments can be rapidly accomplished after the simple recovery of ethidium bromide-detected bands. Cloned RAPD loci will have an advantage over minisatellites and microsatellites in that RAPD loci need not, and most will not, contain repetitive sequences.

As is the case with traditional RFLP loci, the inter-species level of RAPD polymorphism is much greater than that observed among the traditional inbred strains. The data of Serikawa et al. (1992) indicate a five-fold increase in the number of polymorphic bands observed in comparisons between M. spretus and traditional laboratory strains; this increase parallels the known increase in genetic diversity. Thus, the RAPD technology will be even more efficient for marker development in crosses that incorporate one parent that is not derived from one of the traditional inbred strains.

Like minisatellite analysis, the RAPD protocol can provide genomic fingerprints that simultaneously scan loci dispersed throughout the genome. In an analysis of 32 representative inbred strains maintained at the Jackson Laboratory, Nadeau and colleagues (1992) defined 29 unique strain fingerprints with the use of only six primers. Thus, RAPD would appear to provide an efficient and easy means by which to monitor the genetic purity of inbred lines on an ongoing basis.

Finally, it should be mentioned that while the RAPD protocol is a useful, and important, addition to the arsenal of tools available for genetic analysis of the mouse, it is of vastly greater importance for genetic studies of other species, including both animals and plants, that are not well-characterized at the DNA level. For these other species, the RAPD technology can provide a unique method for the rapid development of genetic markers and maps even before DNA libraries and clones are available.

8.3.5 Interspersed Repetitive Sequence (IRS) PCR

The principle behind the RAPD approach is that oligonucleotides having essentially a random sequence will be present at random positions in the genome (of the mouse and every other species) at a frequency which can be pre-determined mathematically. Thus, by choosing oligonucleotides of an appropriate size and by running PCR amplification reactions under the appropriate conditions, one can control the number of independent genomic fragments that are amplified such that they can be optimally resolved by a chosen system of gel electrophoresis. Although useful for linkage analysis, the RAPD approach does not allow the discrimination of mouse sequences from those of other species and, thus, it cannot be used as a means for recovering mouse genomic fragments from cells that contain a defined portion of the mouse genome within the context of heterologous genetic background.

An alternative approach that, like RAPD, also allows the simultaneous PCR amplification of multiple genomic fragments is based on the natural occurrence of highly repeated DNA elements that are dispersed throughout the genome. Three families of mouse repeat elements — B1, B2, and L1 — are each present in approximately 100,000 copies (see section 5.4). Amplification of these elements in-and-of themselves with repeat-specific primers would not be useful, first, because their copy numbers are too great to be resolved by any available gel system and, second, because there would be no way of distinguishing most individual elements from each other since most would be of the same consensus size. However, if instead one used a repeat-specific primer that "faced-out" from the element, one would amplify only regions of DNA present between two elements that were sufficiently close to each and in the correct orientation to allow the PCR reaction to proceed. The number of instances in which two elements would satisfy these conditions will be much lower than the total copy number since, on average, these elements will be spaced apart at distances of approximately 30 kb and, for all practical purposes, PCR amplification does not occur over distances greater than one-to-two kilobases. By working with (1) one or a combination of two or more primers together, that (2) hybridize to whole families or subsets of elements within a family, from (3) two ends of the same element or from different elements, one can adjust the number of PCR products that will be generated to obtain the maximal number that can be resolved by a chosen system of gel electrophoresis (Herman et al., 1992).

This general protocol is referred to as Interspersed Repetitive Sequence PCR (IRS-PCR). It was first developed for use with the highly repeated Alu family of elements in the human genome (Nelson et al., 1989), and was subsequently applied to the mouse genome (Cox et al., 1991). Cox and co-workers used individual primers representing each of the major classes of highly dispersed repetitive elements — B1, B2 and L1 — to amplify genomic fragments from inbred B6 mice (M. musculus) and M. spretus. Although the IRS-PCR patterns obtained upon agarose gel electrophoresis were extremely complex, it was possible to see clear evidence of species-specificity. To simplify the patterns, these workers blotted and sequentially hybridized the IRS-PCR products to simple sequence oligonucleotides (12-mers containing three tandem copies of a tetramer) present frequently in the genome but only, by chance, in a subset of the amplified inter-repeat regions. The simplified patterns obtained allowed the identification and mapping of thirteen new polymorphic loci.

Clearly, the utility of IRS-PCR as a general mapping tool is no greater than that of the RAPD technique or any other protocol that allows the random amplification of PCR fragments from around the genome. However, the real power of IRS-PCR is not in general mapping but in the identification and recovery of mouse-specific sequences from cell hybrids as discussed in the section 8.4.

8.3.6 Microsatellites: Simple Sequence Length Polymorphisms

8.3.6.1 The magic bullet has arrived

Although the ability to identify and type simple basepair substitutions changed the face of genetics, it has not been a panacea. Finding RFLPs within a cloned region is often not easy; when they are found, their polymorphic content is often limited and di-allelic; finally, typing large numbers of RFLP loci by Southern blot analysis is relatively labor-intensive. Non-RFLP base changes can also be difficult to find, although this task has become easier with the development of the SSCP protocol. But, most SSCPs still show a limited polymorphic content with just two distinguishable alleles. Minisatellites are much more polymorphic than loci defined by nucleotide substitutions and minisatellite probes often allow one to simultaneously type multiple loci dispersed throughout the genome. However, minisatellite elements as a class have no unique sequence characteristics and are recognized only by the Southern blot patterns they produce when they are used as probes. The number of minisatellite loci uncovered to date numbers less than 1000. Thus, in general, minisatellites cannot provide specific handles for typing newly cloned genes or genomic regions. Other methods of multi-locus analysis described previously suffer from the same limitations.

In the next chapter, it will be seen that one very important use of DNA markers is not to follow particular genes of interest in a segregation analysis, but rather to provide "anchors" that are spaced at uniform distances along each chromosome in the genome. Together, these anchor loci can be used to establish "framework maps" for new crosses which, in turn, can be used to rapidly map any new locus or mutation that is of real interest. If the number of anchors is sufficient, it will only take a single cross to provide a map position for the new locus. There is no need for anchor loci to represent actual genes. Their only purpose is to mark particular points along the DNA molecule in each of the chromosomes in a genome.

There are three criteria that define perfect anchor loci. First, they should be extremely polymorphic so that there is good chance that any two chromosome homologs in a species will carry different alleles. Second, they should be easy to identify so that one can develop an appropriate set of anchors for the analysis of any complex species that a geneticist wishes to study. Finally, they should be easy to type rapidly in large numbers of individuals.

Now with the dawn of the 1990s has come what may indeed be the magic bullet that geneticists (who study the mouse as well as all other mammals) have been waiting for — a genomic element with unusually high polymorphic content, that is present at high density throughout all mammalian genomes examined, is easily uncovered and quickly typed: the microsatellite. A microsatellite — also known as a simple sequence repeat (or SSR) — is a genomic element that consists of a mono-, di-, tri- or tetrameric sequence repeated multiple times in a tandem array.

Unlike other families of dispersed cross-hybridizing elements — such as B1, B2, and L1 — in which individual loci are derived by retrotransposition from common ancestral sequences (see section 5.4), individual microsatellite loci are almost certainly derived de novo, through the chance occurrence of short simple sequence repeats that provide a template for unequal crossover events (as illustrated in figure 8.4) that can lead to an increase in the number of repeats through stochastic processes. In general, microsatellite loci are not conserved across distant species lines, for example, from mice to humans, and it seems unlikely that these elements — which are practically devoid in information content — have any functionality either to the benefit of the host genome or in-and-of themselves. Microsatellites do not appear to be selfish elements (discussed in section 5.4). Rather, microsatellites, like minisatellites, are simply genomic quirks that result from errors in recombination or replication.

Microsatellites containing all nucleotide combinations have been identified. However, the class found most often in the mouse genome contains a (CA)n•(GT)n dimer, and is often referred to as a CA-repeat. The existence of CA-repeats, their presence at high copy number, and their dispersion throughout the genomes of a variety of higher eukaryotic species was first demonstrated a decade ago by several different laboratories (Miesfeld et al., 1981; Hamada et al., 1982; Jeang and Hayward, 1983). Although independent examples of CA-repeat polymorphisms surfaced over the following decade, it was not until 1989 that three groups working independently uncovered sufficient evidence to suggest that microsatellites as a class were intrinsically, extremely polymorphic (Pickford, 1989; Weber and May, 1989; Pedersen et al., 1993). Further systematic studies have confirmed the high level of polymorphism associated with many microsatellite loci in all higher eukaryotes that have been looked at.

8.3.6.2 Typing by PCR

Without PCR, most microsatellites would be useless as genetic markers. Allelic variation is based entirely on differences in the number of repeats present in a tandem array rather than specific basepair changes. Thus, the only way in which alleles can be distinguished is by measuring the total length of the microsatellite. This is most readily accomplished through PCR amplification of the microsatellite itself along with a small amount of defined flanking sequence on each side followed by gel electrophoresis to determine the relative size of the product as illustrated in figure 8.10.

Microsatellite loci can be identified in two ways — by searching through DNA sequence databases or by hybridization to libraries or clones with an appropriate oligonucleotide such as (CA)15. In the former case, flanking sequence information is obtained directly from the database. In the latter case, it is first necessary to sequence across the repeat region to derive flanking sequence information. A unique oligonucleotide on each side of the repeat is chosen for the production of a primer according to the criteria described in the front of section 8.3. It is best to choose two primers that are as close to the repeat sequence as possible — the smaller the PCR product, the easier it is to detect any absolute difference in size.

Variations in the length of PCR products can be detected by separation on NuSieve™ agarose (FMC Corp.) gels (Love et al., 1990; Cornall et al., 1991) or polyacrylamide gels (Weber and May, 1989; Love et al., 1990). Agarose gels are easier to handle, but polyacrylamide gels provide higher resolution. When alleles are difficult to resolve with native gels, it is often possible to improve the level of resolution by running denaturing gels. Bands are detected by ethidium bromide or silver staining of gels or by autoradiography of PCR products formed with labeled primers.

An even higher level of cost-efficiency can be achieved by combining two or more loci for simultaneous analysis through multiplex PCR. Samples can be combined before the PCR reaction — if the different primer pairs have been shown not to cause combinatorial artifacts — or after the PCR reaction but before the gel run. In all cases, the entire process is amenable to automation.

8.3.6.3 Classification and frequency of microsatellites

Microsatellites can be classified first according to the number of nucleotides in the repeat unit. Mononucleotide and dinucleotide repeat elements are quite common; with each subsequent increment in nucleotide length — from trinucleotide to tetranucleotide to pentanucleotide — the frequency of occurrence drops quickly. Perfect microsatellites are those that contain a single uninterrupted repeat element flanked on both sides by non-repeated sequences (Weber, 1990). A large proportion of microsatellite loci are imperfect with two or more runs of the same repeat unit interrupted by short stretches of other sequences. The polymorphic properties of imperfect microsatellites are determined by the longest stretch of perfect repeat within the locus. Not infrequently, microsatellites are of an imperfect and compound nature, with a mingling of two or more distinct runs of different repeat units.

The most common microsatellites in the mouse genome are members of the dinucleotide class. With complementarity and frame-shift symmetry, there are only four unrelated types of dinucleotide repeats that can be formed— (CA)n•(GT)n, (GA)n•(CT)n, (CG)n•(GC)n, and (TA)n•(AT)n. Of these four, two are not useful as microsatellite markers for different reasons: (CG)n•(GC)n is present only infrequently within all mammalian genome (as discussed in sections 8.2.2 and 10.3.4.4), and long (TA)n•(AT)n stretches do not allow for stable hybrid formation at the temperature normally used for PCR strand elongation. Of the remaining two classes, CA-repeats are found most often in the mouse genome. Furthermore, although CA-repeats have been found in all eukaryotes examined, they are absent from prokaryotes. This fact greatly simplifies the task of screening for their presence in traditional E. coli-based libraries.

Based on a quantitative dot blot analysis, Hamada (1982) estimated the number of CA-repeat loci in the mouse genome at ~100,000, equivalent to an average of one locus every 30 kb. Another estimate of CA-repeat copy number was obtained by scanning 287 kb of mouse genomic sequences entered in GenBank for (CA)n where ‘n’ was 6 or greater (Stallings et al., 1991). This analysis found CA-repeats once every 18 kb on average. The difference between these two estimates can be accounted for entirely by sequences having only 6 to 9 repeats, which are too short to be detected by the hybridization-based dot blot analysis (Weber, 1990).

The second most frequent microsatellite class in the mouse genome is the GA-repeat which occurs at a frequency of approximately half that observed for CA-repeats (Cornall et al., 1991). GA-repeat loci are just as likely to be polymorphic as CA-repeat loci. Thus by screening for both simultaneously, one can increase the chances of finding a useful microsatellite by 50%.

The mononucleotide repeat poly(A•T) is found in the mouse genome at a frequency similar to, if not greater than, the CA-repeats. However, it is often contained within the highly dispersed B1, B2, and L1 repeats, which are themselves present in ~100,000 copies per haploid genome. Thus, random screens for poly(A•T) tracts will frequently land investigators in these more extensive repetitive regions where it will be difficult to derive locus-specific primer pairs for PCR analysis. Nevertheless, if one is aware of this pitfall, it becomes possible to use computer programs to assist one in this task, and it is often possible to type microsatellite-containing-B1 (or B2 or L1) elements (Aitman et al., 1991). Mononucleotide repeats, both within and apart from the more complex repeat elements, are just as likely to be polymorphic as dinucleotide repeats (Aitman et al., 1991). However, a second potential pitfall with long poly(A•T) tracts is that, as is the case with long (TA)n•(AT)n dinucleotide tracts, there is a reduced melting temperature which necessitates the use of PCR elongation steps under conditions of reduced specificity, leading to an increased incidence of artifactual products. As a consequence of these pitfalls, poly(A•T) tracts have been used much less frequently as a source of polymorphic microsatellite markers. The microsatellite poly(C•G) is not associated with either of these pitfalls, but it is much less frequently observed — by an order of magnitude — in the mouse genome (Aitman et al., 1991).

Tri- and tetranucleotide repeat unit microsatellites are also present in the genome, but at a frequency ten-fold below that of the dinucleotide (CA)n and (GA)n loci (Hearne et al., 1992). As such, they will be represented much less often in genomic libraries and individual clones. However, once uncovered, these higher-order microsatellites are much better to work with than the dinucleotide loci. The level of polymorphism observed with the tri- and tetranucleotide loci appears similar to that observed with CA- and GA-repeats, but alleles are much more readily resolved with 3-4 bp mobility shifts for each repeat unit difference. Furthermore, ladders of artifactual PCR products commonly seen with dinucleotide repeats do not appear as often or as intensely with higher-order repeat unit loci (Hearne et al., 1992).

8.3.6.4 Polymorphism levels and mutation rates

As is the case with minisatellite loci, the generation of new microsatellite alleles is not due to classical mechanisms of mutagenesis. Rather, the number of tandem repeats is altered as a consequence of mispairing, or slippage, during recombination or replication within the tandem repeat sequence. As illustrated in figure 8.4, events of this type will create new alleles by expanding or contracting the size of the locus. The frequency with which these events occur is a function of the number of repeats in the locus with a sigmoidal distribution. CA-microsatellites with 10 or fewer repeat units are unlikely to show polymorphism; with 11 to 14 repeat units, there is an intermediate, and climbing, probability of detecting polymorphism; with 15 repeat units or more, there is a maximal probability of detecting polymorphism (Weber, 1990; Dietrich et al., 1992). Thus, to maximize the probability of detecting polymorphism, one should focus analyses on CA-repeat loci having n ³ 15. Hybridization screens can be set up to accomplish this task by probing blots with a (CA)15 oligonucleotide under high stringency conditions of 65°C with 0.1 X SSC (Dietrich et al., 1992).

A large number of laboratories have now reported the results of investigations into the frequencies at which microsatellite polymorphisms are detected in comparisons of two or more inbred strains or mouse species. The actual results would be expected to vary depending on the method used to recover microsatellites (because this will determine the lower boundary for repeat number) and the method used to type the PCR products (because agarose gels are less resolving than polyacrylamide gels). In an analysis of over 300 CA-repeat microsatellites that are predominantly of the n ³ 15 class, an average polymorphism rate of approximately 50% was observed in pairwise comparisons among nine classical M. musculus inbred strains; the lowest level of polymorphism observed was 35% between DBA/2J and C3H/HeJ, and the highest was 57% between B6 and LP/J (Dietrich et al., 1992). Not unexpectedly, even higher levels of polymorphism were observed in pairwise comparisons between classical inbred strains and other Mus species or subspecies. The rate of polymorphism between B6 and M. m. castaneus was 77%, and between B6 and M. spretus, it was ~90% (Love et al., 1990; Dietrich et al., 1992). For a small but significant number of loci, the primers designed to amplify an inbred strain locus failed to amplify an allelic product from the M. spretus genome (Love et al., 1990); this is almost certainly due to an interspecific polymorphism in a target sequence recognized by one of the flanking primers.

A number of investigators have attempted to measure the rate at which new microsatellite alleles are created. This can be readily accomplished in the mouse where the relationships among a large number of different inbred strains have been well-documented and it is possible to count the generations that separate various strains from each other (Bailey, 1978). The results of these studies indicate that the rate of mutation is highly variable — over at least an order of magnitude. This variability could be a consequence of genomic position effects, but the mechanism of allele generation must be clarified before one can say for sure. In the most comprehensive analysis to date, Dietrich and colleagues (1992) analyzed the average rate of mutation at 300 loci within the BXD set of recombinant inbred strains. The average mutation rate was calculated at one in 22,000 per locus per generation, which is 5- to 50-fold greater than that normally attributed to mutagenesis at classical loci. This average microsatellite "mutation" rate is high enough to allow the generation of a large amount of polymorphism among individuals within a species, but low enough to allow one to accurately follow the segregation of two or more alleles from one generation to the next within a typical genetic cross.

8.3.6.5 The awesome power of microsatellites

The high level of polymorphism associated with microsatellites (as a class) represents just one component of their rapid rise to become the "genetic tool of choice" for mappers working with all animal species. Their uniqueness and power also lies within the ease with which they can be uncovered, the ease with which they can be typed, and the ease with which they can be disseminated. To develop a panel of microsatellite loci for analysis of the mouse genome, Todd and his colleagues simply searched through the EMBL and GenBank databases for entries that contained (CA)10, (GA)10, or their complements (Love et al., 1990). To increase the size of this panel for higher resolution mapping analysis, genomic libraries constructed to contain short inserts were screened with CA-repeat probes, and positive clones were isolated and sequenced (Cornall et al., 1991; Dietrich et al., 1992).

Using a combined panel of 317 microsatellite loci, the Whitehead/MIT Genome Center developed a first-generation whole mouse genome linkage map with an average spacing of 4.3 cM (Dietrich et al., 1992). With the publication of the oligonucleotide sequences that define and allow the typing of each locus, the markers became available to everyone in a democratic fashion. As of January, 1994, the Whitehead group had defined and mapped over 3,000 microsatellite loci. Up-to-date mapping, strain distribution, and sequence information on all of these loci can be obtained electronically as described in appendix B. Furthermore, the commercial concern Research Genetics Inc. has made life even easier for the mouse genetics community by offering each primer pair in this panel at a greatly reduced cost relative to custom DNA synthesis.

Since microsatellite typing is PCR-based, and there is usually no need for blotting or probing, results can be obtained rapidly with a minimal expenditure of often-precious material and always precious man- and woman-hours. Dietrich and colleagues (1992) reported that two scientists can "genotype new crosses for the entire genome in a few weeks per cross" which represents an order of magnitude improvement over RFLP-based approaches.

?Microsatellites can serve not only as tags for anonymous loci but for functional genes as well. Stallings and his colleagues (1991) found that 78% of the clones from a mouse cosmid library have CA-repeats. If one also searched for GA-repeats, the percentage of microsatellite-positive cosmid clones would be even greater. An even higher probability of identifying microsatellite loci — close to 100% — can be achieved with clones recovered from larger insert libraries constructed with Yeast Artificial Chromosomes (YACs) or special prokaryotic vectors (see section 10.3.3). Small fragments that contain the microsatellite can be subcloned and sequenced to identify a unique set of flanking primers for genetic analysis. Microsatellites can truly be viewed as universal genetic mapping reagents.

During the 1980s, the difficulties encountered in the search for RFLPs among the classical inbred strains led to the emergence of the interspecific cross — between a M. musculus-derived inbred strain and M. spretus — which became a critical tool for the development of the first high resolution DNA-locus-based maps of the mouse genome (Avner et al., 1988; Copeland and Jenkins, 1991 and section 9.3). Interspecific backcross panels still represent a powerful tool for mapping newly characterized DNA clones. However, with microsatellites, it is now possible to go back to classical crosses among M. musculus strains to map interesting phenotypic variants as discussed in section 9.4.

8.4 Region-specific panels of DNA markers

A large fraction of the gene mapping studies performed today have as an ultimate goal the cloning of a phenotypically-defined locus based on its chromosomal position. This process of positional cloning (discussed in detail in section 10.3) is still rather tedious, and it is usually dependent on two experimental tools that exist in the form of panels. The first panel consists of DNA samples obtained from the offspring of a cross set up to uncover recombination events between and among the phenotypically-defined locus and nearby marker loci. The types of crosses that can be used and the number of offspring to be analyzed are topics of the following chapter. In all cases, analysis of a large number of offspring is required to have a reasonable chance at identifying recombination breakpoints that are close to the locus of interest.

Identification of the recombination breakpoints that lie closest to the locus of interest is dependent on the availability of a sufficient number of region-specific polymorphic DNA markers. This is the second panel of tools. Ideally, one would like to have at-hand a set of markers, such as microsatellites, distributed at average distances of a few hundred kilobases apart. This would provide sufficient resolution for the mapping of recombination sites (section 7.2.3 and figure 7.5) as well for the recovery of overlapping YAC clones (section 10.3.3).

Before 1994, most regions of the genome were not covered to this degree, and it was nearly always necessary for investigators to pursue special strategies to increase the size of the region-specific marker panel. However, as this section is being written, the average whole genome density of mapped microsatellite markers has reached one per megabase, and within a year’s time, it will be one per 500 kb. Furthermore, contigs of overlapping YAC clones have been developed for two complete human chromosome arms — 21q and the Y (Chumakov et al., 1992; Foote et al., 1992), and it is only a matter of time before additional human chromosomes and mouse chromosomes are added to this list. If an ordered, whole chromosome library is available, one can go directly to the clones that span the region of interest to derive polymorphic marker loci. This could be readily accomplished, for example, by screening for microsatellites within these clones.

Thus, what follows will soon be of historical interest only for mouse geneticists: approaches that investigators have used in the past to generate region-specific panels of DNA markers. These approaches have been included here for two reasons. First, to enable all readers to appreciate earlier work in this area of mouse molecular genetics. Second, to describe tools that may still be critical for geneticists working on organisms whose genomes are less-well-characterized than that of the mouse.

All rational approaches to region-specific cloning are based on fractionating the mouse genome such that only a single mouse chromosome or defined subchromosomal region is accessible prior to the recovery of clones that can be tested for use as DNA markers. Genome fractionation protocols fall into several classes with certain advantages and disadvantages. The major classes of genome fractionation methods are described in the following subsections.

8.4.1 Chromosome microdissection

The most direct means for genome fractionation is based on "microscopic dissection" (or microdissection as it is commonly called) of the region of interest from spreads of metaphase chromosomes on glass slides. This technique was first developed for the isolation of polytene chromosome bands from Drosophila salivary gland chromosomes (Scalenghe et al., 1981), and was later modified for use with mammalian chromosomes (Röhme et al., 1984). To aid in the identification of the correct chromosome, one can start with cells from mice in which the chromosome is marked karyotypically within the context of a single Robertsonian chromosome (Röhme et al., 1984, see section 5.2). Microdissection is an extremely tedious protocol that is difficult to master, and it is this difficulty that is its main drawback. However, the most skilled practitioners can circumscribe the region of dissection to a few chromosomal bands. This can represent a 100-fold enrichment from the whole genome, with almost no contamination from unlinked chromosomal regions. Although chromosome microdissection was developed prior to PCR, it is when the two techniques are combined that the power of this approach becomes apparent with the potential for generating thousands of markers from a very well-defined subchromosomal interval (Ludecke et al., 1989; Bohlander et al., 1992). Detailed protocols for performing chromosome microdissection followed by cloning have been described in a monograph by Hagag and Viola (1993)

8.4.2 Chromosome sorting by FACS

A less tedious protocol for direct genomic fractionation is based on the utilization of a Fluorescent-Activated Cell Sorter (FACS) to separate the metaphase chromosome of interest away from all other chromosomes (Gray et al., 1990). The starting material for this protocol must come from a cell line in which this chromosome is physically distinguishable from all others. Sources of such chromosomes include cells from animals with an appropriate Robertsonian translocation (Bahary et al., 1992) or interspecific somatic cell hybrid lines that contain only the foreign chromosome or subregion of interest (section 10.2.3). The material obtained from a typical FACS sort is likely to be 50 to 70% pure, equivalent to an enrichment factor of about ten-fold, with the remaining material due to contaminants from other chromosomes. The resolution of the FACS chromosome fractionation protocol is clearly much less than that possible with microdissection, and this is its main drawback. The main advantage of this protocol is that a greater amount of material can be recovered and used directly to construct chromosome-specific large-insert genomic libraries (Bahary et al., 1992).

8.4.3 Somatic cell hybrid lines as a source of fractionated material

A variety of somatic cell hybrid lines have been generated that contain only one or a few mouse chromosomes on the genetic background of a different species as described in section 10.2.2. The host genomes used most often to create somatic cell hybrid lines of use to mouse geneticists are Chinese hamster and human. The main advantage of well-characterized somatic cell hybrid lines is the ease with which they can be used, and the unlimited amount of high quality material that they can provide. The main disadvantage is that mouse genomic material is not alone, but mixed together with the whole genome of another species. Thus, to derive mouse-specific clones for use as markers, one must choose a protocol that allows the discrimination of mouse sequences from these other sequences, be they hamster or human. This can be accomplished by enlisting the highly repetitive element families B1, B2 and L1, that are unique to the mouse genome. The earlier approaches along this line were based on the construction of whole genome libraries from the cell line and then screening for mouse-containing clones with one or more repeat sequences (Kasahara et al., 1987). Of course, once such repeat element clones were obtained, it was imperative to subclone unique flanking sequences for use as DNA markers. More recently, the IRS-PCR technique described in section 8.3.5 has been used with great success in the rapid recovery of mouse-specific sequences from somatic cell hybrid lines (Simmler et al., 1991; Herman et al., 1992). With IRS-PCR, there is no need to first prepare a whole genome library.

An obvious limitation to the recovery of region-specific probes with IRS-PCR is that amplification will only occur between repetitive elements that are relatively close to each other and in the correct orientation. With the use of just the B2 primer, Herman and her colleagues (1991) were able to amplify approximately one PCR product for each megabase of mouse DNA present in somatic cell lines containing portions of the mouse X chromosome. With the use of other repeat element primers, alone or in combination, additional loci could be amplified (Simmler et al., 1991; Herman et al., 1992). PCR fragments can be readily excised from gels for cloning or for direct use as probes for linkage analysis.

8.4.4 Miscellaneous approaches

Under special circumstances, other approaches can be considered for obtaining an enrichment of sequences from particular subregions of the genome. For example, if the region is contained within a defined NotI restriction fragment (or one derived from another infrequent cutter) that is sufficiently larger than the one megabase average, it would be possible to excise the portion of a pulsed field gel that contained this fragment followed by amplification (IRS-PCR or random sequence) and cloning. This procedure could provide as much as a 10-fold enrichment for sequences within a multiple-megabase region (Michiels et al., 1987).

In another approach, Hardies and colleagues (Rikke et al., 1991; Rikke and Hardies, 1991; Herman et al., 1992) have taken advantage of the concerted evolution of L1 sequences that occurs within a species to develop specific oligonucleotides that recognize L1 subfamilies that are relatively unique to the genomes of either M. spretus or M. musculus(see section 5.4.2). These oligonucleotides can be used to probe whole genome libraries made from animals congenic for a chromosomal region of interest from one species within the genetic background of the other species. This protocol has been validated in another laboratory (Himmelbauer and Silver, 1993) and could serve to provide a small number of new markers from those limited cases where the appropriate congenic lines have been constructed. In genetic terms, congenic strains are far superior to somatic cell hybrids because the region of interest can be more greatly circumscribed. As indicated in figure 3.6, after ten generations of backcrossing, the differential region will have an average length of 20 cM, and after 20 generations of backcrossing, the average differential length will be reduced to 10 cM.

Finally, in theory, one should be able to enrich for a region deleted in one genome, but not another, by subtractive hybridization. This approach has been tried in various formats that are all dependent on the use of a large excess of DNA from the deleted genome to drive hybrid formation with sequences that are also present in the non-deleted or "tester genome" (Kunkel et al., 1985). If the driver sequences are tagged in some way, they can be removed from the completed reaction mixture along with the tester sequences to which they hybridized. "Target sequences" unique to the tester genome — in other words, those that have been deleted from the driver genome — will all be left behind in the solution ready for analysis or cloning.

In practice, this approach has never worked as well as one would like because the high complexity of the mammalian genome prevents the hybridization reaction from going to completion. Even when subtractive steps are reiterated, the target sequences have only been enriched by a factor of 100 to 1000 at the very most. Thus, in its original form, this approach has lost favor. More recently, Wigler and his colleagues have built upon the subtractive hybridization approach to a develop a PCR-based technique that is much more sensitive and highly resolving (Lisitsyn et al., 1993). This new technique, called Representational Difference Analysis (RDA), can be used to purify to completion sequences that are deleted from one genome but not another that is otherwise identical. In theory, this same technique could also be used in manner analogous to that described for the L1 sequences above, for the identification and cloning of new RFLPs that are present in the differential DNA segment that distinguishes two members of a congenic pair.