Mouse Genetics: Concepts & Applications (Full Table of Contents)

Copyright ©1995 Lee M. Silver

9. Classical Linkage Analysis and Mapping Panels

9.1 Demonstration of linkage in the mouse

9.1.1 Mapping new DNA loci with established mapping panels

9.1.2 Anchoring centromeres and telomeres onto the map

9.1.3 Statistical treatment of linkage data

9.2 Recombinant inbred strains

9.2.1 Overview

9.2.2 Using RI strains to determine linkage

9.2.3 Using RI strains to determine map order

9.2.4 Using RI strains to determine map distances

9.2.5 Using RI strains to dissect complex genetic traits

9.3 Interspecific mapping panels

9.3.1 Overview

9.3.2 A comparison: RI strains versus the interspecific cross

9.3.3 Access to established interspecific mapping panels

9.3.4 Is the newly mapped gene a candidate for a previously-characterized mutant locus?

9.4 Starting from scratch with a new mapping project

9.4.1 Overview

9.4.2 Choosing strains

9.4.3 Choosing a breeding scheme

9.4.4 The first stage: mapping to a subchromosomal interval

9.4.5 The second stage: high resolution mapping

9.5 Quantitative traits and polygenic analysis

9.5.1 Introduction

9.5.2 A choice of breeding strategy and estimation of locus number

9.5.3 Choices involved in setting up crosses

9.5.4 An optimal strategy for mapping polygenic loci

 

9.1 Demonstration of linkage in the mouse

9.1.1 Mapping new DNA loci with established mapping panels

When a new mouse locus has been defined at the DNA level, it can be mapped by three different approaches: somatic cell hybrid analysis, in situ hybridization, or formal linkage analysis. The first of these approaches is not applicable generally to the mouse because single chromosome hybrids have not been gathered together in a systematic way for the whole mouse genome. However, even in those cases where such hybrids exist, this type of analysis provides only a chromosomal assignment. The second approach — in situ hybridization — is more highly resolving than somatic cell hybrid analysis, but this protocol requires special expertise and the resolution is still less than that obtained routinely with linkage analysis. Both of these non-sexual mapping protocols have two advantages over all forms of linkage analysis. First, they do not require any prior knowledge of map positions for other loci. Second, they allow the mapping of non-polymorphic loci. Thus, in the early days of mouse molecular genetics, before many DNA markers had been placed onto the map, and before new methods for uncovering polymorphisms had been developed, both of these protocols served useful functions in the arsenal of general mapping tools.

Today, the method of choice for mapping a new locus defined at the DNA level will always be formal linkage analysis. There are two interrelated reasons for this. First, a whole genome mouse linkage map of very high density has been developed with thousands of polymorphic DNA markers already in place and new ones being added each month (Copeland et al., 1993). The second reason lies within the existence of various mouse "mapping panels" that have been established by a number of investigators at different institutions around the world.

A mapping panel is a set of DNA samples obtained from animals that carry random recombinant chromosomes produced within the context of a specific breeding scheme. The most widely used mouse mapping panels are of two specific types. One consists of representative DNA samples derived from each strain of a recombinant inbred (RI) set or group of RI sets. The approach to mapping with RI strains will be detailed in section 9.2. The second type of widely-used mapping panel contains samples derived from the offspring of an interspecific backcross between the two species, M. musculus and M. spretus. This approach will be discussed in section 9.3. It is also possible to design mapping panels that are based on an intercross between two F1 hybrid parents obtained in an interspecific or intersubspecific outcross between two different inbred strains (Dietrich et al., 1992).

The power of mapping panels lies within the database of information that is already available for a large number of previously-typed loci in members of the same defined cohort of animals. The most useful panels have been typed for at least two hundred independent DNA markers and, in fact, the most well-established panels have been typed for many more. In classical genetic terminology, this can be viewed as a multi-hundred point cross that provides linkage maps across the complete spans of all chromosomes in the genome.

Thus, the mapping of a new locus can be accomplished simply by genotyping each of the samples in the same cohort (or a subset thereof) for just the new locus of interest. It is never necessary to type more than one hundred animals in the initial analysis and, as discussed in sections 9.2 and 9.4, with a well-characterized panel, one can usually obtain a map position with the typing of 50 or fewer animals. A single investigator can easily carry out such an analysis in less than a week’s time with the use of either a PCR analysis or Southern blotting. The results obtained are entered into the database containing all prior mapping information on the panel and a computational algorithm is used to determine the location of the new locus within the already-established linkage map. Essentially, this is accomplished by searching for concordant segregation between alleles at the new locus and those at one or more loci that have been previously typed on the same panel. With a well-established mapping panel, a first-order map position will always be obtained. At this point, strategies for further analysis will depend on the goals of the investigator. A discussion of the two most important classes of mapping panels — recombinant inbred strains and the interspecific backcross — will be presented in sections 9.2 and 9.3 of this chapter.

9.1.2 Anchoring centromeres and telomeres onto the map

As discussed in section 5.2, all twenty one chromosomes in the standard mouse karyotype (19 autosomes and the X and Y) are extremely acrocentric. Even with very high resolution light microscopy of extended prophase chromosomes, the centromere appears to lie at one end of each chromosome. Although there must be a segment of DNA containing at least a telomeric sequence that precedes the centromere, no unique sequence loci have ever been localized to this hypothetical segment. Thus, for all intends and purposes, one can view the genetic map of each chromosome as beginning with a centromere and ending with a telomere.

In the absence of centromere and telomere mapping information, a linkage map will be unanchored. As a result, the length of genetic material that lies beyond the furthermost marker at each end of the map will not be known. However, since both centromeres and telomeres are composed of repeated simple sequences that are shared among all chromosomes, their direct mapping requires special approaches.

9.1.2.1 Telomere mapping

All mammalian telomeres are composed of thousands of tandem copies of the same basic repeat unit TTAGGG (Moyzis et al., 1988; Elliott and Yen, 1991). Early sequence comparisons indicated that while the basic repeat unit was highly conserved, occassional nucleotide changes could arise anywhere within the large telomeric sequence present at the end of any chromosome. Elliott and Yen (1991) realized that one particular nucleotide change, from a G to a C in the sixth position of this repeat unit, would create a DdeI restriction site (CTNAG) that overlapped two adjacent repeats — [TTAGGC][TTAGGG]. In the absence of such a change, the enzyme DdeI would not cut anywhere inside a particular telomeric region which would remain intact within a restriction fragment of 20 kb or more in size. In contrast, one or more substitutions of the type described would allow DdeI to reduce a telomeric region into smaller restriction fragments that could be detected by probing a Southern blot with a labeled oligonucleotide (called TELO) consisting of five tandem copies of the consensus telomere hexamer (Elliott and Yen, 1991). To date, strain-specific telomeric DdeI RFLPs have allowed the inclusion of telomeres from six mouse chromosomes as segregating markers in linkage studies (Eicher and Shown, 1993; Ceci et al., 1994). More recently, another repeat sequence has been identified with a subtelomeric position in all mouse chromosomes (Broccoli et al., 1992). In the future, it may be possible to develop analogous strategies for mapping telomeres with this subtelomeric repeat as well.

9.1.2.2 Centromere mapping with Robertsonian chromosomes

Unfortunately, the satellite sequences present within all mouse centromeres are not amenable to the same type of mapping strategy just described. The problem is that each centromere contains about eight megabases of satellite sequences (section 5.3.4), which is about 400 times larger than a telomere. Consequently, base substitutions away from the consensus satellite sequence will be much more numerous; this will lead to whole genome Southern blot patterns, with any restriction enzyme, that are unresolvable smears.

So, how does one go about placing centromeres onto a linkage map? One approach is to mark the centromeres of individual homologs with a Robertsonian fusion (see section 5.2). If a test animal is heterozygous for a particular Robertsonian chromosome, the segregation of the fused centromere can be followed in each offspring through karyotypic analysis. If the Robertsonian chromosome carries distinguisable alleles at linked loci, the recombination distance between the centromere and these linked loci can be determined by DNA marker typing. Unfortunately, this approach is complicated by the finding that local recombination is suppressed in animals heterozygous for many Robertsonian chromosomes due to minor structural differences that interfere with meiotic pairing (Davisson and Akeson, 1993). Thus, the distance between the centromere and the nearest genetic locus is likely to be underestimated by this method.

9.1.2.3 Centromere mapping through secondary oocytes

A second approach to determining distances between centromeres and linked markers is based on the genetic analysis of large numbers of individual "secondary oocytes" which are the products of the first meiotic division. As shown in figure 9.1, sister chromatids remain together in the same nucleus after the first meiotic division. Thus, in the absence of crossing over, the secondary oocyte will receive one complete parental homolog or the other, and would appear "homozygous" for all markers upon genetic analysis. However, if crossing over does occur, the oocyte will receive both parental alleles at all loci on the telomeric side of the crossover event. Thus, all telomeric-side loci that were heterozygous in the parent will also appear heterozygous in the oocyte, but all centromeric-side loci will remain homozygous. The fraction of individual oocytes that are heterozygous for a particular genetic marker will be twice the linkage distance that separates that marker from the centromere since only half of the haploid gametes generated from a double allele oocyte will actually carry the recombinant chromatid.

How does one go about determining the individual genotypes of large numbers of secondary oocytes? There are two basic protocols. The first to be developed was based on the clonal amplification of secondary oocytes within the form of ovarian teratomas (Eicher, 1978). Ovarian teratomas result from the parthenogenetic development of secondary oocytes into disorganized tumors that contain many different cell types. The inbred LT/Sv strain of mice undergoes spontaneous ovarian teratoma formation at a very high rate. This inbred strain in-and-of-itself is not useful for oocyte-based linkage analysis since it is homozygous at all loci, but it is possible to construct congenic animals that are heterozygous for particular marker loci within an overall LT/Sv genetic background. In the cases reported, these congenic animals retain the high rate of teratoma formation associated with the parental LT/Sv strain (Eppig and Eicher, 1983; Artzt et al., 1987; Eppig and Eicher, 1988). This approach is tedious in that a different congenic line has to be developed to map centromeres on each chromosome, but there is every reason to believe that the results obtained are an accurate measure of centromere-marker linkage distances in female mice.

An alternative protocol for genotyping oocytes is based on DNA amplification (by PCR) rather than cellular amplification. The main advantage to this approach is that genotyping can be performed on oocytes derived from any heterozygous female (Cui et al., 1992). Thus, in theory, this approach could be used to position the centromere relative to any marker on any chromosome. However, in practice, PCR amplification from single cells is difficult, and there is a high potential for artifactual results — such as amplification from one DNA molecule but not its homolog.

9.1.2.4 Centromere mapping by in situ hybridization or Southern blots

A third approach to positioning centromeres on linkage maps is based on direct cytological analysis. This approach is possible because of the divergence in centromeric satellite DNA sequences that has occurred since the separation of M. musculus and M. spretus from a common ancestor ~3 million years ago(see section 5.3 and figure 2.2). In particular, the major satellite sequence in M. musculus is composed of a 234 bp repeat unit that is present in 700,000 copies distributed among all the centromeres. This same 234 bp repeat unit is only present in 25,000 copies spread among the centromeres in M. spretus (Matsuda and Chapman, 1991). The 28-fold differential in copy number can be exploited with the technique of in situ hybridization to readily distinguish the segregation of M. musculus centromeres from M. spretus centromeres in the offspring of an interspecific backcross. This approach has now been used to anchor all of the mouse chromosomes at their centromeric ends (Ceci et al., 1994). The only caveat to mention is the possibility that interspecific hybrids have a distorted recombination frequency in the vicinity of their centromeres.

A final possibility, that has yet to be validated, is the mapping of centromeres as RFLPs observed on Southern blots in the same manner as described for telomeres in section 9.1.2.1. This approach may be possible with the use of a newly described repeat sequence that appears to be present in reasonable copy numbers adjacent to the centromeres of nearly every mouse chromosome (Broccoli et al., 1992).

9.1.3 Statistical treatment of linkage data

9.1.3.1 Testing the null hypothesis

Let us assume that two inbred strains of mice (B6 and C3H for example) carry distinguishable alleles (symbolized by b and c respectively) at each of two fictitious loci Xy1 and Gh3 as shown in figure 9.2. An F1 hybrid between B6 and C3H will be heterozygous at each locus with a genotype of Xy1c/Xy1b, Gh3c/Gh3b. If these two loci are linked on a single chromosome, the F1 hybrid will have one homolog with the Xy1c and Gh3c alleles, and the other homolog with the Xy1b and Gh3b alleles. By definition, linkage means that the F1 hybrid will produce a greater number of gametes carrying a parental set of alleles (either Xy1b Gh3b or Xy1c Gh3c) than a recombinant set of alleles (either Xy1b Gh3c or Xy1c Gh3b). As discussed at length in section 7.2, the actual distance that separates the two loci will determine the strength of their linkage.

If one could determine the haploid genotype (or haplotype) of each sperm produced by a C3H x B6 hybrid male, one would know for sure whether the two loci in question are linked. But with the typing of a finite number of progeny in an experimental cross, the answer is often not as clear. Let us say that 100 offspring from the F1 hybrid have been typed to test for linkage between Xy1 and Gb3 with the result that 62 carry parental allele combinations and 38 carry non-parental allele combinations. Do these data provide evidence in favor of the hypothesis: "Xy1 and Gb3 are linked"?

Unfortunately, there is a problem with a general hypothesis that states "genes A and B are linked" in that there is no precise prediction of what to expect in terms of data from a breeding experiment. This is because linkage can be very tight so that recombination would be expected rarely, or linkage can be rather loose so that recombination would be expected frequently. Of course, the strength of linkage, if indeed the genes under analysis are linked, is unknown at the outset of the experiment. In contrast, there is a precise prediction of what to expect from the so-called "null hypothesis" of no linkage between genes A and B. The prediction of this null hypothesis is that alleles at different genes will assort independently leading to a 50 : 50 ratio of gametes with parental or recombinant combinations of alleles.

Thus, whenever geneticists wish to determine whether their data provide evidence for linkage (of any degree), what they actually do is ask the following question: are these data significantly different from what one would expect if the two loci were not linked? With this well-defined null hypothesis, it becomes possible to apply a statistical test to determine whether the data actually observed are significantly different from the expected outcome for no linkage. In the example above, with the analysis of 100 offspring, the null hypothesis would lead to a prediction of 50 animals with a parental allele combination and 50 animals with a recombinant allele combination in comparison with the observed results of 62 and 38 respectively. Are these two sets of numbers significantly different from each other? If the answer is yes, this would suggest that the null hypothesis is false and that the two loci are indeed linked. On the other hand, if the observed data are not significantly different from those expected from the null hypothesis, the question of linkage will remain unresolved — the two loci may be unlinked, but it may also be possible that the loci are linked and there are simply not enough data to detect it.

9.1.3.2 A comparison of mouse linkage data and human pedigree data

Before launching into a discussion of the statistical treatment of linkage data, it is important to illuminate a critical difference between linkage analysis in the mouse and in humans. In nearly all cases of linkage analysis in the mouse, the parental combinations of alleles — the so-called phase of linkage — will be known with absolute certainty. In the example above, if we assume that the two loci in question are linked, we know that the Xy1c and Gh3c alleles will be present on one homolog and the Xy1b and Gh1b alleles will be present on the other homolog in the F1 hybrid, as illustrated in figure 9.2. With this information, we can tell immediately upon typing whether an offspring carries a parental or recombinant combination of alleles.

More often than not, the phase of linkage is not known with certainty in the analysis of human pedigrees. As a consequence, human geneticists are forced to employ more sophisticated statistical tools that evaluate results in light of the probabilities associated with each possible phase relationship for each parent in a pedigree (Elston and Stewart, 1971). These Maximum likelihood estimation (MLE) analyses are always performed by computer and they lead to the determination of LOD score graphs which show the likelihood of linkage between two loci over a range of map distances (Morton, 1955). With most human pedigrees, it is impossible to count the actual number of recombination events that have occurred between two loci, and, as a consequence, it is impossible to determine even a most likely genetic distance separating two loci without the use of a computer. In contrast, all recombination events can be clearly detected in two of the three most common types of mouse breeding protocols — the backcross and RI strains — and with the intercross, all but a small percentage of recombination events can also be distinguished unambiguously (see figure 9.4). With backcross and RI data in particular, linkage distance estimates can be easily determined by hand or with a simple calculator, and confidence limits around these estimates can be extrapolated from sets of tables (such as those in appendix D).

9.1.3.3 The c2 test for backcross data

The standard method for evaluating whether non-Mendelian recombination results are statistically significant is the ‘method of c2.’ Upon calculating a value for c2, one can use a look-up table to determine the likelihood that an observed set of data represents a chance deviation from the values predicted by a particular hypothesis. This determination can lead one to reject or accept the hypothesis that is being tested.

In its most general form, the c2 statistic is defined as follows:

? ?(9.1)

where there are n potential outcome classes, each of which is associated with an observed number (obsi) that is experimentally-determined and an expected number (expi) that is calculated from the hypothesis being tested. It is obvious from a quick examination of equation 9.1 that as the differences between observed and expected values become larger, the calculated value of c2 will also become larger. Thus the c2 value is inversely related to the goodness of fit between the experimental results and the null hypothesis being tested, with a c2 value of zero indicating a perfect fit. As the value of c2 grows larger and larger, the likelihood that the experimental data can be explained by the null hypothesis becomes smaller and smaller.

Consider the case of a backcross with the (B6 X C3H) F1 hybrid described above to analyze the possibility of linkage between the fictitious loci Xy1 and Gh3. In terms of these two loci, the F1 hybrid can produce four types of meiotic products which will engender four experimental outcome classes (figure 9.2). If one makes the a priori assumption that the two parental classes represent different manifestations of the same outcome of no recombination, and the two other classes represent, for all practical purposes, reciprocal products of the same recombination event, then the data can be reduced in complexity to a set of just two outcomes — parental or recombinant. In this case, the c2 statistic becomes:

? ?(9.2)

where the r subscript indicates recombinant and the p subscript indicates parental.

Whenever the test is used to analyze data obtained from a backcross and the null hypothesis is one of no linkage, a further simplification of equation 9.2 can be accomplished. In this case, the expected values for parental and recombinant classes will both be equivalent to half the total number (N) of offspring typed (which is the sum of the two observed values). Furthermore, the two observed values will both differ from the expected value by the same absolute number, and the square of each difference will yield the same positive value. Thus, the two terms in equation 9.2 can be combined to form:

? ?(9.3)

Equation 9.3 can be simplified even further by substituting each appearance of with the equivalent expression . The form of the equation that is so derived contains only the two experimentally-obtained values as variables:

? ?(9.4)

In plain english, equation 9.4 can be read as "square the difference, divide by the sum", and this simple calculation can often be performed through mental calculations alone.

Now we can return to our example from above with 62 parental and 38 recombinant allele combinations and use equation 9.4 to determine the appropriate c2 value. The difference between the numbers in the two observation classes (62—38 = 24) is squared to yield 576, and this value is divided by the total size of the sampled population (100) to yield 5.76.

One more piece of information is required before it is possible to translate a c2 value into a measurement of significance — the number of "degrees of freedom" (df) associated with the particular experimental design. The "degrees of freedom" is always one less than the total number of potential outcome classes (df = n -1). The rationale for this definition is that it is always possible to determine the number of events that have occurred in the any one class by subtracting the sum of the events in all other classes from the total size of the sample set. In the backcross example under discussion, we have defined two potential outcome classes: recombinant and parental. Knowing the number in either class, along with the total sample size, provides the number in the other class. Thus, the number of degrees of freedom in this case is one.

With a c2 value and the number of degrees of freedom in-hand, one can proceed to a c2 probability look-up table such as the one presented in Table 9.2. This table shows the c2 values that are associated with different "P values". A P value is a measure of the probability with which a particular data set, or one even more extreme, would have occured just by chance if the null hypothesis were indeed true. To obtain a P value for the data set in the example under discussion, we would look across the row associated with one degree of freedom to find the largest c2 value that is still less than the one obtained experimentally. In this case, this procedure yields the c2 value of 3.8. Looking up the column from this c2 value, we obtain a P value of 0.05. We have now reached the final goal of our statistical test for significance.

In this hypothetical example, our statistical analysis indicates that the data obtained would be expected to occur with a frequency of less than 5% if the two loci were not linked. But, is this result significant enough to prove linkage? To answer this question, it is very important to understand exactly what it is that the c2 test and its associated P value do and what they do not do. The outcome of a c2 test cannot prove linkage or the absence thereof. It just provides one with a quantitative measure of significance. What is a significant result? Traditionally, scientists have chosen a P value of 0.05 as an arbitrary cutoff. But with this choice, one will conclude falsely that linkage exists in one of every twenty experiments conducted on loci that are, in fact, not linked! As discussed below in section 9.1.3.6, the interpretation of a c2 value in modern genetic experiments that look simultaneously for linkage between a test locus and large numbers of genetic markers is subject to further restrictions that result from the application of Bayes’ theorem.

9.1.3.4 The c2 test for intercross data

It is instructive to consider the application of the c2 test to a breeding protocol with more than two potential outcome classes. The most relevant example of this type is the intercross between two F1 hybrid animals that are identically heterozygous at two loci with a genotype of A/a, B/b. Figure 9.4 illustrates the different types of F2 offspring genotypes that are possible in the form of a Punnett Square. In the absence of linkage between A and B. one would expect each of the sixteen squares shown to be represented in equal proportions among the F2 progeny. If one compares the actual genotypes present in each square, one finds that there is some redundancy with only nine different genotypes in total. These are as follows (with their relative occurrences in the Punnett Square shown in parenthesis): A/A, B/B (1), a/a, b/b (1), A/a, B/b (4), A/A, B/b (2), A/a, B/B (2), A/a, b/b (2), a/a, B/b (2), A/A, b/b (1), and a/a, B/B (1). But, as in the case of the backcross, these classes are not really independent of each other. In particular, two classes (A/A, B/B and a/a, b/b) result from the transmission of parental allele combinations from both F1 parents (zero recombinants in figure 9.4), four classes (A/A, B/b, A/a, b/b, a/a, B/b, and A/a, B/B) result from a single recombination event in one parent or the other, two classes result from two recombination events (A/A, b/b and a/a, B/B), and the final class (A/a, B/b) is ambiguous and could result from either no recombination in either parent or recombination events in both parents. Thus, the the nine genotypic outcomes from the double heterozygous intercross can be combined into four truly independent genotypic classes. By adding up the number of squares included within any one class and dividing by the total (16), one obtains the fraction of offspring expected in each in the case of the null hypothesis: zero recombinants — 1/8; single recombinants — 1/2; double recombinants — 1/8; ambiguous (zero or two recombinants) — 1/4. As discussed above, four outcome classes yield three degrees of freedom.

With this information, it becomes possible to set up a c2 test to evaluate the evidence for linkage between two segregating loci typed among the progeny of an (F1 X F1) intercross. The special c2 equation for the intercross takes the following form:

? ?(9.5)

where the subscript in each observational class indicates the number of recombination events (obsa is the ambiguous class) and N is the total number of F2 progeny typed. An comparison of the experimentally determined c2 value with the critical values shown in the df = 3 row of Table 9.2 will allow a determination of a corresponding P value.

9.1.3.5 Limitations and corrections in the use of the c2 test

The c2 test does have some limitations in usage. First, it cannot be applied to very small data sets, which are defined as those in which 20% or more of the outcome classes have expected values that are less than five (Cochran, 1954). With this rule, it is possible to set minimum sample sizes required for the analysis of backcross data at 10, and F1 X F1 intercross data at 40. In actuality, a backcross or RI data set must include at least thirteen samples to show significance in the case of no recombinants (based on the Bayesian correction described below). Furthermore, when the sample size is below 40 (in cases of one degree of freedom only), a more accurate P value is obtained if one includes the Yates correction for small numbers. This is accomplished by subtracting 0.5 from the absolute difference between observed and expected values in the numerator of equation 9.3.

A final point is that c2 analysis provides a general statistical test for significance that can be used with many different experimental designs and with null hypotheses other than the complete absence of linkage. As long as a null hypothesis can be proposed that leads to a predicted set of values for a defined set of data classes, then one can readily determine the goodness of fit between the null hypothesis and the data that are actually collected.

9.1.3.6 A Bayesian correction for whole genome linkage testing

If one has reason to believe from other results that two loci are just as likely to be linked as not, then the P value obtained with the c2 test can be used directly as an estimate of the probability with which the null hypothesis is likely to be true, and subtracting the P value from the integer one provides a direct estimate of the probability of linkage. However, when a previously unmapped locus is being tested for linkage to a large number of markers across the genome, there is usually no a priori reason to expect linkage between the new locus and any one particular marker locus. If we assume a particular experimental design such that linkage is detectable out to 25 cM on both sides of an unmapped locus and a total genome length of 1500 cM, then the fraction of the genome in linkage with the novel locus will be (25 + 25)/1500 ª 0.033. In other words, out of 100 markers distributed randomly across the genome, one would expect only 3.3 to actually be in linkage with any particular test locus. But, if one accepts a P value of 0.05 as providing evidence for linkage, then 5% of the unlinked 97 loci — or an additional ~5 loci — will be falsely considered linked according to this statistical test. As a consequence, the expected number of false positives — five — is larger than the expected number of truly linked loci — 3.3. Thus, of the 8.3 positive markers expected, only 3.3 would be linked, and this means that aP value of 0.05 has only provided a probability of linkage of 40%. This situation is clearly unacceptable.

The logical approach just discussed is referred to as Bayesian analysis after the statistician who first suggested that prior information on the likelihood of outcomes be included in calculations of probabilities. One can generalize from the example given to obtain a Bayesian equation for converting any P value obtained by c2 analysis of recombination data into an actual estimate of the probability of linkage:

? ?(9.6)

where P is the P value obtained by c2 analysis and fswept is the fraction of the genome over which linkage can be detected based on the power of the genetic approach used. Solutions to equation 9.6 for some critical P values and genomic distances are given in Table 9.1. Of interest are the P values required to provide evidence for linkage with 95% probability. So long as the experimental design allows detection of linkage out to 15 cM, one can use a cutoff P value of 0.001 as evidence for linkage between any two loci. In accepting linkage at P < 0.001, one is actually setting a limit for accepting less than one false positive result for every 20 true positive results. Later in this chapter, the Bayesian approach is used to calculate cutoff values for the demonstration of linkage with 95% probability in the case of RI strain data (figure 9.5) and backcross data (figure 9.13).

9.2 Recombinant inbred strains

9.2.1 Overview

9.2.1.1 Conceptualization and history of RI strains

Although a handful of isolated recombinant inbred (RI) mouse strains were developed during the 1940s and 1950s, the importance of these genetic constructs was not appreciated by any scientist prior to Dr. Donald Bailey who first conceptualized their potential utility for linkage analysis while working at the National Institutes of Health in 1959 (Bailey, 1971; Bailey, 1981). Bailey realized that established sets of RI strains could provide an efficient alternative to the lengthy process of classical breeding analysis that was required at that time to map any newly uncovered locus. Bailey moved with his first set of eight RI strains — the original CXB set — to the Jackson Laboratory in 1967. In 1969, Dr. Benjamin Taylor, also at the Jackson Laboratory, took up the cause of the RI approach and set about developing, from several different pairs of progenitor parents, large sets of RI strains that are now standards for the mouse genetics community including BXD, BXH, AKXL and AKXD (Taylor, 1978 and Table 9.2). These standard RI sets, and others listed in Table 9.3 under JAX availability, can be purchased directly — in the form of either animals or DNA — from the Jackson Laboratory. Today, recombinant inbred strains represent a critical tool in the arsenal used to map mouse genes and the RI approach has made its way into the study of other experimental organisms as well such as the plant Arabidopsis (Lister and Dean, 1993).

9.2.1.2 Construction and naming of RI strains

The construction of a set of RI strains is quite simple in theory and is illustrated in figure 9.3. One begins with an outcross between two well-established highly inbred strains of mice, such as B6 and DBA in the example shown. These are considered the progenitor strains. The F1 progeny from this cross are all identical, and thus in genetic terms, they are all interchangeable. F1 hybrid animals are bred to each other to produce a large set of F2 animals. At this generation, siblings and cousins are no longer identical because of the segregation of B6 and DBA alleles from the heterozygous F1 parents. As illustrated in figure 9.3 for a single pair of homologs, each of the F2 animals will have a unique genotype with some loci homozygous for the B6 allele, some homozygous for the DBA allele, and some heterozygous with both alleles. At this stage, pairs of F2 animals are chosen at random to serve as the founders for new inbred strains of mice. The offspring from each F2 founder pair are maintained separately from all other offspring, and just two are chosen randomly for brother-sister mating to produce the next generation. The same process is repeated at each subsequent generation until at least 20 sequential rounds of strict brother-sister matings have been completed and a new inbred strain with special properties is established.

Each of the new inbred strains produced according to this breeding scheme is called a "recombinant inbred" or "RI" strain. Different RI strains produced from the same pair of progenitor strains are considered members of the same "RI set." Each RI strain is named with an uninterrupted string of uppercase letters and numbers that can be broken-down into the following components: (1) a shortened version of the maternal progenitor strain name, (2) the uppercase letter ‘X,’ (3) a shortened version of the paternal progenitor strain name, and (4) a hyphen and a number to distinguish each strain from all other strains in the same RI set. The first three components of an RI strain name are also used alone as the name for the corresponding RI set.

Table 9.3 lists the 17 RI sets that have been used to a significant degree by members of the mouse genetics community (R.W. Elliott, personal communication). The most readily available of these are the ones that can be purchased directly from the Jackson Laboratory (JAX) in the form of live animals or DNA — fourteen RI sets with a total of 187 strains. The best-known and most well-characterized RI set is BXD with 26 extant strains formed from an initial cross between B6 females and DBA/2J males. Individual strains within this set are called BXD-1, BXD-2, BXD-5 and so on. (The BXD-3, BXD-4 and other missing strains are now extinct and their numbers have been retired.) By August of 1993, 818 loci had been typed on the BXD set. Unfortunately, as discussed earlier in this book (sections 2.3.4 and 3.2.2), the two progenitors of the BXD set — B6 and DBA — are related to each other through the common group of founder animals used to develop most of the classical inbred strains at the beginning of the twentieth century. As a consequence, a significant number of loci will show no variation between the two progenitors even with highly sensitive PCR protocols that detect SSCPs and microsatellites (see section 8.3).

More recently, two progenitor mouse strains — A/J and C57BL/6J — known to differ in susceptibility to over 30 infectious or chronic diseases were used to establish a new pair of RI sets — AXB and BXA — based on reciprocal crosses (Marshall et al., 1992). Genetic surveys suggest that these progenitors are much less related to each other than B6 is to DBA, and with 31 genetically-validated extant strains available from the Jackson Laboratory, this paired-set is the largest and likely to be the most useful for both linkage analysis and the study of complex traits that differ between the two progenitors.

At any time, it is possible to add to the numbers of strains in a particular RI set by starting from scratch with a new cross between the original progenitor strains and following it through twenty generations of inbreeding. In this manner, J. Hilgers at the Netherlands Cancer Institute in Amsterdam and L. Mobraaten at the Jackson Laboratory have increased the size of the original CXB RI set with five and six new strains respectively. However, as discussed later in this section, the results obtained with different RI sets can be readily combined to provide increased mapping resolution and sensitivity. Thus, with over one hundred and fifty well-characterized RI strains already available, and with the fact that RI strain development takes at least four years, there does not seem to be any real need to further increase the number in any particular set.

9.2.1.3 The special properties of RI strains

Like all inbred strains, RI strains are fixed to homozygosity at essentially all loci. Thus, the genomic constitution of each strain can be maintained indefinitely by continued brother-sister matings, and the strain can be expanded into as many animals as required at any time. However, unlike the classical inbred strains, the genotype of an RI strain is greatly circumscribed. First, there are only two choices for the allele that can be present at each locus: thus, for every locus in every strain of the BXD RI set, only the B6 or the DBA allele can be present. Second, because there is only a limited number of opportunities for recombination to occur between the two sets of progenitor chromosomes before homozygosity sets it, complete homogenization of the genome can not take place. An illustration of this principle can be seen in figure 9.3. In this hypothetical situation, one can see genomic regions that are already frozen at the outset of the F2 intercross in two of the three incipient RI strains. The two fixed regions are circled and (in this illustrative example) are both homozygous for B6 genomic material in the two F2 parents that will act as founders for the new BXD-102 and BXD-103 strains. At each subsequent generation, additional regions will become frozen in either a B6 or a DBA state. After 20 generations of inbreeding, each RI strain will be represented by a group of animals that will all carry the same genomic tapestry with random patches from each of the two progenitors, as illustrated in the final row of figure 9.3.

It is this homozygous patchwork genomic structure that is the key to the power of the RI strains. This is because the boundaries that separate all of these genomic patches represent individual recombination events that are also permanently frozen. Every RI strain has different recombination sites distributed randomly throughout its genome. Thus, a set of RI strains can be used in essentially the same manner as the offspring from a mapping cross to obtain information on linkage and map distances. Like other mapping panels, it is only necessary to type each RI genotype once for a particular locus, and the information obtained from typing different loci is cumulative. The major difference, of course, is that the particular genotype present within each RI strain can be propagated indefinitely whereas different offspring from a mapping cross will all have unique genotypes and finite life spans. Thus, there is no limit to the number of loci that can be typed within an RI strain. Furthermore, in the case of complex phenotypic variation, one can actually sample the same genotype multiple times to demonstrate instances of incomplete penetrance or variable expressivity as discussed in section 9.2.5.

9.2.2 Using RI strains to determine linkage

9.2.2.1 Strain distribution patterns

The major use of RI strains by mouse geneticists today is as a tool to determine linkage and map positions for newly derived DNA clones. As the first step in this process, an investigator should survey the progenitor strains for all the most useful RI sets to determine which of these sets can be typed for the presence of alternative alleles. The strains to be surveyed should include AKR/J, DBA/2J, C57L/J, A/J, C57BL/6J, C3H/HeJ, BALB/cJ, NZB/B1NJ, SM/J and SWR/J (see Table 9.3). If one is trying to map a newly cloned gene, it is possible to start a polymorphism search based on the detection of RFLPs among DNA samples digested with one of several different enzymes (section 8.2). However, as discussed at length in section 8.3, one is much more likely to uncover polymorphisms with a PCR-based protocol like SSCP.

Once alternative alleles at a locus have been distinguished for the progenitors of any RI set, one can proceed to type all of the strains in that set. The information obtained from such a single locus analysis will represent the Strain Distribution Pattern, or SDP, for that particular locus. An isolated SDP in-and-of-itself is usually not informative. One would expect approximately half of the strains typed to carry each of the progenitor alleles at random. But, with a new SDP in-hand, it becomes possible to search for linkage with each of the other loci previously typed in the same RI set or group of sets. This is most easily accomplished with a computer program such as Map Manager that compares the new SDP to each previously-determined SDP in the database, one-by-one, and applies a statistical test for evidence against or in favor of linkage (Manly, 1993 and see appendix B for further information).

9.2.2.2 Concordance and discordance

The result of each pairwise comparison of SDPs is expressed in terms of the degree, or level, of concordance and discordance. When a particular RI strain has alleles from the same progenitor at two defined loci, the loci are considered to be "concordant" within that strain. When the alleles at the two loci come from the two different progenitors, they are considered to be "discordant". The probability of discordance is a function of the linkage distance that separates the two loci under analysis. This is easy to understand in terms of the likelihood that two loci will be retained, by chance, within the same genomic patch as illustrated in figure 9.3. At one extreme, unlinked loci — which can’t possibly lie in the same genomic patch — will be just as likely to have alleles from the same progenitor, by chance, as from the two different progenitors. Thus, one would predict that ~50% of the strains within an RI set will show concordance for a pair of unlinked loci. At the other extreme, loci that are very closely linked will always be in the same genomic patch, which is equivalent to saying that 100% of RI strains will show concordance (or 0% will show discordance). Between these two extremes will be loci that are linked, but less closely so. As the distance between two loci increases, the probability of discordance will increase in a calculable manner from 0% up to 50%.

Whenever one accumulates data on multiple RI strains, it is useful to express this information in terms of concordance and discordance rather than in terms of actual genotypes. The terms N and i are used, respectively, to denote the total number of strains typed and the number of discordant strains observed. The fraction is used to denote the observed level of discordance. With the use of these terms, one can combine the data obtained from all sets of RI strains that show variation between progenitors for both loci under analysis. This can be accomplished even if different allele pairs are present in different RI sets. For example, at the H-2K locus, B6 has a b allele, DBA has a d allele, and AKR/J and A/J both have a k allele; even so, one can still combine H-2K data obtained from the BXD, AKXD, and AXB/BXA RI sets. If these three RI sets also show progenitor variation at another locus thought to be linked to H-2K, then one can count up the number of strains discordant between these two loci (i) and divide by the total number of strains (N = 82) to obtain the observed discordant fraction . When strains from multiple RI sets are combined for analysis in this manner, they will be referred to as an "RI group." Increasing the size of the RI group can have dramatic effects on the sensitivity and resolution with which it is possible to determine linkage and map distances, as described in the following subsections and depicted in Figures 9.5 and Tables D1 and D2 in the appendix.

9.2.2.3 Demonstrating linkage: statistical approaches

What degree of concordance between two SDPs is required to demonstrate linkage? The problem of distinguishing a chance fluctuation above the 50% concordance expected with unlinked loci from a significant departure indicative of linkage was discussed in more general terms earlier in this chapter. Suffice it to say here that prior to 1986, researchers did not fully appreciate the more stringent requirements imposed by the Bayesian statistical approach and were misled by concordance values that passed traditional tests for significance. Mathematical formulations aimed at rectifying this situation were begun by J. Silver (1986) and were supplemented by Neumann (1990; 1991). In his 1990 paper, Neumann published tables with a complete set of maximum discordance values (i) that are allowed for a demonstration of linkage at various levels of significance with data obtained from RI groups up to N = 100 in size. Data from these tables have been extracted into figure 9.5 here.

With the maximum allowable discordant values that have been determined, it is possible to estimate the maximum distance over which linkage between two loci is likely to be demonstrated at a sufficient level of significance with an RI group of a particular size. This is a measure of the "swept radius," which is a concept first developed by Carter and Falconer (1951). The swept radius has been defined as the length of a chromosome interval on either side of a marker locus within which linkage can be detected with a certain level of significance. Although the swept radius was originally defined in terms of map distance, it can be readily converted into a measure of recombination fraction (with the use of an appropriate mapping function as described in section 7.2.2.3), which is more useful for direct analysis of raw data. In this way, the swept radius can be viewed as a boundary value for the recombination fraction. If the observed rate of recombination between two loci is less than the swept radius, linkage is demonstrated at a level of significance equal to or greater than the cutoff value chosen. If the observed rate of recombination is greater than the swept radius, linkage cannot be demonstrated with the available data.

The maximum discordance values allowed for each value of N can be translated into linkage distances (through the use of the Haldane-Waddington equation described in the next section) that describe swept radii at which linkage can be detected with a significance level of 95% or 99%. With just 20 RI strains, one will only be likely to detect linkage with marker loci that are within two centimorgans on either side of the test locus. The swept radius increases steadily as the size of the RI group climbs to 40 strains, where it becomes possible to detect linkage to markers that are within seven-to-eight centimorgans of the test locus. However, even with an RI group of 100 strains, the swept radius is only 13-15 cM. In general, the distance swept by each marker locus in an RI group is only 40% to 45% of the distance swept by each locus in a linkage analysis performed with an equivalent number of backcross offspring (see figure 9.13). This disadvantage is offset by the easy availability of the major RI sets and the ever-accumulating number of marker loci for which SDPs have been determined as discussed further below.

9.2.2.4 Demonstrating linkage: a practical strategy

From the preceding discussion, it should be clear that the chances of success in using RI data to demonstrate linkage for a test locus increase dramatically with both the number of strains analyzed and the number of evenly-distributed SDPs that are already present in the database. As of 1993, several RI sets had been typed at over 200 loci (Table 9.3), which, if randomly distributed, would fall on the linkage map at average distances of 7 cM or less from each other. The BXD set alone has been typed for over 800 loci. Even though these marker loci are not randomly distributed, their overlapping ‘swept diameters’ of coverage are sufficient to map most new loci that are typed among all members of this set. Furthermore, RI mapping panels become ever-more efficient at detecting linkage as each new SDP is added to the database. At some point in the near future, it is likely that SDPs will be determined for all 26 BXD strains at marker loci distributed across the genome at a maximum inter-locus distance of five centimorgans. At this point, every new test locus of interest will have to lie within 2.5 cM of a previously-typed marker locus, and thus, by simply typing the 26 BXD strains, one will be able to determine a map position with essentially 100% probability.

Until the scenario just described is reached, it is best to maximize one’s chances of demonstrating linkage by generating an SDP for the test locus over the maximal number of RI strains possible. Once data have been obtained and entered into a computer database, the first attempt to demonstrate linkage should be pursued at the highest stringency possible (equivalent to a Bayesian probability level of 99% in the Manly (1993) Map Manager program). This will minimize the chance of picking up false linkages; if a positive result is obtained from such an analysis, one can confidently move on to the next task of determining a map position for the locus relative to the linked marker loci, as described in the next subsection. However, if this analysis fails to detect linkage, the stringency can be reduced further by small levels of significance in subsequent runs. A positive result obtained at a lower level of significance should be considered tentative and must be confirmed (or rejected) by incorporating more RI strains into the pairwise comparison between the test locus and the putatively linked marker locus, or through an independent approach such as somatic cell hybrid analysis (section 10.2.3), in situ hybridization (10.2.2), or a backcross or intercross mapping panel (sections 9.3 and 9.4). Independent confirmation of just the chromosomal assignment alone, for example, can serve to increase dramatically the significance level for an RI-based determination of linkage to a particular locus on that chromosome (Neumann, 1990). Furthermore, the failure to detect linkage to many markers distributed over large regions of the genome can provide evidence for the exclusion of the test locus from these other regions. This information, in turn, can be used to increase significance levels in cases where the direct evidence in favor of linkage is somewhat weak (Neumann, 1990).

In some cases, further evidence for, or against, linkage can also be obtained by comparing the SDP of the test locus to the SDPs associated with the two marker loci that putatively flank it on either side (Neumann, 1991). If the positioning of the test locus between the flanking loci is correct, one would expect to see mostly single recombination events among the three loci within individual RI strains. If the association of the test locus with this genomic region is false, one would expect to see a substantial number of double recombination events that separate the test locus from both markers. This type of analysis is most easily visualized in the form of a data matrix, discussed just below and illustrated in figure 9.6. It is important to point out, however, that interference does not operate in the formation of RI strain genotypes since the crossover events that produced each patchwork genome occurred over multiple generations. Thus, true double recombination events over short distances are not strictly forbidden; they’re just less likely.

9.2.3 Using RI strains to determine map order

Once linkage has been demonstrated among three or more loci, one can move on to the next step of determining their order along the chromosome relative to each other. This can be accomplished computationally with a program like Map Manger (Manly, 1993), but it is also possible to carry out this analysis without a dedicated computer program. The first task is to set up an adjustable 2x2 data matrix of the kind illustrated in figure 9.6. This sample data matrix contains a subset of the actual data obtained with Chr 17 loci typed on the 26 RI strains present in the BXD set. Each row represents an independently-determined SDP for the locus indicated at the left. Each column represents the complete genotype determined for an individual RI strain over the ~32 cM region that encompasses the loci shown. It is customary to order loci in the data matrix with the centromeric end of the chromosome at the top and the telomeric end at the bottom, when this information is known from other results.

A new data matrix can be initiated for any set of SDPs that have been shown to form a linkage group. It is also possible to expand an established data matrix of marker loci with the inclusion of new test locus SDPs. At the outset, SDPs can be placed into the data matrix according to a first best guess of their genetic order relative to each other. If data are entered into a computer spreadsheet, one should manually shift the order of SDP-containing rows until an arrangement is found that minimizes the total number of recombination events within the whole data set. In addition, instances of triple recombination events over short distances in individual strains should be eliminated if possible. In this manner, it is often possible to arrive at an undisputed order for an extended series of loci as is the case, for example, with the eight loci from D17Leh119 to D17Leh12 and the eight loci from D17Leh173 to D17Leh23 shown in figure 9.6. The computer program Map Manager will carry out this process automatically and will also allow manual adjustment of order when this is desired.

There will sometimes be cases where two or more different orders appear equally likely. For example, in figure 9.6, one could remove the Glo1 and Pim1 loci from the position shown and place them instead between H2M2 and D17Leh173. Both placements require an unsightly triple crossover: in the position shown in the figure, this occurs in strain BXD-12; in the new genetic position, this would occur in BXD-27. Both placements also require an equal number of recombination events so it is not possible to choose one order over the other based on the BXD data alone. In this case, other mapping data have been used to confirm the map order shown in the figure.

In general, determining map order with accuracy is increasingly more difficult with linked SDPs that are increasingly more discordant relative to each other. Increased discordance is indicative of increased inter-locus distances which make it more likely that multiple recombination events will have occurred along individual chromosomes; these will complicate the analysis. But, as is the case with all aspects of RI analysis, the more strains that are typed, the more accurate order determinations can become. For example, if SDP data obtained from the AKXL RI set had been considered in conjunction with BXD data, they would have provided unambiguous evidence in favor of the correct placement of Glo1 and Pim1 proximal to Crya1, as shown in figure 9.6.

With the accumulation of large numbers of RI SDPs and a comprehensive two-by-two analysis for linkage, it becomes possible to build linkage maps that span many loci distributed over large chromosomal regions as illustrated in figure 9.6. By looking down any column, one can clearly see the genomic patches that derive from each of the two progenitor strains. At one extreme, four strains — BXD-2, BXD-12, BXD-18 and BXD-25 — appear to have fixed three separate crossover sites in their genomes leading to the presence of four alternating B6 and DBA genomic patches. At the other extreme, 12 strains — including BXD-1, BXD-8, and ten others — appear to have inherited this chromosomal region intact, without recombination, from either the B6 or DBA progenitor. The remaining ten strains have fixed either one or two crossover sites leading to two or three genomic patches respectively.

Through the visualization of RI data in this way, one can pick up suspicious results that may be due to experimental error. For example, the presence of a B6 allele at the Upg-1 locus in BXD-22 requires a double crossover that encompasses only this locus and no others. One would be advised to go back and re-type this locus on a new sample from this RI strain.

By looking horizontally across the data matrix and comparing pairs of SDPs, one can visualize the degree of concordance that exists between nearby loci. At three separate junctures — D17Leh12/Glo-1, Tpx-1/Iapls1-3, and Ckb-rs2/Hprt-rs1, the BXD RI data alone are not sufficient to demonstrate linkage according to the limits shown in figure 9.5; in all these cases, data from other mapping experiments provided evidence of linkage between two or more loci present on opposite sides of these junctures. There are also numerous examples of loci that have the same SDP; as described in the next section, loci with shared SDPs across only 26 typed strains can actually be quite distant from each other in terms of map distance (see figure 9.8).

The data matrix for this portion of Chr 17 also illustrates how the power of the RI approach to linkage determination increases with the number of loci that are typed. For example, the SDPs for D17Leh119 and Hba-ps4 are distinguished by four discordant strains (BXD-2, BXD-18, BXD-21, and BXD-25). Thus, according to figure 9.5, these data alone do not provide sufficient evidence for linkage between these two loci at the 95% significance level. However, when the Plg SDP was added to the database, it provided the required evidence for linkage of D17Leh119 to Hba-ps4 through their common linkage — with 99% significance — to the newly typed locus.

9.2.4 Using RI strains to determine map distances

9.2.4.1 From discordance to linkage distance

As alluded to earlier in this section, when two loci are known to be linked, the level of discordance that is observed between their SDPs can be equated with a mean estimate of the distance that separates them. In fact, this distance estimate can be made even in those cases where the RI data alone are not sufficient to provide evidence for linkage. Thus, RI data are useful for estimating recombination distances between loci that have been linked by non-breeding methods such as physical mapping or cytogenetic analyses. However, it is not sufficient to simply determine the fraction of strains that show discordance and use this directly as an estimate of the recombination fraction. The problem is that during the generation of each RI strain, an average chromosomal region will have multiple opportunities to recombine as it passes through several generations in a heterozygous state (see figure 9.3).

Interestingly, long before the conceptualization of RI strains, Haldane and Waddington (1931) derived a mathematical solution to this problem in the context of determining the probability with which a recombinant genotype would become fixed after successive generations of inbreeding. This solution was formulated in the following equation where r is the probability of recombination in any one gamete and R is the fraction of RI strains that are predicted to be discordant:

? ?(9.7)

Equation 9.7 illustrates two points. First, the expected fraction of discordant strains is dependent on only a single variable — the probability of recombination between the two loci under analysis. In turn, since interference in any one generation is nearly 100% over the distances analyzed by RI analysis (see section 7.2), the probability of recombination can be converted directly into a centimorgan linkage distance (d) with an r value of 0.01 defined as equivalent to one centimorgan. Thus:

? ?(9.8)

Second, for values of r that are smaller than 0.01, equation 9.7 can be approximated by the simpler . Thus, a one centimorgan distance becomes amplified into a predicted discordance frequency of ~4%. As Taylor (1978) pointed out, this four-fold amplification can be interpreted to mean that during RI strain development, a locus will be transmitted, on average, through four heterozygous animals (with four chances for a recombination event in its vicinity) before it is fixed to homozygosity.

The amplification of the linkage map serves to enhance the usefulness of the RI approach in the analysis of closely linked loci. For example, in a group of 100 RI strains, recombination sites will be distributed at average distances of 0.25 cM, which is four times more highly resolving than that possible with an equivalent number of backcross animals. However, this same amplification has the negative consequence of limiting the usefulness of the RI approach in studying loci that are more distantly linked to each other. For example, at a distance of 25 cM ( ), the predicted discordance level for RI strains would be 40% (R = 0.4), a value which is perilously close to the 50% expected with unlinked loci. As a consequence, the per-locus swept radius obtained with RI strains will be much less than that obtainable with an equal number of backcross offspring.

In most cases of RI analysis, an investigator wants to go from a discordance fraction to an estimate of linkage distance. The experimentally-determined discordance fraction provides an estimate of the true value which is based on the actual probability of recombination r. By substituting for in equation 9.7, one can obtain a corresponding recombination fraction estimate . This can be accomplished more easily if equation 9.7 is inverted to yield r as a function of R or, through direct substitution, as a function of :

? ?(9.9)

Another useful formulation of this same equation allows one to obtain the estimated recombination fraction as a function of the sample size, N, and the number of discordant strains, i:

? ?(9.10)

Finally, a corresponding linkage distance estimate in centimorgans ( ) can be derived by multiplying the value obtained in equation 9.10 by 100.

The graph in figure 9.7 provides a rapid means for determining a linkage distance estimate from values for i and N that are commonly obtained in RI analyses. Just place a ruler over the graph so that it crosses the experimental value for N along the top and bottom axes, then observe the point at which the ruler crosses the curve associated with the experimentally-determined value for i. From this point, look across to the Y axis to read off the linkage distance in centimorgans.

9.2.4.2 How accurate is the linkage distance estimate?

Once a value for linkage distance has been obtained from RI data, the next question an investigator will ask is: how accurate is this value? The answer to this question will be critical to investigators who want to use RI data to evaluate possible relationships between the cloned gene they have just mapped and other previously-mapped loci that are defined strictly in terms of a mutant phenotype. (A detailed discussion of the general strategy used to evaluate such relationships has been left to section 9.3.4 and is illustrated in figure 9.10.)

The accuracy of an experimentally-determined value such as linkage distance can be quantitated in terms of a "confidence interval" which is defined by lower and upper boundaries — called confidence limits. To calculate a confidence interval from a body of data, one must first choose the "confidence coefficient", or level of confidence, that one wishes to attain. The confidence coefficient represents the probability with which the associated interval is likely to contain the value of the true recombination fraction or linkage distance. Each confidence coefficient will produce a different confidence interval.

In general, two particular confidence coefficients are used most often for the evaluation of data obtained from sampling experiments aimed at estimating an absolute real value (like linkage distance). The first is based on the standard error of the mean ( ) for a normal (bell-shaped) probability distribution around a mean value . When the lower limit of the interval around a normal distribution is set at , and the upper limit is set at , the corresponding confidence coefficient is always equal to 68%. It is standard practice to display this confidence interval within a single term: . Twice as often as not, the 68% confidence interval will contain the true value being estimated; this interval provides an investigator with an overall sense of the accuracy of the mean estimate predicted from experimental data.

As discussed at length in appendix D and by other authors (Silver, 1985), data obtained from linkage studies based on small sample sizes and low levels of discordance are not well-approximated by "normal" probability distributions (see figures D.1 and D.2). Unfortunately, the standard deviation does not provide an accurate measure of lower and upper confidence limits in the context of non-normal probability distributions (Moore and McCabe, 1989, p.41). In its place, I have used more appropriate estimates for lower and upper confidence limits associated with an actual 68% confidence interval.

The second confidence interval of critical importance for interpreting experimental data is one that encompasses the range of values likely to contain the actual recombination fraction with a probability of 95%. The more stringent 95% confidence interval is used often to define the generally accepted limits beyond which the actual value of r is unlikely to lie. A discussion of the statistical approach used to determine confidence limits for both RI strain and backcross data is presented in appendix D along with tables of minimum and maximum values for 68% and 95% intervals. By interpolating between the numbers presented in either Table D.1 or Table D.2, one can derive confidence limits for pairs of i and N values generated in an analysis of 20 to 100 RI strains.

9.2.4.3 What is the meaning of 100% concordance?

One special case of RI strain results deserves particular attention — when complete concordance is observed between the SDP patterns obtained for two different loci. In this special case, the Haldane-Waddington formulation (equations 9.8 and 9.9) leads to a linkage distance estimate of zero centimorgans. However, if the two loci under analysis were independently-derived and known not to be identical, an estimate of zero distance clearly makes no sense. Based on intuitive grounds alone, one would expect there to be a strong likelihood that the two loci are actually separated from each other by a significant distance, especially when the total number of RI strains typed is small.

An accurate estimate of the median expected recombination fraction (leading directly to an estimate of the median expected linkage distance ) that separates two completely concordant loci can be obtained by determining the midline of the area under the associated probability density function as discussed in appendix D and illustrated in figure D.1. The results of this calculation for the range of 15 to 100 RI strains are presented graphically in figure 9.8 along with upper limits for 68% and 95% confidence intervals.

As an example, consider the implications of these statistical formulations in the case of complete concordance between two SDP patterns for the set of 26 BXD RI strains. The median estimate of linkage distance between two concordant loci is 0.66 cM (or 1.3 megabases based on a conversion of one centimorgan to 2.0 mb). The 68% confidence interval extends from 0.2 cM to 1.76 cM (or 400 kb to 3.5 megabases), and the 95% confidence interval extends from 0.02 cM to 3.95 cM (or 40 kb to 7.9 megabases from the computer program presented in appendix D). These numbers confirm the intuitive suspicion that two concordant loci typed in only a small sample set probably do not map on top of each other.

Although a small group of RI strains is not sufficient to demonstrate close linkage, this conclusion does become more appropriate as the number of concordant RI strains increases. With 100 completely concordant strains, the median estimate of linkage distance is reduced to 0.17 cM or 340 kb, the maximum limit on the 66% interval is reduced to 0.45 cM or 900 kb, and the maximum confidence limit for the 95% interval becomes 0.95 cM or 1.9 megabases.

9.2.4.4 A comparison of RI and backcross predictions of linkage

One can gain a better perspective on the accuracy of RI-determined linkage distance estimates by comparing a map obtained with the BXD set of 26 strains to a more accurate map obtained with a set of 374 backcross animals for the same eleven loci distributed over a span of 19 to 31 cM as shown in figure 9.9. There are five instances here in which a single RI discordance has occurred to yield an inter-locus linkage distance estimate of 1.0 cM (PlgD17Pri1, D17Pri1Hba-ps4, Pim1Crya1, Crya1H2M2, and Tcte1D17Mit10) with a 68% confidence interval extending from 0.7 to 3.4 cM, and a 95% interval extending from 0.2 to 6.6 cM. In three of these cases, the backcross estimate lies within the 68% confidence interval and in the other two cases, it lies within the 95% interval. There are two instances in which two RI discordances have occurred to yield an inter-locus linkage distance estimate of 2.2 cM (D17Leh119Plg and H2M2Tpx1) with a 68% confidence interval extending from 1.4 to 5.3 cM and a 95% interval extending from 0.6 to 9.6 cM. In both cases, the backcross estimate lies within the 68% interval. There are another two instances in which four RI discordances yield an inter-locus linkage distance of 5.0 cM (Hba-ps4Pim1 and Tpx1Tcte1) with a 68% confidence interval extending from 3.3 to 9.7 cM, and a 95% interval extending from 1.7 to 17 cM. In one case, the backcross estimate lies within the 68% interval, and in the other case, it lies within the 95% confidence interval. Finally, there is a single instance of complete RI concordance between D17Mit10 and D17Mit6. The backcross estimate of linkage distance between these two loci lies outside the 95% confidence interval (but within the 99% confidence interval according to results obtained from the computer program listed in appendix D).

In summary, backcross estimates lie within the 68% confidence interval of the RI estimates in six (60%) of the ten pairwise comparisons, and in all but one of the remaining comparisons (90%), the backcross estimates lie within the 95% confidence interval. This level of accuracy is about as close as one can get to that predicted from statistical formulations. Furthermore, in the single case where the two mean estimates of linkage distance are significantly different from each other (0.66 cM versus 5.9 cM for D17Mit10D17Mit6), their associated 95% confidence intervals (0.02—3.95 cM and 3.9—8.8 cM) do overlap (barely), and the data taken together would suggest that the actual recombination frequency is somewhere between the two extreme mean values.

9.2.5 Using RI strains to dissect complex genetic traits

9.2.5.1 Susceptibility, predisposition, penetrance, and expressivity

When Bailey first conceived of RI strains, it was with the notion that they would be useful for the analysis of the many forms of complex phenotypic variation that distinguish different inbred strains from each other. In the past, the use of RI strains for this purpose had been rather limited because of the absence of an overall framework of genetic markers on which phenotypic differences could be mapped. But at the current time, the availability of highly polymorphic, rapidly-typed DNA markers is allowing the construction of framework maps of marker loci that span essentially whole genomes in each of the major RI sets. These framework maps will finally open up RI strains to the use originally conceived of by Bailey.

There is an enormous reservoir of susceptibility differences to a variety of disease conditions, both chronic and infectious, among the classical inbred strains. For example, the A/J strain is relatively susceptible to various carcinogen-induced cancers (lung adenomas, sarcomas, and colorectal tumors), parasites (Giardia, Trypanosoma, and Plasmodium), bacteria (Listeria and Pseudomonas), viruses (Ectromelia and Herpes), and fungi (Candida albicans), as well as gall stones and teratogen-induced cleft palate; the B6 strain is relatively resistant to all of these conditions (Mu et al., 1993). On the other hand, B6 is relatively susceptible to other parasites, bacteria, viruses, and fungi as well as atherosclerosis, diabetes, and obesity; the A/J strain is relatively resistant to all of these conditions. In all, strain-specific differences in susceptibility to over 30 infectious or chronic diseases have been identified between A/J and B6, and the genetic basis for each can be approached with the use of the combined AXB/BXA set of RI strains. Differences in disease susceptibility exist among all of the traditional inbred strains, and the genes involved in many of these differences can be approached as well with the appropriate RI sets.

With most of the conditions described above, a particular genetic constitution only predisposes an individual to express a disease. This means that some individuals that carry the predisposing genotype will actually not express the disease. The fraction of genotypically identical individuals that express a particular trait defines the penetrance of that trait from that genotype. When a particular genotype guarantees the expression of a phenotype in 100% of the animals that carry it, the phenotype is considered to be completely penetrant. In all other cases, a phenotype is considered to be partially penetrant or incompletely penetrant. For example, a particular substrain of BALB/c mice is predisposed to a particular form of induced cancer known as a plasmacytoma. However, only 60% of these inbred animals actually get the cancer upon induction. Thus, the penetrance of induced-plasmacytoma in these mice is 60%.

The cousin of incomplete penetrance is variable expressivity. Variable expressivity describes the situation in which multiple individuals all express a particular trait, but in a quantitatively distinguishable manner. For example, a tumor may appear at a young age or an old age, a birth defect such as cleft palate may be more or less severe. Variable expressivity can also be measured for traits that do not show an either/or type of wild-type/mutant variation. For example, there are many strain-specific differences in physiological parameters and behavior that are strictly quantitative. Thus, blood cholesterol levels may vary among different strains as will the average number of pups that a female has in a litter (see Table 4.1).

Both incomplete penetrance and variable expressivity can be caused by genetic as well as non-genetic factors. Inbred strains allow one to clearly distinguish genetic factors since measurements can be made of the mean level of expression or penetrance of a trait in one strain relative to another when populations of both are maintained under identical environmental conditions.

In cases where two strains differ quantitatively in penetrance levels and/or expressivity for a particular trait, it becomes difficult to design traditional breeding crosses that can uncover the loci involved. For example, if strain A shows 20% penetrance for a trait and strain B shows 80% penetrance for the same trait, then its expression in offspring from a cross between the two strains would not provide straightforward information as to which predisposing allele(s) is present. In contrast, each RI strain provides an unlimited number of animals with the same homozygous genotype. Thus, through the analysis of a sufficient number of animals, it becomes possible to quantitate the levels of penetrance and expressivity and associate distinct measurements of mean and standard deviation with each RI genotype. Furthermore, it is just as easy to map recessive traits as dominant traits since RI strains are completely homozygous.

RI strains are also useful in those cases where multiple animals must be sacrificed in order to make a single phenotypic determination. This will be true for certain biochemical assays (although in most cases today, micro-techniques allow analysis on tissues obtained from single animals) and for other assays that require a determination of multiple test points in which each point is a single animal. An example of the latter would be an LD50 determination for a particular toxic chemical.

If every RI strain in a set expresses a trait with essentially the same penetrance and expressivity as one of the two progenitor strains, and approximately half of the RI strains resemble one progenitor and half resemble the other, determining a map position for the responsible locus is no different than that described earlier in the case of DNA marker loci. Data of this type can be viewed as evidence in favor of a single major locus that is responsible for the difference in susceptibility, penetrance, or expressivity between the two progenitor strains. One can simply write out an SDP for the phenotype and then subject this SDP to concordance analysis with the SDPs obtained for all previously typed markers as described in section 9.2.2. Once linkage is demonstrated, gene order and map distances can be determined as described in sections 9.2.3 and 9.2.4.

9.2.5.2 Further genetic complexity: polygenic traits

There are two forms of RI strain data that are indicative of a more complex basis of inheritance which may be impossible to resolve using only the RI approach. The first occurs when there is a significant departure from a balanced SDP in that the phenotype expressed by one progenitor strain is found in many more RI strains than the alternative phenotype. Data of this type would suggest that the expression of the rarer phenotype requires the simultaneous presence of two or more genes from the appropriate progenitor. One can calculate the probability of occurrence of a phenotype that requires the action of two or more unlinked loci through the law of the product as (0.5)n where n is the number of loci required. Thus, if two unlinked B6 loci are both required for susceptibility to a particular viral infection (relative to DBA), only (0.5)2 = 25% of the BXD RI strains would be expected to show susceptibility. Unfortunately, for obvious reasons, unbalanced SDPs cannot be compared directly for linkage relationships with normal single-locus marker SDPs.

The second form of RI data indicative of genetic complexity is the occurrence of strains that show a level of penetrance or expressivity that is significantly different from both of the progenitors. Since every RI strain can be considered homozygous for one progenitor allele or the other at every locus, data of this type will also implicate the action of multiple genes. The simplest explanation for these results is that different combinations of alleles from the two progenitors cause the different levels of phenotypic expression. For example, with the involvement of two loci, X and Y, in the expression of a trait that distinguishes the strains A/J and B6, there will be four relevant genotypes among the AXB/BXA RI strains — XAYA, XAYB, XBYA, and XBYB. Two of these genotypic combinations are different than that found in either progenitor and one or both could be responsible for a novel phenotypic expression.

Many variations upon these examples are possible. Thus, every complex trait will have to be approached independently to formulate a reasonable hypothesis for inheritance. In some cases, RI strains may still provide an appropriate tool for genetic analysis, but in most cases, it will be necessary to move to a different form of analysis that may require the establishment of a new breeding cross as described in section 9.4.

9.3 Interspecific mapping panels

9.3.1 Overview

The "interspecific mapping" approach was conceived of by François Bonhomme (1979) working in Montpellier, France. Bonhomme had discovered that two clearly distinct mouse species — M. musculus and M. spretus — could be bred together in the laboratory to form fertile F1 female hybrids (Bonhomme et al., 1978). The two parents involved in the generation of these F1 animals are so evolutionarily divergent (figure 2.1) that polymorphisms in the form of RFLPs can be readily identified between them with the great majority of mouse DNA probes. Thus, by backcrossing F1 females to one parental strain, it becomes possible to follow the segregation and linkage of almost any group of cloned loci (Avner et al., 1988; Copeland and Jenkins, 1991).

For historical reasons, the M. musculus representative chosen for use in most interspecific crosses has been B6 (Bonhomme et al., 1979; Copeland and Jenkins, 1991; Nadeau et al., 1991), although there is no reason why other traditional inbred strains cannot be used instead (Hammer et al., 1989; Moseley and Seldin, 1989). The initial outcross is always set up between a B6 (or other traditional inbred strain) female and an inbred M. spretus male (written as [B6 X SPRET]); the outcross is carried out in this direction because of the greater fecundity associated with the traditional M. musculus inbred strains relative to M. spretus strains. In the subsequent generation, a backcross is performed between an F1 female (since F1 males are sterile) and either a B6 or M. spretus male; the standard written descriptions of these entire two generation protocols are: [(B6 X SPRET) X SPRET] and [(B6 X SPRET) X B6] respectively.

An "interspecific mapping panel" is typically composed of DNA samples obtained from one hundred to one thousand N2 offspring from this backcross. Aliquots of each sample are digested typically with one restriction enzyme at a time, electrophoresed on gels, and transferred to Southern blots which can be sequentially probed with radioactively-labeled DNA clones. As more and more loci are typed, and as segregation patterns are compared, linkage groups will begin to emerge. As the number of typed markers approaches several hundred, all will begin to coalesce into a series of only 20 linkage groups that each correspond to a single mouse chromosome. (A more detailed discussion of the actual numbers of loci and animals required for linkage determination will be presented in section 9.4.) Obviously, the correct assignment of linkage groups with their associated chromosomes depends upon the incorporation into the mapping panel analysis of previously-assigned anchor loci.

It should be emphasized that each member of an interspecific mapping panel typically survives only in the form of DNA. Thus, the power of these panels is limited to the analysis of cloned loci. To map loci defined solely by a variant phenotype, one would have to choose an alternative approach. In most cases, it will be necessary to set up a new cross from scratch as described in section 9.4.

By the end of 1990, over 600 cloned loci had been typed on the single interspecific mapping panel maintained by Jenkins and Copeland at the Frederick Cancer Research Center (Copeland and Jenkins, 1991). By the end of 1993, the interspecific mapping panels maintained by Jenkins and Copeland (Copeland et al., 1993), as well as those maintained by several other investigators, had all been typed for at least one thousand loci. With 1000 or more loci in a database, one can be virtually assured of finding a correct linkage relationship for any new test locus that is put through an analysis of the same mapping panel.

9.3.2 A comparison: RI strains versus the interspecific cross

Many investigators will want to obtain a high resolution map position for their newly characterized DNA clone without having to set up their own cross, and without having to invest a substantial amount of time, energy, and money. For all such investigators, typing an established mapping panel will be the method of choice. But, which mapping panel should be used? One possibility is to type one or more sets of RI strains, as discussed in the previous section. The second possibility is to type one of the well-established interspecific backcross mapping panels. Each approach has its advantages and disadvantages.

9.3.2.1 Genetic considerations

In terms of ease of polymorphism discovery, the interspecific approach provides a clear advantage over the RI approach. As discussed previously, it is often difficult to uncover RFLPs between the progenitors of RI strains. Furthermore, the identification of a RFLP between one set of RI progenitors is often not useful for the analysis of other RI sets. Thus, even when RFLPs have been uncovered, the total number of RI strains that can be analyzed is often quite limited; it can be as few as 26 and it is rarely more than 80. In contrast, the ease of RFLP identification between the progenitors of the interspecific cross was the main impetus to the initial use of this mapping approach. Furthermore, one need only identify a single type of polymorphism to type the entire interspecific panel.

With the newer PCR-based approaches to polymorphism identification discussed in section 8.3, it is now easier to identify differences between RI progenitor strains. Of course, with these same approaches, polymorphism identification between the interspecific progenitors is even easier still.

In terms of the resolution of the genetic map that is obtained, the interspecific approach has a number of advantages over the RI approach. First, the number of samples in several of the well-characterized interspecific panels ranges from over two hundred to as high as one thousand; a thousand samples provides an average map resolution of 0.1 cM. In contrast, the total number of well-characterized RI strains is less than 140. Second, interference acts to eliminate nearby double crossover events in the interspecific backcross, and thus gene order can be determined with very high levels of confidence for any linked loci. In contrast, crossing over in multiple generations during the creation of RI strains eliminates the effect of interference and this can sometimes causes ambiguity in the determination of gene order.

There is one potential problem that could act to reduce the resolution of the interspecific cross in certain genetic regions — the existence of small inversion polymorphisms that may have arisen during the divergence of M. spretus and M. musculus. An inversion will preclude the observation of recombination across all the loci that it encompasses and, in turn, this will prevent the mapping of all of these loci relative to each other. Only one such inversion polymorphism has been identified to date (Hammer et al., 1989), however, direct tests for the existence of others have not been performed. Inversions could only be demonstrated directly by creating a intraspecific linkage map for M. spretus by itself and comparing the gene order on this map to the gene order on an intraspecific M. musculus map. Although this comparison has not been performed for any chromosome other than 17, indirect evidence for several additional inversions has come from the finding of regions of apparent recombination suppression in an interspecific linkage map in comparison to an intersubspecific (castaneus-B6) linkage map (Copeland et al., 1993). Cryptic inversions could have serious consequences for those who would like to use interspecific linkage distances as a gauge for estimating the physical distance between two markers as a precursor to positional cloning as described in chapter 10.

9.3.2.2 Practical considerations

A unique advantage held by the established RI mapping panel sets is that individual RI samples are actually represented by strains of mice, and as such, they are immortal; RI samples from a mapping panel will never be "used-up." In contrast, the amount of DNA in each sample of every interspecific panel is finite. Even under the best conditions, the amount of DNA recovered from a single whole mouse will never be more than 10 mg, and in many cases, mapping panels were previously established with samples containing only one or two milligrams. In the days when all typing was carried out by Southern blot analysis of genomic DNA, it was typical to use five-to-ten microgram aliquots for each analysis. With a total per-sample size of one milligram of DNA, one could produce 200 Southern blots which could each be probed multiple times. Although this may sound like a large capacity, in reality, samples are spilled or transferred inefficiently, and blots become ruined. For panels that are analyzed primarily by the RFLP approach, samples will be "used-up" eventually and, as a consequence, the practical lifetime of such interspecific mapping panels is limited.

Today, of course, it is possible to develop a PCR protocol for typing in many cases and this allows one to use much smaller sample aliquots — on the order of nanograms. Thus, if the typing of a panel is restricted to PCR methods, one could conceivably analyze hundreds of thousands of loci on a single panel before it goes extinct.

For many investigators, a second important advantage to the RI approach is that DNA samples or animals can be purchased, without constraints, from the Jackson Laboratory. The investigator can then perform the experimental analysis in his or her own lab, and, by comparing the new SDP obtained with those present in a public database, a map position for the new locus can be established. This entire analysis can be accomplished independently, without any need to contact, consult, or collaborate with other scientists.

In contrast, each well-characterized interspecific panel is maintained in the context of an ongoing research project by a particular scientist or laboratory. Thus, an investigator with a new clone must interact, at some level, with another scientist, in order to utilize their mapping panel and private database for the purpose of determining a new map position. Some investigators may see this interaction as an advantage. For example, in a number of cases, mapping panel "owners" are willing to carry out the experimental analysis in their own labs, thereby alleviating the workload of the independent investigator; such extensive interactions are normally treated as collaborations. Other investigators will wish to remain independent and will view such an extensive interaction as a disadvantage.

9.3.3 Access to established interspecific mapping panels

At the time of this writing, the laboratories listed below maintain well-characterized interspecific mapping panels. Different laboratories operate their mapping programs in very different ways — some send out DNA samples or filters from their panel while others perform all typing in-house; the reader should make direct contact with a particular laboratory to determine the specific protocol that is followed there. The reader should be cautioned, of course, that all of these programs are maintained by funding agencies, and a change in funding or personnel may have eliminated a particular program during the period between this writing and your reading.

In the United States, the best characterized interspecific mapping panels are maintained by N. Jenkins and N. Copeland at the Frederick Cancer Research Center in Frederick, Maryland (Copeland and Jenkins, 1991; Copeland et al., 1993), M. Seldin at Duke University Medical Center in Durham, North Carolina (Moseley and Seldin, 1989; Watson et al., 1992), and E. Birkenmeier at the Jackson Laboratory in Bar Harbor, Maine (Birkenmeier et al., 1994). In Europe, a very large interspecific mapping panel with 1,000 samples (the European Collaborative Interspecific Backcross or EUCIB) is maintained by the Human Genome Mapping Project (HGMP) Resource Centre (Watford Rd., Harrow Middx HA1 3UJ, England; FAX: 081-869-3807). EUCIB is under the joint supervision of S. Brown at St. Mary’s Hospital in England and Jean-Louis Guénet at the Pasteur Institute in Paris, France (Brown, 1993).

9.3.4 Is the newly mapped gene a candidate for a previously-characterized mutant locus?

The main reason that many investigators will want to map a newly cloned gene is to determine whether it is equivalent to a locus that has been previously mapped but is characterized only at the level of a mutant phenotype. Cloning the genes associated with interesting phenotypes in this round-about manner is usually a matter of luck and is referred to as the "candidate gene" approach. How does one begin to rule-out or rule-in possible identity to a phenotypically-defined locus? Unfortunately, the mapping panel used to localize the cloned gene will usually not provide simultaneous map information for any phenotypically-defined loci. Thus, one is forced to compare map positions derived from different crosses.

One should begin a search for potentially equivalent mutationally defined loci by scanning database lists of all loci that are thought to lie within 15 cM of the map position obtained for the new clone. The databases to search should include the chromosome map compiled by the appropriate mouse chromosome committee and published annually in a special issue of Mammalian Genome (Chromosome committee chairs, 1993)and GBASE, the electronic database maintained at the Jackson Laboratory and available as well within the Encyclopedia of the Mouse Genome (see Appendix B). Descriptions of the phenotypes associated with loci picked up in this scan can be obtained from a compendium published in the latest edition of the Genetic Variants and Strains of the Laboratory Mouse (Green, 1989) or electronically from the continuously-updated on-line Mouse Locus Catalog (see appendix B). The expression pattern of the newly cloned gene and information concerning its protein product can often provide a means for evaluating the likelihood of an association with any particular mutant phenotype.

Once a particular locus has been identified for further consideration, one should begin a statistical evaluation of the likelihood of an equivalent map position with the newly cloned gene. To carry out this evaluation, it is important to look at the raw data that were used to place the locus on the map. Much of this information is compiled in GBASE,. However, at times, it may be necessary to go back to the original reports that are cited. In this way, it will be possible to determine the actual marker locus, or loci, that were shown to be linked to the mutation, the nature of the cross that was used for analysis, and the number of recombinants observed. In many cases, mutant loci will have been mapped relative to "anchor loci" with well-established positions on contemporary chromosome consensus maps. If this is not the case, it may be necessary to backtrack through citations to uncover a multiple-step linkage relationship that does exist between the mutation and a well-established anchor.

Once a particular anchor locus has been identified with a direct linkage association to both the cloned gene and the mutant locus under consideration, the next task is to determine whether the confidence intervals associated with the map position of each show overlap. This can be accomplished with the use of the confidence limit tables presented in appendix D.

An illustration of such an analysis is presented in figure 9.10. In this hypothetical example, a newly cloned locus has been mapped relative to a common anchor locus with nine recombinants found in 94 backcross samples. This provides an estimated linkage distance of 9.6 cM. By consulting Table D.5, one can estimate lower and upper 95% confidence limits of 5.2 and 17 cM respectively. Next, one evaluates the linkage data associated with three mutant loci that have been identified as having the potential to be equivalent to the cloned gene. Mutation number one (Mut1) has been mapped relative to the same anchor locus in a backcross experiment, with 52 recombinants found among 250 samples for an estimated linkage distance of 21 cM. Extrapolation from the values given in Table D.6 provides lower and upper 95% confidence limits of 16 and 26 cM respectively. Mutation number two (Mut2) has also been mapped relative to the same anchor locus in a backcross, with 88 recombinants in 400 samples giving a linkage distance of 22 cM with lower and upper confidence limits of 18.2 and 26.3 cM (also from Table D.6). Finally, mutation number 3 (Mut3) has been mapped with a group of RI strains with one discordance observed in 40 strains giving an estimated linkage distance of 0.6 cM (from figure 9.7) and lower and upper confidence limits of 0.2 and 4.0 cM (from Table D.2).

The results of all four crosses are represented graphically in figure 9.10. The data make it very unlikely that the newly cloned gene is equivalent to loci defined by either mutation 2 or mutation 3 since none of these confidence intervals overlap. However, the 95% confidence intervals of the cloned gene and mutation 1 do overlap (even though absolute estimates of their map positions place them over ten centimorgans apart). If mutant-bearing animals are available, the potential equivalence between these two loci can be followed up with further experiments of several types. First, expression of the cloned gene can be examined in animals that carry the mutation. Second, the cloned locus itself can be examined within the mutant genome for the possible detection of easily visible alterations such as a deletion or gene-inactivating insertion. Finally, segregation of the mutant allele and the cloned gene can be followed directly in a breeding experiment (as described in the next section). It only takes one validated recombination event to rule out an equivalence between the two loci.

9.4 Starting from scratch with a new mapping project

9.4.1 Overview

There are two types of experimental situations in which established mapping panels may not be sufficient to the needs of an independent investigator. In the first instance, an investigator may want to pursue the mapping of a large group of cloned loci to obtain, for example, a very high resolution map for an isolated genomic region. For extended mapping projects of this and other types, it becomes both cost-effective and time-effective to perform an "in-house" cross for the production of a panel of samples over which the investigator has complete control.

With a second class of experimental problems, an investigator will have no choice but to perform an "in-house" cross for analysis. This will be the case in all situations where the test locus is defined only in the context of a mutant phenotype. Often, the goal of such projects will be to clone the locus of interest through knowledge of its map position. To map a mutationally-defined locus, one will have to generate a special panel of samples in which segregation of the mutant and wild-type alleles can be followed phenotypically in animals prior to DNA preparation for marker locus typing. What follows in this section is a summary of the choices that confront an investigator in the development of a mapping project from scratch, and the process by which an investigator should proceed through the project from start to finish.

At the outset, the investigator must make decisions concerning the form of the breeding cross itself. In particular, which parental strains will be used and what type of breeding scheme will be followed? To map a mutationally-defined locus, one will obviously have to include one strain that carries the mutation. The second parental strain should be chosen based on the contrasting considerations of genetic distance (the more distant the strain, the greater the chance of uncovering polymorphisms at DNA marker loci) and ability to generate offspring in which segregation of the mutant allele can be observed. The choice of breeding scheme is limited typically to one of two different two-generation crosses: the outcross-backcross (F1 X P, where P represents one of the original parental strains) or the outcross-intercross (F1 X F1) illustrated in figures 9.11 and 9.12 respectively. If the purpose of the analysis is to map loci associated with a mutant phenotype, the nature of the phenotype may limit this choice further as discussed more fully in sections 9.4.2 and 9.4.3.

Once the strains and a breeding scheme have been chosen, one can begin to carry out the first generation cross. The number of mating pairs that should be set up need not be as large as one might think because of the expansion that will occur at the second generation. Backcrosses are usually peformed with females that are F1 hybrids and intercrosses, by definition, are always based on F1 hybrid females. As such, the second generation cross is likely to be highly productive with larger and more frequent litters than one obtains with inbred females (see section 4.1). Consider the goal of obtaining 1,000 offspring from either an outcross-intercross or an outcross-backcross. If one assumes that 90% of the second generation mating pairs will be productive with an average of four litters with eight pups in each, one would need to set up only 35 such matings. Working backwards, to generate the 35 F1 females and/or males required would entail only ten initial matings between the two parental strains with the assumption that 50% would be productive and these would each have three litters of five pups.

An alternative backcross strategy that may sometimes be even more efficient is to set up F1 males with inbred females of one of the two parental strains in the second generation. This approach is only effective when the backcross parent to be used is a common inbred strain such as B6. In this situation, there is no limit to the number of females that can purchased at a modest cost from various suppliers, and individual F1 males can be rotated among multiple cages of these females. Thus, what is sacrificed in terms of hybrid vigor is made up for in terms of absolute number of crosses. As few as ten males could be rotated every five days among cages with two females each for a total of 120 matings in a month. One should be aware, however, that an analysis of this type will be based entirely on recombination in the male germline which may or may not be beneficial to the investigator according to different experimental requirements as discussed at the end of section 9.4.4.2.

When offspring from the second generation cross are born, one will need to analyze each for expression of the mutant phenotype. In some cases, it will be possible to use both the expression and non-expression of phenotypes as direct indicators of genotype. In other cases, it will only be possible to use phenotypic expression as an indicator of genotype in a subset of animals. This will be true for all phenotypes that are only partially penetrant as well as those that are only expressed in homozygous offspring from a second generation intercross. In both cases, the lack of phenotypic expression in any particular animal will preclude an unambiguous determination of its genotype. When it is only possible to incorporate a subset of offspring into the ultimate genetic analysis, it will obviously be necessary to generate more offspring at the outset to achieve the same level of genetic resolution. Once "phenotyping" is accomplished, animals can be converted into DNA for incorporation into the panel that will be used for analysis of marker segregation. Optimal strategies for determining map position are discussed in sections 9.4.4 and 9.4.5.

9.4.2 Choosing strains

9.4.2.1 For developing DNA marker maps

Upon commencing a new linkage study, an investigator will first have to decide upon the two parental mouse strains that will be used in the initial cross to generate F1 animals. This choice will be informed by the goal of the linkage study. If the goal is simply to develop a new panel for mapping loci defined as DNA markers, there will be no a priori limitation on the strains that can be chosen. The most important considerations will be the degree of polymorphism that exists between the two parental strains and the ease with which they, and their offspring, can be bred to produce a large panel of second generation animals for DNA typing.

As discussed earlier in this chapter and previous ones, the traditional inbred M. musculus strains show minimal levels of inter-strain polymorphism. It was for this reason that the initial two-generation mapping panels were all based on interspecific crosses between a M. musculus strain and a M. spretus strain as described in section 9.3. M. spretus is the most distant species from M. musculus that still allows the production of fertile F1 hybrids (see section 2.3.5). As such, the M. musculus X M. spretus cross will provide the highest level of polymorphism that is theoretically obtainable for the purpose of mapping.

It is certainly possible to replicate this interspecific cross with any one of a number of inbred M. musculus strains (such as B6, C3H, or DBA) and an inbred M. spretus strain (such as SPRET/Ei) that can be purchased from the Jackson Laboratory or another supplier. However, the interspecific cross is less than ideal for several reasons. First, breeding between M. musculus strains and M. spretus is generally poor with infrequent, small litters. As a consequence, one must begin with a larger number of initial mating pairs, and wait for a considerable length of time before obtaining a complete panel of second generation animals. Second, only F1 females are fertile. This rules out any possibility of setting up a second generation intercross. Finally, as discussed in section 9.3.2.1, M. musculus and M. spretus differ by at least one, and perhaps many more, small inversions that will act to eliminate recombination and, as a consequence, distort the true genetic map.

These limitations have led investigators to test the practicality of using an intersubspecific cross as an alternative that would still show sufficient levels of polymorphism but without any of the problems inherent in the interspecific cross. In particular, several laboratories have published mapping results based on crosses between the inbred strain B6, which is derived predominantly from the subspecies M. m. domesticus and the inbred strain CAST/Ei (distributed by the Jackson Laboratory) which is derived entirely from the M. m. castaneus subspecies (Dietrich et al., 1992; Himmelbauer and Silver, 1993).

The two subspecies M. m. domesticus and M. m. castaneus evolved apart from a common ancestor approximately one million years before present (see figure 2.2 and section 2.3.2). As a consequence, the level of polymorphism between the two is much greater than that observed among strains that are predominantly derived from just M. m. domesticus, but not quite as high as that observed between M. m. domesticus and M. spretus which evolved apart three million years ago. A sense of the relative levels of polymorphism that exist in various pairwise comparisons can be achieved by looking at the frequencies with which TaqI RFLPs are detected at random loci. In a comparison of two predominantly M. m. domesticus strains — B6 and DBA — 19% of the tested loci showed TaqI RFLPs; between B6 and the M. m. castaneus strain CAST/Ei, 39% of the tested loci showed RFLPs. Finally, between B6 and M. spretus, 63% of the tested loci showed RFLPs (LeRoy et al., 1992; Himmelbauer and Silver, 1993). Relative rates of polymorphism have also been surveyed at random microsatellite loci with the following results: among M. m. domesticus strains, the average rate of polymorphism is ~50%; between B6 and CAST/Ei, the polymorphism rate is 77%, and between B6 and M. spretus, the polymorphism rate is 88% (Love et al., 1990; Dietrich et al., 1992).

The bottom line from these various comparisons is that the level of polymorphism inherent in the B6 X CAST/Ei cross seems more than sufficient for generating high resolution linkage maps, especially with the use of highly polymorphic markers like microsatellites. Furthermore, the somewhat lower rate of polymorphism is more than compensated for by various advantages that this cross has over the interspecific cross with M. spretus. First, the two strains, B6 and CAST/Ei, breed easily in the laboratory with the production of large numbers of offspring. Second, both male and female F1 hybrids are fully fertile. Third, the single well-characterized interspecific inversion polymorphism does not exist between B6 and CAST/Ei (Himmelbauer and Silver, 1993), and it is likely that most other postulated interspecific inversion polymorphisms are also absent as well (Copeland et al., 1993). Consequently, the linkage map that one obtains with this intersubspecific cross is much more likely to represent the map that would have be derived from a cross within the M. m. domesticus subspecies itself.

As indicated in figure 2.2, and as discussed in section 2.3.2, there are a number of other M. musculus subspecies that are just as divergent from M. m. domesticus as is M. m. castaneus. Inbred strains have been developed from the M. m. musculus subspecies, and at least two (Skive and CzechII) are available from the Jackson Laboratory. In addition, another set of inbred strains (MOLF/Ei) have been derived from the faux subspecies M. m. molossinus which is actually a natural mixture of M. m. musculus and M. m. castaneus (see figure 2.2 and section 2.3.3). It is likely that each of these inbred strains could be used in place of CAST/Ei with a similar level of polymorphism relative to M. m. domesticus, and with the same advantages described above. In fact, the availability of several unrelated wild-derived strains provides a means for overcoming the limitation to genetic resolution caused by recombination hotspots as described in section 7.2.3.3 (and illustrated in figure 7.5). This is because F1 hybrids between B6 and CAST/Ei, or MOLF/Ei or Skive are all likely to have different hotspots for recombination. Thus, by combining data from all three crosses, one will be able to "see" recombination sites that are spread out among perhaps three times as many possible locations.

9.4.2.2 For mapping a simple mutation

Another factor in strain choice comes into play when the goal of a breeding study is to map a locus defined solely by a mutant phenotype. In this case, it is obvious that one of the parental strains must carry the mutant allele to be mapped. Ideally, the mutation will be carried in an inbred, congenic, or coisogenic strain. In the second best situation, the mutation will be present in a genetic background that is a mixture of just two well-defined inbred strains. Finally, the most potentially difficult situation occurs when the mutation is present in a non-inbred, undefined genetic background.

In this last situation, it is advisable to use a single male as the sole representative of the mutant strain in matings to produce all F1 hybrids. The advantage to this approach is that the number of alleles contributed by the mutant "strain" at any one locus in all of the F1 animals will be limited to just two. If, on the other hand, one had begun with multiple males as representatives of the non-inbred mutant strain, the number of potential alleles at every locus in the panel would be twice the number of males used. The larger the number of alleles, the more complicated the analysis could become. By rotating a single male among a large set of cages containing females from the second strain, it will be possible to produce a sufficient number of F1 hybrids in a reasonable period of time.

In essentially all cases, the mice that carry the mutation will be derived from the traditional inbred strains which are themselves mostly derived from the M. m. domesticus subspecies. For all of the same reasons discussed in the previous subsection, the best choice of a second parental strain would be one that is inbred from a different M. musculus subspecies such as CAST/Ei, MOLF/Ei, or "Mus musculus" Skive or CzechII, which are all available from the Jackson Laboratory.

9.4.3 Choosing a breeding scheme

The second choice that an investigator will make upon beginning a new linkage study is between the two prescribed breeding schemes. With both schemes, illustrated in figures 9.11 and 9.12, the first mating will always be an outcross between the two parental strains chosen according to the strategies outlined above. But once F1 hybrid animals have been obtained, an investigator must decide whether to backcross them to one of the parental strains or intercross them with each other. There are advantages and disadvantages to each breeding scheme.

9.4.3.1 The backcross

The primary advantages of the backcross approach are all based on the fact that each offspring from the backcross can be viewed as representing an isolated meiotic event. The entire set of alleles contributed by the inbred parent (strain B in figure 9.11) is pre-determined. Thus, the only question to be resolved at each typed locus is whether the F1 parent has contributed the same parental allele (from strain B) or the allele from the other parent (strain A): in the first instance, typing would demonstrate the presence of only the strain B allele, and in the second instance, typing would demonstrate the presence of both the strain A and strain B alleles.

By looking at figure 9.15, one can visualize the actual meiotic products contributed by the F1 parent in the form of individual haplotypes. Every recombination event can be detected and the frequency of recombination between any two loci can be easily determined. The existence of strong interference over distances of twenty centimorgans or more can be used to advantage in the determination of gene order, since any order which requires nearby double crossover events in any haplotype is likely to be incorrect.

The analysis of backcross data is very straightforward, and when all loci are known to map on the same chromosome, it is possible to derive linkage relationships even in the absence of specialized computer programs. But, with the use of the Macintosh computer-based Map Manager program (described in appendix B), data presentation and analysis become even more transparent. The major disadvantage with the backcross is that it is not universally applicable to all genetic problems. In particular, it cannot be used to map loci defined only by recessive phenotypes that interfere with viability or absolute fecundity in both males and females.

9.4.3.2 The intercross

The intercross approach has two main advantages over the backcross. The first is that it can be used to map loci defined by recessive deleterious mutations since both heterozygous F1 parents will be normal, and homozygous F2 offspring can be recovered at any stage (postnatal or prenatal if necessary) for use in typing further markers. The second advantage is a consequence of the fact that informative meiotic events will occur in both parents. This will lead to essentially twice as much recombination information on a per-animal basis as compared to the backcross approach.

The main disadvantage with the intercross is also a consequence of informative meiotic events in both parents. The problem is that the data obtained are more complex, as illustrated in figure 9.4 and discussed in section 9.1.3.4, and more difficult to analyze because of the impossibility of determining which allele at each heterozygous F2 locus came from which parent. Thus, while each animal will, by definition, carry two separate haplotypes for each linkage group, the assignment of alleles to each haplotype can only be accomplished retrospectively or, in some circumstances, not at all. In addition, interference is no longer as powerful a tool for ordering loci, since nearby crossover sites can be brought together into individual F2 animals from the two parents. To generate de novo linkage maps from large scale intercross experiments, it is essential to use computer programs such as Mapmaker that carry out multilocus maximum likelihood analysis (Lander et al., 1987 and appendix B). However, when previously mapped codominant anchor loci are typed within an intercross, the more user friendly Map Manager program (version 2.5 and higher) can be used for data input and analysis.

9.4.3.3 Making a choice

In large-scale mapping experiments with many loci spread over one or more chromosomes, the backcross is usually the breeding scheme of choice. What is sacrificed in terms of mapping resolution is made up for in terms of ease of data handling and presentation. However, when an investigator is focusing on a small genomic region (on the order of a few centimorgans or less) for very high resolution mapping as a precursor to positional cloning, the intercross may be a better choice. At this level of analysis, the data will be much less complex with only a small fraction of animals expected to show mostly single recombination events in the interval of interest; the advantage gained by doubling the frequency of such events may be critical to efforts aimed at zeroing-in on the locus of interest.

Of course, as discussed above, in the case of recessive deleterious mutations, one may not have a choice but to use the intercross. Unfortunately, in situations where the mutation is strictly recessive, one will only be able to map the mutant locus with those 25% of F2 animals that express the mutant phenotype because the genotype of non-expressing animals cannot be determined (see figure 9.12). Since two meiotic events are scored in each F2 animal, the total amount of genetic information obtained will be approximately double that obtained from an equivalent number of backcross animals that can be typed. Nevertheless, this still comes out to only 50% of the information obtained from typing a complete backcross panel of the same size as the complete intercross panel. Consequently, if the trait under analysis is strictly recessive but does not seriously hinder viability or fecundity in homozygotes of at least one sex, it is more advantageous to use the backcross. In these situations, a backcross can be set up with a homozygous mutant parent, as illustrated in figure 9.11, and 100% of the offspring can be scored phenotypically for the contribution of either the mutant or wild-type allele from the F1 parent.

9.4.4 The first stage: mapping to a subchromosomal interval

9.4.4.1 A stratified approach to high resolution mapping

An optimal strategy for high resolution linkage mapping is one that proceeds in stages with nested sets of both marker loci and animals. One can see the logic of this sequential approach by considering the numbers of markers and animals required to obtain a high resolution map in a single pass. For example, suppose one wanted to obtain a linkage map with both an average crossover resolution of 0.1 cM and an average marker density of one per centimorgan. In a one-pass approach, one would have to analyze 1,000 backcross animals for segregation at 1,500 marker loci (spanning 1,500 cM), which would require one and one-half million independent typings.

A much more efficient approach is to divide the protocol into two separate stages. The goal of the first stage should be to link the locus to a defined subchromosomal interval. This can be accomplished by typing a relatively small set of markers on a relatively small random subset of phenotypically-typed animals from within the larger panel. Once this first stage is completed, it becomes possible to proceed to the second stage which should focus on the construction of a high resolution map just in the vicinity of the locus of interest with a selected set of markers and a selected set of animal samples. The ultimate goal of this entire protocol is the identification of a handful of markers and recombinant animals that bracket a very small interval containing an interesting gene that can then be subjected to positional cloning as described in section 10.3.

9.4.4.2 How many animals and how many markers?: evaluation of the swept radius

The first step in the first stage of the protocol is to develop a framework map that is "anchored" by previously well-mapped loci spaced uniformly throughout the entire genome. To accomplish this task most efficiently, it is critical to calculate the minimum number of anchor loci required to develop this low resolution, but comprehensive, map. This calculation is based on the length of the swept radius that extends on either side of each marker. As discussed earlier in this chapter (section 9.2.2.3), the swept radius is a measure of the distance over which linkage can be detected between any marker and a test locus when both are typed in a set number of offspring generated with a defined breeding protocol. Although the swept radius was defined originally in terms of map distances (Carter and Falconer, 1951), it is much easier to work directly with recombination fractions, and in the following discussion, charts, and figures, I will use this alternative metric.

Two measures of the backcross swept radius, determined for sample sizes that range from 20 to 100 animals, are presented in figure 9.13. The first measure is based on the traditional view of a swept radius as a boundary that separates significant from non-significant rates of observed recombination. This "experimental swept radius" is shown as the solid curve in figure 9.13. The graph can be used to find out quickly whether any experimentally determined recombination fraction, or concordance value, meets the strictly defined Bayesian-corrected cutoff for demonstration of linkage at a probability of 95% or greater (see section 9.1.3.6).

Although the experimental swept radius provides a means to evaluate the significance of newly derived data, it is not useful as a means to establish the distances that should separate marker loci to be chosen for a framework map in a new cross. The problem is that marker loci that are actually separated by a map distance equivalent to the experimental swept radius will, by chance, recombine to a greater or lesser extent with equal probability in any particular experimental cross, and in those 50% of the crosses where a higher recombination fraction is observed, the data will not be sufficient to establish linkage at a 95% level of significance. Thus, a second, more conservative measure of swept radius is needed to determine the maximum actual recombination distance between two loci that will allow the demonstration of linkage at a probability of 95% with a frequency of 95%. I will call this parameter the "framework swept radius".

The "framework swept radius" can be evaluated as a recombination fraction associated with a 95% confidence interval having an upper confidence limit equivalent to the value of the experimental swept radius for a sample set of a particular size. In the discussion that follows, I will use this framework swept radius as a means for establishing the distances that should separate markers to be used in setting up a new framework map.

With a set number of backcross samples, one can use figure 9.13 to find the corresponding framework swept radius associated with each anchor locus. For example, with 52 samples, the framework swept radius is 15 cM, with 72 samples, it is 20 cM, and with 94 samples, it becomes 24 cM. It is clear that once a critical number of samples has been reached (45 to 50), further increases in number provide only a marginal increase in the distance that is swept. Figure 9.13 can also provide a first approximation of the framework swept radius associated with a panel of intercross samples. To a first approximation, each intercross sample is equivalent to two backcross samples. Thus, a swept radius of ~15 cM can be obtained with 26 intercross samples, and ~20 cM can be obtained with 36 intercross samples.

The framework swept radius can be used in conjunction with the lengths of each individual chromosome to determine the number of anchor loci required to provide complete coverage over the entire genome. Essentially, anchors can be chosen such that their "swept diameters" (twice the swept radius) cover directly adjacent regions that span the length of every chromosome as illustrated in figure 9.14. The first and last anchors on each chromosome must be placed within one swept radius of their respective ends, while the distance between adjacent anchors should be within two swept radii. The estimated lengths of all twenty mouse chromosomes are sorted into a set of ranges in Table 9.4. The number of anchors required per chromosome for a backcross analysis is calculated by dividing the chromosome length by the swept diameter defined with a sample set of a particular number (from the graph in figure 9.13) and rounding up to the nearest integer. As indicated in Table 9.4, with 52 backcross samples, it is possible to cover the entire mouse genome with 60 well-placed anchors. With 72 samples, the number of required anchors decreases to 46, and with 94 samples, it decreases to 43. It is clear that little is to be gained by including more than 72 samples in this initial analysis.

The minimalist approach just outlined to a comprehensive framework map has only become feasible as this chapter is being written. This feasibility is based on the availability of over 3,000 highly polymorphic microsatellite loci that span the genome with an average spacing of less than one centimorgan (Copeland et al., 1993). Primer pairs that define each of these loci are commercially available at a modest cost from Research Genetics Inc. in Huntsville, Alabama. By contacting the Genome Center at the Whitehead Institute, as described in appendix B, one can obtain chromosome-specific lists of microsatellites that are polymorphic between the particular parental strains that an investigator has used to generate his or her linkage panel. With this information, one can choose specific microsatellite loci that map to each of the general locations required to span each chromosome as illustrated in figure 9.14.

In the backcross linkage studies reported to date, the gender of the F1 hybrid used in the second generation cross has usually been female. In the case of the interspecific cross, there is no other choice since the F1 male is sterile. However, this is not a factor in the intraspecific or intersubspecific cross. Rather, F1 hybrid females are used for two other reasons. First, they have a much higher fecundity relative to inbred females, and second, they generally display higher frequencies of recombination (section 7.2.3.2) which, in turn, will produce a higher resolution map in the second stage of linkage analysis described in the next section. Interestingly, the lower recombination frequency associated with male mice is actually better suited to the first stage of mapping because it can act, in effect, to reduce the centimorgan length of each chromosome by 15% to 40%. Thus, with the use of male F1 hybrids in the backcross, one would, in theory, need fewer anchor loci to span the genome. Furthermore, as discussed in section 9.4.1, in backcrosses to a common inbred parent such as B6, the use of F1 males is likely to be much more efficient and provide many more N2 progeny more quickly than the reciprocal cross. Unfortunately, at the time of this writing, male-specific linkage maps have not been developed for the new libraries of microsatellite loci. Hence, at the current time, the spacing of microsatellites for this purpose would be a matter of guesswork.

9.4.4.3 Determining linkage

The first analysis of backcross data should be directed at simply determining the existence of linkage to the locus of interest. This is accomplished by comparing the pattern of allele segregation from the new locus with the patterns of allele segregation from each anchor locus. Essentially, the frequency of recombination between the new locus and each anchor locus is calculated, one at a time, to identify one or more anchors that show a significant departure from the independent assortment frequency of 50%. This task is performed most easily by entering the accumulated allele segregation data into an electronic file that is analyzed by a special computer program developed for this type of analysis. A number of such computer programs have been described (see appendix B). The most user-friendly of these is the Apple Macintosh-based Map Manager program developed by K. Manly (1993) and described in appendix B.

It is also possible to determine linkage, when a backcross set is not too large, without the use of a specialized computer program. This can be accomplished by entering the allele segregation information for each locus along a separate row or line in a spreadsheet or word processing file, where each column represents a separate animal (analogous to the RI strain data matrix illustrated in figure 9.6). Anchor loci should be placed in sequential rows according to their known order along each chromosome. The very first rows should be reserved for the new locus (or loci). The complete file will be a matrix of information with the number of rows equal to the number of anchor and new loci typed and the number of columns equal to the number of backcross animals analyzed. For the N = 52 backcross typed for one new locus in addition to a minimal number of anchors (from Table 9.4), this would be a 61 X 52 matrix of data.

Next, one would take the row representing a new locus and compare it row-by-row, either on the computer or on paper, for pattern similarities with each anchor locus allele distribution. Visual inspection alone will be sufficient to distinguish similar runs of alleles in two rows. The total recombination fraction between the new locus and any anchor locus identified in this way can be easily calculated; if the fraction of recombinants is greater than the experimental swept radius found in figure 9.13 (0.27 for N = 52), linkage can be rejected and one can move on to the next locus. Although this process is somewhat tedious, the time that it takes is minimal compared to the time involved in actually typing DNA markers in the first place. In contrast, with whole genome data obtained from an intercross, manual determination of linkage is extremely difficult. Instead, one should use one of the limited number of programs available for this type of analysis. The most well-known of these programs is Mapmaker developed by Eric Lander (1987 and appendix B).

Ideally, linkage analysis will identify at least one, and at most two, loci that are linked at a significance level of 95% to the new locus of interest. If there are two linked loci, they should be adjacent to each other within the framework map formed on the same chromosome. With results of this type, one can move on to the next task of determining the order of the new locus relative to the framework map as discussed below.

It is possible that the data will not be sufficient to demonstrate linkage with a significance of 95% to any of the anchor loci that were typed. It is critical at this point to confirm all DNA marker typings as well as phenotypic determinations for each animal. If there is still no evidence of linkage at the 95% significance level, one can attempt to uncover potential linkage relationships by reducing the required level of significance. This may allow the suggestion of linkage in the middle of a particular chromosomal interval between two anchors or near a chromosome end. If this approach fails, one should examine the recombination intervals that separate each anchor along each chromosome (with the haplotype method described in the next section) in order to pick out intervals that are larger than anticipated. One can re-type the same set of backcross animals for new anchors in regions suggested by any of these approaches. If this approach fails as well, one should consider the possibility that the new locus may map very close to a centromere or telomere; to test this possibility, it would be necessary to type more centromeric and telomeric anchors on each chromosome. Finally, one should consider the possibility that complex genetic interactions such as incomplete penetrance and/or polygenic effects may be acting to distort the one-to-one relationship between phenotype and genotype at any single locus (see section 9.5).

9.4.4.4 Pooling DNA samples for the initial identification of linked markers

In essentially all mapping experiments performed today, the vast majority, if not all, of the marker loci used are typed by DNA-based techniques. At the time of this writing, the most versatile, and most commonly used, genetic marker is the microsatellite (section 8.3.6). But other DNA markers that are useful in particular cases include those that can be assayed by the SSCP protocol (section 8.3.3) and RFLP analysis (section 8.2). The genotyping of all of these marker types within offspring from a mapping cross is based on the detection of "codominant" alleles recognized as different size bands after gel electrophoresis.

In the mapping approach just described in the previous section, each backcross animal is converted into a DNA sample that is typed independently for each marker locus that has been chosen to sweep the genome. The total number of PCR reactions (or restriction digests) required can be determined from Table 9.4 by multiplying the number of markers by the number of backcross animals. The smallest number is obtained with 52 animals typed for 60 markers, which comes out to 3,120 reactions (followed by an equivalent number of lanes on gels). Unless one has access to automated PCR and gel running equipment and unlimited funds for thermostable DNA polymerase, this approach could be prohibitive in cost.

A much more efficient approach can be used when the goal of a cross is to map the locus or loci responsible for a particular mutant phenotype or polymorphic trait that is segregating in either a backcross or an intercross. The only essential prerequisite is that the parents used in the first generation mating must be from an inbred or segregating inbred strain (see section 3.2.4).

The basic strategy is to reduce the number of PCR reactions (or restriction digests) and subsequent gel runs through the analysis of only one or two combined DNA samples that are obtained by pooling together equivalent amounts of high quality DNA from all second generation animals expressing the same phenotype (Michelmore et al., 1991; Asada et al., 1994). This pooled DNA strategy works for both the backcross protocol and the intercross protocol. It works for incompletely penetrant traits and for quantitative traits controlled by segregating alleles at more than one locus (see section 9.5.4.2). However, it requires the use of markers with segregating alleles that can be reproducibly distinguished and detected with equal levels of intensity. Thus, not all PCR-based markers will be suitable.

Let us consider the simple example of a backcross in which all N2 animals can be phenotypically distinguished at a single mutant locus as illustrated in figure 9.11. The first step of the analysis would be to classify each animal as +/m or m/m followed by the conversion of each individual into a high quality DNA sample. Then, equal amounts of DNA from each m/m sample would be combined into one pool, and equal amounts of DNA from each +/m sample would be combined into a second pool. A third control sample would be formed by combining equal amounts of DNA from the two parents of the cross: the F1 hybrid and strain B in figure 9.11. Finally, an aliquot from each of these three composite samples would be subjected to PCR amplification with primer pairs specific for one marker at a time (or restriction digestion), and the amplified (or digested) samples would be separated by gel electrophoresis and analyzed by ethidium bromide staining, or probing, or autoradiography.

The results expected for markers showing different linkage relationships to the mutant locus are illustrated in Table 9.5. For all markers that are not linked to the test locus, the allele patterns obtained with the three composite DNA samples should be indistinguishable with a ratio of 1 : 3 in the intensities of the strain A and strain B alleles. In contrast, when a marker is very closely linked to the mutant locus, the ratio of alleles in the two pooled samples will diverge significantly in opposite directions from the control sample: in the m/m sample, the strain A allele will be absent or very light, while in the +/m sample, the intensity of the strain A allele will climb to equality with the strain B allele (whose signal will decrease proportionally). For ease of analysis, it is best to run the control sample in-between the two pooled N2 samples.

The power of this strategy for linkage analysis derives from the huge reduction in the number of samples that must be typed for each marker. Instead of 40, 50, 60 or more, the number is reduced to just three. However, to get a sense of the overall savings in time and cost, it is important to consider several factors: (1) the number of individual N2 samples that must be included in each pool, and (2) the recombination distance over which a significant departure from the control sample can be observed.

Increasing the number of individual samples in each pool serves two purposes. First, random errors in the measurement of individual sample aliquots will tend to become evened out over a larger pool. Second, chance departures from the control ratio of alleles (i.e. false positives) will become much less frequent for unlinked markers (see figure 9.13). For both of these reasons, one should set a minimum pool size at 20 animals. There is no maximum to the pool size but there is nothing to be gained from pooling more than 50 samples together.

It is difficult to predict the level of concordance that must exist between the test locus and a marker before one can judge a result to be evidence of linkage. A certain level of non-genetic variation is likely from sample-to-sample, and thus, a positive result must be one with a signal ratio that goes significantly beyond this normal variation. Consequently, the swept radius for markers analyzed in pooled samples will almost certainly be less than that possible with individual animal analysis as well as different from one marker to another. From the numbers shown in Table 9.5, the detection of linkage out to a distance of ~20 cM, but not much farther, would appear possible. Thus, up to 50% more markers may be required to sweep the entire genome.

The pooled DNA approach is maximally resolving when the nature of the phenotype under analysis allows the investigator to obtain two pools representing samples from each of the parents in the backcross (the F1 and strain B in figure 9.11) or the two original strains used to generate the intercross (strain A and strain B in figure 9.12). In a situation of this type, each departure from the control ratio observed for a marker in one pool should be accompanied by a departure in the opposite direction for the other pool (see Table 9.5). This requirement for confirmation will act to reduce the frequency of false positive results. In many experimental situations, however, it will only be possible to develop a single pool of homozygous m/m samples for analysis. This will be the case for backcross studies of incompletely penetrant traits and for intercross studies of fully recessive phenotypes (figure 9.12). In such cases, it will be necessary to generate and phenotype a larger number of animals in order to identify the smaller subset of samples that can be included within the single pool that can be made available for comparison to the control.

Once markers potentially linked to the test locus have been identified by the DNA pooling approach, it is essential to go back with each "positive" marker and individually type each sample in the pool to obtain quantitative confirmation of linkage or to rule it out. But, even with the reduction in genetic resolution and the requirement for confirmatory analysis, the DNA pooling approach can reduce the number of samples to be analyzed by at least an order of magnitude with large savings in labor and cost. If linkage to a single marker has been confirmed through individual sample analysis, the investigator can re-type each of the samples with additional markers that lie within a 30 cM radius on either side to pursue the haplotype analysis described in the next section.

9.4.4.5 Determining gene order: generating a map

Once linkage has been demonstrated for a new locus, it is usually straightforward to determine its relative position on the chromosome framework map. For backcross data, this is accomplished by a method referred to as haplotype analysis. Haplotype analysis is performed on one linkage group at a time. For the mapping of any new locus, it is only necessary to carry out this approach for the chromosome to which the locus has been linked. The first task is to classify each backcross animal according to the alleles that it carries at the anchor loci typed just on the chromosome of interest. By definition, when two or more animals carry an identical set of alleles, they have the same "haplotype" on that chromosome. By comparing the data obtained for all members of the backcross panel, one can determine the total number of different haplotypes that are present.

As illustrated in figure 9.15, each distinct haplotype is represented by a column of boxes, with one box for each locus; each box is either filled-in to indicate one parental allele or left empty to indicate the other parental allele. Anchor loci are placed according to their order along the chromosome from most centromeric at the top to most telomeric at the bottom. The number of animals that carry each haplotype is indicated at the bottom of each column. Haplotypes are presented in order from left to right according to the number and location of recombination events. Parental haplotypes — showing no recombination — are indicated first. Haplotypes with single recombination events are presented next, followed by those with two recombination events, and more, if they exist. Vertical lines can be used to separate haplotype pairs defined by reciprocal allele combinations. The class of single recombination haplotypes are presented in order from left to right according to the position of the breakpoint from most centromeric to most telomeric. The haplotype diagram shown in figure 9.15 can be generated automatically (in printable form) from recorded data in the Manly (1993) Map Manager program.

The haplotype diagram can be used to generate a linkage map by adding up the total number of animals that are recombinant between adjacent loci. For example, the G, H, I, and K haplotypes show recombination between the hypothetical D51 and D33 loci shown in figure 9.15; these haplotypes are carried by 9, 10, 1, and 1 animals respectively. Thus, in total, 21 animals are recombinant between these loci for a calculated recombination fraction (rf) of 0.404. When a recombination fraction is larger than 0.25, one should use the Carter-Falconer mapping function (equation 7.3) to obtain a more accurate estimate of map distance in centimorgans. The calculated mFC value is 44 cM. Similarly, the recombination fractions that separate D81 from D12, and D12 from D51 are both found to be 0.269. With the Carter-Falconer equation, this recombination fraction value is adjusted slightly to a map distance of 27.3 cM.

With a framework haplotype diagram and map, it becomes possible to determine the location of a new locus under analysis. Consider the hypothetical example in figure 9.15 where linkage has already been demonstrated between a new locus and just one anchor locus — D51. In this case, the new locus could be in either one of two positions on the chromosome, proximal or distal to D51. To test these two locations, one can draw a second set of haplotype diagrams that include only those newly defined haplotypes showing recombination between the linked anchor D51 and the new locus. In this example, a subset of animals from the previously defined haplotype classes A, G, H, and I define four new haplotypes labeled A’, G’, H’ and I’ respectively as illustrated in figure 9.15. These haplotypes are drawn in two different ways with the new locus either proximal or distal to D51. The correct order can be determined by minimizing both the number of multiply-recombinant haplotypes and the total number of implied recombination events within the sample set. In the example shown, a distal location requires a total of eight crossover events that take place within four single recombinant chromosomes and two double recombinant chromosomes. Alternatively, a proximal location requires a total of twelve crossover events with no single recombinant chromosomes, one double, six triples and one quadruple. Data of this type clearly point to a distal location for the new locus. Although any real set of data will obviously give different results, the same logical progression will almost always provide a definitive map position. With the computer program Map Manager, this analysis can be accomplished automatically.

With intercross data, whole chromosome haplotype analysis can be much less straightforward (as illustrated in figure 9.4). Consequently, gene order is usually determined computationally by the method of maximum likelihood analysis (Lander et al., 1987). Nevertheless, with the aid of a framework map, it is usually possible to breakdown F2 genotype information into pairs of most likely haplotypes for each animal (D’Eustachio and Clarke, 1993). At this point, a new locus could be mapped according to the same logic described above.

9.4.5 The second stage: high resolution mapping

The ultimate goal of the second stage of many mapping projects is to identify both DNA markers and recombination breakpoints that are tightly enough linked to a new locus of interest to provide the tools necessary to begin positional cloning. This second stage can be broken down optimally into a series of steps as follows:

Step 2.1: The first goal of this second stage should be to narrow down the map interval as much as possible using only the small panel of samples typed in stage 1. This can normally be accomplished by selecting and typing additional microsatellite markers spaced across the 20 cM region to which the locus of interest has been mapped. With an original panel of 54 backcross samples, for example, recombination breakpoints will be distributed at average distances of about two centimorgans. Thus, by typing additional markers, one should be able to reduce the size of the gene-containing interval from an original 25 to 40 cM down to 4 to 10 cM. The goal of this step is to identify the closest "limiting markers" on both sides of the locus of interest that do show recombination with it in order to establish an interval within which the locus must lie.

Step 2.2: The next step requires the breeding of a large number of animals that segregate the mutant allele. Ideally, the total number of animals bred should be at least 300 with a maximum of 1,000 (see section 7.2.3.3). But, this large set can be quickly reduced to the smaller set of samples that show recombination in the interval to which the gene has already been mapped. This can be accomplished by typing each animal for just the two "limiting markers" identified in step 2.1. If, for example, the locus-containing interval had previously been restricted to a ten centimorgan region bounded by these markers, this analysis would eliminate from further consideration approximately 90% of the total samples in the large cohort. If a PCR-based analysis is used to type the two markers, rapid methods for obtaining small quantities of partially purified DNA from members of the large cohort may be sufficient (Gendron-Maguire and Gridley, 1993).

Step 2.3: The smaller subset of animals selected in step 2.2 can now be typed with a larger set of markers previously localized to the genomic interval between the two limiting markers defined in step 2.1. At this point, it makes sense to test all segregating microsatellites that have been placed into one centimorgan bins extending from one limiting marker to the other as well as any other suitably located DNA markers (Copeland et al., 1993 and appendix B). Newly tested markers that show no recombination with either one limiting marker or the other (among all animals tested) are likely to map outside the defined interval. But, all new markers that show recombination in different samples with each of the previously defined limiting markers will almost certainly map between them. Haplotype analysis can be used once again to obtain a relative order for these newly mapped markers. If the initial interval defined in step 2.1 is ten centimorgans or less, double recombination events will be extremely unlikely, and with this underlying assumption, it should be possible to obtain an unambiguous order for all markers that show recombination with each other and/or the phenotypically-defined locus.

Step 2.4: As multiple new markers are mapped to the interval between the two previously defined limiting markers, it should become possible to reduce the size of the gene-containing interval even further than the one defined in step 2.1. As the size of the interval is reduced, the number of animal samples within the panel that need to be analyzed further can also be reduced to include only those that show recombination between the newly defined limiting markers. Additional markers should be typed until one reaches the ultimate goal of identifying limiting markers that each show only one (ideally) or a few recombination events on either side of the locus of interest along with one or more markers that show absolute concordance with the locus itself as illustrated in figure 10.1. If one exhausts the available sources of markers without coming close to this goal, it may be necessary to derive additional region-specific markers as discussed in section 8.4.

With the identification of one or more DNA markers that show no recombination with the locus of interest, figure 9.16 or 9.17 can be used to gain a sense of the distance that separates them. For example, with an initial cohort of 380 animals, the average distance that will separate marker and locus will be 0.2 cM and the 95% confidence interval will extend out to one centimorgan. With an initial cohort of 1,000 animals, the average distance will be less than 0.1 cM and the 95% confidence interval will extend out less than 0.4 cM. At this stage of analysis, one can move on to the task of generating a physical map that extends across the genomic region between the two closest limiting markers. Section 10.3.3.3 and figure 10.1 provide a comprehensive example of the approach described in this section.

9.5 Quantitative traits and polygenic analysis

9.5.1 Introduction

Most of the phenotypic characteristics that distinguish different individuals within a natural population are not of the all-or-none variety associated with laboratory-bred mouse mutations like albino, non-agouti, brown, quaking, Kinky tail, and hundreds of others. On the contrary, easily visible human traits such as skin color, wavy hair, and height, as well as hidden traits such as blood pressure, musical talent, longevity, and many others each vary over a continuous range of phenotypes. These are "Quantitative traits" which are so-called because their expression in any single individual can only be described numerically based on the results of an appropriate form of measurement. Quantitative traits are also called continuous traits, and they stand in contrast to qualitative, or discontinous, traits that are expressed in the form of distinct phenotypes chosen from a discrete set.

Continuous variation in the expression of a trait can be due to both genetic and non-genetic factors. Non-genetic factors can be either environmental (in the broadest definition of the term) or a matter of chance. In mice, it is relatively straightforward to separate genetic from non-genetic contributions through the analysis and comparison of animals within and between inbred strains. Variation in expression among individual members of an inbred strain must be caused by non-genetic factors. Furthermore, if one is convinced that all individuals are maintained under identical environmental conditions, then existing variation is likely to be the result of chance alone.

Geneticists are, obviously, most interested in the genetic contribution to a Quantitative Trait. A genetic contribution cannot be demonstrated by looking at individuals from a single inbred strain alone. Rather, a comparison of expression levels must be made on sets of animals from two different inbred strains. The statistical approach described in appendix D.2 can be used to determine formally whether two strains differ significantly in the expression of the quantitative trait. If a significant strain-specific difference is demonstrated, and all other variables have been controlled for, it becomes possible to attribute the observed difference in quantitative expression to allelic differences that distinguish the two strains.

In practice, a Quantitative Trait is most amenable to genetic analysis in mice and other experimental organisms with a pair of inbred strains that show non-overlapping distributions in measured levels of expression among at least 20 members of each group. Although a significant strain-specific difference can be demonstrated under much less stringent criteria (as described in appendix D.2), it becomes more and more difficult to ferret out the Quantitative Trait Loci (QTLs) involved as the possibility of phenotypic overlap increases.

The appearance of a Quantitative Trait usually signifies the involvement of multiple genetic loci, although this need not be the case. In particular, a single polymorphic locus with multiple, differentially expressed alleles can give rise to continuous variation within a natural population. There may also be some instances where the expression of a quantitative trait is controlled by a mutant allele at a single locus with a high degree of variable expressivity (Asada et al., 1994). However, if a single locus is responsible for the entire genetic contribution to a Quantiative Trait difference between two inbred strains, this would most likely become apparent in the second generation of either an outcross-backcross or outcross-intercross breeding protocol. In the first instance, half the N2 animals will be identical to the F1 parent, and the other half will be identical to the inbred backcross parent at the critical locus as illustrated in the top panel of figure 9.18. The result would be a discontinuous distribution of phenotypes that fall into two equally populated classes with separable distributions that parallel those found for each of the first generation parents. With the intercross protocol, F2 animals will be distributed among three classes (in a 1:2:1 ratio) that will parallel the phenotypic distributions found among one parental strain, the F1 hybrid, and the second parental strain.

If a significant number of second generation animals are found to express phenotypes intermediate to those found in the parental strains and F1 hybrid, it is most likely that multiple genetic differences between the progenitor strains are responsible as illustrated in the lower panels of figure 9.19. The term polygenic is used to describe traits that are controlled by multiple genes, each of which has a significant impact on expression. The term multifactorial is also used to describe such traits, but is more broadly defined to include those traits controlled by a combination of at least one genetic factor with one or more environmental factors.

Not all polygenic traits are quantitative traits. A second polygenic class consists of those traits associated with a discrete phenotype that requires particular alleles at multiple loci for its expression. Polygenic traits of this type can be classified and analyzed with breeding protocols that are the same as those used for quantiative traits. For example, suppose strain DBA shows hypersensitivity to loud noises with 100% penetrance while neither strain B6 nor F1 hybrid animals show any sensitivity. This result would indicate that hypersensitivity is recessive. Further analysis would proceed by backcrossing the F1 animals to the homozygous recessive DBA parent. If instead, the DBA trait was expressed in a dominant manner, the backcross would be made to the homozygous recessive B6 parent. In either case, backcross offspring would be typed for hypersensitivity. If 25% or less of the backcross animals expressed the trait while all of the others were normal, this would provide evidence for the requirement of at least two DBA genes to allow phenotypeic expression.

It is important to mention that more complex scenarios are possible and likely to be the rule, rather than the exception. In particular, different members of the gene set involved in the expression of a trait may differ in their relative contribution to the trait; they may behave differently relative to their corresponding wild-type allele with some showing complete dominance or recessiveness and other showing varying degrees of partial or semi-dominance. In some instances, a discrete trait may become quantitative upon outcrossing, or it may exhibit a threshold effect where the probability of expression in N2 offspring increases with an increasing number of critical genes from the affected parental strain. The strategy described in the next section for the analysis of polygenic traits is a general one which should be applicable to all of these situations. However, it is almost always true that the greater the genetic complexity, the larger the number of animals that will have to be bred and analzyed to obtain the same degree of genetic resolution.

9.5.2 A choice of breeding strategy and estimation of locus number

Whenever viability and fecundity are not a problem, it is much more efficient to analyze complex genetic traits through a backcross rather than an intercross. This is because each backcross animal will have one of only two genotypes at each locus. In contrast, offspring from an intercross can have one of three genotypes at each locus, which can combine into many more permutations with a set of multiple unlinked loci. Consider the situation where four loci are involved. With the backcross, all offspring will have one of (1/2)4 = 16 different genotypes, whereas in the intercross, offspring can have one of (1/3)4 = 81 different genotypes. Furthermore, as described below, the most efficient means to analyze polygenic traits is to collect and genotype only those animals that express the most extreme forms of the phenotype since these animals are most likely to be homozgyous for all of the involved genes. If three genes are involved, ~12.5% of the N2 animals will have a genotype equivalent to the backcross parent and an equivalent proportion will be identical to the F1 parent. But in offspring from an intercross, only 1.6% will be expected to have a genotype equivalent to that of each parental strain. Finally, as discussed in section 9.4.3, when marker data are finally obtained, their compilation and analysis is much easier for a backcross than an intercross.

Before embarking on a detailed mapping project, it is useful to derive an estimate of the number of segregating genes involved in the expression of the trait under analysis. In complex cases of inheritance, the derivation of such an estimate will not be possible. However, an estimate can be made in two simple situations. The first is that of a discrete phenotype whose expression shows an absolute requirement for alleles at multiple unlinked loci from the affected parent. With a sufficient number of backcross animals, an estimation of gene number in this situation is trivial because the expression of the variant phentotype is absolutely correlated with the presence of a parental strain genotype at all involved loci. The probability of this occurrence is (0.5)n where n is the total number of loci required for expression. Thus, if the observed proportion of affected animals is ~25%, this would imply the action of two required genes, at ~12.5%, the prediction would be three genes, at ~6.25%, the prediction would be four genes, and so on. With these numbers, it is easy to see that each additional locus will require a doubling in the total number of backcross animals that must be phenotyped to obtain the same number of affected animals for genotyping.

In the case of quantitative traits, it is also possible to estimate gene number if one makes the simplifying assumption that all involved genes are unlinked and active in a strictly semidominant manner with an equivalent contribution to the phenotype. In this situation, one can use a modified form of a formula derived by Wright (1952) for an intercross analysis and known as "Wright’s polygene estimate":

? ?(9.10)

where mP2, mN2 and mF1 are the mean values of expression of the backcross parent, the N2 population and the F1 hybrid respectively, and VN2 and VF1 are the computed variances for the N2 and F1 populations respectively. The two forms of the equation shown here are mathematically equivalent so long as the mean value of the N2 population is halfway between the means of the F1 and the backcross parent. One can see the logic behind this equation by considering the probability that a backcross animal will show an extreme phenotype associated with one of its parents. From figure 9.19, one can see the proportion of genotypes equivalent to either parent drop by a factor of two with each successive increase in locus number from one to two to three. As a consequence, the variance in the complete N2 generation (shown in the right panels of figure 9.19) will also drop as values tend to cluster more around the mean. As the N2 variance goes down, the denominator of equation 9.10 will decrease as well. It is important to realize that equation 9.10 will only provide a very rough, minimum estimate of locus number because it is unlikely that all of the assumptions that went into the use of the equation will hold true in a real life situation.

9.5.3 Choices involved in setting up crosses

The first step in polygenic analysis is the same as the first step in mapping a single phenotypically defined locus — the choice of two parental strains (section 9.4.2). Unlike the situation with single locus studies, the two parental strains to be choosen for polygenic analysis must be inbred; if not, unexpected and uninterpretable genetic complications could arise. The most important consideration in the choice of parental strains is that they should show the greatest difference possible in the expression of the trait under analysis. Other considerations are the same as those discussed in section 9.4.2 with the caveat that an investigator may want to avoid interspecific, and perhaps intersubspecific, crosses because of the possibility that "abnormal" admixtures of alleles may not function together as they would in a normal offspring from either breeding group.

Upon choosing two inbred parental strains (called P1 and P2 in the following discussion), one should perform a cross to obtain F1 hybrid offspring. But before proceeding to a second generation cross, it is critical to determine the expression of the trait of interest in the F1 population. Figure 9.18 shows different examples of the potential results that might be obtained. If the pairs of alleles present at all "polygenetic" loci that distinguish P1 from P2 act in a strictly semidominant manner, the F1 population will show a mean level of expression halfway between the means of the two parental strains (example 3 in the figure). On the other hand, the F1 population may show a distribution that is indistinguishable from one parental strain or the other if there are strong dominant effects (example 1 in the figure). Finally, a likely result is the complex one with unequal allele strengths — but not strict dominance — that lead to a distribution differing from both parental strains, but with a mean value that is closer to one than the other (example 2 in the figure). In fact, the F1 distribution can have a mean value that lies anywhere along the continuum between the two parental means. However, in all cases, the standard deviation around this mean value should be similar to that found with the parental strains, since the F1 population is always genetically homogeneous.

If the mean expression of the F1 population lies essentially halfway between that found with the two parental strains, then the backcross can be performed with either parent. Other criteria, such as reproductive performance, should be the deciding factors (chapter 4 and Table 4.1). However, if the mean F1 expression is closer to one parent (such as P1 in examples 1 and 2 shown in figure 9.18), one should backcross F1 animals to the opposite parent (P2 in the example). As one can see from figure 9.18, this choice will serve to minimize the degree of phenotypic overlap between the two "parents of the backcross" and will allow a more accurate identification of N2 animals with genotypes that match one parent or the other as discussed below.

It has been customary in mouse genetic studies to perform a backcross between an F1 female and a male from the chosen parental strain. The main advantage to backcrossing in this direction is the higher fecundity of F1 females that results from "hybrid vigor". However, as discussed in sections 9.4.1 and 9.4.4.2, it may sometimes be more advantageous to cross the F1 male with an inbred female.

9.5.4 An optimal strategy for mapping polygenic loci

9.5.4.1 As the number of loci increases, interpreting results becomes more difficult

Once backcross (N2) progeny are obtained, they can be analyzed for expression of the trait of interest with the same protocol used to measure expression in the F1 and progenitor strain (P1 and P2) populations. When a sufficient number of N2 animals have been tested, the distribution of expression levels can be graphed out and compared to the distributions obtained with the F1 and P2 populations. The right hand side of figure 9.19 shows examples of the idealized distributions that one would obtain upon analysis of a trait whose expression is determined through the additive effects of semidominant alleles at one, two, or three loci that contribute equally to expression levels.

Consider the trival case in which a trait that what was thought to be polygenic is actually controlled primarily by a single locus, A, having two semidominant alleles A1 and A2. There are only two potential gentoypes in the N2 population obtained from backcrossing to parent P2: A1/A2 and A2/A2. Thus, the complete distribution shown in the upper right hand panel of figure 9.19 can be broken down into the two separate distributions associated with each of these genotypes, as shown in the upper left hand panel. This analysis shows that the telltale sign of involvement of only single major locus is a biphasic distribution with peaks similar to those of the parents and a paucity of animals in-between.

Next, consider the simplest case of polygenic inheritance with alleles at two major loci, A and B, that both have additive semidominant effects on expression. In this case, the number of relevant N2 genotypes doubles from two to four: A1/A2 B1/B2; A1/A2 B2/B2; A2/A2 B1/B2; and A2/A2 B2/B2. If one assumes that the pair of genotypes containing one heterozygous and homozygous locus affect expression equally, the idealized distribution pattern shown in the middle right hand panel of figure 9.19 would be obtained. This idealized pattern can be broken down into the three subdistributions that correspond to the different genotypic classes as shown in the middle left hand panel; the intermediate subdistribution is twice as high as the side distributions because of the contribution of two genotypes rather than one.

In some experimental cases, when there is a sufficient distance between the mean values of expression of the two parents, it may be possible to actually obtain a triphasic distribution pattern with a shape and peak distribution similar to that shown in the middle right hand panel of figure 9.19. A result of this type would be a sign that only two major additive loci were involved in the expression of the trait.

In most experimental situations, the distribution patterns obtained for the expression of a complex trait of interest in an N2 population are unlikely to show significant evidence of multiple phases and multiple peaks. Rather, the most likely distribution will be an undifferentiated continuum that extends across the range between and beyond the mean values of expression observed for the F1 and P2 parental populations. There are several factors that are likely to contribute to this tendency toward a monophasic distribution. First, with each increment in the number of loci having an effect on expression, there will be a doubling in the number of different genotypes that are possible in the N2 population. With just three loci, the number of genotypes will be eight. If alleles at all three loci show addititive semidominant effects, a distribution of the form shown in the bottom right hand panel of figure 9.19 will be obtained. This nearly monophasic distribution results from the combination of only four subdistributions that correspond to separate genotypic classes. As the number of genes involved grows beyond three, the possibility of seeing multiple distribution peaks that correspond to different genotypic classes is essentially nil.

9.5.4.2 Selective genotyping

For the purposes of genetic analysis, the most critical feature of polygenic, quantitative trait inheritance is the impossibility of correlating intermediate levels of phenotypic expression with particular genotypes at each of the segregating loci involved. This problem is clearly visible even in the idealized distributions of the three locus trait shown in the bottom panels of figure 9.19. In this simple example, an N2 phenotype halfway between the means of the F1 and P2 parents could be caused by heterozygosity at any one or two of the three loci involved; thus, this halfway phenotype is almost useless in terms of providing marker linkage information. However, there will always be one or two portions of each N2 distribution that will have a high level of predictability for genotype at linked markers — the tails at one or both ends.

An N2 animal that shows an extreme level of phenotypic expression that is, in fact, within the normal range observed for one parental strain (either the F1 or P2) is likely to have the genotype of that parent at all of the segregating loci that distinguish the F1 and P2 parents. This means that a set of animals with the same extreme phenotype at one end of the N2 distribution will be likely to show a significant level of concordance with the same parental genotype at all markers that are closely linked to any one of the segregating trait loci. For example, imagine that one has chosen a subset of 20 N2 animals that most resemble the P2 parental strain in the expression of a trait and each animal within this set is typed for DNA markers that span the genome. A marker that is closely linked to any one of the trait loci will appear homozygous for the P2 allele in a significant majority of the animals of this subset. If possible, a second subset of animals could be collected that most resemble the F1 population; markers linked to the trait loci would appear heterozygous for the P1 and P2 alleles in a significant majority of the animals of this subset.

The strategy just described, known as "selective genotyping", provides the most highly efficient means for mapping polygenic loci (Soller, 1991). Phenotypic analysis is performed on the complete set of backcross animals which should typically number in the hundreds. This analysis allows the investigator to identify one or two smaller subsets of N2 animals with the greatest amount of genotypic information content. DNA typing is performed only on these smaller subsets, each of which can be pooled together into single composite samples as described in section 9.4.4.4.

How does one decide what proportion of N2 animals to include in each extreme phenotypic subset when a continuum of expression levels is observed for the whole population? The answer is not simple. If one is too stringent, there may be too few animals to type and the power of the linkage test will suffer accordingly (figure 9.13). But, if one is too lax, animals without a parental genotype at each critical locus will be included at a higher frequency (figure 9.19). This will cause the level of discordance with truly linked markers to increase beyond the actual recombination fraction to a point that may fall beyond the level of significance shown in figure 9.13. There will obviously be an optimal cutoff point, but it will be impossible to ascertain its position in advance without knowing how many segregating loci have a major effect on expression. As illustrated in figure 9.19, as the number of loci grows, so does the phenotypic overlap between each completely parental genotypic class (indicated with dark lines) and its adjacent mixed genotypic class (indicated with lighter lines).

In a first round of analysis without prior information, a reasonable fraction of backcross animals to include within each extreme subset would be 10% (Soller, 1991). Since it’s important to have at least 20 individual samples within each composite sample for DNA pooling, this would entail the inital phenotypic analysis of at least 200 backcross animals. With a sample size that is this small, the swept radius is quite modest (see figure 9.13) and a large number of markers will be required to span the whole genome. If it is possible to pool together 30 or 40 samples, this will greatly improve the sweep of individual markers. Alternatively, if the DNA pooling method provides evidence of potential marker linkage, the results obtained upon analysis of individual samples in the two extreme classes (if there are two that can be formed) can be combined for greater statistical power.

The results obtained from the initial analysis of the 10% DNA pools will provide the investigator with a certain amount of information on the experimental direction that is best to follow. For example, if the initial analysis allows the identification of even one marker that shows 100% concordance within an extreme phenotypic class, it is likely that this class does not contain any animals with non-parental genotypes. Thus, it would be worthwhile to expand the extreme class to include a larger sample size to search more efficiently for markers linked to additional loci that affect trait expression. Furthermore, positive results with individual markers that fail to meet the most stringent requirements for significance could still be pursued through the typing of markers that are 10 to 20 cM removed and may be closer to a potential trait locus. If a trait locus is, indeed, present in the vicinity of the original marker, this strategy could yield closer markers that will show higher levels of concordance and significance. Finally, more advanced non-parametric statistical methods, such as the Mann-Whitney U test (available within most statistical software packages for desktop computers), can be used to extract additional information from the available data with a consequent increase in statistical power.