Mouse Genetics: Concepts & Applications (Full Table of Contents)

Copyright ©1995 Lee M. Silver

5. The Mouse Genome

5.1 Quantifying the genome

5.1.1 How large is the genome?

5.1.2 How complex is the genome?

5.1.3 What is the size of the mouse linkage map?

5.1.4 What proportion of the genome is functional?

5.1.5 How many genes are there?

5.2 Chromosomes

5.2.1 The standard karyotype

5.2.2 Robertsonian translocations

5.2.3 Reciprocal translocations

5.3 Genome evolution and gene families

5.3.1 Classification of genomic elements

5.3.2 Forces that shape the genome

5.3.3 Gene families and superfamilies.

5.3.4 Centromeres and satellite DNA

5.4 Repetitive "non-functional" DNA families

5.4.1 Endogenous retroviral element

5.4.2 The LINE-1 family

5.4.3 The major SINE families: B1 and B2

5.4.4 General comments on SINEs and LINEs

5.4.5 Genomic stutters: microsatellites, minisatellites, and macrosatellites

5.5 Genomic imprinting

5.5.1 Overview

5.5.2 Why is there imprinting?

5.4.3 The molecular basis for imprinting.

 

5.1 Quantifying the genome

Even before the discovery of the structure of DNA, it was clear that the fertilized mammalian egg could contain only a finite amount of genetic information, and that this information was all that was needed to define something as complicated as a whole mouse or human being. However, with the demonstration of the double helix and the unraveling of the relationships that exist between basepairs, codons, genes, and polypeptides, it became possible to determine just how finite the total sum of genetic information actually is. But the problem that still looms large is an understanding of the essential genetic information needed to make a mammal. Is it the total amount of DNA in a haploid set of chromosomes, just that portion of DNA that doesn’t include repeated sequence copies, transcription units, coding and regulatory regions, or only those genes required for viability? In some cases, it seems possible to distinguish among what is essential, what is nice to have but not essential, and that which serves no useful function at all. However, in many cases, the distinctions are still not yet clear. This section addresses the quantitation of the genome at various levels of analysis.

5.1.1 How large is the genome?

Quantitative DNA-specific staining can be achieved with the use of the Feulgen reagent. Through microphotometric measurements of the staining intensity in individual sperm nuclei, it is possible to determine the total amount of DNA present in the haploid mouse genome (Laird, 1971). These measurements indicated a total haploid genome content of three picograms, which translates into a molecular weight of 1.8 x 1012 daltons.

The smallest unit of genetic information is the basepair (bp) which has a molecular weight of ~600 daltons. By dividing this number into the total haploid DNA mass, one arrives at an approximate value for the total information content in the haploid genome: three billion bp, which can also be written as three million kilobasepairs (kb) or 3,000 megabasepairs (mb). All eutherian mammals have genomes of essentially the same size.

It is instructive to consider the size of the mammalian genome in terms of the amount of computer-based memory that it would occupy. Each basepair can have one of only four values (G, C, A, or T) and is thus equivalent to two bits of binary code information (with potential values of 00, 01, 10, 11). Computer information is usually measured in terms of bytes that typically contain 8 bits. Thus, each byte can record the information present in four bp. A simple calculation indicates that a complete haploid genome could be encoded within 750 megabytes of computer storage space. Incredibly, small lightweight storage devices with such a capacity are now available for desktop computers. Of course, the computer capacity required to actually interpret this information will be many orders of magnitude larger.

5.1.2 How complex is the genome?

Another method for determining genome size relies upon the kinetics of DNA renaturation as an indication of the total content of different DNA sequences in a sample. When a solution of double stranded DNA is denatured into single strands which are then allowed to renature, the time required for renaturation is directly proportional to the complexity of the DNA in the solution, if all other parameters are held constant. Single-stranded and double-stranded molecules are easily distinguished by various physical, chemical, and enzymatic procedures.

Complexity is a measure of the information contained within the DNA. The maximal information possible in a solution of genomic DNA purified from one animal or tissue culture line is equivalent to the total number of basepairs present in the haploid genome. The information content of a DNA solution is independent of the actual amount or concentration of DNA present. DNA obtained from one million cells of a single animal or cell line contains no more information than the DNA present in one cell. Furthermore, if sequences within the haploid genome are duplicates of one another — repeated sequences — the complexity will drop accordingly.

The effect of complexity on the kinetics of renaturation can be understood by viewing the system through the eyes of a single strand of DNA, randomly diffusing through a solution, looking for its complementary partner. For example, imagine two DNA solutions, both 2 mg/ml in concentration, but one from a genome having a complexity of 3x109 bp, and the second from a genome having a complexity of only 3x108 bp. In the second solution, with the same quantity of DNA but tenfold less complexity, each segment of DNA sequence will be represented 10-times as often as any particular segment of DNA sequence in the first solution. Thus, a single strand will be able to find its partner 10-times more quickly in the second solution as compared to the first solution. The speed with which a DNA sample renatures can be expressed in the form of a Cot curve, which is a graphic representation of the fraction of a sample that has renatured (along the Y axis) as a function of the single stranded DNA concentration at time zero (C0) multiplied by the time allowed for renaturation (t) shown on the X axis. The C0t value attained at the midpoint of renaturation — when half of the molecules have become double-stranded — is called C0t1/2 and is used as a indicator of the complexity of the sample being measured. Different C0t1/2 values can be compared directly to allow a determination of complexity in a new sample relative to a calibrated control.

Renaturation analysis of mouse DNA reveals an overall complexity of approximately 1.3—1.8 x 109 bp. This value is only 40-60% of the size of the complete haploid genome and it implies the existence of a large fraction of repeated sequences. In fact, a careful analysis of the renaturation curve indicates that 5% of the genome renatures almost one million times faster than the bulk of the DNA. This "low complexity class" of sequences represents the satellite DNA which is discussed in detail in section 5.3.3. After renaturation of the satellite DNA class comes a very broad class of repeated sequences (whose copy number varies from several hundred thousand to less than ten) which merges into the final bulk class of "unique" sequences. With the advent of DNA cloning and sequencing, the "repeated sequence" class of mouse DNA has been divided into a number of functionally and structurally distinct subclasses which are also discussed more fully in sections 5.3 and 5.4. It was originally assumed that nearly all of the protein-coding genes would be present in the final renaturation class of unique copy sequences. However, we now know that the situation is not that simple and that many genes are members of gene families that can have anywhere from two to fifty similar, but non-identical, cross-hybridizing members.

5.1.3 What is the size of the mouse linkage map?

The genome size of any sexually-reproducing diploid organism can actually be measured according to two semi-independent parameters. There is the physical size measured in numbers of basepairs, as just discussed, and there is recombinational size measured in terms of the cumulative linkage distances that span each chromosome (discussed fully in section 7.1). The size of the whole mouse linkage map can be arrived at by a number of different approaches. First, one can perform a statistical test on the frequency with which new loci are found to be linked to previously identified loci In 1954, Carter used this test on then-available data for 43 loci to estimate the size of the complete mouse linkage map at 1620± 352 centimorgans or cM (Carter, 1954).

A second estimate is based on counting the number of chiasmata that appear in spreads of chromosomes prepared from germ cells undergoing meiosis and viewed under the microscope. A chiasma (the singular of chiasmata) represents the cytological manifestation of crossing over; it is seen as a visible connection between non-sister chromatids at each site where a crossover event has occurred between the maternally- and paternally-derived chromosomes of the animal that provided the sample. Chiasma formation occurs after the final round of DNA replication when each of the two homologs contains two identical sister chromatids — the genomic content of cells at this stage is represented by the notation "4N." Each crossover event involves only two of the four chromatids present. Thus, there is only a 50% chance that any one crossover event will be segregated to any one haploid (1N) gamete and so the total number of crossovers segregated into any one gamete genome will be approximately half the number of chiasmata present within 4N meiotic cells. Thus, one can derive an estimate of total linkage distance by multiplying the average number of chiasmata observed per meiotic cell by the expected inter-chiasmatic distance (100 cM) and dividing by 2. This analysis provided the basis for a whole mouse genome linkage size of 1954 cM (Slizynski, 1954).

With the generation of high density whole genome linkage maps based on the segregation of hundreds of loci, it is now possible to determine map size directly from the distance spanned by the set of mapped loci. In two cases, this calculation was performed for data generated within the context of single crosses: the resulting map sizes were 1424 cM for a B6 X M. spretus intercross-backcross (Copeland and Jenkins, 1991) and 1447 cM for an F2 intercross between B6 and M. m. castaneus (Dietrich et al., 1992). Two other direct estimates of 1468 cM and 1476 cM are based on whole genome consensus maps formed by the incorporation of data from large numbers of different crosses that used overlapping sets of markers for mapping (Hillyard et al., 1992; Lyon and Kirby, 1992).

All of these estimates are remarkably consistent with each other and yield a simple average value of 1453 cM. This consistency is remarkable because, in isolated regions of the genome, linkage distances are highly strain-dependent with differences that vary by as much as a factor of two (see section 7.2.3). Nevertheless, the accumulated data suggest that the overall level of recombination is pre-determined in the Mus genus and maintained from one cross to another through compensatory changes so that suppression of recombination in one region will be offset by an increase in recombination in another region of the genome.

One can derive an average equivalence value between the two metrics of genome measurement described in this section — kilobases and centimorgans — of approximately 2,000 kb per cM. As mentioned above and discussed in section 7.2.3.3, the actual relationship between linkage distance and physical distance can vary greatly in different parts of the genome as well as in crosses between different strains of mice.

5.1.4 What proportion of the genome is functional?

Bacterial species are remarkably efficient at packing the most genetic information into the smallest possible space. In one analysis of a completely sequenced 100 kb region of the E. coli chromosome, it was found that 84% of the total DNA content was actually used to encode polypeptides (Daniels et al., 1992). Most of the remaining DNA is used for regulatory purposes, and only 2% was found to have no recognizable function.

In higher eukaryotes of all types, the situation has long been known to be quite different. The early finding that some primitive organisms had haploid genome sizes which were many-fold larger than that of mammals led to the realization that large portions of higher eukaryotic genomes might be "non-functional". However, to answer the question posed in the title to this section, one must first define what is meant by functional. Are entire transcription units considered functional even though, in most cases, 80% or more of the transcript will be spliced away before translation begins? Are both copies of a perfectly duplicated gene considered functional even though the organism could function just as well without one. What about the twilight class of pseudogenes which, in some cases, may be functional in some individuals but not others, and may serve as a reservoir for the emergence of new genetic elements in a future generation? Finally, comparative sequence analysis over long regions of the mouse and human genomes shows evolutionary conservation over stretches of sequence that do not have coding potential or any obvious function (Hood, 1992). However, sequences can only be conserved when selective forces act to maintain their integrity for the benefit of the organism. Thus, conservation implies functionality, even though we may be too ignorant at the present time to understand exactly what that functionality might be in this case.

Taking all of these caveats into consideration, and defining functional sequences as those with coding potential or with potential roles in gene regulation or chromatin structure, one can come up with a broad answer to the question posed in this section based on a synthesis of the data described in the next section. The fraction of the mouse genome that is functional is likely to lie somewhere between 5% and 10% of the total DNA present.

5.1.5 How many genes are there?

5.1.5.1 Gene density estimates

How many genes are in the genome? A truly accurate answer to this question will be a long time in coming. The complete sequence of the genome will almost certainly provide a means for uncovering most genes, however, an unknown percentage will probably still remain hidden from view. But, in the absence of a complete sequence, one is forced to make multiple assumptions in order to come up with just a broad estimate of the final number.

One approach to estimating gene number is to derive an average gene size and then determine how many average genes can fit into a 3,000 megabase space. Unfortunately, the sizes of the genes characterized to date do not form a nice discrete bell curve around some mean value. The first mammalian gene to be characterized — Hbb — encodes the beta-globin polypeptide; the Hbb gene has a length on the order of two kilobases. The alpha globin gene is even smaller with a length of less than one kilobase. At the opposite extreme is the mouse homolog of the human Duchene’s muscular dystrophy gene (called mdx in the mouse); at 2,000 kb, the size of mdx is three orders of magnitude larger than Hbb. Nevertheless, a survey spanning all of the hundreds of mammalian genes characterized to date would seem to suggest that mdx is an extreme example, with most genes falling into the range of 10 to 80 kb, with an median size in the range of 20 kb. This estimate must be considered highly qualified since size could play a role in determining which genes have been cloned and characterized.

Interestingly, a similar estimate of median gene size is obtained by viewing complete cellular polypeptide patterns on two-dimensional gels where the highest density of proteins appears in the 50-70,000 Mr window for all cell types. A "typical" polypeptide in this range will have an amino acid length of ~600, encoded within 1800 nucleotides, that will "typically" be flanked by another 200 nucleotides of untranslated regions on the 3’ and 5’ ends of a two kilobase mRNA that has been "typically" spliced down from an original 20 kilobase transcript that included 18 kb of intronic sequences. If one assumes an average inter-gene distance of 10 kb — including gene regulatory regions and various non-essential repetitive elements to be discussed later — one obtains an average density of one gene per 30 kb. When this number is divided into the whole 3,000 mb genome, one derives a total gene number of 100,000.

Actual validation of a gene density in the range just estimated has been obtained for the major histocompatibility (MHC) region of the human and mouse genomes. With intensive searches for all transcribed sequences present within portions of the four megabase MHC region, a gene density of one per 20 kb has been found (Milner and Campbell, 1992) . Direct extrapolation of this gene density to the whole genome would yield a total of 130,000 genes. However, such an extrapolation is probably not valid since the average gene size in the MHC appears to be significantly smaller than the average overall. Another problem is that a significant proportion of the "genes" in the MHC (and elsewhere as well) are non-functioning pseudogenes. If one assumes a pseudogene rate of 20%, the value of 130,000 is reduced down to near 100,000.

A serious problem with all estimates made from the extrapolation of "average" genes is that genomic regions containing smaller, more densely packed genes will contribute disproportionately to the total number. As an example only, if 1% of the genome was occupied by (mostly uncharacterized) short genes that were only 500 bp in length and were packed at a density of one per kilobase, this class alone would account for 30,000 genes to be added onto the previous 100,000 estimate.

5.1.5.2 Number of transcript estimates

A very different approach to placing boundaries on the total gene number is to estimate the number of different transcripts produced in various cell types. Estimates of this type can be made, in a manner similar to that described for Cot studies, by analyzing the kinetics of mRNA-cDNA renaturation for a determination of the "complexity" of transcript populations in single cell types or tissues. This approach allowed Hasties and Bishop (1976) to estimate the presence of 12,000 different transcripts (representing the products of 12,000 different genes) in each of the three tissues analyzed — liver, kidney, and brain. However, as these investigators indicate, the brain in particular is a very complex tissue with millions of cells that are likely to have different patterns of gene expression. Genes expressed at low levels in a small percentage of cells will go undetected in a broad-tissue analysis of the type described. Thus, the actual level of gene expression in the brain, and other complex "tissues" like the developing embryo, could be much greater than the number derived experimentally. A revised complexity estimate of 20-30,000 has been suggested for the brain.

The only way in which estimates of transcript number could provide an estimate of total gene number is if every tissue in the body was analyzed at every developmental stage, and the number of cell- or stage-specific transcripts was determined apart from the number that were expressed elsewhere. A comprehensive analysis of this type is impossible even today, but some simple estimates can be made. For example, by analysis of cross-hybridization between sequences from different tissues, an overlap in expression of 75-85% has been estimated (Hasties and Bishop, 1976). This would suggest that perhaps 3,000 genes may be uniquely expressed in any one tissue relative to any other. However, when the data from many tissues are brought together, the actual number of tissue-specific transcripts is likely to be further reduced. On the other hand, it is also the case that some genes are likely to function in some tissue types only during brief periods of development.

Interpretation of the accumulated data provides a means only for estimating the minimum number of transcribed genes that could be present in the genome. By adding the brain estimate of ~25,000 to ~1,000 unique genes for each of 25 different tissue types, one arrives at a minimum estimate of 50,000.

5.1.5.3 Vital function estimates

Another independent, and very old, method for estimating gene number is to first saturate a region of known length with mutations that cause homozygous lethality, then count the number of lethal complementation groups and extrapolate from this number to the whole genome. The assumption one makes with an approach of this type is that the vast majority of genes in the genome will be essential to the viability of the organism . If one eliminates the expression of a "vital" gene through mutagenesis, the outcome will be a clear homozygous phenotype of pre- or postnatal lethality.

It has long been clear that the early assumption of vitality associated with most genes is incorrect. Even genes thought to play critical roles in cell cycling and growth, such as p53, can be "knocked-out" and still allow the birth of normal-looking viable animals (Donehower et al., 1992). Whole genomic regions of 550 kb in length can be eliminated with a resulting phenotype not more severe than short ears and subtle changes in skeletal structures (Kingsley et al., 1992). Observations of this type can be interpreted in two ways. First, there is likely to be redundancy in various genetic pathways so that if the absence of one gene prevents one pathway from being followed, another series of unrelated genes may provide compensation through the use of an alternative pathway. Second, genes need not be vital to be maintained in the genome. If a gene provides even the slightest selective advantage to an animal, it will be maintained throughout evolution.

Although vital genes will represent only a subset of the total in the genome, it is still of interest to determine the size of this particular subset. In two saturation mutagenesis experiments in regions from different chromosomes, estimates of 5,000-10,000 vital genes were derived (Shedlovsky et al., 1988; Rinchik et al., 1990b). This range of values is likely to represent only 5-10% of the total functional units in the mouse genome. However, it is interesting that the number of vital genes in mice is not very different from the number of vital genes in Drosophila melanogaster, where the genome size is an order of magnitude smaller. This suggests that genes added on to the genome later in evolutionary time are less likely to be vital to the organism and more likely to help the organism in more subtle ways.

5.1.5.4 Overview

From the discussion presented in this section, it seems fair to say, with a high level of confidence, that the actual number of genes in the mammalian genome will be somewhere between 50,000 and 150,000. As of 1992, fewer than 10% of these genes have been characterized at any level from DNA to phenotype, and many fewer still are fully understood in terms of their effect on the organism and their interaction with other genes. Although, the efforts to clone and sequence the entire human and mouse genomes will provide an entry point into many more genes, an understanding of the relationship between genotype and phenotype, in nearly every case, will still require much more work with the organism itself. Thus, the need to breed mice is likely to remain strong for many years to come.

5.2 Chromosomes

5.2.1 The standard karyotype

5.2.1.1 Chromosome number and banding patterns

All of the Mus musculus subspecies (domesticus, musculus, castaneus, and bactrianus) as well as the closely related species M. spretus, M. spicilegus and M. macedonicus have the same "standard karyotype" with 20 pairs of chromosomes, including 19 autosomal pairs and the X and Y sex chromosomes as shown in figure 5.1. The correct chromosome number was first established in 1928 (Painter, 1928). Surprisingly, all of the 19 autosomes as well as the X chromosome appear to be telocentric, with a centromere at one end and a telomere at the other. The biological explanation for this uniformity in chromosome morphology is entirely unknown; however, it makes the task of individual chromosome identification much more difficult than it is with human karyotypes. Nevertheless, trained individuals can distinguish chromosomes on the basis of reproducible banding patterns that are accentuated with the use of various staining protocols. The most common of these includes a mild trypsin treatment followed by staining with the dye Giemsa to produce dark Giemsa-stained bands — called G bands — that alternate with Giemsa-negative bands — called R bands for reverse G-bands. A variety of other staining protocols have been developed — called R, Q, and T banding — that are all based on the same principal of chromatin denaturation and/or mild enzymatic digestion followed by staining with a DNA-binding dye (Craig and Bickmore, 1993). In general, all of these different protocols produce the same pattern of bands and interbands observed with Giemsa staining, although in some cases, the dark and light regions are reversed.

The reproducibility of the alternative pattern of G and R bands observed with many different staining protocols implies an underlying difference in the structure of chromatin which, in turn, suggests an underlying heterogeneity in the long-range structure of the genome. In fact, numerous differences have been found in the DNA associated with the two types of bands. G band DNA condenses early, replicates late and is relatively A:T-rich; in contrast, R band DNA condenses late, replicates early and is relatively G:C-rich (Bickmore and Sumner, 1989). All housekeeping genes are located in R bands, while tissue-specific genes can be located in both G and R bands. Each band type is also associated with a different class of dispersed repetitive DNA elements: G bands contain LINE-1 elements whereas R-bands contain SINE elements (see section 5.4 for a detailed discussion of these elements).

With all of these contrasting properties, it becomes an interesting problem to distinguish between cause and effect in the generation of the two major types of chromosomal domains. In other words, is there a particular DNA element that defines the G or R bands and somehow contributes to the preferential association — or disassociation — of all other DNA elements that contribute to the characteristics of the band type? Further research will be necessary to unravel this problem.

5.2.1.2 Idiograms and band names

As a mechanism for facilitating data presentation and for comparing results obtained by different investigators, the light and dark bands observed in a raw karyotype are usually converted into idiograms, which are black and white drawings of idealized chromosomes as shown in figure 5.2. Autosomes are numbered from 1 to 19, in descending order of length. Major bands (alternating dark and light regions) within each autosome are designated with a capital letter starting from A at the centromere, and ascending in alphabetical order. With an increase in resolution, most major bands can be resolved into a series of smaller bands, which are numbered sequentially from 1 starting at the proximal — or centromeric — side of the major band and ending at the distal — or telomeric — side. Finally, when increased resolution allows the visualization of multiple minor bands within a single previously-defined sub-band, these are designated with a number (in sequence from 1) demarcated with a decimal point. As an example of the use of this nomenclature, the designation 17E1.3 represents (in reverse order), the third minor band within the first sub-band within the fifth major band (all in order from the centromere) on the mouse chromosome ranked seventeenth in size (illustrated in Figure 7.1).

5.2.1.3 Chromosome length and DNA content

The amount of DNA present in each chromosome can be estimated by measuring its length — cytologically — relative to the sum of the lengths of all 20 chromosomes and multiplying this fraction by the total genome length of 3,000 kb (Evans, 1989). From these measurements, one finds that the largest chromosome [1] has a DNA length of approximately 216 mb and the smallest chromosome [19] has a DNA length of 81 mb, with all others following in a near-continuum between these two values (see Table 9.4 for estimates of the centimorgan lengths of individual chromosomes).

5.2.2 Robertsonian translocations

5.2.2.1 Presence in natural populations

Since 1967, there have been numerous reports of wild-caught house mice with karyotypes containing fewer than 20 sets of chromosomes. The first report described a karyotype with 13 sets of chromosomes (seven metacentrics and 6 telocentrics) in mice captured from the "Valle di Poschiavo" in southeastern Switzerland (Gropp et al., 1972). The assumption was made that animals with such a grossly different karyotype could not possibly be members of the M. musculus species, and as a consequence, these Swiss mice were classified as belonging to a separate species named Mus poschiavinus and informally referred to as the "tobacco mouse." In subsequent years, additional populations of animals from the alpine regions of both Switzerland and Italy were found with a variety of non-standard karyotypes having anywhere from one to nine metacentrics. Further studies of wild house mice by other investigators have led to the discovery of additional non-standard karyotypes in house mice from other regions of Europe as well as South America and Northern Africa (Adolph and Klein, 1981; Wallace, 1981; Searle, 1982).

When the "M. poschiavinus" animals and others with non-standard karyotypes were subjected to a variety of tests — both morphological and genetic — to determine their relatedness to M. m. domesticus , investigators were surprised to find that no characteristics, other than karyotype, distinguished these populations from each other. In particular, phylogenetic studies place the "M. poschiavinus" animals securely within the M. m. domesticus fold; thus the M. poschiavinus species name is inappropriate and should not be used.

How can animals within the same species (and even sub-species) have karyotypes that have diverged apart so radically in what has to be a very short period of evolutionary time? The first point to consider is that the karyotypes are actually not as different from each other as they might appear to be at first glance. When subjected to staining and banding analysis, each arm of every metacentric chromosome uncovered to date has been found to be identical to one of the chromosomes present in the standard M. musculus karyotype. Thus, it would appear that all of the non-standard karyotypes have arisen by simple fusion events, each of which resulted in the attachment of two standard mouse chromosomes at their centromeres. These centromeric fusions, also referred to as whole arm translocations, have been given the formal name of Robertsonian translocations, because W. R. B. Robertson was the first to identify such chromosomes in the grasshopper.

A three part nomenclature is used to describe each individually isolated Robertsonian chromosome. First, the "Rb" symbol indicates a Robertsonian; second, the two chromosomes that have fused together are separated by a dot and listed within parenthesis (with the lower numbered chromosome first); and third, the laboratory number and symbol are indicated. Thus, the twenty-third Robertsonian uncovered at the Institute for Pathology in Lubeck, Germany, that resulted from a fusion between chromosomes 10 and 15, would be designated Rb(10.15)23Lub.

Why does the standard mouse karyotype contain no metacentric chromosomes, and at the same time, why do multiple metacentric chromosomes become fixed so rapidly in unrelated populations from isolated geographical regions? In the latter case, genetic drift alone does not appear to provide a satisfactory answer since metacentric fixation requires an intermediate stage during which animals must be karyotypically heterozygous; heterozygosity for one or more metacentric chromosomes will usually result in decreased fertility as a consequence of nondisjunction. Thus, spontaneously arising Robertsonians cannot be expected to survive in a population (let alone reach fixation) unless they engender a selective advantage to the animal within which they reside. Based on the limited, but scattered, occurrence of populations that contain Robertsonians, it would appear that they can only provide a selective advantage under certain environmental conditions, whereas under other conditions, mice are better served with only telocentric chromosomes. The mechanisms by which such selective pressures would operate on chromosome structure remain totally obscure at the present time.

5.2.2.2 Experimental applications

Robertsonian translocations are useful as genetic tools in two types of experimental applications. Like other translocations, they can catalyze nondisjunction events in heterozygous meiotic cells which — in this case — can lead to the genetic transmission of both homologs of the affected chromosome through individual gametes. This phenomenon will be discussed in the following section. In addition, Robertsonians are especially valuable as visible genetic markers in somatic cells. This usefulness is peculiar to the mouse where all chromosomes other than the Robertsonians will be telocentric. Under microscopic examination, Robertsonians can be easily distinguished as the only chromosomes with two arms.

The advantages of using Robertsonians as genetic markers can be best exploited within animals that contain a single pair of such homologs on a standard karyotypic background. Numerous strains carrying single Robertsonian pairs have been generated through selective breeding between wild and laboratory animals. Each of the standard mouse autosomes is available within the context of a Robertsonian in one or more strains of this type, which can be purchased from the Jackson Laboratory.

The most useful Robertsonians are those with two chromosome arms that differ significantly in length. For example, if one is interested in the analysis of chromosome 2, it would make sense to work with a strain that carries the Rb(2.18)6Rma fusion in which the longer arm of the metacentric (Chr 2) will be easily distinguished under the microscope from the shorter arm (Chr 18).

Robertsonians can be used as somatic markers for both analytical and preparative purposes. Analysis of a particular chromosome by in situ hybridization or other staining protocols is greatly aided by the ability to easily identify the relevant chromosome in all metaphase plates. Preparative microdissection for the purpose of generating subchromosome-specific DNA libraries (section 8.4) is less tedious and more rapidly accomplished by this easy identification (Röhme et al., 1984). Finally, Robertsonians can be more easily distinguished from all other mouse chromosomes by fluorescence-activated flow sorting (FACS) methods for chromosome identification and purification (Bahary et al., 1992).

5.2.3 Reciprocal translocations

5.2.3.1 Derivation and genetics

Although crossovers normally occur between homologous sequences present on sister chromatids, in rare instances, aberrant crossovers will occur between sequences that are non-allelic. When the two non-allelic sequences that partake in a crossover event of this type come from different chromatids in the same pair of homologs (as illustrated in figure 5.5), the result is a pair of reciprocal recombinant products that are "unequal" with one having a duplication and the other having a deletion of the material located between the two breakpoints. Intra-chromosomal unequal crossover events are discussed at length in section 5.3.2.

When crossing over occurs between sequences located on entirely different chromosomes, the result is even more dramatic. As shown in figure 5.3, inter-chromosomal crossing over results in the production of two reciprocal translocation chromosomes. Although inter-chromosomal crossover events occur even less often than intrachromosomal events, the former are more much readily detected (in most cases) for two reasons. First, reciprocal translocations cause the swapping of entire distal portions of two different chromosomes. Since the portions being swapped are usually not equal in size and are always associated with different banding patterns, each resultant translocation chromosome will usually look quite different from any normal chromosome. So long as the breakpoints are not exceedingly close to the centromeres or telomeres, these aberrant chromosomes will be easily recognized through karyotypic analysis. Second, reciprocal translocations usually cause a significant reduction in fertility as a consequence of the unusual pairing that must occur during synapsis and the production of a high frequency of unbalanced gametes through adjacent-1 segregation discussed in more detail below and illustrated in figure 5.3. Unbalanced gametes derived from reciprocal translocation heterozygotes give rise to embryos that are partially trisomic or monosomic, and in some cases, these do not survive to birth.

Unlike Robertsonian fusions, reciprocal translocations are not found in wild populations of mice. They can arise spontaneously in laboratory animals and they are recovered at a higher frequency in offspring of males that have been subjected to chemical mutagenesis or irradiation treatment (discussed in section 6.1). A large number of translocations have been recovered to date (Searle, 1989) and strains homozygous for many can be purchased from the Jackson Laboratory.

As shown in figure 5.3, translocations will cause genetic linkage between chromosomal regions that assort independently in animals with normal karyotypes. Eva Eicher (1971) was the first to use this correlation between genetic linkage and karyotypic linkage to make a specific chromosomal assignment for a particular linkage group and by the end of the 1970s, all nineteen autosomal linkage groups and chromosomes had been paired together (Miller and Miller, 1975). Higher resolution studies that compared genetic and cytological breakpoint positions provided a means for further mapping of genes to particular chromosome bands; these data also provided a means for determining the centromeric and telomeric ends of each linkage map (Searle, 1989).

5.2.3.2 Chromosome segregation

The main contemporary use of reciprocal translocations is as a tool to generate animals that receive both homologs of a chromosomal region from a single parent. To understand the genetic basis for this outcome, you can follow the process of chromosome segregation during meiosis for the fictitious reciprocal translocation heterozygote shown in figure 5.2. In this example, mouse chromosomes 2 and 8 have exchanged material. During the anaphase I stage of the first meiotic division, the two homologs of every chromosome "disjoin" from each other and are pulled to opposite poles by spindles that attach to the centromeric regions. This disjunction of chromosomes is the physical basis for the genetically observed segregation of alleles according to Mendel’s first law. In mice with a normal karyotype, the segregation of any one pair of homologs will not affect the segregation of any other pair of homologs. Thus, individual homologs of different chromosomes that came into the animal together from one parent will go out into the offspring in an independent manner. This is the physical basis for Mendel’s second law of independent assortment.

In animals with a normal karyotype, chromosome disjunction will always lead to the production of gametes that are "balanced" with a complete haploid genome — no more, no less. However, the same is not true with animals heterozygous for a reciprocal translocation. As shown in figure 5.3, there are two equally-likely outcomes called "alternate segregation" and "adjacent-1 segregation." With the alternate segregation pathway, one gamete class will receive one Chr 2 homolog and one Chr 8 homolog {2,8}, just like all gametes produced by mice with a normal karyotype. The other gamete class will receive both translocated chromosomes called 2’ and 8’ in this example {2’,8’}; although the genetic material is rearranged, one complete haploid genome is present, and thus these gametes are considered to be "balanced". If a balanced gamete joins together with a normal gamete during fertilization, the resulting animal will be a balanced, reciprocal translocation heterozygote just like the original parent.

With the adjacent-1 segregation pathway, the two gamete classes are unbalanced in a reciprocal fashion. One will have a normal Chr 2 and a translocated Chr 8’ {2, 8’}; this gamete is deleted for sequences at the distal end of the normal Chr 8 (8d) and duplicated with both homolog copies of sequences from the distal end of Chr 2 (2d). The other {2’,8} will be deleted for distal Chr 2 sequences (2d) and duplicated for distal Chr 8 sequences (8d).

5.2.3.3 Partial trisomies and uniparental disomies

The special consequences of chromosome segregation from reciprocal translocation heterozygotes have been exploited with two types of breeding protocols. In the first, translocation heterozygotes are bred to animals with a normal karyotype. Adjacent-1 segregation will give rise to animals that are partially trisomic (for the distal end of one translocated chromosome) and partially monosomic (for the distal end of the other). Thus, by choosing appropriate translocations, it becomes possible to construct animals that are deleted or duplicated for particular genes of interest. By breeding in mutations at these loci, it becomes possible to construct genotypes of the {+/+/m} and {+/m/m} variety (where + and m are wild-type and mutant alleles respectively) as a means toward a better understanding of gene dosage effects and dominance and recessive relationships (Agulnik et al., 1991; Ruvinsky et al., 1991).

In the second type of breeding protocol, animals heterozygous for the same pair of reciprocal translocations are mated to each other. The most interesting offspring to emerge from such unions are those formed through the fusion of complementary unbalanced gametes that represent the two different products of adjacent-1 segregation (figure 5.3). Although the resulting zygotes have fully balanced genomes — they are not deleted nor duplicated for any sequences — they carry two subchromosomal regions in which both homologs came from only one or the other parent respectively. In other words, for one chromosomal region, these animals are maternally disomic and paternally nullisomic; for the other chromosomal region, the opposite holds true.

Uniparental disomy can also be obtained, albeit with lower frequency, in the offspring of matings between animals heterozygous for the same Robertsonian translocation. Pairing between the Robertsonian and the homologous acrocentric chromosomes can lead to nondisjunction with gametes that contain either two copies or no copies of one homolog represented within the Robertsonian. Once again, the fusion of two complementary nondisjunction gametes will lead to zygotes with fully balanced genomes but with whole chromosome uniparental disomy. Both whole and partial chromosome disomy provide powerful genetic tools for the analysis of genomic imprinting which is discussed later in this chapter (Cattanach and Kirk, 1985).

5.3 Genome evolution and gene families

5.3.1 Classification of genomic elements

5.3.1.1 Functional and non-functional sequences

Sequences within the genome can be classified according to a number of criteria. The most important of these is functionality, and the largest class of functional DNA elements consists of coding sequences within transcription units. Transcription units usually contain exons and introns, and are usually associated with flanking regulatory regions that are necessary for proper expression. For the most part, transcription units correspond one-to-one with Mendelian genes, and they usually function on behalf of the organism within which they lie. However, mammalian genomes also contain transcribable elements that do not benefit the organism and whose sole function appears to be self-propagation. Such sequences are referred to as selfish DNA or selfish genes and will be described at length in section 5.4. Although these sequences may undergo transcription, they cannot be detected, in-and-of-themselves, in terms of traditional Mendelian phenotypes. The functional class of DNA elements also includes a number of specialized sequences that play roles in chromosome structure and transmission. The best characterized structural elements are associated with the centromeres and telomeres.

Most of the genome appears to consist of DNA sequences that are entirely non-functional. This non-functional class includes pseudogenes that derive from, and still share homology with, specific genes but are not themselves functional with a lack of transcription or translation. However, for the most part, non-functional DNA is present in the context of long lengths of apparently random sequence — located between genes and within their introns — with origins that have long since become indecipherable as a consequence of constant "genetic drift."

5.3.1.2 Single copy and repeated sequences

Both functional and non-functional sequences can be distinguished by a second criterion — copy number. Sequences in a genome that do not share homology with any other sequences in the same genome are considered unique or single copy. This single copy class contains both functional and non-functional elements. Sequences that do share homology with one or more other genomic regions are considered to be repeated or multicopy.

All claims to the contrary aside, homology is a relative characteristic. At one extreme, two sequences can show 100% identity to each other at the nucleotide level. At the other extreme, homology may be recognized only through the use of computer algorithms that show a level of identity between two sequences that is unlikely to have occurred by chance. In the case of many gene families, individual members are not identical — in fact, they are likely to have evolved different functions — yet a probe from one will cross-hybridize with sequences from the others. Cross-hybridization provides a powerful tool for the identification of multi-copy DNA elements by simple Southern blot analysis and for their characterization by library screening and cloning.

Homologies among more distantly related functional sequences that do not show cross-hybridization can sometimes be uncovered through the use of the polymerase chain reaction (PCR). The rationale behind this approach — which has been used successfully with a number of different gene families — is that specific short regions of related gene sequences may be under more intense selective pressure to remain relatively unchanged due to functional constraints on the encoded peptide regions. These highly conserved regions may not be long enough to allow cross-hybridization, but the constrained peptide sequences that they encode can be used to devise two degenerate oligonucleotides for use as primers to identify additional members of the gene family through amplification from either genomic DNA or tissue-specific cDNA.

All sequences that are partially identical to each other — as recognized by hybridization, PCR, or sequence comparisons — are considered to be members of the same DNA element family. Families of functional elements are called gene families. Families of non-functional elements have been referred to simply as "repeat families" or "DNA element families". Multicopy DNA families — both functional and non-functional — can be further classified according to copy number, element size, and distribution within the genome. Related sequences can be found closely linked to each other in a cluster, they can be unlinked to each other and dispersed to different chromosomes, or they can have a combination of these two arrangements with multiple clusters dispersed to different sites.

From a distance, the genome appears to be a chaotic mixture of sequences from all of these classes thrown together without any structure or order, like craters, one overlapping the next, on the surface of the moon. However, on closer examination, it becomes possible to make sense of the genome, the relationship of different genomic elements to each other, and the mechanisms by which they have evolved as indicated for the hypothetical genomic region shown in figure 5.4.

5.3.2 Forces that shape the genome

5.3.2.1 Genomic complexity increases by gene duplication and selection for new function

Mice, humans, the lowly intestinal bacterium E. coli, and all other forms of life evolved from the same common ancestor that was alive on this planet a few billion years ago. We know this is the case from the universal use of the same molecule — DNA — for the storage of genetic information, and from the nearly universal genetic code. But E. coli has a genome size of 4.2 megabases (mb), while the mammalian genome is nearly a thousand-fold larger at ~3,000 mb. If one assumes that our common ancestor had a genome size that was no larger than that of the modern-day E. coli, the obvious question one can ask is where did all of our extra DNA come from?

The answer is that our genome grew in size and evolved through a repeated process of duplication and divergence. Duplication events can occur essentially at random throughout the genome and the size of the duplication unit can vary from as little as a few nucleotides to large subchromosomal sections that are tens, or even hundreds, of megabases in length. When the duplicated segment contains one or more genes, either the original or duplicated copy of each is set free to accumulate mutations without harm to the organism since the other good copy with an original function will still be present.

Duplicated regions, like all other genetic novelties, must originate in the genome of a single individual and their initial survival in at least some animals in each subsequent generation of a population is, most often, a simple matter of chance. This is because the addition of one extra copy of most genes — to the two already present in a diploid genome — is usually tolerated without significant harm to the individual animal. In the terminology of population genetics, most duplicated units are essentially neutral (in terms of genetic selection) and thus, they are subject to genetic drift, inherited by some offspring but not others derived from parents that carry the duplication unit. By chance, most neutral genetic elements will succumb to extinction within a matter of generations. But even when a duplicated region survives for a significant period of time, random mutations in what were once-functional genes will almost always lead to non-functionality. At this point, the gene becomes a pseudogene. Pseudogenes will be subject to continuous genetic drift with the accumulation of new mutations at a pace that is so predictable (~0.5% divergence per million years) as to be likened to a molecular clock. Eventually, nearly all pseudogene sequences will tend to drift past a boundary where it is no longer possible to identify the functional genes from which they derived. Continued drift will act to turn a once-functional sequence into a sequence of essentially-random DNA.

Miraculously, every-so-often, the accumulation of a set of random mutations in a spare copy of a gene can lead to the emergence of a new functional unit — or gene — that provides benefit and, as a consequence, selective advantage to the organism in which it resides. Usually, the new gene has a function that is related to the original gene function. However, it is often the case that the new gene will have a novel expression pattern — spatially, temporally, or both — which must result from alterations in cis-regulatory sequences that occur along with codon changes. A new function can emerge directly from a previously-functional gene or even from a pseudogene. In the latter case, a gene can go through a period of non-functionality during which there may be multiple alterations before the gene comes back to life. Molecular events of this class can play a role in "punctuated evolution" where, according to the fossil or phylogenetic record, an organism or evolutionary line appears to have taken a "quantum leap" forward to a new phenotypic state.

5.3.2.2 Duplication by transposition

With duplication acting as such an important force in evolution, it is critical to understand the mechanisms by which it occurs. These fall into two broad categories: (1) transposition is responsible for the dispersion of related sequences; (2) unequal crossing over is responsible for the generation of gene clusters. Transposition refers to a process in which one region of the genome re-locates to a new chromosomal location. Transposition can occur either through the direct movement of original sequences from one site to another or through an RNA intermediate that leaves the original site intact. When the genomic region itself (rather than its proxy) has moved, the "duplication" of genetic material actually occurs in a subsequent generation after the transposed region has segregated into the same genome as the originally-positioned region from a non-deleted homolog. In theory, there is no upper limit to the size of a genomic region that can be duplicated in this way.

A much more common mode of transposition occurs by means of an intermediate RNA transcript that is reverse-transcribed into DNA and then inserted randomly into the genome. This process is referred to as retrotransposition. The size of the retrotransposition unit — called a retroposon — cannot be larger than the size of the intermediate RNA transcript. Retrotransposition has been exploited by various families of selfish genetic elements (described in section 5.4), some of which have been copied into 100,000 or more locations dispersed throughout the genome with a self-encoded reverse transcriptase. But, examples of functional, intronless retroposons — such as Pgk2 and Pdha2 — have also been identified (Boer et al., 1987; Fitzgerald et al., 1993). In such cases, functionality is absolutely dependent upon novel regulatory elements either present at the site of insertion or created by subsequent mutations in these sequences.

5.3.2.3 Duplication by unequal crossing over

The second broad class of duplication events result from unequal crossing over. Normal crossing over, or recombination, can occur between equivalent sequences on homologous chromatids present in a synaptonemal complex that forms during the pachytene stage of meiosis in both male and female mammals. Unequal crossing over — also referred to as illegitimate recombination — refers to crossover events that occur between non-equivalent sequences. Unequal crossing over can be initiated by the presence of related sequences — such as highly repeated retroposon-dispersed selfish elements — located nearby in the genome (figure 5.5). Although the event is unequal, in this case, it is still mediated by the homology that exists at the two non-equivalent sites.

So-called non-homologous unequal crossovers can also occur, although they are much rarer than homologous events. I say so-called because even these events may be dependent on at least a short stretch of sequence homology at the two sites at which the event is initiated. The initial duplication event that produces a two-gene cluster may be either homologous or non-homologous, but once two units of related sequence are present in tandem, further rounds of homologous unequal crossing over can be easily initiated between non-equivalent members of the pair as illustrated in figure 5.5. Thus, it is easy to see how clusters can expand to contain three, four, and many more copies of an original DNA sequence.

In all cases, unequal crossing over between homologs results in two reciprocal chromosomal products: one will have a duplication of the region located between the two sites and the other will have a deletion that covers the same exact region (figure 5.5). It is important to remember that, unlike retrotransposition, unequal crossing over operates on genomic regions without regard to functional boundaries. The size of the duplicated region can vary from a few basepairs to tens or even hundreds of kilobases and it can contain no genes, a portion of a gene, a few genes, or many.

5.3.2.4 Genetic exchange between related DNA elements

There are many examples in the genome where genetic information appears to flow from one DNA element to other related — but non-allelic — elements located nearby or even on different chromosomes. In some special cases, the flow of information is so extreme as to allow all members of a gene family to co-evolve with near-identity as discussed in section 5.3.3.3. In at least one case — that of the class I genes of the major histocompatibility complex (MHC or H2) — information flow is unidirectionally selected, going from a series of 25 to 38 non-functional pseudogenes into two or three functional genes (Geliebter and Nathenson, 1987). In this case, intergenic information transfer serves to increase dramatically the level of polymorphism that is present at the small number of functional gene members of this family.

Information flow between related DNA sequences occurs as a result of an alternative outcome from the same exact process that is responsible for unequal crossing over. This alternative outcome is known as intergenic gene conversion. Gene conversion was originally defined in yeast through the observation of altered ratios of segregation from individual loci that were followed in tetrad analyses. These observations were fully explained within the context of the Holliday model of DNA recombination which states that homologous DNA duplexes first exchange single strands that hybridize to their complements and migrate for hundreds or thousands of bases. Resolution of this "Holliday intermediate" can lead with equal frequency to crossing over between flanking markers or back to the status quo without crossing over. In the latter case, a short single strand stretch from the invading molecule will be left behind within the DNA that was invaded. If an invading strand carries nucleotides that differ at any site from the strand that was replaced, these will lead to the production of heteroduplexes with basepair mismatches. Mismatches can be repaired (in either direction) by specialized "repair enzymes" or they can remain as-is to produce non-identical daughter DNAs through the next round of replication.

By extrapolation, it is easy to see how the Holliday Model can be applied to the case of an unequal crossover intermediate which can be resolved in one of two directions with equal probability. With one resolution, unequal crossing over will result; with the alternative resolution, gene conversion can be initiated between non-allelic sequences. Remarkably, information transfer — presumably by means of gene conversion — can also occur across related DNA sequences that are even distributed to different chromosomes.

5.3.3 Gene families and superfamilies.

5.3.3.1 Origins and examples

Much of the functional DNA in the genome is organized within gene families and hierarchies of gene superfamilies. The superfamily term was coined to describe relationships of common ancestry that exist between and among two or more gene families, each of which contains more closely related members. As more and more genes are cloned, sequenced, and analyzed by computer, deeper and older relationships among superfamilies have unfolded. Complex relationships can be visualized within context of branches upon branches in evolutionary trees. All of these superfamilies have evolved out of combinations of unequal crossover events that expanded the size of gene clusters and transposition events that acted to seed distant genomic regions with new genes or clusters.

A prototypical small-size gene superfamily is represented by the very well-studied globin genes illustrated in figure 5.6. All functional members of this superfamily play a role in oxygen transport. The superfamily has three main families (or branches) represented by the beta-like genes, the alpha-like genes and the single myoglobin gene. The duplication and divergence of these three main branches occurred early during the evolution of vertebrates and, as such, all three are a common feature of all mammals. The products encoded by genes within two of these branches — alpha-globin and beta-globin — come together (with heme cofactors) to form a tetramer which is the functional hemoglobin protein that acts to transport oxygen through the blood stream. The product encoded by the third branch of this superfamily — myoglobin — acts to transport oxygen in muscle tissue.

The beta-like branch of this gene superfamily has duplicated by multiple unequal crossing over events and diverged into five functional genes and two beta-like pseudogenes that are all present in a single cluster on mouse chromosome 7 as shown in figure 5.6 (Jahn et al., 1980). Each of the beta-like chains codes for a similar polypeptide which has been selected for optimal functionality at a specific stage of mouse development: one functions during early embryogenesis, one during a later stage of embryogenesis, and two in the adult. The alpha-like branch has also expanded by unequal crossing over into a cluster of 3 genes — one functional during embryogenesis and two functional in the adult — on mouse chromosome 11 (Leder et al., 1981). The two adult alpha genes are virtually identical at the DNA sequence level, which is indicative of a very recent duplication event (on the evolutionary time scale).

In addition to the primary alpha-like cluster are two isolated alpha-like genes (now non-functional) that have transposed to dispersed locations on chromosomes 15 and 17 (Leder et al., 1981). When pseudogenes are found as single copies in isolation from their parental families, they are called "orphons." Interestingly, one of the alpha globin orphons (Hba-ps3 on Chr 15) is intronless and would appear to have been derived through a retrotransposition event, whereas the other orphon (Hba-ps4 on Chr 17) contains introns and may have been derived by a direct DNA-mediated transposition. Finally, the single myoglobin gene on chromosome 15 does not have any close relatives either nearby or far away (Blanchetot et al., 1986; Drouet and Simon-Chazottes, 1993). Thus, the globin gene superfamily provides a view of the many different mechanisms that can be employed by the genome to evolve structural and functional complexity.

The Hox gene superfamily provides an alternative prototype for the expansion of gene number as illustrated in figure 5.7. In this case, the earliest duplication events (which pre-date the divergence of vertebrates and insects) led to a cluster of related genes that encoded DNA-binding proteins used to encode spatial information in the developing embryo. The original gene cluster has been duplicated en masse and dispersed to a total of four chromosomal locations (on Chrs 2, 6, 11, and 15) each of which contains nine to twelve genes (McGinnis and Krumlauf, 1992). Interestingly, because of the order in which the duplication events occurred — unequal crossing over to expand the cluster size first, transposition en masse second — an evolutionary tree would show that a single "gene family" within this superfamily is actually splayed out physically across all of the different gene clusters as shown in figure 5.7. Some gene additions and subtractions within individual clusters have occurred by unequal crossing over since the en masse duplication so that differences in gene number and type can be seen within a basic framework of homology among the different whole clusters.

A final example of a gene super-superfamily is the very large set of genes that contain immunoglobulin-like (Ig) domains and function as cell surface or soluble receptors involved in immune function or other aspects of cell-cell interaction. This set includes the immunoglobulin gene families themselves, the major histocompatibility genes (called H2 in mice), the T cell receptor genes and many more (Hood et al., 1985). There are dispersed genes and gene families, small clusters, large clusters, and clusters within clusters, tandem and interspersed. Dispersion has occurred with the transposition of single genes that later formed clusters and with the dispersion of whole clusters en masse. Furthermore, the original Ig-domain can occur as a single unit in some genes, but it has also been duplicated intragenically to produce gene products that contain two, three, or four domains linked together in a single polypeptide. The Ig-superfamily, which contains hundreds (perhaps thousands) of genes, illustrates the manner in which the initial emergence of a versatile genetic element can be exploited by the forces of genomic evolution with a consequential enormous growth in genomic and organismal complexity.

5.3.3.2 Does gene order or localization matter?

Does the chromosome on which a gene lies matter to its function? Is gene clustering significant to function or is it simply a remnant of the fact that duplicated genes are most often generated by unequal crossover events? One can gain insight toward the answers to these questions by comparing the positions of homologous genetic information in different species: specifically mice and humans. Whole genome comparisons quickly demonstrate that the question of conservation to a particular chromosome only makes sense in the context of the X and Y. This is because every autosome from one species contains significant stretches of homology with two or more autosomes in the other species. Thus, the question of autosome conservation is meaningless. The X and Y chromosomes are a different story for three interrelated reasons. First, as a pair, they play a special role in sex determination. Second, they are the only chromosomes that can appear in a hemizygous state in normal genomes. Third, the X chromosome alone is subject to stable inactivation in all normal female mammals. With few exceptions, X-linked and Y-linked genes have remained in the same linkage groups throughout mammalian evolution as originally proposed by Ohno (1967), although various intra-chromosomal rearrangements have occurred (Bishop, 1992; Brown et al., 1992; Foote et al., 1992).

The second question asked at the head of this section can be re-stated as follows: do fine-structure genetic maps have functional significance? The answer is that in at least some cases, the integrity of genes within a clustered family is clearly important to function. This was first illustrated in the case of the beta-globin gene family with its five members arranged in a 70 kb array (figure 5.6). Although beta-globin was used in the first transgenic experiments conducted in 1980 and many subsequent experiments, it was never possible for researchers to achieve full expression of the transgene at the same level as the endogenous gene. The problem was that all of the members of the endogenous gene family are dependent for expression on a locus control region (or LCR) that maps outside of the gene cluster and appears to play a role in "opening-up" the chromatin structure of the entire cluster in hematopoietic cells so that individual family members can then be regulated in different temporal modes (Talbot et al., 1989; Townes and Behringer, 1990). When transgene constructs are produced with the beta-globin LCR linked to the beta globin structural gene, full endogenous levels of expression can be obtained (Grosveld et al., 1987). In recent years, evidence has accumulated for the role of LCRs in the global control of other gene clusters as well.

There is not only a requirement for some genes to remain in their ancestral cluster, but in some cases, the precise order of genes is conserved as well. Actual gene order has been observed to play roles in two different patterns of expression. Transgenic experiments indicate that for the beta-globin cluster, the temporal sequence of expression appears to be directly encoded (to a certain extent) in the order in which the genes occur (Hanscombe et al., 1991). In the Hox gene clusters, the order of genes correlates with the pattern of spatial expression along the anterior-posterior axis of the developing embry (McGinnis and Krumlauf, 1992 and figure 5.7).

There are also a few examples of genes and clusters that are unrelated by sequence, but which map together in a small chromosomal region and have a common arena of function. The best example of this phenomenon is the major histocompatibility complex which contains various gene families that diverged from a common immunoglobulin-like-domain ancestor but also unrelated genes that play a role in antigen presentation and other aspects of immune function. This conjunction of immune genes has been conserved in all mammalian species that have been examined. Is this significant? Farr and Goodfellow (1992) quote Sydney Brenner in likening gene mappers to astronomers boldly mapping the heavens and conclude that "Seeking meaning in gene order may be the equivalent of astrophysics — or it might be astrology". I think it is safe to bet that sometimes it will be one and sometimes it will be the other. The problem will be to distinguish between the two.

5.3.3.3 Tandem families of identical genetic elements

A limited number of multi-copy gene families have evolved under a very special form of selective pressure that requires all members of the gene family to maintain essentially the same sequence. In these cases, the purpose of high copy number is not to effect different variations on a common theme, but rather to supply the cell with a sufficient amount of an identical product within a short period of time. The set of gene families with identical elements includes those that produce RNA components of the cell’s machinery within ribosomes and as transfer RNA. It also includes the histone genes which must rapidly produce sufficient levels of protein to coat the new copy of the whole genome that is replicated during the S phase of every cell cycle.

Each of these gene families is contained within one or more clusters of tandem repeats of identical elements. In each case, there is strong selective pressure to maintain the same sequence across all members of the gene family because all are used to produce the same product. In other words, optimal functioning of the cell requires that the products from any one individual gene are directly interchangeable in structure and function with the products from all other individual members of the same family. How is this accomplished? The problem is that once sequences are duplicated, their natural tendency is to drift apart over time. How does the genome counteract this natural tendency?

When ribosomal RNA genes and other gene families in this class were first compared both between and within species, a remarkable picture emerged: between species, there was clear evidence of genetic drift with rates of change that appeared to follow the molecular clock hypothesis described earlier. However, within a species, all sequences were essentially equivalent. Thus, it is not simply the case that mutational changes in these gene families are suppressed. Rather, there appears to be an on-going process of "concerted evolution" which allows changes in single genetic elements to spread across a complete set of genes in a particular family. So the question posed previously can now be narrowed down further: how does concerted evolution occur?

Concerted evolution appears to occur through two different processes (Dover, 1982; Arnheim, 1983). The first is based on the expansion and contraction of gene family size through sequential rounds of unequal crossing over between homologous sequences. Selection acts to maintain the absolute size of the gene family within a small range around an optimal mean. As the gene family becomes too large, the shorter of the unequal crossover products will be selected; as the family becomes too small, the longer products will be selected. This cyclic process will cause a continuous oscillation around a mean in size. However, each contraction will result in the loss of divergent genes, whereas each expansion will result in the indirect "replacement" of these lost genes with identical copies of other genes in the family. With unequal crossovers occurring at random positions throughout the cluster and with selection acting in favor of the least divergence among family members, this process can act to slow-down dramatically the continuous process of genetic drift between family members.

The second process responsible for concerted evolution is intergenic gene conversion between "non-allelic" family members. It is easy to see that different tandem elements of nearly identical sequence can take part in the formation of Holliday intermediates which can resolve into either unequal crossing over products or gene conversion between non-allelic sequences. Although the direction of information transfer from one gene copy to the next will be random in each case, selection will act upon this molecular process to ensure an increase in homogeneity among different gene family members. As discussed above, information transfer — presumably by means of gene conversion — can also occur across gene clusters that belong to the same family but are distributed to different chromosomes.

Thus, with unequal crossing over and inter-allelic gene conversion (which are actually two alternative outcomes of the same initial process) along with selection for homogeneity, all of the members of a gene family can be maintained with nearly the same DNA sequence. Nevertheless, concerted evolution will still lead to increasing divergence between whole gene families present in different species.

5.3.4 Centromeres and satellite DNA

In the early days of molecular biology, equilibrium sedimentation through CsCl2 gradients was used as a method to fractionate DNA according to buoyant density. Genomic DNA prepared from animal tissues according to standard protocols is naturally degraded by shear forces into fragments that are, on average, smaller than 100 kb. When a solution of genomic DNA fragments is subjected to high-speed centrifugation in CsCl2, each fragment will move to a position of equivalent density in the CsCl2 concentration gradient that forms. DNA buoyant density is related to the molar ratio of G:C basepairs to A:T basepairs by a simple linear function. The greater the G:C content, the higher the density. When mouse DNA is subjected to CsCl2 fractionation, the bulk of the DNA (90%) is distributed within a narrow bell shaped curve having an average density of 1.701 g/cm3 equivalent to a G:C content of 42%.

In addition to this "main band" of DNA, a second "satellite band" was observed with an average density of 1.690 g/cm3 equivalent to a G:C content of 31% (Kit, 1961). Approximately 5.5% of the total mouse genome is found within this band, and the DNA within this fraction was given the name "satellite DNA" (Davisson and Roderick, 1989). It was not until 1970 that Pardue and Gall used their newly invented technique of in situ hybridization to demonstrate the localization of satellite DNA sequences to the centromeres of all mouse chromosomes except the Y (Pardue and Gall, 1970). Centromeres are highly specialized structural elements that function to segregate eukaryotic chromosomes during mitosis and meiosis (Rattner, 1991).

When DNA recovered from the satellite band was subjected to renaturation analysis, as described earlier in this chapter, the C0t1/2 value obtained indicated a complexity of only ~200 nucleotides. This result showed that the satellite DNA fraction was composed of a simple sequence that was repeated over and over again many, many times. Modern cloning and sequence analysis has demonstrated a basic repeating unit with a size of 234 bp (Hörz and Altenburger, 1981). One can calculate the copy number of this basic repeat unit by dividing the proportion of the genome devoted to satellite sequences (5.5% x 3x109 bp = 1.65 x108 bp) by the repeat size (234 bp) to obtain 700,000 copies. If these copies were distributed equally among all chromosomes, each centromere would contain 35,000 copies having a total length of eight megabases.

Although the original definition of "satellite" DNA was based on a density difference observed in CsCl2 gradients, the meaning of the term has expanded to describe all highly repeated simple sequences found in the centromeres of chromosomes from higher eukaryotes. In many species, satellite sequences do not have G:C contents that differ from that of the bulk DNA.

The M. musculus genome has a second family of satellite sequences present in only 50-100,000 copies (Davisson and Roderick, 1989). This "minor satellite" is also localized to the centromeres and appears to share a common ancestry with the major satellite. It is of interest that the relative proportion of the two satellites in M. spretus is the reverse of that found in M. musculus. The M. spretus genome has only 25,000 copies of the "major satellite" and 400,000 copies of the "minor satellite". This difference can be exploited to allow the determination of centimorgan distances between centromeres and linked loci in interspecific crosses as discussed in section 9.1.2 (Matsuda and Chapman, 1991).

The satellite sequences in the distant Mus species M. caroli, M. cervicolor, and M. cookii have diverged so far from the musculus sequences that cross-hybridization between the two is minimal. This qualitative difference can be exploited, once again, by in situ hybridization, to differentially mark cells from each species in interspecific chimeras (Rossant et al., 1983). A satellite DNA marker is useful for cell lineage studies because it is easy to detect by hybridization of tissue sections and it is present in all cells irrespective of gene activity or developmental state.

The term satellite has been incorporated as a suffix into a number of other terms (microsatellite, minisatellite, midisatellite, etc.) that are used to describe DNA sequences formed from basic units that have become amplified by multiple rounds of tandem duplication. Some of these sequence classes are described in section 5.4.5 and chapter 8.

5.4 Repetitive "non-functional" DNA families

In the preceding section, we examined several different classes of DNA families with members that carry out a variety of tasks necessary for the survival of the organism. This section surveys a final major class of DNA families whose members in-and-of-themselves do not function for the benefit of the animal in which they lie. This class can be subdivided further into individual families that are actively involved in their own dispersion — the so-called selfish genes — and those that consist of very simple sequences that appear to arise de novo at each genomic location. The selfish gene group can be further divided — somewhat arbitrarily — into subclasses based on copy number in the genome. Each of the resulting subclasses of repetitive DNA families will be discussed in the subsections that follow.

5.4.1 Endogenous retroviral elements

Retroviruses are RNA-containing viruses that can convert their RNA genome into circular DNA molecules through a viral-associated reverse transcriptase which becomes activated upon cell infection. The resultant DNA "provirus" can integrate itself into a relatively-random site in the host genome. The genetic information present in the retroviral genome is retained within the integrated provirus, and under certain conditions, the provirus can be activated to produce new RNA genomes along with the associated proteins — including reverse transcriptase — that can come together to form new virus particles that are ultimately released from the cell surface by exocytosis. However, in many cases, stably integrated retroviral elements appear not be active.

Once it has become integrated into a chromosome, the provirus will become replicated with every round of host replication irrespective of whether the provirus itself is active or silent. Furthermore, proviruses that integrate into the germ line — through the sperm or egg genome — will segregate along with their host chromosome into the progeny of the host animal and into subsequent generations of animals as well. In certain hybrid mouse strains, new proviral integrations into the germ line can be observed to occur at abnormally high frequencies (Jenkins and Copeland, 1985).

All strains of mice as well as all other mammals have endogenous proviral elements. These elements can be classified and subclassified according to the type of retrovirus from which they derived (ecotropic, MMTV, xenotropic, and others). Ecotropic elements are generally present at 0 to 10 copies (Jenkins et al., 1982), MMTVs are present at 4 to 12 copies (Kozak et al., 1987), and non-ecotropic elements are present at 40 to 60 copies (Frankel et al., 1990). Loss and acquisition of new proviral sequences is an ongoing process and, as a consequence, the genomic distribution of these elements is highly polymorphic. Thus, these elements can be very useful as genetic markers as discussed in section 8.2.4.

In addition to the DNA families clearly related to known retroviruses, there are a number of additional families that are retroviral-like in structure but are not clearly related to any known virus strain in existence today. The Intracisternal A Particle (IAP) DNA family is defined by homology to RNA sequences that are actually present within non-functional retroviral-like particles found in the cytoplasm of some types of mouse cells. The IAP family is present in ~1000 copies (Lueders and Kuff, 1977), but very few of these copies can actually produce transcripts. Another retroviral-like DNA family is called VL30, which stands for Viral-Like 30S particles (Carter et al., 1986); there are approximately 200 copies of this element in the mouse genome (Courtney et al., 1982; Keshet and Itin, 1982). There is no reason to expect that additional retroviral-like families will not be uncovered through further genomic studies.

It is of evolutionary interest to ask the question: from where do retroviruses come? Retroviruses cannot propagate in the absence of cells, but cells can propagate in the absence of retroviruses. Thus, it seems extremely likely that retroviruses are derived from sequences that were originally present in the cell genome. The first retrovirus must have been able to free itself from the confines of the cell nucleus through an association with a small number of proteins that allowed it to coat, and thus protect, itself from the harsh extracellular environment. Of course, the protein most critical to the propagation of the retrovirus is the enzyme that allows it to reproduce — RNA-dependent DNA polymerase, commonly referred to as reverse transcriptase. But where did this enzyme come from? Reverse transcriptase catalyzes the production of single stranded complementary DNA molecules from an RNA template. This enzymatic activity does not appear to be required for any normal cellular process known in mammals! How could such an activity — without any apparent benefit to the host organism — arise de novo in a normal cell? One possible answer is that reverse transcriptase did not evolve for the benefit of the organism itself but, rather, for the benefit of selfish DNA elements within the genome that utilize the enzyme to propagate themselves within the confines of the genome as described in the next section.

5.4.2 The LINE-1 family

The mouse genome contains three independent families of dispersed repetitive DNA elements — called B1, B2, and LINE-1 (or L1) — that are each present at more than 80,000 chromosomal sites (Hasties, 1989). The general name coined for genomic elements of this type that disperse themselves through the genome by means of an RNA intermediate is retroposon. Of the three major retroposon families, it is only L1 that appears to be derived from a full-fledged selfish DNA sequence with a self-encoded reverse transcriptase. The mouse L1 DNA family is very old and homologous repetitive families have been found in a wide variety of organisms including protists and plants (Martin, 1991). Thus, LINE-related elements, or others of a similar nature, are likely to have been the source material that gave rise to retroviruses.

Full-length L1 elements have a length of 7 kb; however, the vast majority (>90%) of the ~100,000 L1 elements have truncated sequences which vary in length down to 500 bp (Martin, 1991). But, of the ~10,000 full-length L1 elements, only a few retain a completely functional reverse transcriptase gene which has not been inactivated by mutation. Thus, only a very small fraction of the L1 family members retain "transposition competence," and it is these that are responsible for dispersing new elements into the genome.

Dispersion to new positions in the germ line genome presumably begins with the transcription of competent L1 elements in spermatogenic or oogenic cells. The reverse transcriptase coding region on the L1 transcript is translated into enzyme that preferentially associates with and utilizes the transcript that it came from as a template to produce L1 cDNA sequences (Martin, 1991). For reasons that are unclear, it seems that the reverse transcriptase usually stops before a full-length copy is finished. These incomplete cDNA molecules are, nevertheless, capable of forming a second strand and integrating into the genome as truncated L1 elements that are forever dormant.

The L1 family appears to evolve by repeated episodic amplifications from one or a few progenitor elements, followed by the slow degradation of most new integrants — by genetic drift — into random sequence. Thus, at any point in time, a large fraction of the cross-hybridizing L1 elements in any one genome will be more similar to each other than to L1 elements in other species. In a sense, episodic amplification followed by general degradation is another mechanism of concerted evolution.

A large percentage of the mouse L1 elements share two EcoRI restriction sites located at a 1.3 kb distance from each other near the 3’ end of the full-length sequence. With its very high copy number, this 1.3 kb fragment is readily observed in — and, in fact, diagnostic of — EcoRI digests of total mouse genomic DNA that has been separated by agarose gel electrophoresis and subjected to staining by ethidium bromide. This high copy number EcoR1 fragment was originally given other names, including MIF-1 and 1.3RI, before it was realized to be simply a portion of L1.

5.4.3 The major SINE families: B1 and B2

The two other major families of highly repetitive elements in the mouse — B1 and B2 — are both of the SINE type with relatively short repeat units of ~140 bp and ~190 bp in length respectively. The significance of this short repeat length is that it does not provide sufficient capacity for these elements to actually encode their own reverse transcriptase. Nevertheless, SINE elements are able to disperse themselves through the genome, just like LINE elements, by means of an RNA intermediate that undergoes reverse transcription. Clearly, SINEs are dependent on the availability of reverse transcriptase produced elsewhere, perhaps from L1 transcripts or endogenous retroviruses.

All SINE elements, in the mouse genome and elsewhere, appear to have evolved out of small cellular RNA species — most often tRNAs but also (in the case of mice and humans) the 7SL cytoplasmic RNA which is one of the components of the signal recognition particle (SRP) essential for protein translocation across the endoplasmic reticulum (Okada, 1991). Unlike the LINE families, however, SINE families present in the genomes of different organisms appear, for the most part, to have independent origins. The defining event in the evolution of a functional cellular RNA into an altered-function self-replicating SINE element is the accumulation of nucleotide changes in the 3’ region that lead to self-complementarity with the propensity to form hairpin loops. The open end of the hairpin loop can be recognized by reverse transcriptase as a primer for strand elongation. Since hairpin loop formation of this type is likely to be very rare among normal cellular RNAs, the SINE transcripts in a cell will be utilized preferentially as templates for the production of cDNA molecules that are able (somehow) to integrate into the genome at random sites. Like the L1 family, the B1 and B2 families appear to be evolving by episodic amplification followed by sequence degradation.

The B1 element is repeated ~150,000 times, and the B2 element is repeated ~90,000 times (Hasties, 1989); together these elements alone account for ~1.3% of the material in the Mus musculus genome. The B1 element is derived from a portion of the 7SL RNA gene whereas the B2 element appears to be derived (in a complicated fashion) from a tRNAlys gene. Human beings have just one family of SINE elements, referred to as the "Alu family," which is present in 500,000 copies and is also derived from 7SL RNA, although in an independent fashion from the mouse B1 element. Interestingly, the mouse genome does contain about 10,000 copies of a retroposon family that is closely related to the human Alu family (Hasties, 1989).

5.4.4 General comments on SINEs and LINEs

A number of other independent SINE families have been identified in the mouse genome, but none are present in more than 10,000 copies. One such family of 80 bp tRNA-derived elements called ID was originally found in the rat genome at a copy number of 200,000; in the mouse genome, there are only 10,000 ID copies. In addition, there are probably other minor SINE families in the mouse genome that have yet to be well-characterized.

The total mass of SINE and LINE elements probably accounts for less than 15% of the mouse genome. With the efficient means of self-dispersion that these elements employ, one can ask why they haven’t amplified themselves to even higher levels? The answer is almost certainly that if the amount of selfish DNA in a genome goes above a certain critical level, it will cause the host organism to be less fit and, thus, less likely to pass its selfish DNA load on to future generations. The existence of a critical ceiling means that the various SINE and LINE families are in direct competition for a limited amount of genomic real estate.

If one assumes that each of the major highly repetitive families — B1, B2 and L1 — is dispersed at random, it is a simple matter to calculate that, on average, a member of each family will be present in every 20 to 30 kb of DNA. In fact, if one screens a complete genomic library with 15 kb inserts in bacteriophage lambda with a probe for each family, 80% of the clones are found to contain B1 elements, 50% are found to contain B2 elements, and 20% are found to contain the central portion of the full-length L1 element (Hasties, 1989). However, if one analyzes individual clones for the presence of both SINE and LINE elements, there is a significant negative correlation. In other words, SINE and LINE elements appear to prefer different genomic domains.

To better understand the non-random distribution of the three major mouse repetitive families and to investigate possible correlations with chromosome structure, the karyotypic distribution of each family was investigated by fluorescence in situ hybridization or FISH (Boyle et al., 1990 and section 10.2). Incredibly, the distribution of the LINE-1 elements corresponds almost precisely with the distribution of the Giemsa stained dark G bands. In contrast, both SINE element families co-localize to the lightly stained chromosomal regions located between G bands (R bands). When the same type of experiment was performed on human karyotypes, the same result was obtained — the human SINE element Alu was found in R bands, whereas human LINE sequences co-localized with G bands (Korenberg and Rykowski, 1988). Since, essentially all of the SINE and LINE elements integrated into the mouse and human genomes subsequent to their divergence from a common mammalian ancestor, the implication is that preferential integration into different chromosomal domains is a property of each element class.

The correlation between G and R bands and LINE and SINE distribution respectively is not perfect. Some chromosomal regions are observed to have an overabundance or underabundance of the associated sequences, and a small, but significant fraction, of elements are located outside the "correct" regions. One consistent exception to the general correlation is that centromeric heterochromatic regions, which normally stain brightly with Giemsa, do not have any detectable LINE elements in mice or humans. However, this exception can be easily understood in terms of the special structural role played by centromeric satellite DNA in chromosome segregation — any integration of a LINE element would disrupt this special DNA and its function, and this would be selected against evolutionarily.

As a final note, it should be mentioned that although the SINE and LINE elements have amplified themselves for selfish purposes, they have, in turn, had a profound impact on whole genome evolution. In particular, homologous elements located at nearby locations can, and will, act to catalyze unequal but "homologous" crossovers that result in the duplication of single copy genes located in-between and the initiation of gene cluster formation as illustrated in figure 5.5.

5.4.5 Genomic stutters: microsatellites, minisatellites, and macrosatellites

With large-scale sequencing and hybridization analyses of mammalian genomes came the frequent observation of tandem repeats of DNA sequences, without any apparent function, scattered throughout the genome. The repeating unit can be as short as two nucleotides (CACACACA etc.), or as long as 20 kb. The number of tandem repeats can also vary from as few as two to as many as several hundred. The mechanism by which tandem repeat loci originate may be different for loci having very short repeat units as compared to those with longer repeat units. Tandem repeats of short di- or trinucleotides can originate through random changes in non-functional sequences. In contrast, the initial duplication of larger repeat units is likely to be a consequence of unequal crossing over. Once two or more copies of a repeat unit (whether long or short) exist in tandem, unequal pairing followed by crossing over can lead to an increase in the number of repeat units in subsequent generations (see figure 8.4). Whether stochastic mechanisms alone can account for the rich variety of tandem repeat loci that exist in the genome or whether other selective forces are at play is not clear at the present time. In any case, tandem repeat loci continue to be highly susceptible to unequal crossovers and, as a result, they tend to be highly polymorphic in terms of overall locus size.

Tandem repeat loci are classified according to both the size of the individual repeat unit and the length of the whole repeat cluster. The smallest and simplest — with repeat units of one to four bases and locus sizes of less than 100 bp — are called microsatellites. The use of microsatellites as genetic markers has revolutionized the entire field of mammalian genetics (see section 8.3.6). Next come the minisatellites with repeat units of 10 to 40 bp and locus sizes that vary from several hundred bas pairs to several kilobases (see section 8.2.3). Tandem repeat loci of other sizes do not appear to be as common, but a great variety are scattered throughout the genome. The term midisatellite has been proposed for loci containing 40 bp repeat units that extend over distances of 250 to 500 kb, and macrosatellites has been proposed as the term to described loci with large repeat units of 3 to 20 kb present in clusters that extend over 800 kb (Giacalone et al., 1992). However, the use of arbitrary size boundaries to "define" these other types of loci is probably not meaningful since it appears that, in reality, no such boundaries exist in the potential for tandem repeat loci to form in the mouse and other mammalian genomes.

5.5 Genomic imprinting

5.5.1 Overview

From the birth of the field of genetics until a decade ago, it was generally assumed that the parental origin of a gene could have no effect on its function. In the vast majority of studies carried out during the last 90 years, this paradigm has appeared to hold true. However, with increasingly sophisticated genetic and embryological investigations in the mouse, important exceptions to this rule have been uncovered over the last decade. First, the results of nuclear transplantation experiments carried out with single-cell fertilized embryos have demonstrated an absolute requirement for both a maternally-derived and a paternally-derived pronculeus to allow full-term development (McGrath and Solter, 1983). Second, in animals that receive both homologs of certain chromosomes or subchromosomal regions from one parent and not the other (through the mating of translocation heterozygotes as described in section 5.2.3), dramatic effects on development can be observed including enhanced or retarded growth and outright lethality (Cattanach and Kirk, 1985). Third, either of two deletions that cover a small region of mouse chromosome 17 can be transmitted normally from a father to his offspring, but these same deletions cause prenatal lethality when they are maternally transmitted (Johnson, 1974; Winking and Silver, 1984). Fourth, similar parent-of-origin effects have been observed on the phenotypes expressed by animals that carry a targeted knock-out allele at the Igf2 locus (DeChiara et al., 1991). Finally, molecular techniques have been used to directly demonstrate the expression of transcripts from one parental allele and not the other at the Igf2r locus (Barlow et al., 1991) and the H19 locus (Bartolomei et al., 1991).

The accumulated data indicate that a subset of mouse genes will function differently in normal embryos depending on whether they have been inherited through the male or the female gamete such that one allele will be expressed and the other will be silent. Genomic imprinting is the term that has been coined to describe this situation in which the phenotype expressed by a gene varies depending on its parental origin (Sapienza, 1989). Further experiments have demonstrated that, in general, the "imprint" is erased and re-generated during gametogenesis so that the function of an imprintable gene is fully determined by the sex of its progenitor alone, and not by earlier ancestors.

With the demonstration of genomic imprinting in the mouse, patterns of disease inheritance in humans have been investigated for the possibility of parent-of-origin-determined phenotypes in this mammal as well. To date, clear-cut parental effects have been uncovered in the transmission of the juvenile form of Huntington disease (Ridley et al., 1991), Prader-Willi and Angelman deletion syndromes (Nichols et al., 1989) and certain forms of juvenile familial carcinomas such as multifocal retinoblastoma, Wilm's tumor, embryonal rhabdomyosarcoma and Beckwith-Wiedemann syndrome (Ferguson-Smith et al., 1990; Henry et al., 1991; Sapienza, 1991).

5.5.2 Why is there imprinting?

The first explanation for the existence of imprinting was as a mechanism to prevent the full-term development of parthenogenetic embryos. This explanation was never satisfactory because it did not account for the intricate control of imprinting at multiple well-bounded loci. An alternative hypothesis put forward by Haig and his colleagues is based on a tug-of-war between the sexes (Haig and Graham, 1991; Moore and Haig, 1991). According to this hypothesis, it is in the interest of a male to attempt to recover more maternal resources for his developing offspring in relation to offspring in the same mother that were sired by other males. This can be accomplished with a paternal imprint that down-regulates the expression of genes that normally act to slow down the growth of embryos. As a consequence, embryos that are sired by these males will grow more rapidly than half-siblings sired by other males. Although overgrowth may be beneficial to these offspring, it extracts a heavy reproductive cost from the mother. Consequently, it is in the interest of the mother to counteract this increased level of growth. She can do this with an imprint that down-regulates the relevant growth factor genes themselves. The evolutionary endpoint of this tug-of-war is the current day situation where genes that act to increase embryonic growth (such as Igf2) have inactivated maternal alleles, and genes that act to limit growth (such as Igf2r) have inactivated paternal alleles.

The only other currently viable hypothesis to explain imprinting is that it results from the accidental, ectopic use of machinery that has evolved for the really important imprinting associated with X chromosome inactivation. According to this hypothesis, autosomal imprinting is a red herring whose study is unlikely to provide information of significance to an understanding of developmental genetics. The major strike against this hypothesis is dealt by selectionists who would contend that genetic accidents of this magnitude just do not happen, and there must be something peculiar about mammals that has promoted the evolution of imprinting. In support of the selectionist view is the recent demonstration of mono-allelic expression of the H19 gene in humans (Zhang and Tycko, 1992). Conservation of imprinting during the evolution of both humans and mice from a common ancestor strongly suggests the existence of selective forces. Nevertheless, it is still possible that the Haig hypothesis is not entirely correct and that other reasons for imprinting lie hidden beneath the surface waiting to be uncovered.

5.4.3 The molecular basis for imprinting.

The question "how does it happen?" can be easily separated from the question "why does it happen?" However, here again, our understanding is still quite rudimentary. Figure 5.8 illustrates in a very general way the essential requirements of a paternal imprinting system. Both parents have one imprinted allele (derived from their fathers) and one active allele (derived from their mothers). During oogenesis, the imprint must be erased so that all eggs will contain equivalent alleles that can become activated in all offspring. In contrast, at the completion of spermatogenesis in the father, all sperm will contain alleles that are "marked" for imprinting. It is possible that the mark present on one of the father’s alleles is erased and both copies are marked de novo in all spermatogenic cells, or the one imprinted copy may retain its mark, with de novo marking applied only at the second copy. In either case, the new embryo will receive one "marked" gene from the father and one non-marked gene from the mother.

The "mark" may itself be replicated faithfully along with its homolog, and the "mark" may itself be responsible for the actual repression of gene activity. On the other hand, the "mark" may simply identify the paternal allele so that a separate imprinting machinery that acts to prevent gene repression can be laid down within the developing embryo. If there is a separate imprinting machinery, either it or the "mark" could be replicated along with the paternal homolog to maintain the imprint through each cell division.

It is still the case in 1993 that the nature of both the mark and the imprinting machinery (if it exists as a separate entity) are entirely unknown. Both could presumably entail direct chemical modifications of DNA and/or specific protein components (that might lead to changes in the local chromatin configuration). In addition, the specific DNA sequences that must be recognized by the gametogenic marker are also unknown at this time.