Restriction fragment length polymorphisms (RFLPs) of the two genes encoding the type I collagen polypeptides are turning out to be valuable anthropogenetic markers [for a review see Pepe et al. (1990); Kuivaniemi et al. 1991; Pepe et al. 1994] in addition to being potentially useful in prenatal diagnosis of dominant osteogenesis imperfecta (OI). Dominant OI affects about 1 in 10,000 individuals. OI is also called brittle bone disease (McKusick 1972), and the main clinical manifestations are long-bone fractures, presenile hearing loss, dentinogenesis imperfecta, and blue sclerae (Sillence et al. 1979). Most, if not all, OI is caused by mutations in one of the two genes (COL1A1 and COL1A2) that encode type I procollagen (Kuivaniemi et al. 1991; Byers 1990), the major collagen of the bone and most other connective tissues.
The population of Sardinia, the second largest island in the Mediterranean Sea, is known to display peculiar gene frequencies for most of the traditional markers with respect to not only white populations on the whole but also populations of other Italian regions (Modiano et al. 1986; Piazza et al. 1989). This is particularly well proved because the Sardinian population has been exhaustively studied for many markers. Of the 50 adequately studied markers, including blood group systems (ABO, MNS, RH, FY, LU, P), red cell enzyme polymorphisms (ACP1, PGM1, AK1, ADA, GLO1, ESD, PGP, GPT, UMPK), serological markers (GM and KM), and the HLA-A, -B, and -C histocompatibility systems, Sardinians are clearly outliers for 33 markers and show borderline values with respect to the Italian frequencies for an additional 7 markers [for a review concerning not only genetic but also other biological aspects of Sardinians as well as their history and demography, see Modiano et al. (1986); see also Piazza et al. (1989)]. It seemed justified, therefore, to extend the genetic studies of the COL1A2 gene to this population, not so much to further characterize it (a mere addition of just one new genetic system to the long list of those already studied in detail was not expected to improve substantially the high degree of knowledge already attained about Sardinians) but to look for peculiarities of this particular genetic system. In other words, Sardinians have been the tool rather than the object of the present study, which has been that of better defining the potential usefulness of the COL1A2 genetic system as an anthropogenetic marker.
Two approaches were adopted to study the COL1A1 and COL1A2 genes in Sardinians. The more traditional one was to determine the gene and haplotype frequencies for three already known COL1A2 RFLPs so that the Sardinian frequencies could be compared with those of other white (particularly Italian) populations. The second approach consisted of a search for new COL1A polymorphisms on 11 restriction sites (6 on the COL1A1 gene and 5 on the COL1A2 gene, each defined by a restriction enzyme and by a probe), which turned out to be monomorphic when tested on white populations.
We collected triplets (parents and child) instead of random individuals for two main reasons: (1) to improve the reliability of the haplotype frequency estimates, because the haplotypes were directly counted; and (2) to be able to test the inheritance of new RFLPs, if any were found.
The expected outcome of the present investigation was at the minimum to further characterize the Sardinian population and hopefully to discover one or more new DNA polymorphisms.
MATERIALS AND METHODS
Subjects. The sample is made up of 38 triplets (father, mother, and child), each yielding two random subjects. Both parents are Sardinians. Because a number of families were gathered for studies of thalassemia in the Pediatric Clinic of the University of Sassari (Northwestern Sardinia), we took advantage of the availability of such family material. It appears justified to consider these samples as random samples with respect to the collagen genes.
Procedures. DNA was extracted from whole blood according to standard procedures (Sambrook et al. 1989). For a detailed description of Southern blot analysis and the PCR (polymerase chain reaction) technique and its application to the analysis of the EcoRI, RsaI, and MspI RFLPs of the COL1A2 gene, see Pepe et al. (1990, 1994). The enzymeprobe systems explored and adopted when searching for new COL1A1 or COL1A2 RFLPs are listed in Table 1. (Table 1 omitted)
Estimates of the Gene Frequency for the Three RFLPs under Study. The gene frequencies for the RFLPs were determined by direct gene counting. In the two testable RFLPs the genotype distribution was compatible with that expected from Hardy-Weinberg equilibrium (Table 2). (Table 2 omitted)
Estimates of COL1A2 Haplotype Frequencies (by Direct Counting). Consider the following family triplet: father's phenotype, E+/+, R+/-; mother's phenotype, E+/-, R+/-; child's phenotype, E+/+, R+/+. The father's haplotypes can be deduced directly from his phenotype; they are E+R+/E+R-. This is not the case for the mother, whose phenotype is compatible both with E+R+/E-R- and E+R-/E-R+. However, the child's phenotype allows one to unambiguously infer the mother's genotype: Because the child's genotype is E+R+ /E+R +, the haplotype of the gamete the child received from the mother must have been E+R+; thus the genotype of the mother is E+R+/E-R- (it should be recalled that by definition no recombination occurs between sites of the same haplotype, as in the case of E and R so close to each other as to belong to the same gene). In this particular example all four haplotypes under examination (two from the father and two from the mother) could be unambiguously identified and counted. The few haplotypes that could not be identified even through the child (triplets where father, mother, and child are all double heterozygotes) were excluded. The same procedure leads to the identification and count of the eight haplotypes theoretically possible when considering the three sites simultaneously (see the three-marker set in Table 3). (Table 3 omitted)
Absolute Value of Linkage Disequilibrium. For each pair of markers D sub abs is the difference between the observed and the expected relative frequency of any one of the four possible haplotypes. In fact, this difference is necessarily the same for all of them (once the frequency of the four alleles is set, the system has only one degree of freedom).
Let us consider, as an example, the E-M couple of markers. Table 4 shows that D sub abs = 0.020. This is the difference between the observed frequency of E+M+ (= 20/146 = 0.136986; see Table 3) and its expected frequency, namely, the equilibrium frequency. (Table 4 omitted) The expected frequency is simply the product of the relative frequencies of E+ and M+, the two alleles that the E+M+ haplotype is made of. The E+ frequency is 0.164384 [(24/146) because (20 + 4) haplotypes out of 146 haplotypes examined contain this allele)l; the M+ frequency is 0.952055 (139/ 146). Their product is (0.164384 X 0.952055) = 0.156503. Therefore D sub abs is (0.156503 - 0.136986) == 0.019517 0.020.
Relative D Value. D sub abs alone is not sufficient to describe adequately the degree of disequilibrium of a system. In fact, to do so, a further fundamental value is needed, the so-called D sub rel . D sub rel is the ratio between the observed D sub abs and the highest possible D (D sub max ) in the same direction. In fact, in every two-marker system (A1 and A2; B1 and B2) there may be two D sub max : D sub max type I, in which the least common of the four alleles (say, B2) is always associated with the less common of the two alleles of the other marker (say, A2); and D sub max type II, which is the opposite case (B2 is always associated with A1). Obviously, D sub max type I is larger than D sub max type II to the same extent as the frequency of A2 is smaller than that of A1 [see also Thompson et al. (1988) and Pepe et al. (1990)].
Let us now go back to the E-M system of Table 3. The observed disequilibrium is a type I disequilibrium because the least common of its four alleles, M- (frequency = 0.047945) is preferably associated with the less common of the two E alleles, E+ (frequency = 0.164384). The frequency of the E+M- haplotype is 0.027397 (4/146; see Table 3), whereas its expected frequency is (0.164384 X 0.047945) = 0.007881. The D sub max type I would correspond to the constant rather than the preferential association of M- with E+ (namely, the frequency of E+M- coincides with that of M- = 0.047945). Thus the observed D sub abs is (0.027397 - 0.007881) = 0.019516 == 0.020 (see the two-marker set in Table 3); the D sub abs in the same direction, D sub max type I, is (0.047945 - 0.007881) = 0.0400636 == 0.040 (see Table 4); and D sub rel , namely, the ratio between the observed D sub abs and D sub max in the same direction (D sub max type I) is 0.019516/0.0400636 = 0.487 (see Table 4).
Statistical Significance of the Observed Linkage Disequilibrium. Statistical significance of D sub abs is tested with a chi-square test with one degree of freedom. In the two-marker system we are elaborating as an example the observed absolute frequencies of the four E-M haplotypes (Table 3): E+M+ = 20; E+M- = 4; E-M+ = 119; E-M- = 3. Their expected frequencies are E+M+ = [(O.164384 X 0.952055) X 146] = 22.849; E+M- = 1.151; E-M+ = 116.151; E-M- = 5.849. As discussed, the absolute difference between the expected and the observed frequency is the same for all four haplotypes (2.849 in the present case). Therefore the chi-square value is (2.849) sup 2 /22.849 + (2.349) sup 2 /1.151 + (2.849) sup 2 /116.151 + (2.849) sup 2 /5.849 = 6.61 (the Yates correction was used for the second component of this sum), which corresponds to p
RESULTS
None of the restriction enzyme-probe systems used in the present survey (Table 1) revealed a new RFLP.
The phenotype and gene frequencies for the COL1A2 RFLPs of this study are presented in Table 2. All three are in Hardy-Weinberg equilibrium.
Tables 5 and 6 deal with the gene and the haplotype analyses of the data, respectively. (Tables 5 and 6 omitted)
It is convenient to deal with the different types of comparisons, namely, those concerning gene frequencies, haplotype frequencies, and linkage disequilibrium values, separately.
Gene Frequency Comparisons. The gene frequencies are summarized in Table 5. Clearly, Sardinians are different from both Italians and whites on the whole for EcoRI and from some white populations for MspI.
Haplotype Frequency Comparisons, The haplotype frequencies are presented in Table 6.
Sardinia shows an almost threefold and highly significant difference from Calabria for the E+R+ haplotype and a much smaller and barely significant difference for the E-R+ haplotype. Clear-cut and highly significant differences have been found between Sardinia and Calabria for three of the four E-M haplotypes. Also, for the R-M set of markers Sardinia is different from Calabria.
It is clear that these three two-marker sets are efficient in distinguishing Sardinia from Calabria, even though, because they are not independently assorted, their combined efficiency is somewhat less than the sum of the single efficiencies (at any rate, it is at least as large as that of the most efficient of the single markers and, as a rule, much larger than that). The extent of such a loss with respect to the optimal situation corresponding to a random assortment of the alleles of the RFLPs in the haplotypes varies depending on the degree of disequilibrium D.
The effect of the disequilibrium is even more evident when haplotypes are considered at the three-marker rather than at the two-marker level. In the three-marker case the risk of using redundant information is high, except when the sample sizes are large enough to allow comparisons of different populations for allele frequency distributions of one RFLP within each of the four haplotypes of the remaining two RFLPs. Clearly, this is not yet the case for the COL1A2 system; thus no comparisons have been made with the present data at the three-marker haplotype level.
As for linkage disequilibrium, on the whole the present data are substantially in agreement with data on the two Italian populations thus far examined for all three two-maker haplotypes. In fact, for the E-R set of markers a D of the same type as that found in Calabria and central Italy was observed in the present sample, although in this case it is practically nil. For the E-M marker set the concordance between northern Sardinians and the other two samples is even more evident because they all exhibit a similar and significant disequilibrium of the same type. Finally, for the R-M marker set all three samples are compatible with equilibrium. It would be interesting to know whether this apparent lack of R-M disequilibrium, which is evident in Northern Europeans (Borresen et al. 1988; Sykes et al., 1986), is an Italian feature or whether it also involves other Mediterranean populations.
As expected, the analysis of the data at the haplotype level makes the COL1A2 system a much more valuable anthropogenetic marker than if it were studied simply at the single-marker level. However, these comparisons were possible only with the Italian samples because the phenotype frequencies were available. This demonstrates the necessity to include the haplotype frequency estimates or the original phenotype frequencies.
The notion that the analysis of multimarker systems at the haplotype rather than at a single-marker level may result in a dramatic improvement in their anthropogenetic discriminative efficiency is old and well known. This is best illustrated by the RH system [Race and Sanger (1975 and previous editions); Mourant et al. 1976], where the R0 (Dce) haplotype is usually found at frequencies of 0.02-0.05 among whites and of 0.5 or even higher in African populations, whereas the frequencies of the individual D, c, and e alleles are much less different between the two groups. Another illustration is the GM system, where alleles of the A and B series are always found in the same haplotype among African populations and never in the same haplotype among white populations (Steinberg and Cook 1981). It appears paradoxical, then, that anthropogenetic studies at the haplotype level became less and less common, indeed completely out of fashion, particularly over the last 10-15 years, the very period when the discovery of a host of multimarker genetic systems (RFLPs of the same genes or gene clusters) was taking place. The few serological and protein multimarker systems known before the advent of the DNA era (RH, MNS, KEL, GM, PGM1, and a few others) have been adequately exploited and have provided many extremely relevant anthropogenetic data, whereas virtually all the DNA markers known still wait to be utilized at the haplotype level (with the notable exceptions of the mtDNA and Y-chromosome-specific RFLPs). The time has come to reverse this trend.
Acknowledgments This work was supported by the Italian National Research Council [(CNR), Targeted Project "Prevention and Control Disease Factors"; Subproject (07) "Disease Factors in Mother-Infant Pathology"] under grant 92.00153.PF41 awarded to G. Pepe.
KEY WORDS: RFLP, HAPLOTYPES, SARDINIA, TYPE I COLLAGEN.
LITERATURE CITED
Borresen, A.L., P. Moller, and K. Berg. 1988. Linkage disequilibrium analyses and restriction mapping of four RFLPs at the pro alpha2(1) collagen locus: Lack of correlation between linkage disequilibrium and physical distance. Hum. Genet. 78:216-221.
Byers, P.H. 1990. Brittle bones, fragile molecules: Disorders of collagen gene structure and expression. Trends Genet. 6:293-300.
Chu, M.-L., W. de Wet, M. Bernard et al. 1984. Human pro alpha1(I) collagen gene structure reveals evolutionary conservation of a pattern of introns and exons. Nature 310:337-340.
Kuivaniemi, H., G. Tromp, and D.J. Prockop. 1991. Mutations in collagen genes: Causes of rare and some common diseases in humans. FASEB J. 5:2052-2060.
Kuivaniemi, H., G. Tromp, M.-L. Chu et al. 1988. Structure of a full length cDNA clone for the prepro-alpha2(I) chain of human type I procollagen. Biochem. J. 252:633-640.
McKusick, V.A. 1972. Heritable Disorders of Connective Tissue, 4th ed. St. Louis, MO: Mosby.
Modiano, G., L. Terrenato. R. Scozzari et al. 1986. Population genetics in Sardinia. Atti Accad. Naz. Lincei 4:257-330.
Mourant, A.E., A.C. Kopec, and K. Domaniewska-Sobczak. 1976. The Distribution of the Human Blood Groups and Other Polymorphisms. London, England: Oxford University Press.
Pepe, G., M. Muglia, C. Brancati et al. 1990. Studies of four restriction fragment length polymorphisms of the type I collagen genes in two Italian populations. Hum. Hered. 40:369-380.
Pepe, G., O. Rickards, C. Bue et al. 1994. EcoRI, RsaI, and MspI RFLPs of the COL1A2 gene (type I collagen) in the Cayapa, a native American population of Ecuador. Hum. Biol. 66(6) (in press).
Piazza, A., E. Olivetti, M. Barbanti et al. 1989. The distribution of some polymorphisms in Italy. Gene Geogr. 3:69-139.
Race, R.R., and R. Sanger. 1975. Blood Groups in Man, 6th ed. Oxford, England: Blackwell Scientific Publications.
Sambrook, J., E.F. Fritsch, and T. Maniatis. 1989. Molecular cloning: A laboratory manual. New York: Cold Spring Harbor Laboratory Press.
Sillence, D.O., A. Senn, and D.M. Danks. 1979. Genetic heterogeneity in osteogenesis imperfecta. J. Med. Genet. 16:101-116.
Steinberg, A.G., and C.E. Cook. 1981. The Distribution of the Human Immunoglobulin Allotypes. Oxford Monographs on Medical Genetics. Oxford, England; Oxford University Press.
Sykes, B., D. Olgive, P. Wordsworth et al. 1986. Osteogenesis imperfecta is linked to both type I collagen structural genes. Lancet 2:69-72.
Tagarelli, A., L. Bastone, R. Cittadella et al. 1991. Glucose-6-phosphate dehydrogenase (G6PD) in southern Italy: A study on the population of the Cosenza province. Gene Geogr. 5(3):141-150.
Thompson, E.A., S. Deeb, D. Walker et al. 1988. The detection of linkage disequilibrium between closely linked marker RFLPs at the AI-CIII apolipoprotein genes. Am. J. Hum. Genet. 42:113-124.
Tsipouras, P., J.C. Myers, F. Ramirez et al. 1983. Restriction fragment length polymorphism associated with an autosomal dominant form of osteogenesis imperfecta. J. Clin. Invest. 72:1262-1267.
G. PEPE,(1) C. BUE,(2) P. ORLANDI,(1) F. ENA,(2) T. MELONI,(2) AND G. MODIANO(1)
1. Dipartimento di Biologia, Universita di Roma "Tor Vergata," Via della Ricerca Scientifica 1, 00133 La Romanina, Rome, Italy.
2. Dipartimento di Pediatria "A. Filia." Universita di Sassari, Sassari, Italy.
Address correspondence to Guglielmina Pepe.
Copyright Wayne State University Press Aug 1994
Provided by ProQuest Information and Learning Company. All rights Reserved