This file is "readme" for chirxc.exe and chihw.exe, described in Zaykin, D.V. and Pudovkin, A.I. 1993. Two programs to estimate significance of Chi-square values using pseudo-probability test. - J.Hered., 84(2), p. 152. The authors, would appreciate receiving any comments on the programs, algorithms and the theory. Our addresses: Alexander Pudovkin: Institute of Marine Biology, Vladivostok 690041, Russia. email: apud@biom.marine.su Dmitri Zaykin: North Carolina State Univ, Statistics Dept, Box 8203, Raleigh, NC 27695-8203. email: zaykin@statgen.ncsu.edu Both programs are written in Borland C++, ver. 3.1. To get short instructions on operation of the programs type in the name of one of the programs from the command line prompt and press "Enter". Monte-Carlo probabilities reported by the programs are just "p-values". That is, they are (frequentist cumulative) probabilities FOR the null hypothesis rather than probabilities OF the hypothesis. CHIRXC: The program performs homogeneity tests for RxC contingency tables and estimates the probability of the data under the null-hypothesis (homogeneity) using Monte Carlo simulation as suggested by Roff,D.A. and Bentzen,P. - Mol.Biol.Evol., 6(5):539-545, 1990. In short, the program does the following. It generates random RxC tables with the same marginal totals as in the original RxC table under analysis. For each generated table chi-square (X2) is computed and compared with chi-square for the original table (X2_or). The result of Monte Carlo simulation is ratio of numbers of generated tables, for which X2 > or = X2_or, to the total number of generated tables ("number of runs"). This ratio is regarded as the probability of the data under null hypothesis for the original RxC table. For 2x2 tables the program offers computation of Fisher's exact probability test as well. It is necessary to note that p obtained through the exact probability test by Fisher does not fully correspond to the p obtained by the pseudoprobability test. The pseudoprobability test is based on the values of chi-square for randomly produced 2x2 tables and p equals the ratio of the tables for which chi-square is larger than that for original table. But two tables (with equal marginal totals) may have identical chi-squares (therefore the identical results of pseudoprobability test) but their p-values obtained by the exact Fisher's test may be different. Hence the discrepancy of the tests. In the region when p approximately equals 1/2 the discrepancy may be substantial, but in the region of critical importance (p is ca 1/20) the both tests are concordant. For example, we have two 2x2-tables with identical marginal totals: 10 10 23 12 X2 = 1.30952 both tails probability: 0.2702 14 6 19 16 X2 = 1.30952 both tails probability: 0.3912 Chi-squares for both of them are equal; thus the pseudoprobability test will yield identical results. But the exact test gives different probabilities: 0.2702 and 0.3912. The program permits two input formats: 1) conventional RxC contingency table; a) input from the keyboard b) input from a file 2) table of allele frequencies and sample sizes (number of individuals); Input from a file only. To operate the program one must use the command: CHIRXC INPT OUTPT Here INPT and OUTPT mean the names of input and output files (the latter is optional). When the table of numbers you are analyzing is small, you may enter it directly from the keyboard. In this case the command should be: CHIRXC OUTPT /S The explanations for entering and editing the screen will be displayed during execution. If the table under analysis is 2x2 the program offers computation of Fisher's exact test. The prompts here are the following: 2x2 table: wish to perform exact test? (y/n) If the answer is "Yes", Fisher's test is calculated, if "No", the user is offered to perform chi-square calculations and pseudoprobability test. For RxC tables the program calculates the Chi-square value and asks if you wish to perform the Monte Carlo experiment. If "Yes", the program will ask how many runs you would like to perform (each run produces a simulated RxC table with marginal totals identical with those of the original RxC table under analysis). The default number of runs is 1000 and you should confirm it by answering "Yes". If you prefer a different number of runs, you must answer "No" and the program will ask you to indicate the number you wish. It will look like this: ................................................ file: "sample_1" X2=40.944702; chi-sq_05 (15)=24.996 wish to perform Monte Carlo experiment? (y/n) ...Y number of runs is 1000: (Y)es or (N)ot - select key ...N enter number of runs -> 2000 ................................................. You may interrupt simulation runs at any moment by pressing the ESC key. The main results obtained by the program are displayed on the screen. More complete results you can write into an output file. It may be handy to use the same file both for input and output. The program will append the input file; then after your input data will follow the results. The name of the output file may be the same for different RxC tables under analysis. In this case the program will append to it the results of analyses of different tables indicating the names of input files. To analyze several tables successively you should type in the command line the name of the executable file (CHIRXC), then names of all the input files you want to process; the last type the name of the output file. For example: CHIRXC ECOR1 HINF MVAL BME SAU OUTP The input files with data are "ECOR1", "HINF", "MVAL", "BME", and "SAU". Results of processing of all the input files will be in the last file, in our example it is "OUTP". If a file with this name existed before running the program, it will be appended. The formats of input files are given below: a) If the input data are numbers of different categorical classes (genotypes, phenotypes, morphs, etc.) in a conventional RxC table, first goes the letter 'n' (numbers), then number of lines (samples) and the number of columns (categorical classes); then follow the data. Example of input file EXAM_NUM with numbers: n 4 6 4 1 0 2 1 0 7 0 5 0 1 1 1 1 1 5 0 4 1 1 0 0 2 1 b) If the input data are allelic frequencies, then the letter is 'f' (frequencies), then follow the number of samples (lines) and number of alleles (columns). Then follow data for each sample: number of individuals (number of genomes - 2N - is computed within the program) and the allele frequencies. In the example below are allele frequencies in 4 samples; number of alleles in each sample is 9. In the first sample there are 71 individuals, in the 2nd - 79, etc. The program sums up the allele frequencies and if this sum differs from 1 it is displayed and the program asks if the user wishes to correct the input file: file: "vlad_bay" sum of frequencies in sample # 2 is 1.001000 wish you check the input file? (y/n) ...N sum of frequencies in sample # 5 is 0.899000 wish you check the input file? (y/n) ...Y ................................................. Example of input file EXAM_FRE with frequencies: f 4 7 71 .000 .007 .218 .021 .000 .733 .021 79 .006 .000 .266 .000 .000 .703 .025 89 .006 .006 .180 .017 .017 .752 .022 96 .000 .010 .193 .000 .063 .708 .026 Entries (numbers, frequencies, etc.) may be separated by space(s), tabs or carriage returns. Examples of output files: a) input data are numbers name of input file: "EXAM_NUM" row x col = 4 x 6 calculated on Fri Jul 31 00:52:28 1992 Row frequencies: 1 0.500 0.125 0.000 0.250 0.125 0.000 2 0.500 0.000 0.357 0.000 0.071 0.071 3 0.083 0.083 0.083 0.417 0.000 0.333 4 0.200 0.200 0.000 0.000 0.400 0.200 av.fr 0.333 0.077 0.154 0.179 0.103 0.154 Expected counts: 1 2 3 4 5 6 1 2.667 0.615 1.231 1.436 0.821 1.231 2 4.667 1.077 2.154 2.513 1.436 2.154 3 4.000 0.923 1.846 2.154 1.231 1.846 4 1.667 0.385 0.769 0.897 0.513 0.769 Cell contributions to the total X2: 1 2 3 4 5 6 sum 1 0.667 0.240 1.231 0.222 0.039 1.231 3.629 2 1.167 1.077 3.761 2.513 0.132 0.618 9.268 3 2.250 0.006 0.388 3.761 1.231 2.513 10.149 4 0.267 0.985 0.769 0.897 4.313 0.069 7.300 sum 4.350 2.308 6.149 7.393 5.715 4.431 totals 13 3 6 7 4 6 grand total=39 df = 15, X2 = 30.3461, chi-sq_05 = 24.996 Monte Carlo testing: 1000 runs made estimated probability of homogeneity (P) is 0.007 95% confidence interval for P: 0.00278247 - 0.0131108 b) input data are numbers from the keyboard The output is identical to shown in (a) except for inclusion of the input table c) input data are frequencies name of input file: "EXAM_FRE" row x col = 4 x 7 calculated on Fri Jul 31 01:10:41 1992 Expected counts: 1 2 3 4 5 6 7 1 0.424 0.848 30.096 1.272 3.179 102.79 3.391 2 0.472 0.943 33.487 1.415 3.537 114.37 3.773 3 0.531 1.063 37.725 1.594 3.985 128.85 4.251 4 0.573 1.146 40.693 1.719 4.299 138.99 4.585 Cell contributions to the total X2: 1 2 3 4 5 6 7 sum 1 0.424 0.027 0.027 2.349 3.179 0.014 0.045 6.066 2 0.592 0.943 2.164 1.415 3.537 0.099 0.014 8.765 3 0.413 0.004 0.869 1.240 0.244 0.206 0.015 2.990 4 0.573 0.636 0.335 1.719 13.799 0.064 0.038 17.164 sum 2.002 1.610 3.396 6.724 20.758 0.384 0.111 totals 2 4 142 6 15 485 16 grand total=335 av.fr 0.003 0.006 0.212 0.009 0.022 0.724 0.024 df = 18, X2 = 34.9846, chi-sq_05 = 28.869 Monte Carlo testing: 1000 runs made estimated probability of homogeneity (P) is 0.004 95% confidence interval for P: 0.00104276 - 0.00886206 There is no fixed array limits in the program. All the arrays are organized dynamically. COMMENTS ON THE RANDOMIZATION ALGORITHM (for CHIRXC program) Our randomization algorithm is somewhat different from those of Roff & Bentzen. Instead of searching for indices of the first, second, etc. row (as in their procedure) we enter each individual after every randomization of the RxC table into its proper cell of the new zeroed RxC table. It is done like this. Each individual (i.e. item in the original RxC table) is regarded as a pair of indices for the row and column. Thus, if in a cell of the original table (say 3d row and 5th column) there are 10 individuals, the program writes into two tied vectors 10 pairs of 3's and 5's. The same is done for all the cells of the RxC table. Then row or column indices are shuffled in one vector relative to the other, in which the order of the indices is fixed. Row and column indices are shuffled alternatively: one time - row indices, next time - column indices. Then, after the shuffling of tied vectors a new RxC table is being formed. For that end pairs of indices (from the tied vectors) are taken one pair at a time. If the pair is, say, 4 and 6, then contents of the cell with coordinates 4, 6 in the table being formed is increased by 1. This permits to fill in the table after each randomization without using cumbersome nested loops. The shuffling of indices of individuals (within randomization procedure) is done without sorting procedures. For each consecutive individual, which is represented by a pair of indices (in the vector for row indices and in the one for column indices), a random number is chosen in the interval from the current individual number to the total number of individuals in the original RxC table. This random number is used as the address (within the vectors) of another individual, with which swapping will occur. In this way of shuffling through swapping with randomly chosen individual the duration of the randomization procedure is linearly dependent on the sample size. Prior to analysis of each new file the random-number generator is initialized using computer time. The program CHIHW: The program calculates deviations of observed genotypic proportion from those expected under Hardy-Weinberg equilibrium and estimates their significance using conventional Chi-square test and pseudoprobability test, as described in: Hernandez,J.L. and Weir,B.S. A disequilibrium coefficient approach to Hardy-Weinberg testing. - Biometrics, 1989, 45: 53-70. The program estimates the probability of the data under null-hypothesis (agreement with H-W equilibrium) using Monte Carlo simulation. The program calculates as well Selander's index of excess of heterozygotes: D = ( Hobs - Hexp ) / Hexp. Significance of D (one tail and both tails tests) is obtained using the same approach as for the agreement with H-W equilibrium. In short, the program does the following. The explanation given below concerns the test based on chi-square values. Pseudoprobability test for significance of D is done similarly. The program generates random samples with the same numbers of genes of each category as in the original sample under analysis. For each generated table chi-square (X2) is computed and compared with chi-square for the original table (X2_or). The result of Monte Carlo simulation is the ratio of numbers of generated tables, for which X2 > or = X2_or, to the total number of generated tables ("number of runs"). This ratio is regarded as the probability of the data under null hypothesis for the original sample. The 95% confidence interval is constructed by using arcsine transformation. For D index the program calculates (by simulation) both the probability of chance obtaining of the same or higher values (of the same sign - one tail test), and the same or higher values disregarding the sign - two tails test. For more details on the problem of assessment of statistical significance of the test statistic (when some genotypes are rare) see the mentioned paper by Hernandez & Weir as well as the following: 1. Guo, S-W and E.A. Thompson. 1992. Performing the exact test of Hardy-Weinberg proportion for multiple alleles. Biometrics 48:361-372. 2. Zaykin, D., L. Zhivotovsky, B.S. Weir. 1995. Exact tests for association between alleles at arbitrary number of loci. - Genetica, 96: pp. 169-178 To operate the program one must use the command: CHIHW INPT OUTPT Here INPT and OUTPT mean the names of input and output files (the latter is optional). After this command the program calculates the Chi-square and D values and asks if you wish to perform the Monte Carlo experiment. If "yes", the program will ask how many runs you would like to perform (each run produces a simulated sample with the numbers of each allele identical with those in the original sample under analysis). The default number of runs is 1000 and you should confirm it by answering "yes". If you prefer a different number of runs, you must answer "no" and the program will ask you to indicate the number you wish. You may interrupt simulation at any moment by pressing the ESC key. The results of the simulation will be displayed. The main results obtained by the program are displayed on the screen. More complete results you can write into an output file. The name of the output file may be the same for different samples under analysis. In this case the program will append to it the results of analyses of different samples indicating the names of input files. To analyze several samples successively you should type in the command line the name of the file (CHIHW), then names of all the input files you want to process; the last type the name of the output file. For example: CHIHW ESTD PGM GPI ADH AO XDH OUTP The input files with data are "ESTD", "PGM", "GPI", "ADH", "AO", and "XDH". Results of processing of all the input files will be in the last file, in our example it is "OUTP". If a file with this name existed before running the program, it will be appended. In the input file numbers of genotypes should be preceded by letters (in upper or low case) indicating the genotype and separated from numbers by colon (:) or hyphen (-). The data for each genotype can be separated by space(s), tabulation or a new line. Alleles in genotype designations may be in alphabetical order or not. Thus designations AB:10 and BA:10 are identical. Genotypes may be entered in any succession. Input is case sensitive, E.g. Aa:10 is not equal to aa:10. Some alleles can be omitted (e.g., in the input file example below there is no allele "A"). However, program works more efficient if allele names start from capital A (program runs faster and is capable to manage larger data sets), also, the sequence of names like A,B,C,D (where there are no gaps in allele names) would work more efficiently than the sequences A,B,M,N or B,C,D,E. For that reason use small letters (a,b,c...) only if you run out of alphabet. The format of input file: EE:26 FF:6 BE:2 CE:2 DF:1 DE:1 EF:44 GE:13 Output file with results for the example above: Name of input file: "EXAMPLE" ---------------------------------- calculated on Sun Apr 26 03:21:57 1992 allelic frequencies: B:0.01053 C:0.01053 D:0.01053 E:0.60000 F:0.30000 G:0.06842 ------------------------------------------------------------------------ gntp | obs | exp | X2 | M-Carlo Probability ------------------------------------------------------------------------ BB 0 0.0105 0.0105 1.0000 BC 0 0.0211 0.0211 1.0000 BD 0 0.0211 0.0211 1.0000 BE 2 1.2000 0.5333 0.5240 BF 0 0.6000 0.6000 0.5745 BG 0 0.1368 0.1368 1.0000 CC 0 0.0105 0.0105 1.0000 CD 0 0.0211 0.0211 1.0000 CE 2 1.2000 0.5333 0.5271 CF 0 0.6000 0.6000 0.5820 CG 0 0.1368 0.1368 1.0000 DD 0 0.0105 0.0105 1.0000 DE 1 1.2000 0.0333 1.0000 DF 1 0.6000 0.2667 1.0000 DG 0 0.1368 0.1368 1.0000 EE 26 34.2000 1.9661 0.0006 EF 44 34.2000 2.8082 0.0165 EG 13 7.8000 3.4667 0.0026 FF 6 8.5500 0.7605 0.2412 FG 0 3.9000 3.9000 0.0234 GG 0 0.4447 0.4447 1.0000 ________________________________________________________________________ totals: 95 95 16.4181 df = 15, X2 = 16.4181, chi-sq_05 = 24.996 H_obs = 0.66316 H_exp = 0.54499 Selander's D = 0.21683 (k-1)*N*D*D = 22.333 <- chi-sq_05 = 3.841 Monte Carlo testing: 16999 runs made: estimated probability (P) of agreement to H-W is 0.138891 estimated probability (P) of D=0 (one tail) is 0.00464733 estimated probability (P) of D=0 (both tails) is 0.00688276 95% confidence intervals for probabilities (P): for H-W -> 0.133733 - 0.14413; for D D(one tail) -> 0.00368091 - 0.0057257; D(both tails) -> 0.00569565 - 0.0081813; There is no fixed array limits in the program. All the arrays are organized dynamically. COMMENTS ON THE RANDOMIZATION ALGORITHM (for CHIHW program) Numbers of alleles of each category (Na, Nb, Nc, etc.) are counted and allelic pool is formed, which is a vector containing a's Na times, b's Nb times, etc. Then the vector is randomly shuffled (as described above in the section for CHIRXC). Then a pair of alleles is drawn form the vector. This will be first genotype in the new randomized sample. Then the second pair of alleles is drawn, then next, etc. until the end of the vector is reached and thus the new randomized sample has been formed. Prior to analysis of each new file the random-number generator is initialized using computer time.