This file is "readme" for chirxc.exe and chihw.exe, described in

Zaykin, D.V. and Pudovkin, A.I. 1993. Two programs to estimate
significance of Chi-square values using pseudo-probability test. -
J.Hered., 84(2), p. 152.

The authors, would appreciate receiving any comments on the programs,
algorithms and the theory.

Our addresses:

Alexander Pudovkin:
Institute of Marine Biology, Vladivostok 690041, Russia.
email: apud@biom.marine.su

Dmitri Zaykin: 
North Carolina State Univ, Statistics Dept, Box 8203, 
Raleigh, NC 27695-8203.
email: zaykin@statgen.ncsu.edu

Both programs are written in  Borland C++, ver. 3.1.

To get short instructions on operation of the programs
type in the name of one of the programs from the command line
prompt and press "Enter".

Monte-Carlo probabilities reported by the programs are just
"p-values". That is, they are (frequentist cumulative) probabilities
FOR the null hypothesis rather than probabilities OF the hypothesis.

CHIRXC:

   The program performs homogeneity tests for RxC contingency tables
and estimates the probability of the data under the null-hypothesis
(homogeneity) using Monte Carlo simulation as suggested by
Roff,D.A. and Bentzen,P. - Mol.Biol.Evol., 6(5):539-545, 1990.

   In short, the program does the following. It generates random RxC
tables with the same marginal totals as in the original RxC table
under analysis. For each generated table chi-square (X2) is computed
and compared with chi-square for the original table (X2_or). The
result of Monte Carlo simulation is ratio of numbers of generated
tables, for which X2 > or = X2_or, to the total number of generated
tables ("number of runs"). This ratio is regarded as the probability
of the data under null hypothesis for the original RxC table.

   For 2x2 tables the program offers computation of Fisher's
exact probability test as well. It is necessary to note that p
obtained through the exact probability test by Fisher does not
fully correspond to the p obtained by the pseudoprobability test.
The pseudoprobability test is based on the values of chi-square
for randomly produced 2x2 tables and p equals the ratio of the
tables for which chi-square is larger than that for original table.
But two tables (with equal marginal totals) may have identical
chi-squares (therefore the identical results of pseudoprobability
test) but their p-values obtained by the exact Fisher's test may be
different. Hence the discrepancy of the tests. In the region when
p approximately equals 1/2 the discrepancy may be substantial, but
in the region of critical importance (p is ca 1/20) the both tests
are concordant. For example, we have two 2x2-tables with identical
marginal totals:

 10 10
 23 12

X2 = 1.30952
both tails probability: 0.2702

 14 6
 19 16

X2 = 1.30952
both tails probability: 0.3912

Chi-squares for both of them are equal; thus the pseudoprobability
test will yield identical results. But the exact test gives
different probabilities: 0.2702 and 0.3912.

The program permits two input formats:
   1) conventional RxC contingency table;
      a) input from the keyboard
      b) input from a file
   2) table of allele frequencies and  sample sizes
      (number of individuals); Input from a file only.

To operate the program one must use the command:

   CHIRXC INPT OUTPT

Here INPT and OUTPT   mean the names of input and output files
(the latter is optional).

When the table of numbers you are analyzing is small, you may enter
it directly from the keyboard. In this case the command should be:

   CHIRXC OUTPT /S

The explanations for entering and editing the screen will be
displayed during execution.

   If the table under analysis is 2x2 the program offers computation
of Fisher's exact test. The prompts here are the following:
	2x2 table: wish to perform exact test? (y/n)
If the answer is "Yes", Fisher's test is calculated,
if "No", the user is offered to perform chi-square calculations
and pseudoprobability test.

   For RxC tables the program calculates the Chi-square value and
asks if you wish to perform the Monte Carlo experiment. If "Yes",
the program will ask how many runs you would like to perform (each
run produces a simulated RxC table with marginal totals identical
with those of the original RxC table under analysis). The default
number of runs is 1000 and you should confirm it by answering "Yes".
If you prefer a different  number of runs, you must answer "No"
and the program will ask you to indicate the number you wish.
It will look like this:

................................................
  file: "sample_1"
X2=40.944702;    chi-sq_05 (15)=24.996
wish to perform Monte Carlo experiment? (y/n) ...Y
number of runs is 1000:
(Y)es or (N)ot  - select key  ...N
enter number of runs -> 2000
.................................................


 You may interrupt simulation runs at any moment by pressing
the ESC key.  The main results obtained by the program are
displayed on the screen.


More complete results you can write into an output file.

It may be handy to use the same file both  for input and output.
The program will append the input file; then after your input
data will follow the results.

The name of the output file may be the same for different RxC
tables under analysis. In this case the program will append to
it the results of analyses of different tables indicating the
names of input files.

To analyze several tables successively you should type in the
command line the name of the executable file (CHIRXC), then
names of all the input files you want to process; the last type
the name of the output file.

     For example:

	CHIRXC ECOR1 HINF MVAL BME SAU OUTP

The input files with data are "ECOR1", "HINF", "MVAL", "BME",
and "SAU".

Results of processing of all the input files will be in the last
file, in our example it is "OUTP". If a file with this name
existed before running the program, it will be appended.

The formats of input files are given below:

a) If the input data are numbers of different categorical classes
(genotypes, phenotypes, morphs, etc.) in a conventional RxC table,
first goes the letter 'n' (numbers), then number of lines (samples)
and the number of columns (categorical classes); then follow the data.


	 Example of input file EXAM_NUM with numbers:


n
4 6
4 1 0 2 1 0
7 0 5 0 1 1
1 1 1 5 0 4
1 1 0 0 2 1


b) If the input data are allelic frequencies, then the letter is
'f' (frequencies), then follow the number of samples (lines) and
number of alleles (columns). Then follow data for each sample:
number of individuals (number of genomes - 2N - is computed
within the program) and the allele frequencies. In the example
below are allele frequencies in 4 samples; number of alleles in
each sample is 9.  In the first sample there are 71 individuals,
in the 2nd - 79, etc.  The program sums up the allele frequencies
and if this sum differs from 1 it is displayed and the program
asks if the user wishes to correct the input file:

 file: "vlad_bay"

 sum of frequencies in sample # 2 is 1.001000
  wish you check the input file? (y/n) ...N
 sum of frequencies in sample # 5 is 0.899000
  wish you check the input file? (y/n) ...Y
.................................................


	 Example of input file EXAM_FRE with frequencies:

f 4 7
71   .000 .007 .218 .021 .000 .733 .021
79   .006 .000 .266 .000 .000 .703 .025
89   .006 .006 .180 .017 .017 .752 .022
96   .000 .010 .193 .000 .063 .708 .026


   Entries (numbers, frequencies, etc.) may be separated
   by space(s), tabs or carriage returns.


	Examples of output files:

a) input data are numbers

 name of input file: "EXAM_NUM"
 row x col = 4 x 6
 calculated on Fri Jul 31 00:52:28 1992


 Row frequencies:

 1	 0.500   0.125   0.000   0.250   0.125   0.000
 2	 0.500   0.000   0.357   0.000   0.071   0.071
 3	 0.083   0.083   0.083   0.417   0.000   0.333
 4	 0.200   0.200   0.000   0.000   0.400   0.200
av.fr	 0.333   0.077   0.154   0.179   0.103   0.154


 Expected counts:

	 1       2       3       4       5       6
 1	 2.667   0.615   1.231   1.436   0.821   1.231
 2	 4.667   1.077   2.154   2.513   1.436   2.154
 3	 4.000   0.923   1.846   2.154   1.231   1.846
 4	 1.667   0.385   0.769   0.897   0.513   0.769


 Cell contributions to the total X2:

	 1       2       3       4       5       6       sum
 1	 0.667   0.240   1.231   0.222   0.039   1.231   3.629
 2	 1.167   1.077   3.761   2.513   0.132   0.618   9.268
 3	 2.250   0.006   0.388   3.761   1.231   2.513   10.149
 4	 0.267   0.985   0.769   0.897   4.313   0.069   7.300

sum	 4.350   2.308   6.149   7.393   5.715   4.431
totals	 13      3       6       7       4       6
grand total=39


df = 15, X2 = 30.3461, chi-sq_05 = 24.996


  Monte Carlo testing:
 1000 runs made
 estimated probability of homogeneity (P) is 0.007
 95% confidence interval for P: 0.00278247 - 0.0131108

b) input data are numbers from the keyboard
The output is identical to shown in (a) except for inclusion
of the input table


c) input data are frequencies

 name of input file: "EXAM_FRE"
 row x col = 4 x 7
 calculated on Fri Jul 31 01:10:41 1992


 Expected counts:

	 1       2       3       4       5       6       7
 1	 0.424   0.848   30.096  1.272   3.179   102.79  3.391
 2	 0.472   0.943   33.487  1.415   3.537   114.37  3.773
 3	 0.531   1.063   37.725  1.594   3.985   128.85  4.251
 4	 0.573   1.146   40.693  1.719   4.299   138.99  4.585


 Cell contributions to the total X2:

	 1       2       3       4       5       6       7      sum
 1	 0.424   0.027   0.027   2.349   3.179   0.014   0.045  6.066
 2	 0.592   0.943   2.164   1.415   3.537   0.099   0.014  8.765
 3	 0.413   0.004   0.869   1.240   0.244   0.206   0.015  2.990
 4	 0.573   0.636   0.335   1.719   13.799  0.064   0.038  17.164

sum	 2.002   1.610   3.396   6.724   20.758  0.384   0.111
totals	 2       4       142     6       15      485     16
grand total=335
av.fr	 0.003   0.006   0.212   0.009   0.022   0.724   0.024


df = 18, X2 = 34.9846, chi-sq_05 = 28.869


  Monte Carlo testing:
 1000 runs made
 estimated probability of homogeneity (P) is 0.004
 95% confidence interval for P: 0.00104276 - 0.00886206


	There is no fixed array limits in the program. All the
arrays are organized dynamically.


 COMMENTS ON THE RANDOMIZATION ALGORITHM   (for CHIRXC program)


	Our randomization algorithm is somewhat different from
those of Roff & Bentzen. Instead of searching for indices of the
first, second, etc. row (as in their procedure) we enter each
individual after every randomization of the RxC table into its
proper cell of the new zeroed RxC table. It is done like this.

	Each individual (i.e. item in the original RxC table) is
regarded as a pair of indices for the row and column. Thus, if
in a cell of the original table (say 3d row and 5th column)
there are 10 individuals, the program writes into two tied
vectors 10 pairs of 3's and 5's. The same is done for all the
cells of the RxC table. Then row or column indices are shuffled
in one vector relative to the other, in which the order of the
indices is fixed. Row and column indices are shuffled
alternatively: one time - row indices, next time - column
indices. Then, after the shuffling of tied vectors a new RxC
table is being formed. For that end pairs of indices (from the
tied vectors) are taken one pair at a time. If the pair is, say,
4 and 6, then contents of the cell with coordinates 4, 6 in the
table being formed is increased by 1. This permits to fill in
the table after each randomization without using cumbersome
nested loops.
        The shuffling of indices of individuals (within
randomization procedure) is done without   sorting procedures.
For each consecutive individual, which is represented by a pair
of indices (in the vector for row indices and in the one for
column indices), a random number is chosen in the interval from
the current individual number to the total number of individuals
in the original RxC table.
        This random number is used as the address (within the
vectors)  of another individual, with which swapping will occur.
In this way of shuffling through swapping with randomly chosen
individual the duration of the randomization procedure is
linearly dependent on the sample size.
        Prior to analysis of each new file the random-number
generator is initialized using computer time.

   The program CHIHW:

   The program calculates deviations of observed genotypic
proportion from those expected under Hardy-Weinberg equilibrium
and estimates their significance using conventional Chi-square
test and pseudoprobability test, as described in:
   Hernandez,J.L. and Weir,B.S. A disequilibrium coefficient
      approach to Hardy-Weinberg testing. - Biometrics, 1989,
      45: 53-70.

   The program estimates the probability of the data under
null-hypothesis (agreement with H-W equilibrium) using Monte Carlo
simulation.

   The program calculates as well Selander's index of excess of
heterozygotes: D = ( Hobs - Hexp ) / Hexp. Significance of D
(one tail and both tails tests) is obtained using the same
approach as for the agreement with H-W equilibrium.

   In short, the program does the following. The explanation given
below concerns the test based on chi-square values.  Pseudoprobability
test for significance of D is done similarly.  The program generates
random samples with the same numbers of genes of each category as in
the original sample under analysis.  For each generated table
chi-square (X2) is computed and compared with chi-square for the
original table (X2_or). The result of Monte Carlo simulation is the
ratio of numbers of generated tables, for which X2 > or = X2_or, to
the total number of generated tables ("number of runs"). This ratio is
regarded as the probability of the data under null hypothesis for the
original sample. The 95% confidence interval is constructed by using
arcsine transformation.

  For D index the program calculates (by simulation) both the
probability of chance obtaining of the same or higher values
(of the same sign - one tail test), and the same or higher values
disregarding the sign - two tails test.

   For more details on the problem of assessment of statistical
significance of the test statistic (when some genotypes are
rare) see the mentioned paper by Hernandez & Weir as well as
the following:

1. Guo, S-W and E.A. Thompson. 1992. Performing the exact test
      of Hardy-Weinberg proportion for multiple alleles.
      Biometrics 48:361-372.

2. Zaykin, D., L. Zhivotovsky, B.S. Weir. 1995. Exact tests for
      association between alleles at arbitrary number of loci. -
      Genetica, 96: pp. 169-178

To operate the program one must use the command:


   CHIHW INPT OUTPT


Here INPT and OUTPT   mean the names of input and output files
(the latter is optional).  After this command the program
calculates the Chi-square and D values and asks if you wish to
perform the Monte Carlo experiment. If "yes", the program will ask
how many runs you would like to perform (each run produces a
simulated sample with the numbers of each allele identical
with those in the original sample under analysis). The default
number of runs is 1000 and you should confirm it by answering
"yes". If you prefer a different  number of runs, you must
answer "no" and the program will ask you to indicate the number
you wish. You may interrupt simulation at any moment by pressing
the ESC key. The results of the simulation will be displayed.
The main results obtained by the program are displayed  on the
screen. More complete results you can write into an output file.

The name of the output file may be the same for different samples
under analysis. In this case the program will append to
it the results of analyses of different samples indicating the
names of input files.
To analyze several samples successively you should type in the
command line the name of the file (CHIHW), then names of all
the input files you want to process; the last type the name of
the output file. For example:

	CHIHW ESTD PGM GPI ADH AO XDH OUTP

The input files with data are "ESTD", "PGM", "GPI", "ADH",
"AO", and "XDH".

Results of processing of all the input files will be in the last
file, in our example it is "OUTP". If a file with this name
existed before running the program, it will be appended.

In the input file numbers of genotypes should be preceded by
letters (in upper or low case) indicating the genotype and
separated from numbers by colon (:) or hyphen (-). The data
for each genotype can be separated by space(s), tabulation
or a new line. Alleles in genotype designations may be in
alphabetical order or not. Thus designations AB:10 and BA:10
are identical.

Genotypes may be entered in any succession.

Input is case sensitive, E.g. Aa:10 is not equal to aa:10.
Some alleles can be omitted (e.g., in the input file example below
there is no allele "A").

However, program works more efficient if allele names start from
capital A (program runs faster and is capable to manage larger data
sets), also, the sequence of names like A,B,C,D (where there are no
gaps in allele names) would work more efficiently than the sequences
A,B,M,N or B,C,D,E. For that reason use small letters (a,b,c...)
only if you run out of alphabet.


	 The format of input file:

EE:26   FF:6   BE:2
CE:2    DF:1   DE:1
EF:44   GE:13


Output file with results for the example above:

	Name of input file: "EXAMPLE"
	----------------------------------

calculated on Sun Apr 26 03:21:57 1992


 allelic frequencies:

 B:0.01053 C:0.01053 D:0.01053 E:0.60000 F:0.30000
 G:0.06842

------------------------------------------------------------------------
 gntp |     obs     |    exp      |      X2     |  M-Carlo Probability
------------------------------------------------------------------------
  BB        0            0.0105          0.0105          1.0000
  BC        0            0.0211          0.0211          1.0000
  BD        0            0.0211          0.0211          1.0000
  BE        2            1.2000          0.5333          0.5240
  BF        0            0.6000          0.6000          0.5745
  BG        0            0.1368          0.1368          1.0000
  CC        0            0.0105          0.0105          1.0000
  CD        0            0.0211          0.0211          1.0000
  CE        2            1.2000          0.5333          0.5271
  CF        0            0.6000          0.6000          0.5820
  CG        0            0.1368          0.1368          1.0000
  DD        0            0.0105          0.0105          1.0000
  DE        1            1.2000          0.0333          1.0000
  DF        1            0.6000          0.2667          1.0000
  DG        0            0.1368          0.1368          1.0000
  EE        26           34.2000         1.9661          0.0006
  EF        44           34.2000         2.8082          0.0165
  EG        13           7.8000          3.4667          0.0026
  FF        6            8.5500          0.7605          0.2412
  FG        0            3.9000          3.9000          0.0234
  GG        0            0.4447          0.4447          1.0000

________________________________________________________________________

totals:     95          95             16.4181


df = 15, X2 = 16.4181, chi-sq_05 = 24.996


			H_obs = 0.66316
			H_exp = 0.54499
		 Selander's D = 0.21683
		  (k-1)*N*D*D = 22.333 <- chi-sq_05 = 3.841

	Monte Carlo testing:
	16999 runs made:
		estimated probability (P) of agreement to H-W is 0.138891
		estimated probability (P) of D=0   (one tail) is 0.00464733
		estimated probability (P) of D=0 (both tails) is 0.00688276


95% confidence intervals for probabilities (P):
for H-W
              -> 0.133733 - 0.14413;

for D
D(one tail)   -> 0.00368091 - 0.0057257;
D(both tails) -> 0.00569565 - 0.0081813;


	There is no fixed array limits in the program. All the
arrays are organized dynamically.


 COMMENTS ON THE RANDOMIZATION ALGORITHM  (for CHIHW  program)

	Numbers of alleles of each category (Na, Nb, Nc, etc.)
are counted and allelic pool is formed, which is a vector
containing a's Na times, b's Nb times, etc.
        Then the vector is randomly shuffled (as described above 
in the section for CHIRXC).
        Then a pair of alleles is drawn form the vector.
This will be first genotype in the new randomized sample. Then
the second pair of alleles is drawn, then next, etc. until the
end of the vector is reached and thus the new randomized sample
has been formed.
        Prior to analysis of each new file the random-number
generator is initialized using computer time.