Kuo C-L and Zaykin DV 2011.
Novel rank-based approaches for discovery and replication in
genome wide association studies. Genetics 189:329--340. (PDF)
First, please download all scripts at once: everything.zip
and unzip to a folder.
Option 1: specify some a priori likely values.
These, along with some other parameters can be specified by
editing the top section of the file Get-noncentralities-from-table.r
(under "USER-DEFINED SECTION")
Next, edit "USER-DEFINED SECTION" of rFDR.r and execute (source)
that script, which will compute rFDR based on the formula given by
equation 3.
Another script, posterior.r
takes a file with p-values as an input. The format is
comma-separated (e.g. produced by saving an Excel sheet as ".CSV".
The column with p-values is assumed to have the name Pvalue, other
columns can be added: for example, in the sample file pvalues.csv
there is an additional column, "SNP". This column tracks which
p-values correspond to true- and which to false effects, denoted
by TA and FA. This sample file was generated by a simulation
script, simulate-pvalues.r.
You can generate a different sample input file by re-running the
simulation script (note, the simulation script assumes the effect
distribution specified in Get-noncentralities-from-table.r).
Next, run posterior.r
again. How well the proportion of true effects is estimated? The
true proportion in the simulated file can be printed with the R
command:
cat("True proportion = ",
length(d[d$SNP == "TA",]$Pvalue) / length(d$Pvalue), "\n")
Option 2: estimate the effect distribution from data.
This option is experimental: you may want to let us know if you
are using this option. The methods have been developed and tested,
but not yet published. Briefly, parameters (effect distribution
and the proportion of nulls) are estimated using p-values,
thresholded at some reasonably small alpha level. The null
distribution may also be adjusted if there is a systematic
inflation that results in a deviation from the uniform. The
estimation procedure requires a large number of p-values (e.g.
from a GWAS). The script name is Estimate-effects-distribution.r.
Again, "USER-DEFINED SECTION" part may have to be edited. The
script reads from a file, the hard-coded file name is pvalues-all.csv.gz.
Currently, this file contains p-values created by the simulation
script, described in the previous section. This file is in exactly
the same format as pvalues.csv,
except that it contains all 1e6 p-values, rather than top 500, and
it is compressed with gzip (hence the .gz in the name).
Alternatively, you can specify zip compression or no compression
(see notes in "USER-DEFINED SECTION" part of the script).
Next, rFDR.r or posterior.r
can be ran, but you have to uncomment the line in these files that
calls the estimation script just described (Estimate-effects-distribution.r).
See comments in these files, under "USER-DEFINED SECTION".
Our scripts assume that p-values under the hypothesis of
association come from a distribution that is close to a
distribution of 1-df chisquare test statistic.
Estimation using GWAS p-values ("Option 2") gives the shape and
the scale parameters, assigned at the end of the script to
variables qShape and qScale. These values are utilized by rFDR.r and
posterior.r,
in the line "Params <-
c(qShape, qScale, At)". In principle, the values can be
fit once and then reused in a different study of the same disease.
However, if the sample size changes, the scale parameter changes,
because it is the multiple of (approximately) the harmonic mean of
case and control sample sizes. In this case, "qScale" would need to be
adjusted as (n2/n1)*qScale,
where n1 is the
harmonic mean of the sample sizes for the original study (GWAS),
and n2 is the
harmonic mean of the sample sizes for the new study.
Sometimes, it is useful to estimate just the probability that a
single true positive will rank among a certain number of top
p-values (i.e., ranking probability), or equivalently, the number
of top results given the specified ranking probability. We provide
scrips to do that: one for case-control, GetRanks.r,
and one continuous data, GetRanks-T.r.
These scripts need to be edited at the top (put in your assumed
relative risk, or mean difference for continuous trait, the number
of p-values, etc.)