Rank-FDR scripts

R scripts for methods in

Kuo C-L and Zaykin DV 2011. Novel rank-based approaches for discovery and replication in genome wide association studies. Genetics 189:329--340. (PDF)

First, please download all scripts at once: everything.zip and unzip to a folder.

To estimate the proportion of true positives and related measures, one needs to assume some typical effect sizes and their frequencies.

Option 1: specify some a priori likely values. These, along with some other parameters can be specified by editing the top section of the file Get-noncentralities-from-table.r (under "USER-DEFINED SECTION")

Next, edit "USER-DEFINED SECTION" of rFDR.r and execute (source) that script, which will compute rFDR based on the formula given by equation 3.

Another script, posterior.r takes a file with p-values as an input. The format is comma-separated (e.g. produced by saving an Excel sheet as ".CSV". The column with p-values is assumed to have the name Pvalue, other columns can be added: for example, in the sample file pvalues.csv there is an additional column, "SNP". This column tracks which p-values correspond to true- and which to false effects, denoted by TA and FA. This sample file was generated by a simulation script, simulate-pvalues.r. You can generate a different sample input file by re-running the simulation script (note, the simulation script assumes the effect distribution specified in Get-noncentralities-from-table.r). Next, run posterior.r again. How well the proportion of true effects is estimated? The true proportion in the simulated file can be printed with the R command:
cat("True proportion = ", length(d[d$SNP == "TA",]$Pvalue) / length(d$Pvalue), "\n")

Option 2: estimate the effect distribution from data. This option is experimental: you may want to let us know if you are using this option. The methods have been developed and tested, but not yet published. Briefly, parameters (effect distribution and the proportion of nulls) are estimated using p-values, thresholded at some reasonably small alpha level. The null distribution may also be adjusted if there is a systematic inflation that results in a deviation from the uniform. The estimation procedure requires a large number of p-values (e.g. from a GWAS). The script name is Estimate-effects-distribution.r. Again, "USER-DEFINED SECTION" part may have to be edited. The script reads from a file, the hard-coded file name is pvalues-all.csv.gz. Currently, this file contains p-values created by the simulation script, described in the previous section. This file is in exactly the same format as pvalues.csv, except that it contains all 1e6 p-values, rather than top 500, and it is compressed with gzip (hence the .gz in the name). Alternatively, you can specify zip compression or no compression (see notes in "USER-DEFINED SECTION" part of the script).

Next, rFDR.r or posterior.r can be ran, but you have to uncomment the line in these files that calls the estimation script just described (Estimate-effects-distribution.r). See comments in these files, under "USER-DEFINED SECTION".

Notes

Our scripts assume that p-values under the hypothesis of association come from a distribution that is close to a distribution of 1-df chisquare test statistic.

Estimation using GWAS p-values ("Option 2") gives the shape and the scale parameters, assigned at the end of the script to variables qShape and qScale. These values are utilized by rFDR.r and posterior.r, in the line "Params <- c(qShape, qScale, At)". In principle, the values can be fit once and then reused in a different study of the same disease. However, if the sample size changes, the scale parameter changes, because it is the multiple of (approximately) the harmonic mean of case and control sample sizes. In this case, "qScale" would need to be adjusted as (n2/n1)*qScale, where n1 is the harmonic mean of the sample sizes for the original study (GWAS), and n2 is the harmonic mean of the sample sizes for the new study.

Sometimes, it is useful to estimate just the probability that a single true positive will rank among a certain number of top p-values (i.e., ranking probability), or equivalently, the number of top results given the specified ranking probability. We provide scrips to do that: one for case-control, GetRanks.r, and one continuous data, GetRanks-T.r. These scripts need to be edited at the top (put in your assumed relative risk, or mean difference for continuous trait, the number of p-values, etc.)