False-Discovery rate correction for Hardy-Weinberg tests on a VCF

It is common to apply a “quality check” filter on a VCF file to retain only variants that do not depart from a neutrality expectation, namely the Hardy-Weinberg equilibrium (HWE). In fact it’s also a way to discard potential genotyping errors. Note that the Hardy-Weinberg test can be two-sided or one-sided, in which case it is refered to as “excess of heterozygosity” (HWE vs ExcHet in bcftools +fill-tags). The latter test might be preferable, as loss of heterozygosity compared to HWE usually means population structure, something that we want to retain in some cases.

I have started to work on population genomics only recently (although I have had some experience awhile ago) so if you see anything incorrect, please reach out.

Why multiple-test correction?

We are applying the test to many thousands of variants. Even if they are not strictly independent, there is a large probability that we reject some of them that actually conform to the null hypothesis.

Choosing a p-value cutoff in this case is not straightforward, as $p \lt 0.01$ might really be too stringent with many variants.

Why False-discovery rate correction?

The Bonferroni correction is deemed too conservative. In our case it means it might not reject enough variants. A related problem is that it assumes independent tests, so we need to determine what is our effective number of independent variants (ie. correcting for linkage disequilibrium).

Also, the paper for the HWE exact test (Wigginton et al. 2005¹) states:

For large datasets, rather than fixing and arbitrary threshold for rejecting HWE, we suggest that methods based on the false-discovery rate (Benjamini and Hochberg 1995) be used to identify a subset of markers whose genotypes do not conform to the expected equilibrium distribution.

The drawback of the FDR method like Benjamini-Hochberg or Benjamini-Yekutieli is that it requires sorting of all p-values, but on my data it was reasonably fast.

How to apply FDR correction for the HWE test of bcftools?

We will apply the Benjamini-Yekutieli method which is based on Benjamini-Hochberg but does not assume independence between tests (at the expense of being more conservative). See White et al. 2019².

From my literature search, this is what is used for filtering variants out of HWE prior to GWAS studies. In this case it is important to filter out genotyping errors as well as variants correlated with population structure that could be confounders. However potential variants of interest should be retained, so this is why the filtering should be moderate. In my use-case, we want to run population structure analyses, so the concerns are different, and we should probably only apply the “excess of heterozygosity” test.

Here is the step-by-step guide using Bash³. It ends up being a few more lines of code than I expected when I started writing it.

Inputs:

$vcf: our VCF input file.
$subpopulations: a two-column file specifying a population for which sample. For example, it makes sense to only keep ingroup samples. The following assumes that our ingroup population is named in, so the format can be:
```
samplename1  in
samplename2  in
```
and outgroup samples omitted.

Set up

First, because of using the utility sort with numbers, the “locale” matters, so ensure you are using a standard one:

export LC_ALL=C  # For sorting

Compute HWE or ExcHet independent tests.

We first let bcftools +fill-tags compute the HWE or ExcHet p-values, which we then output into a table. We sort this table by increasing p-values, and add a rank column. The CHROM,POS,REF,ALT columns we output in the table are determined by the requirement of the bcftools annotate command, which we will use afterwards to insert back our adjusted test results.

# Name for the output tab delimited file
tab=$(basename "$vcf" .vcf.gz).hwe.tab

bcftools view --max-alleles 2 "$vcf" -Ou \
| bcftools +fill-tags -Ou -- -S "$subpopulations" -t HWE,ExcHet \
| bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%HWE_in\t%ExcHet_in' \
| sort -k5,5 -g \
| awk 'BEGIN{OFS="\t"}{print $0, NR}' > "$tab"

Compute the FDR-adjusted significance

The Benjamini-Yekutieli procedure requires to compute the correction factor $\mathrm{correction}(i)$ for each p-value $p_i$ in order, starting from the smallest, until we find a p-value such that $p_i > \mathrm{correction}(i)\times \alpha$. Then all preceeding tests are declared significant, and all subsequent tests non-significant.

# Total number of tests:
ntests=$(tail -n 1 "$tab" | cut -f7)

# Name of the tab file with the significance column after BY correction.
# This file must be sorted by genomic coordinates and indexed, because we will
# use it to annotate the VCF.
tabBY=$(basename "$vcf" .vcf.gz).hweBY.tab.gz

# Benjamini-Yekutieli FDR correction term.
# The function 'h()' is the harmonic series, which can be approximated to log(n) + euler-mascheroni constant
awk -v "m=$ntests" -v "alpha=0.05" 'function h(n, s,i){
        if(n>100){
            return log(n)+0.577215664901532
        }else{
            s=0; for(i=1;i<=n;++i){s+=1/i} return s
        }
    }
    BEGIN{signif=1; OFS="\t"}
    { if(signif){corr=$7/(m * h($7)); signif=($5<=corr*alpha)}else{corr="."};
        print $0, 1*signif, corr
    }' "$tab" \
| sort -k1V,1 -k2,2n \
| bgzip > "$tabBY"
tabix -s1 -b2 -e2 "$tabBY"

Finally merge back results into the VCF and select variants

# The 'annotate' command requires a header text to describe the new INFO variables.
cat > tmp_header.hdr <<'EOF'
##INFO=<ID=HWE_in,Number=A,Type=Float,Description="p-value of the HWE exact test (ingroup)">
##INFO=<ID=ExcHet_in,Number=A,Type=Float,Description="Excess heterozygosity test (ingroup)">
##INFO=<ID=rankHWE_in,Number=1,Type=Integer,Description="rank of the HWE test p-value">
##INFO=<ID=BYcorr,Number=1,Type=Float,Description="Benjamini-Yekutieli correction term for the HWE_in p-values">
##INFO=<ID=BYsignif,Number=1,Type=Float,Description="HWE_in significance after Benjamini-Yekutieli correction">
EOF

# Output file names.
out_kept=$(basename "$vcf" .vcf.gz).hweBYfiltered.vcf.gz
out_removed=$(basename "$vcf" .vcf.gz).hweBYsignif.vcf.gz

bcftools annotate -h tmp_header.hdr -a "$tabBY" -c 'CHROM,POS,REF,ALT,HWE_in,ExcHet_in,rankHWE_in,BYsignif,BYcorr' $vcf -Ob -o tmp_hwe.bcf

# Apply the filter
bcftools filter -e 'BYsignif == 1' tmp_hwe.bcf -Oz -o "$out_kept"
bcftools filter -i 'BYsignif == 1' tmp_hwe.bcf -Oz -o "$out_removed"

# Cleanup
rm -v tmp_header.hdr tmp_hwe.bcf "$tab" "$tabBY"

Done! Yay!

See this code as a single file on my gitlab.

Wigginton et al. 2005, A note on exact tests of Hardy-Weinberg equilibrium ↩
White et al. 2019, Beyond Bonferroni revisited: concerns over inflated false positive research findings in the fields of conservation genetics, biology, and medicine. ↩
If you paste this in a script, I recommend using the so-called “strict mode” by having this line: set -euo pipefail. ↩