Pre-filtering variants from VCFs: switching to bcftools

I am currently (re)familiarizing with VCF files and their processing.

It happens that they are many softwares to process them, in particular bcftools, vcftools (which are quite different from the former but whose functionalities overlap), and also GATK.

My current work is to update an existing workflow from the lab, which uses each of them. I was thinking, can I get rid of one or more dependencies?

I seemed to me that bcftools was the most flexible so this is the one I choose to keep.

Prerequisites

Some knowledge of the VCF format is required. See the docs here. In particular the 8th column is called INFO and contains an arbitrary number key=value entries. Some of these variables have standard names, like MQ for mapping quality. The 7th column is the FILTER column which is either PASS or a semicolon separated list of failed filters.

Quality check of variants with genotype calling metrics : vcffilter VS bcftools

The VCF were generated by a genotype caller which recommends the following filtering command:

vcffilter -f 'ABHet < 0.0 | ABHet > 0.33' \
          -f 'ABHom < 0.0 | ABHom > 0.97' \
          -f 'MaxAASR > 0.4' \
          -f 'MQ > 30' \
          -t graphtyper_filter \
          -F PASS \
          $vcf \
| bgzip > marked.vcf.gz

This is a soft filter that only adds a FILTER tag (options -t and -F). Actual removal can be done in a separate command.

In this case, the variables being filtered are all present as INFO tags, so it’s straightforward to translate the command to bcftools:

bcftools filter \
    -i '(ABHet<0 || ABHet>0.33)
        && (ABHom<0 || ABHom>0.97) \
        && MaxAASR>0.4 \
        && MQ>30' \
    --soft-filter 'graphtyper_filter' \
    --mode x \
    $vcf -Oz -o marked.vcf.gz

Note: the & and && have slightly different meanings in bcftools but it should not matter in this case.

Sample-specific genotype quality filter: GATK4 vs bcftools

Instead of removing the full row, we can decide to discard only some genotypes, by setting them to the missing value based on a sample-specific quality metric, here GQ.

GQ is therefore a FORMAT tag, i.e. it applies to each sample.

The original code used GATK4:

gatk VariantFiltration -V $vcf \
    --genotype-filter-name 'GQ_filter' \
    --genotype-filter-expression 'GQ<20' \
    --set-filtered-genotype-to-no-call \
    -O GQ_filtered.vcf.gz

equivalent for bcftools:

bcftools filter --exclude 'GQ<20' --set-GTs '.' $vcf -Oz -o gq_filtered.vcf.gz

Allele frequency filter: vcftools VS bcftools

vcftools was only used for the following “MAF” (Minor allele frequency) and “HWE” (Hardy-Weinberg equilibrium test) filter:

vcftools --gzvcf $vcf \
    --maf 0.01 --hwe 0.01 --max-missing 0.2 \
    --recode --recode-INFO-all --stdout \
| bgzip > hwe_filtered.vcf.gz

This command removes variants with:

MAF strictly less than 0.01
OR with a p-value for the HWE test strictly less than 0.01
OR a proportion of missing genotypes strictly less than 0.8 (yes this option flag is counter-intuitively named).

vcftools calculates these statistics on the fly and filters simultaneously.

However although bcftools can calculate the variables MAF and F_MISSING on the fly, for the HWE p-value we have to add the right INFO entry to the vcf before applying a fixed-threshold filter with the filter command.

This can be done with the fill-tags plugin. The command that is exactly equivalent to the above vcftools call is the following:

# Using bcftools 1.9 with htslib 1.9.

bcftools +fill-tags $vcf -Ou -- -t HWE \
| bcftools filter \
    -M 2 --exclude 'MAF < 0.01 || HWE < 0.01 || F_MISSING > 0.8' \
    -Oz -o hwe_filtered.vcf.gz

Note the option -M 2: while vcftools discards variants with more than 1 alternative allele when filtering for HWE, bcftools actually calculates one value of HWE (and MAF) for each alternative allele. So if you don’t explicitly remove such variants, the HWE (and MAF) entries are actually arrays, and the filtering command HWE < 0.01 is equivalent to HWE[0] < 0.01 which is probably not intended.

Remark on HWE filter: there are better ways to filter for HWE outliers. At the very least multiple test correction should be applied (see FDR correction for the HWE test on a vcf). Also HWE should be considered separately for each subpopulation. In case they are not known, more elaborate methods have been designed, such as Kwong et al. 2021.

Conclusion

I got rid of three dependencies and improved consistency, so I am happier now.