Pre-filtering variants from VCFs: switching to bcftools
I am currently (re)familiarizing with VCF files and their processing.
It happens that they are many softwares to process them, in particular bcftools
,
vcftools
(which are quite different from the former but whose functionalities
overlap), and also GATK
.
My current work is to update an existing workflow from the lab, which uses each of them. I was thinking, can I get rid of one or more dependencies?
I seemed to me that bcftools
was the most flexible so this is the one I
choose to keep.
Prerequisites
Some knowledge of the VCF format is required. See the docs here.
In particular the 8th column is called INFO
and contains an arbitrary number key=value
entries. Some of these variables have standard names, like MQ
for mapping quality.
The 7th column is the FILTER
column which is either PASS
or a semicolon
separated list of failed filters.
Quality check of variants with genotype calling metrics : vcffilter VS bcftools
The VCF were generated by a genotype caller which recommends the following filtering command:
vcffilter -f 'ABHet < 0.0 | ABHet > 0.33' \
-f 'ABHom < 0.0 | ABHom > 0.97' \
-f 'MaxAASR > 0.4' \
-f 'MQ > 30' \
-t graphtyper_filter \
-F PASS \
$vcf \
| bgzip > marked.vcf.gz
This is a soft filter that only adds a FILTER
tag (options -t
and -F
).
Actual removal can be done in a separate command.
In this case, the variables being filtered are all present as INFO
tags, so
it’s straightforward to translate the command to bcftools
:
bcftools filter \
-i '(ABHet<0 || ABHet>0.33)
&& (ABHom<0 || ABHom>0.97) \
&& MaxAASR>0.4 \
&& MQ>30' \
--soft-filter 'graphtyper_filter' \
--mode x \
$vcf -Oz -o marked.vcf.gz
Note: the &
and &&
have slightly different meanings in bcftools
but it
should not matter in this case.
Sample-specific genotype quality filter: GATK4 vs bcftools
Instead of removing the full row, we can decide to discard only some genotypes,
by setting them to the missing value based on a sample-specific quality metric, here GQ
.
GQ
is therefore a FORMAT
tag, i.e. it applies to each sample.
The original code used GATK4:
gatk VariantFiltration -V $vcf \
--genotype-filter-name 'GQ_filter' \
--genotype-filter-expression 'GQ<20' \
--set-filtered-genotype-to-no-call \
-O GQ_filtered.vcf.gz
equivalent for bcftools
:
bcftools filter --exclude 'GQ<20' --set-GTs '.' $vcf -Oz -o gq_filtered.vcf.gz
Allele frequency filter: vcftools VS bcftools
vcftools
was only used for the following “MAF” (Minor allele frequency) and “HWE” (Hardy-Weinberg equilibrium test) filter:
vcftools --gzvcf $vcf \
--maf 0.01 --hwe 0.01 --max-missing 0.2 \
--recode --recode-INFO-all --stdout \
| bgzip > hwe_filtered.vcf.gz
This command removes variants with:
- MAF strictly less than 0.01
- OR with a p-value for the HWE test strictly less than 0.01
- OR a proportion of missing genotypes strictly less than 0.8 (yes this option flag is counter-intuitively named).
vcftools
calculates these statistics on the fly and filters simultaneously.
However although bcftools
can calculate the variables MAF
and F_MISSING
on the fly, for the HWE p-value we have to add the right INFO entry to the vcf before
applying a fixed-threshold filter with the filter
command.
This can be done with the fill-tags
plugin. The command that is exactly equivalent to the above vcftools
call is the following:
# Using bcftools 1.9 with htslib 1.9.
bcftools +fill-tags $vcf -Ou -- -t HWE \
| bcftools filter \
-M 2 --exclude 'MAF < 0.01 || HWE < 0.01 || F_MISSING > 0.8' \
-Oz -o hwe_filtered.vcf.gz
Note the option -M 2
: while vcftools
discards variants with more than 1
alternative allele when filtering for HWE, bcftools
actually calculates one
value of HWE (and MAF) for each alternative allele. So if you don’t explicitly
remove such variants, the HWE (and MAF) entries are actually arrays, and the
filtering command HWE < 0.01
is equivalent to HWE[0] < 0.01
which is
probably not intended.
Remark on HWE filter: there are better ways to filter for HWE outliers. At the very least multiple test correction should be applied (see FDR correction for the HWE test on a vcf). Also HWE should be considered separately for each subpopulation. In case they are not known, more elaborate methods have been designed, such as Kwong et al. 2021.
Conclusion
I got rid of three dependencies and improved consistency, so I am happier now.