VariantPlaner¶
variantplaner is a set of tools that converts a large set of vcf's into an interoperable data structure efficiently.
To show the capabilities of the variantplaner
, we will use a small example.
The purpose of this short tutorial is to present:
- how to convert vcf into a more suitable format
- how data can be restructured for querying
- how to integrate variant annotation databases
- how these different files can be used to obtain interesting biological information
This tutorial suggests an organization of files, but you're under no obligation to follow it variantplaner
is quite flexible in its organization.
Setup¶
This tutorial assume you are on unix like system, you have python setup and you install variantplaner
Requirements list:
- curl
- gunzip
- pqrs (only for transmission computation, otherwise optional)
Optional: - gnu-parallel
Quering dataset:
Download data¶
mkdir -p vp_tuto/vcf/
cd vp_tuto
curl https://raw.githubusercontent.com/natir/variantplaner/main/tests/data/grch38.92.csv > grch38.92.csv
URI_ROOT="https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release"
curl ${URI_ROOT}/NA12878_HG001/latest/GRCh38/HG001_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | gunzip - | sed 's/^chr//' > vcf/HG001.vcf
curl ${URI_ROOT}/AshkenazimTrio/HG002_NA24385_son/latest/GRCh38/HG002_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | gunzip - | sed 's/^chr//' > vcf/HG002.vcf
curl ${URI_ROOT}/AshkenazimTrio/HG003_NA24149_father/latest/GRCh38/HG003_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | gunzip - | sed 's/^chr//' > vcf/HG003.vcf
curl ${URI_ROOT}/AshkenazimTrio/HG004_NA24143_mother/latest/GRCh38/HG004_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | gunzip - | sed 's/^chr//' > vcf/HG004.vcf
curl ${URI_ROOT}/ChineseTrio/HG006_NA24694_father/latest/GRCh38/HG006_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | gunzip - | sed 's/^chr//' > vcf/HG006.vcf
curl ${URI_ROOT}/ChineseTrio/HG007_NA24695_mother/latest/GRCh38/HG007_GRCh38_1_22_v4.2.1_benchmark.vcf.gz | gunzip - | sed 's/^chr//' > vcf/HG007.vcf
Variant planner presentation¶
variantplaner is a python module with command line tools, it is composed of several subcommands (they will be detailed later) but has two global options, one for parallelization and another for the level of verbosity you want.
Usage: variantplaner [OPTIONS] COMMAND [ARGS]...
Run VariantPlanner.
Options:
-t, --threads INTEGER RANGE Number of threads usable [default: 1; x>=0]
-v, --verbose Verbosity level [0<=x<=4]
--help Show this message and exit.
Commands:
annotations Convert an annotation variation file in parquet.
metadata Convert an metadata file in parquet.
parquet2vcf Convert parquet in vcf.
struct Struct operation on parquet file.
vcf2parquet Convert a vcf in parquet.
vcf2parquet¶
First step is to convert vcf data in parquet, it's a column oriented format with better performance than indexed vcf.
We split vcf in two part on for variant information and another for genotype information.
mkdir -p variants genotypes/samples/
for vcf_path in $(ls vcf/*.vcf)
do
sample_name=$(basename ${vcf_path} .vcf)
variantplaner -t 4 vcf2parquet -i ${vcf_path} \
-c grch38.92.csv \
-v variants/${sample_name}.parquet \
-g genotypes/samples/${sample_name}.parquet \
-f GT:PS:DP:ADALL:AD:GQ
done
We iterate over all vcf, variants are store in variants/{sample_name}.parquet
, genotype information are store in variants/{sample_name}.parquet
. Only genotypes with a format value equal to the -f
parameter value are retained.
gnu-parallel method
find vcf -type f -name *.vcf -exec basename {} .vcf \; | \
parallel variantplaner -t 2 vcf2parquet -c grch38.92.csv -i vcf/{}.vcf \
-v variants/{}.parquet -g genotypes/samples/{}.parquet -f GT:PS:DP:ADALL:AD:GQ
Parquet variants file contains 5 column:
- pos: Position of variant
- ref: Reference sequence
- alt: Alternative sequence
- id: An hash of other value collision isn't check but highly improbable check api documentation
variants parquet file content
You can inspect content of parquet file generate with pqrs
pqrs head variants/samples/HG001.parquet
{id: 17886044532216650390, chr: 1, pos: 783006, ref: "A", alt: "G"}
{id: 7513336577790240873, chr: 1, pos: 783175, ref: "T", alt: "C"}
{id: 17987040642944149052, chr: 1, pos: 784860, ref: "T", alt: "C"}
{id: 10342734968077036194, chr: 1, pos: 785417, ref: "G", alt: "A"}
{id: 890514037559296207, chr: 1, pos: 797392, ref: "G", alt: "A"}
Parquet genotypes file contains column:
- id: Same as variant id
- gt: vcf GT value 1 -> heterozygote 2 -> homozygote (phasing information is lost)
- ps: Phase set in which this variant falls
- dp: vcf DP coverage of the variant for this sample
- adall: Net allele depths across all datasets
- ad: vcf AD per allele reads depth
- gq: vcf GQ quality of variant for this sample
genotypes parquet file content
You can inspect content of parquet file generate with pqrs
pqrs head genotypes/HG001.parquet
{id: 17886044532216650390, sample: "HG001", gt: 2, ps: null, dp: 652, adall: [16, 234], ad: [0, 82], gq: 312}
{id: 7513336577790240873, sample: "HG001", gt: 2, ps: null, dp: 639, adall: [0, 218], ad: [0, 84], gq: 194}
{id: 17987040642944149052, sample: "HG001", gt: 2, ps: null, dp: 901, adall: [105, 406], ad: [0, 74], gq: 301}
{id: 10342734968077036194, sample: "HG001", gt: 2, ps: null, dp: 820, adall: [125, 383], ad: [0, 70], gq: 339}
{id: 890514037559296207, sample: "HG001", gt: 1, ps: null, dp: 760, adall: [161, 142], ad: [25, 37], gq: 147}
Structuration of data¶
Merge all variant¶
We can now aggregate all variant present in our dataset to perform this operation we use divide to conquer merge method by generate temporary file. By default, file are written in /tmp
but you can control where these files are written by set TMPDIR
, TEMP
or TMP
directory.
input=$(ls variants/ | xargs -I {} -x echo "-i variants/"{} | tr '\n' ' ')
variantplaner -t 8 struct $(echo $input) variants -o variants.parquet
File variants.parquet
contains all unique variants present in dataset.
Genotypes structuration¶
By samples¶
This structurations data is already down in vcf2parquet step check content of genotypes/samples
:
➜ ls genotypes/samples
HG001.parquet HG002.parquet HG003.parquet HG004.parquet HG006.parquet HG007.parquet
By variants¶
Here, we'll organize the genotypes information by variants to make it easier to find samples where a variant is present or not.
mkdir -p genotypes/variants/
input=$(ls genotypes/samples/ | xargs -I {} -x echo "-i genotypes/samples/"{} | tr '\n' ' ')
variantplaner -t 8 struct $(echo $input) genotypes -p genotypes/variants
All genotypes information are split in hive like structure to optimize request on data.
Compute transmission mode¶
If you are working with families, variantplaner
can calculate the modes of transmission of the variants.
For these step, we need to concatenate all genotypes of a AshkenazimTrio in one parquet sample.
pqrs merge -i genotypes/samples/HG002.parquet genotypes/samples/HG003.parquet genotypes/samples/HG004.parquet -o genotypes/samples/AshkenazimTrio.parquet
mkdir -p genotypes/transmissions/
variantplaner generate transmission -i genotypes/samples/AshkenazimTrio.parquet -I HG002 -m HG003 -f HG004 -t genotypes/transmissions/AshkenazimTrio.parquet
-I
parameter is use for index sample, -m
parameter is use for mother sample, -f
parameter is use for father sample only the index sample is mandatory if mother sample or father sample isn't present command work, you could also use a pedigree file with parameter -p
.
transmission parquet file content
{id: 10201716324449815219, index_gt: 2, index_ps: null, index_dp: 1066, index_adall: [0, 284], index_ad: [118, 586], index_gq: 598, mother_gt: null, mother_ps: null, mother_dp: null, mother_adall: null, mother_ad: null, mother_gq: null, father_gt: null, father_ps: null, father_dp: null, father_adall: null, father_ad: null, father_gq: null, origin: "#~~"}
{id: 8292180701257594706, index_gt: 1, index_ps: null, index_dp: 1122, index_adall: [177, 165], index_ad: [310, 283], index_gq: 556, mother_gt: null, mother_ps: null, mother_dp: null, mother_adall: null, mother_ad: null, mother_gq: null, father_gt: null, father_ps: null, father_dp: null, father_adall: null, father_ad: null, father_gq: null, origin: ""~~"}
{id: 1728452411043401356, index_gt: 1, index_ps: null, index_dp: 1365, index_adall: [225, 222], index_ad: [348, 380], index_gq: 658, mother_gt: null, mother_ps: null, mother_dp: null, mother_adall: null, mother_ad: null, mother_gq: null, father_gt: null, father_ps: null, father_dp: null, father_adall: null, father_ad: null, father_gq: null, origin: ""~~"}
{id: 4237549706021671868, index_gt: 1, index_ps: null, index_dp: 1019, index_adall: [154, 153], index_ad: [277, 282], index_gq: 517, mother_gt: null, mother_ps: null, mother_dp: null, mother_adall: null, mother_ad: null, mother_gq: null, father_gt: null, father_ps: null, father_dp: null, father_adall: null, father_ad: null, father_gq: null, origin: ""~~"}
{id: 1361753917441299167, index_gt: 1, index_ps: null, index_dp: 1033, index_adall: [159, 170], index_ad: [265, 273], index_gq: 552, mother_gt: null, mother_ps: null, mother_dp: null, mother_adall: null, mother_ad: null, mother_gq: null, father_gt: null, father_ps: null, father_dp: null, father_adall: null, father_ad: null, father_gq: null, origin: ""~~"}
Parquet transmissions file contains column all genotypes information with suffix _index
, _mother
and _father
plus a origin
column
Origin column contains a string with 3 character:
#~"
││└ ASCII_value_of(father genotype + 33)
│└─ ASCII_value_of(mother genotype + 33)
└── ASCII_value_of(index genotype + 33)
In this example case, variants is homozygotes in index, mother information is missing, variants is heterozygotes in father.
Maximal genotype value is 92, which corresponds to the character }
, ~
match with value 93, this value also mean unknow genotype.
Add annotations¶
To work on your variant, you probably need and annotations.
Snpeff annotations¶
First convert your unique variants in parquet format (variants.parquet
) in vcf:
variantplaner -t 8 parquet2vcf -i variants.parquet -o variants.vcf
parquet2vcf
subcommand have many more options but we didn't need it now.
Next annotate this variants.vcf
with snpeff, we assume you generate a file call variants.snpeff.vcf
.
To convert annotated vcf in parquet, keep 'ANN' info column and rename vcf id column in snpeff_id you can run:
mkdir -p annotations
variantplaner -t 8 annotations -c grch38.92.csv -i variants.snpeff.vcf -o annotations/snpeff.parquet vcf -i ANN -r snpeff_id
If you didn't set any value of option -i
in vcf subsubcommand all info column are keep.
Clinvar annotations¶
Download last clinvar version:
mkdir -p annotations
curl https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz \
| gunzip - > annotations/clinvar.vcf
Convert clinvar vcf file in parquet file:
variantplaner annotations -c grch38.92.csv -i annotations/clinvar.vcf -o annotations/clinvar.parquet vcf -r clinvar_id
Parquet file produce contains many columns:
- clinvar_id: content of vcf id column if option
-r
is not set, column name isvid
- id: variantplaner id
- All INFO filed
Annotations subcommand try to make match between vcf info type and parquet type.
annotations/clinvar.parquet
file content
pqrs head annotations/clinvar.parquet
{clinvar_id: 2205837, id: 11650605831284591550, AF_ESP: null, AF_EXAC: null, AF_TGP: null, ALLELEID: 2193183, CLNDN: ["Inborn_genetic_diseases"], CLNDNINCL: null, CLNDISDB: ["MeSH:D030342", "MedGen:C0950123"], CLNDISDBINCL: null, CLNHGVS: ["NC_000001.11:g.69134A>G"], CLNREVSTAT: ["criteria_provided", "_single_submitter"], CLNSIG: ["Likely_benign"], CLNSIGCONF: null, CLNSIGINCL: null, CLNVC: "single_nucleotide_variant", CLNVCSO: "SO:0001483", CLNVI: null, DBVARID: null, GENEINFO: "OR4F5:79501", MC: ["SO:0001583|missense_variant"], ORIGIN: ["1"], RS: null}
{clinvar_id: 2252161, id: 2295086632353399847, AF_ESP: null, AF_EXAC: null, AF_TGP: null, ALLELEID: 2238986, CLNDN: ["Inborn_genetic_diseases"], CLNDNINCL: null, CLNDISDB: ["MeSH:D030342", "MedGen:C0950123"], CLNDISDBINCL: null, CLNHGVS: ["NC_000001.11:g.69581C>G"], CLNREVSTAT: ["criteria_provided", "_single_submitter"], CLNSIG: ["Uncertain_significance"], CLNSIGCONF: null, CLNSIGINCL: null, CLNVC: "single_nucleotide_variant", CLNVCSO: "SO:0001483", CLNVI: null, DBVARID: null, GENEINFO: "OR4F5:79501", MC: ["SO:0001583|missense_variant"], ORIGIN: ["1"], RS: null}
{clinvar_id: 2396347, id: 11033100074712141168, AF_ESP: null, AF_EXAC: null, AF_TGP: null, ALLELEID: 2386655, CLNDN: ["Inborn_genetic_diseases"], CLNDNINCL: null, CLNDISDB: ["MeSH:D030342", "MedGen:C0950123"], CLNDISDBINCL: null, CLNHGVS: ["NC_000001.11:g.69682G>A"], CLNREVSTAT: ["criteria_provided", "_single_submitter"], CLNSIG: ["Uncertain_significance"], CLNSIGCONF: null, CLNSIGINCL: null, CLNVC: "single_nucleotide_variant", CLNVCSO: "SO:0001483", CLNVI: null, DBVARID: null, GENEINFO: "OR4F5:79501", MC: ["SO:0001583|missense_variant"], ORIGIN: ["1"], RS: null}
{clinvar_id: 2288999, id: 10487392163259126218, AF_ESP: null, AF_EXAC: null, AF_TGP: null, ALLELEID: 2278803, CLNDN: ["Inborn_genetic_diseases"], CLNDNINCL: null, CLNDISDB: ["MeSH:D030342", "MedGen:C0950123"], CLNDISDBINCL: null, CLNHGVS: ["NC_000001.11:g.69769T>C"], CLNREVSTAT: ["criteria_provided", "_single_submitter"], CLNSIG: ["Uncertain_significance"], CLNSIGCONF: null, CLNSIGINCL: null, CLNVC: "single_nucleotide_variant", CLNVCSO: "SO:0001483", CLNVI: null, DBVARID: null, GENEINFO: "OR4F5:79501", MC: ["SO:0001583|missense_variant"], ORIGIN: ["1"], RS: null}
{clinvar_id: 2351346, id: 5356120651941363990, AF_ESP: null, AF_EXAC: null, AF_TGP: null, ALLELEID: 2333177, CLNDN: ["Inborn_genetic_diseases"], CLNDNINCL: null, CLNDISDB: ["MeSH:D030342", "MedGen:C0950123"], CLNDISDBINCL: null, CLNHGVS: ["NC_000001.11:g.69995G>C"], CLNREVSTAT: ["criteria_provided", "_single_submitter"], CLNSIG: ["Uncertain_significance"], CLNSIGCONF: null, CLNSIGINCL: null, CLNVC: "single_nucleotide_variant", CLNVCSO: "SO:0001483", CLNVI: null, DBVARID: null, GENEINFO: "OR4F5:79501", MC: ["SO:0001583|missense_variant"], ORIGIN: ["1"], RS: null}
With option of subcommand vcf -i
you can select which column are included in parquet file For example command:
variantplaner annotations -i annotations/clinvar.vcf -o annotations/clinvar.parquet vcf -r clinvar_id -i ALLELEID -i CLNDN -i AF_ESP -i GENEINFO
Produce a annotations/clinvar.parquet
with columns:
- clinvar_id
- id
- ALLELEID
- CLNDN
annotations/clinvar.parquet
file content
➜ pqrs head annotations/clinvar.parquet
{clinvar_id: 2205837, id: 11650605831284591550, ALLELEID: 2193183, CLNDN: ["Inborn_genetic_diseases"]}
{clinvar_id: 2252161, id: 2295086632353399847, ALLELEID: 2238986, CLNDN: ["Inborn_genetic_diseases"]}
{clinvar_id: 2396347, id: 11033100074712141168, ALLELEID: 2386655, CLNDN: ["Inborn_genetic_diseases"]}
{clinvar_id: 2288999, id: 10487392163259126218, ALLELEID: 2278803, CLNDN: ["Inborn_genetic_diseases"]}
{clinvar_id: 2351346, id: 5356120651941363990, ALLELEID: 2333177, CLNDN: ["Inborn_genetic_diseases"]}
Querying¶
You can use any tool or software library supporting the parquet format to use the files generated by variantplaner
.
We show you how to use files with polars-cli and duckdb.
polars-cli¶
Count variants¶
〉select count(*) from read_parquet('variants.parquet');
┌─────────┐
│ count │
│ --- │
│ u32 │
╞═════════╡
│ 7852699 │
└─────────┘
check result with pqrs
We can check we have same result with pqrs
➜ pqrs rowcount variants.parquet
File Name: variants.parquet: 7852699 rows
Filter variants from annotations:¶
Get all variant with a AF_ESP upper than 0.9999
〉select chr, pos, ref, alt, AF_ESP from read_parquet('variants.parquet') as v left join read_parquet('annotations/clinvar.parquet') as c on c.id=v.id where AF_ESP>0.9999;
┌─────┬──────────┬─────┬─────┬─────────┐
│ chr ┆ pos ┆ ref ┆ alt ┆ AF_ESP │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u8 ┆ u64 ┆ str ┆ str ┆ f64 │
╞═════╪══════════╪═════╪═════╪═════════╡
│ 10 ┆ 16901372 ┆ G ┆ C ┆ 0.99992 │
│ 11 ┆ 78121030 ┆ T ┆ A ┆ 0.99992 │
└─────┴──────────┴─────┴─────┴─────────┘
Get sample have variant¶
Get all variant and sample with GENEINFO equal to 'SAMD11:148398'
〉select distinct chr, pos, ref, alt, sample from read_parquet('variants.parquet') as v left join read_parquet('genotypes/samples/*') as g on v.id=g.id left join read_parquet('annotations/clinvar.parquet') as a on v.id=a.id WHERE GENEINFO='SAMD11:148398';
┌─────┬────────┬─────┬─────┬────────┐
│ chr ┆ pos ┆ ref ┆ alt ┆ sample │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u8 ┆ u64 ┆ str ┆ str ┆ str │
╞═════╪════════╪═════╪═════╪════════╡
│ 1 ┆ 942451 ┆ T ┆ C ┆ HG003 │
│ 1 ┆ 942451 ┆ T ┆ C ┆ HG001 │
│ 1 ┆ 942451 ┆ T ┆ C ┆ HG007 │
│ 1 ┆ 942451 ┆ T ┆ C ┆ HG004 │
│ … ┆ … ┆ … ┆ … ┆ … │
│ 1 ┆ 942934 ┆ G ┆ C ┆ HG003 │
│ 1 ┆ 943937 ┆ C ┆ T ┆ HG004 │
│ 1 ┆ 942451 ┆ T ┆ C ┆ HG002 │
│ 1 ┆ 942451 ┆ T ┆ C ┆ HG006 │
└─────┴────────┴─────┴─────┴────────┘
duckdb¶
Count variants¶
D select count(*) from read_parquet('variants.parquet');
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 7852699 │
└──────────────┘
check result with pqrs
We can check we have same result with pqrs
➜ pqrs rowcount variants.parquet
File Name: variants.parquet: 7852699 rows
Filter variants from annotations:¶
Get all variant with a AF_ESP upper than 0.9999
D select chr, pos, ref, alt, AF_ESP from read_parquet('variants.parquet') as v left join read_parquet('annotations/clinvar.parquet') as c on c.id=v.id where AF_ESP>0.9999;
┌───────┬──────────┬─────────┬─────────┬─────────┐
│ chr │ pos │ ref │ alt │ AF_ESP │
│ uint8 │ uint64 │ varchar │ varchar │ double │
├───────┼──────────┼─────────┼─────────┼─────────┤
│ 10 │ 16901372 │ G │ C │ 0.99992 │
│ 11 │ 78121030 │ T │ A │ 0.99992 │
└───────┴──────────┴─────────┴─────────┴─────────┘
Get sample have variant¶
Get all variant and sample with GENEINFO equal to 'SAMD11:148398'
D select distinct chr, pos, ref, alt, sample from read_parquet('variants.parquet') as v left join read_parquet('genotypes/samples/*') as g on v.id=g.id left join read_parquet('annotations/clinvar.parquet') as a on v.id=a.id WHERE GENEINFO='SAMD11:148398';
┌───────┬────────┬─────────┬─────────┬─────────┐
│ chr │ pos │ ref │ alt │ sample │
│ uint8 │ uint64 │ varchar │ varchar │ varchar │
├───────┼────────┼─────────┼─────────┼─────────┤
│ 1 │ 942451 │ T │ C │ HG002 │
│ 1 │ 942934 │ G │ C │ HG002 │
│ 1 │ 942451 │ T │ C │ HG003 │
│ 1 │ 942934 │ G │ C │ HG003 │
│ 1 │ 942451 │ T │ C │ HG007 │
│ 1 │ 943937 │ C │ T │ HG007 │
│ 1 │ 942451 │ T │ C │ HG001 │
│ 1 │ 942451 │ T │ C │ HG004 │
│ 1 │ 943937 │ C │ T │ HG004 │
│ 1 │ 942451 │ T │ C │ HG006 │
├───────┴────────┴─────────┴─────────┴─────────┤
│ 10 rows 5 columns │
└──────────────────────────────────────────────┘
Use genotype partition¶
In this example, I'll show how I interact with the data structures created by variantplaner.
Import¶
import duckdb
import polars
import variantplaner
Get variants¶
query = f"""
SELECT
*
FROM
read_parquet('variants.parquet') as v
WHERE
v.chr == '19'
"""
variants = duckdb.query(query).pl()
Add annotations¶
query = f"""
SELECT
v.*, c.CLNSIG
FROM
variants as v
JOIN
read_parquet('annotations/clinvar.parquet') as c
ON
v.id == c.id
WHERE
c.CLNSIG LIKE '%Patho%'
"""
annotations = duckdb.query(query).pl()
Add genotypes¶
def worker(name_data: (str, polars.DataFrame)) -> polars.DataFrame:
"""Request genotype homozygote variant."""
name, data = name_data
query = f"""
SELECT
data.*, g.*,
FROM
data
JOIN
read_parquet('genotypes/variants/id_part={name}/0.parquet') as g ON data.id = g.id
WHERE
g.gt == 2
"""
df = duckdb.query(query).pl()
return df
annotations = variantplaner.normalization.add_id_part(annotations)
all_genotypes = []
for data in map(worker, annotations.group_by(by="id_part")):
if data is not None:
all_genotypes.append(data)
genotypes = polars.concat(all_genotypes)
genotypes is polars.DataFrame with pathogene homozygote variants in chromosome 19.