Skip to content

variantplaner logoariantPlaner

ci doc pypi version

A toolkit to manage many variants from many samples, with limited resources.

Installation

pip install git+https://github.com/natir/variantplaner.git@0.3.1

With uv:

python -m pip install --user pipx
pipx install git+https://github.com/natir/variantplaner.git@0.3.1

Usage

This section presents basic usage. For a more complete exemple checkout our usage page.

Warning

variantplaner doesn't support compressed VCFs. This is a downstream trouble we are aware of and sorry about.

Extract data from one vcf into several parquet files

With variantplaner, you can parse an input VCF and save the relevant data into several parquet files.

variantplaner vcf2parquet -i input.vcf -v variants.parquet -g genotypes.parquet -a annotations.parquet
  • -g option isn't mandatory. If not set you will lose genotyping information, and if GT field is present in the input VCF then only heterozygote or homozygote variants will be kept.
  • -a option isn't mandatory. If not set you will lose "INFO" fields information.

Genotypes encoding:

gt field in parquet file Meaning
0 variant not present
1 heterozygote
2 homozygote
3 no information (only used in transmission file)

Convert parquet files back to vcf

variantplaner parquet2vcf -i variants.parquet -g genotypes.parquet -o output.vcf

-g option isn't mandatory if not set the information isn't added. This options has many options that control the behavior of this subcommand, we apologize for this complexity.

Structuration of data

Merge variants

Danger

This command can have huge memory and disk usage

variantplaner struct -i variants/1.parquet -i variants/2.parquet -i variants/3.parquet  -i variants/n.parquet variants -o variants.parquet
Tip

By default temporary files are written to /tmp, but you can set your TMPDIR, TEMP or TMP environment variables to change this behavior.

This command uses the divide-and-conquer algorithm to perform variants merging. The -b|--bytes-memory-limit option controls the size (in bytes) of each file chunk. Empirically RAM usage will be ten times this limit.

Partitioning genotypes

Danger

This command can have huge disk usage

variantplaner struct -i genotypes/1.parquet -i genotypes/2.parquet -i genotypes/3.parquet  -i genotypes/n.parquet genotypes -p partition_prefix/

Annotations

You can export annotation fields from VCF or CSV/TSV files into parquet files. To do so, variantplaner provides the annotations subcommand.

Command:

variantplaner annotations -i $INPUT_FILE -o $OUTPUT_PARQUET $INPUT_TYPE [OPTIONS...]

Where:

  • -i|--input-path is the input file (required)
  • -o|--output-path is the output parquet file (required)
  • $INPUT_TYPE is whether VCF or CSV (see below for the different value types)

Following OPTIONS are input type-specific (see below).

VCF format

If you wish to export CLNDN and AF_ESP fields from annotations.vcf into clinvar.parquet, you can run the following command:

variantplaner annotations -i annotations.vcf -o clinvar.parquet vcf -r annot_id --info CLNDN --info AF_ESP

clinvar.parquet will contain id of variant as well as all the info fields you've selected with the info option. If not set, all the info columns will end up in the output file.

Options:

  • -r|--rename-id: Can be used to rename vcf id column name (default is vid).
  • -i|--info: Lets you select the info fields you wish to output. If not set, this will export them all.
  • vcf: If the input file type is VCF

Tip

Mind the vcf argument, as the following options depend on the input file type.

CSV or TSV format

variantplaner annotations -i annotations.tsv -o annotations.parquet csv -c chr -p pos -r ref -a alt -s$'\t' --info CLNDN --info AF_ESP

Unlike the VCF format, variantplaner has no way to tell which columns in the CSV/TSV file correspond to the relevant fields of a variant file. This is why you need to specify the column names in the options (requires a header).

Options:

  • -c|--chromosome: Name of chromosome column
  • -p|--position: Name of position column
  • -r|--reference: Name of reference column
  • -a|--alternative: Name of alternative column
  • -i|--info: Lets you select the info fields you'd like to output. If not set, this will export them all.
  • -s|--separator: A single byte character to use as a delimiter in the input file (defaults to ,)

Metadata

JSON format

variantplaner metadata -i metadata.json -o metadata.parquet json -f sample -f link -f kindex

Csv format

variantplaner metadata -i metadata.csv -o metadata.parquet csv -c sample -c link -c kindex

Generate

Variants transmission

If you study germline variants it's useful to calculate the familial origin of variants.

variantplaner generate transmission -i genotypes.parquet -I index_sample_name -m mother_sample_name -f father_sample_name -t transmission.parquet

genotypes.parquet file with variants of all family. This file must contains gt and samples columns.

In transmission.parquet each line contains an index sample variants, index, mother, father genotypes sample information and also column origin.

Origin column contains a number with 3 digit:

#~"
││└ father genotype
│└─ mother genotype
└── index genotype

You can also use pedigree file:

variantplaner generate transmission -i genotypes.parquet -p family.ped -t transmission.parquet
Danger

This command could have important RAM usage (propotionaly to number of sample index variants)

Contribution

All contributions are welcome, see our "How to contribute" page.