ariantPlaner¶
A toolkit to manage many variants from many samples, with limited resources.
Installation¶
pip install git+https://github.com/natir/variantplaner.git@0.3.1
With uv
:
python -m pip install --user pipx
pipx install git+https://github.com/natir/variantplaner.git@0.3.1
Usage¶
This section presents basic usage. For a more complete exemple checkout our usage page.
Warning
variantplaner
doesn't support compressed VCFs. This is a downstream trouble we are aware of and sorry about.
Extract data from one vcf into several parquet files¶
With variantplaner
, you can parse an input VCF and save the relevant data into several parquet files.
variantplaner vcf2parquet -i input.vcf -v variants.parquet -g genotypes.parquet -a annotations.parquet
-g
option isn't mandatory. If not set you will lose genotyping information, and ifGT
field is present in the input VCF then only heterozygote or homozygote variants will be kept.-a
option isn't mandatory. If not set you will lose "INFO" fields information.
Genotypes encoding:
gt field in parquet file | Meaning |
---|---|
0 | variant not present |
1 | heterozygote |
2 | homozygote |
3 | no information (only used in transmission file) |
Convert parquet files back to vcf¶
variantplaner parquet2vcf -i variants.parquet -g genotypes.parquet -o output.vcf
-g
option isn't mandatory if not set the information isn't added. This options has many options that control the behavior of this subcommand, we apologize for this complexity.
Structuration of data¶
Merge variants¶
Danger
This command can have huge memory and disk usage
variantplaner struct -i variants/1.parquet -i variants/2.parquet -i variants/3.parquet … -i variants/n.parquet variants -o variants.parquet
Tip
By default temporary files are written to /tmp, but you can set your TMPDIR
, TEMP
or TMP
environment variables to change this behavior.
This command uses the divide-and-conquer algorithm to perform variants merging. The -b|--bytes-memory-limit
option controls the size (in bytes) of each file chunk. Empirically RAM usage will be ten times this limit.
Partitioning genotypes¶
Danger
This command can have huge disk usage
variantplaner struct -i genotypes/1.parquet -i genotypes/2.parquet -i genotypes/3.parquet … -i genotypes/n.parquet genotypes -p partition_prefix/
Annotations¶
You can export annotation fields from VCF or CSV/TSV files into parquet files. To do so, variantplaner provides the annotations
subcommand.
Command:
variantplaner annotations -i $INPUT_FILE -o $OUTPUT_PARQUET $INPUT_TYPE [OPTIONS...]
Where:
-i|--input-path
is the input file (required)-o|--output-path
is the output parquet file (required)$INPUT_TYPE
is whether VCF or CSV (see below for the different value types)
Following OPTIONS are input type-specific (see below).
VCF format¶
If you wish to export CLNDN
and AF_ESP
fields from annotations.vcf
into clinvar.parquet
, you can run the following command:
variantplaner annotations -i annotations.vcf -o clinvar.parquet vcf -r annot_id --info CLNDN --info AF_ESP
clinvar.parquet
will contain id
of variant as well as all the info fields you've selected with the info
option. If not set, all the info columns will end up in the output file.
Options:
-r|--rename-id
: Can be used to rename vcf id column name (default isvid
).-i|--info
: Lets you select the info fields you wish to output. If not set, this will export them all.vcf
: If the input file type is VCF
Tip
Mind the vcf
argument, as the following options depend on the input file type.
CSV or TSV format¶
variantplaner annotations -i annotations.tsv -o annotations.parquet csv -c chr -p pos -r ref -a alt -s$'\t' --info CLNDN --info AF_ESP
Unlike the VCF format, variantplaner
has no way to tell which columns in the CSV/TSV file correspond to the relevant fields of a variant file. This is why you need to specify the column names in the options (requires a header).
Options:
-c|--chromosome
: Name of chromosome column-p|--position
: Name of position column-r|--reference
: Name of reference column-a|--alternative
: Name of alternative column-i|--info
: Lets you select the info fields you'd like to output. If not set, this will export them all.-s|--separator
: A single byte character to use as a delimiter in the input file (defaults to,
)
Metadata¶
JSON format¶
variantplaner metadata -i metadata.json -o metadata.parquet json -f sample -f link -f kindex
Csv format¶
variantplaner metadata -i metadata.csv -o metadata.parquet csv -c sample -c link -c kindex
Generate¶
Variants transmission¶
If you study germline variants it's useful to calculate the familial origin of variants.
variantplaner generate transmission -i genotypes.parquet -I index_sample_name -m mother_sample_name -f father_sample_name -t transmission.parquet
genotypes.parquet
file with variants of all family. This file must contains gt
and samples
columns.
In transmission.parquet
each line contains an index sample variants, index, mother, father genotypes sample information and also column origin.
Origin column contains a number with 3 digit:
#~"
││└ father genotype
│└─ mother genotype
└── index genotype
You can also use pedigree file:
variantplaner generate transmission -i genotypes.parquet -p family.ped -t transmission.parquet
Danger
This command could have important RAM usage (propotionaly to number of sample index variants)
Contribution¶
All contributions are welcome, see our "How to contribute" page.