Sake Request¶
Offers an object and some function to help user to interogate sake.
It's a wrapper around duckdb and thriller functions, so if sake_request doesn't meet your needs, feel free to draw inspiration from it.
Create request object¶
import pathlib
import sake
sake_path = pathlib.Path("/path/to/your/sake")
sake_db = sake.Sake(sake_path)
sake_db
object store: - path usefull for sake request - number of thread could be use, by default it's set to value return by os.cpu_count()
- if you want activate tqdm or not, by default not - an object db
to store duckdb connection
sake_db = sake.Sake(
# mandatory argument
sake_path,
# optional argument
threads=3,
activate_tqdm=True,
# overwrite annotations_path
annotations_path="my_annotations"
)
This sake_db
object use 3 thread, activate tqdm progress bar, and annotations path are sake_path / "my_annotations"
instead of default value.
Get variants from a genomic region¶
df = sake_db.get_interval("germline", 10, 329_034, 1_200_340)
df
is a polars.DataFrame you can make conversion to and from pandas with to_pandas()
and from_pandas()
. The result contains chr
, pos
, ref
and alt
column that are the minimum to define a variant and also a id
it's a sake almost unique variants id.
If you have multiple region you could run this:
target_chrs = ["1", "2", "3"]
target_start = [10_000, 40_232, 80_000]
target_stop = [199_232, 50_123, 800_000]
df = sake_db.get_intervals(
"germline",
target_chrs,
target_start,
target_stop
)
You can see get_intervals
as just a loop of get_interval
.
Get variants from prescription¶
df = sake_db.get_variant_of_prescription("AAAA", "germline")
DataFrame contains all variants(id, chr, pos, …) and genotype (gt, ad, …) information of prescription AAAA in germline dataset.
Get variants from an annotations¶
df = sake_db.get_annotations("clinvar", "20241103", "germline")
DataFrame contains all variants(id, chr, pos, …) and annotations information. By default columns are rename with annotations name as prefix, add rename_column=False
in call to change this behavior. If you want just some column use select_columns
parameter, use original name without prefix.
Add variants to a dataframe¶
Your dataframe must contains id
column (see variants).
df = sake_db.add_variants(df, "germline")
Now df
store variants imformation: - chr: chromosome name - pos: position of variant - ref: reference sequence - alt: alternative sequence
Add genotypes to variants¶
Your dataframe must contains id
column (see variants).
df = sake_db.add_genotypes(df, "germline")
Now df
store variants with sample information and genotyping: - gt: number of 1 in GT column in vcf, phasing and . information are lose - ad: string that stop AD column in vcf - db: DP column in vcf - gq: GQ column in vcf
df = sake_db.add_genotypes(df, "germline", drop_column=["gq"])
This df store not store gq
column if you didn't need a column add it in drop_column.
Add annotations¶
df = sake_db.add_annotations(df, "gnomad", "3.1.2")
By default all column in annotation are prefixed by annotation name. It's likely that not all columns are of interest to you, use parameter select_columns
to list columns of interest. Use original name not with prefix.
df = sake_db.add_annotations(
df,
"gnomad",
"genomes.4.1",
rename_column=False,
select_columns=["AC"]
)
This call add to df
a column AC from the gnomad annotations.
Add sample information¶
Your data frame must contains sample
column (see genotypes)
df = sake_db.sample_info(df)
You can select which column you want add in your dataframe
df = sake_db.sample_info(df, select_columns=["pid_crc"])
Result only contains new pid_crc
column.
Add transmission information¶
Transmission information are available only for germline information and for kindex sample. Your dataset must contains pid_crc
column (see sample information)
index_transmission = sake_db.add_transmissions(df)
Result contains only variant of kindex sample with genotype column for index sample, father and mother with coresponding prefix and an origin column. More details in how origin column are build in variantplaner documentation.