normalization ¶
Function use to normalize data.
Functions:
-
add_id_part
–Add column id part.
-
add_variant_id
–Add a column id of variants.
add_id_part ¶
add_id_part(
lf: LazyFrame, number_of_bits: int = 8
) -> LazyFrame
Add column id part.
If id is large variant id value, id_part are set to 255, other value most weigthed position 8 bits are use.
Parameters:
-
lf
(LazyFrame
) –polars.LazyFrame contains: id column.
Returns:
-
LazyFrame
–polars.LazyFrame with column id_part added
Source code in src/variantplaner/normalization.py
78 79 80 81 82 83 84 85 86 87 88 89 |
|
add_variant_id ¶
add_variant_id(
lf: LazyFrame, chrom2length: LazyFrame
) -> LazyFrame
Add a column id of variants.
Id computation is based on
Two different algorithms are used to calculate the variant identifier, depending on the cumulative length of the reference and alternative sequences.
If the cumulative length of the reference and alternative sequences is short, the leftmost bit of the id is set to 0, then a unique 63-bit hash of the variant is calculated.
If the cumulative length of the reference and alternative sequences is long, the right-most bit of the id will have a value of 1, followed by a hash function, used in Firefox, of the chromosome, position, reference and alternative sequence without the right-most bit.
If lf.columns contains SVTYPE and SVLEN variant with regex group in alt <([^:]+).*> match SVTYPE are replaced by concatenation of SVTYPE and SVLEN first value.
Parameters:
-
lf
(LazyFrame
) –polars.LazyFrame contains: chr, pos, ref, alt columns.
-
chrom2length
(LazyFrame
) –polars.DataFrame contains: chr and length columns.
Returns:
-
LazyFrame
–polars.LazyFrame with chr column normalized
Source code in src/variantplaner/normalization.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
|