Genotypes

Utility functions for working with genotype data.

See also the examples at:

anhima.gt.is_called(genotypes)[source]

Find non-missing genotype calls.

Parameters:

genotypes : array_like, int

An array of shape (n_variants, n_samples, ploidy) or (n_variants, ploidy) or (n_samples, ploidy), where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

Returns:

is_called : ndarray, bool

An array where elements are True if the genotype call is non-missing.

Notes

Applicable to polyploid genotype calls.

Applicable to multiallelic variants.

anhima.gt.is_missing(genotypes)[source]

Find missing genotype calls.

Parameters:

genotypes : array_like, int

An array of shape (n_variants, n_samples, ploidy) or (n_variants, ploidy) or (n_samples, ploidy), where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

Returns:

is_missing: ndarray, bool

An array where elements are True if the genotype call is missing.

Notes

Applicable to polyploid genotype calls.

Applicable to multiallelic variants.

anhima.gt.is_hom(genotypes)[source]

Find homozygous genotype calls.

Parameters:

genotypes : array_like, int

An array of shape (n_variants, n_samples, ploidy) or (n_variants, ploidy) or (n_samples, ploidy), where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

Returns:

is_hom : ndarray, bool

An array where elements are True if the genotype call is homozygous.

Notes

Applicable to polyploid genotype calls.

Applicable to multiallelic variants.

anhima.gt.is_het(genotypes)[source]

Find heterozygous genotype calls.

Parameters:

genotypes : array_like, int

An array of shape (n_variants, n_samples, ploidy) or (n_variants, ploidy) or (n_samples, ploidy), where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

Returns:

is_het : ndarray, bool

An array where elements are True if the genotype call is heterozygous.

Notes

Applicable to polyploid genotype calls, although note that all types of heterozygous genotype (i.e., anything not completely homozygous) will give an element value of True.

Applicable to multiallelic variants, although note that the element value will be True in any case where the two alleles in a genotype are different, e.g., (0, 1), (0, 2), (1, 2), etc.

anhima.gt.is_hom_ref(genotypes)[source]

Find homozygous reference genotype calls.

Parameters:

genotypes : array_like, int

An array of shape (n_variants, n_samples, ploidy) or (n_variants, ploidy) or (n_samples, ploidy), where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

Returns:

is_hom_ref : ndarray, bool

An array where elements are True if the genotype call is homozygous reference.

Notes

Applicable to polyploid genotype calls.

Applicable to multiallelic variants.

anhima.gt.is_hom_alt(genotypes)[source]

Find homozygous non-reference genotype calls.

Parameters:

genotypes : array_like, int

An array of shape (n_variants, n_samples, ploidy) or (n_variants, ploidy) or (n_samples, ploidy), where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

Returns:

is_hom_alt : ndarray, bool

An array where elements are True if the genotype call is homozygous non-reference.

Notes

Applicable to polyploid genotype calls.

Applicable to multiallelic variants.

anhima.gt.count_called(genotypes, axis=None)[source]

Count non-missing genotype calls.

Parameters:

genotypes : array_like, int

An array of shape (n_variants, n_samples, ploidy) or (n_variants, ploidy) or (n_samples, ploidy), where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

axis : int, optional

The axis along which to count (0 = variants, 1 = samples).

Returns:

n : int or array

If axis is None, returns the number of called (i.e., non-missing) genotypes. If axis is specified, returns the sum along the given axis.

See also

is_called

anhima.gt.count_missing(genotypes, axis=None)[source]

Count non-missing genotype calls.

Parameters:

genotypes : array_like, int

An array of shape (n_variants, n_samples, ploidy) or (n_variants, ploidy) or (n_samples, ploidy), where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

axis : int, optional

The axis along which to count (0 = variants, 1 = samples).

Returns:

n : int or array

If axis is None, returns the number of missing genotypes. If axis is specified, returns the sum along the given axis.

See also

is_missing

anhima.gt.count_hom(genotypes, axis=None)[source]

Count homozygous genotype calls.

Parameters:

genotypes : array_like, int

An array of shape (n_variants, n_samples, ploidy) or (n_variants, ploidy) or (n_samples, ploidy), where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

axis : int, optional

The axis along which to count (0 = variants, 1 = samples).

Returns:

n : int or array

If axis is None, returns the number of homozygous genotypes. If axis is specified, returns the sum along the given axis.

See also

is_hom

anhima.gt.count_het(genotypes, axis=None)[source]

Count heterozygous genotype calls.

Parameters:

genotypes : array_like, int

An array of shape (n_variants, n_samples, ploidy) or (n_variants, ploidy) or (n_samples, ploidy), where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

axis : int, optional

The axis along which to count (0 = variants, 1 = samples).

Returns:

n : int or array

If axis is None, returns the number of heterozygous genotypes. If axis is specified, returns the sum along the given axis.

See also

is_het

anhima.gt.count_hom_ref(genotypes, axis=None)[source]

Count homozygous reference genotype calls.

Parameters:

genotypes : array_like, int

An array of shape (n_variants, n_samples, ploidy) or (n_variants, ploidy) or (n_samples, ploidy), where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

axis : int, optional

The axis along which to count (0 = variants, 1 = samples).

Returns:

n : int or array

If axis is None, returns the number of homozygous reference genotypes. If axis is specified, returns the sum along the given axis.

See also

is_hom_ref

anhima.gt.count_hom_alt(genotypes, axis=None)[source]

Count homozygous non-reference genotype calls.

Parameters:

genotypes : array_like, int

An array of shape (n_variants, n_samples, ploidy) or (n_variants, ploidy) or (n_samples, ploidy), where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

axis : int, optional

The axis along which to count (0 = variants, 1 = samples).

Returns:

n : int or array

If axis is None, returns the number of homozygous non-reference genotypes. If axis is specified, returns the sum along the given axis.

See also

is_hom_alt

anhima.gt.max_allele(genotypes, axis=None)[source]

Return the highest allele index.

Parameters:

genotypes : array_like

An array of shape (n_variants, n_samples, ploidy) where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

axis : int, optional

The axis along which to determine the maximum (0 = variants, 1 = samples). If not given, return the highest overall.

Returns:

n : int

The value of the highest allele index present in the genotypes array.

anhima.gt.as_haplotypes(genotypes)[source]

Reshape an array of genotypes to view it as haplotypes by dropping the ploidy dimension.

Parameters:

genotypes : array_like, int

An array of shape (n_variants, n_samples, ploidy) where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

Returns:

haplotypes : ndarray

An array of shape (n_variants, n_samples * ploidy).

Notes

Note that if genotype calls are unphased, the haplotypes returned by this function will bear no resemblance to the true haplotypes.

Applicable to polyploid genotype calls.

Applicable to multiallelic variants.

anhima.gt.as_n_alt(genotypes)[source]

Transform genotypes as the number of non-reference alleles.

Parameters:

genotypes : array_like, int

An array of shape (n_variants, n_samples, ploidy) or (n_variants, ploidy) or (n_samples, ploidy), where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

Returns:

gn : ndarray, uint8

An array where each genotype is coded as a single integer counting the number of alternate alleles.

See also

as_012

Notes

Applicable to polyploid genotype calls.

Applicable to multiallelic variants, although this function simply counts the number of non-reference alleles, it makes no distinction between different non-reference alleles.

Note that this function returns 0 for missing genotype calls and for homozygous reference genotype calls, because in both cases the number of non-reference alleles is zero.

anhima.gt.as_012(genotypes, fill=-1)[source]

Transform genotypes recoding homozygous reference calls a 0, heterozygous calls as 1, homozygous non-reference calls as 2, and missing calls as -1.

Parameters:

genotypes : array_like, int

An array of shape (n_variants, n_samples, ploidy) or (n_variants, ploidy) or (n_samples, ploidy), where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

fill : int, optional

Default value for missing calls.

Returns:

gn : ndarray, int8

An array where each genotype is coded as a single integer as described above.

See also

as_nalt

Notes

Applicable to polyploid genotype calls, although note that all types of heterozygous genotype (i.e., anything not completely homozygous) will be coded as 1.

Applicable to multiallelic variants, although note the following. All heterozygous genotypes, e.g., (0, 1), (0, 2), (1, 2), ..., will be coded as 1. All homozygous non-reference genotypes, e.g., (1, 1), (2, 2), ..., will be coded as 2.

anhima.gt.as_allele_counts(genotypes, alleles=None)[source]

Transform genotypes into allele counts per call.

Parameters:

genotypes : array_like, int

An array of shape (n_variants, n_samples, ploidy) or (n_variants, ploidy) or (n_samples, ploidy), where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

alleles : sequence of ints, optional

The alleles to count. If not specified, all alleles will be counted.

Returns:

gac : ndarray, uint8

An array where the ploidy dimension has been replaced by counts of each allele.

anhima.gt.pack_diploid(genotypes)[source]

Pack diploid genotypes into a single byte for each genotype, using the left-most 4 bits for the first allele and the right-most 4 bits for the second allele. Allows single byte encoding of diploid genotypes for variants with up to 15 alleles.

Parameters:

genotypes : array_like, int

An array of shape (n_variants, n_samples, ploidy) or (n_variants, ploidy) or (n_samples, ploidy), where each element of the array is an integer corresponding to an allele index (-1 = missing, 0 = reference allele, 1 = first alternate allele, 2 = second alternate allele, etc.).

Returns:

packed : ndarray, int8

An array of genotypes where the ploidy dimension has been collapsed by bit packing the two alleles for each genotype into a single byte.

See also

unpack_diploid_genotypes

anhima.gt.unpack_diploid(packed)[source]

Unpack an array of diploid genotypes that have been bit packed into single bytes.

Parameters:

packed : array_like

An array of genotypes where the ploidy dimension has been collapsed by bit packing the two alleles for each genotype into a single byte.

Returns:

genotypes : ndarray, int8

An array of genotypes where the ploidy dimension has been restored by unpacking the input array.

See also

pack_diploid_genotypes

anhima.gt.count_genotypes(gn, t, axis=None)[source]

Count genotypes of a given type.

Parameters:

gn : array_like, int

An array of shape (n_variants, n_samples) or (n_variants,) or (n_samples,) where each element is a genotype called coded as a single integer.

t : int

The genotype to count.

axis : int, optional

The axis along which to count (0 = variants, 1 = samples).

Returns:

n : int or array

If axis is None, returns the total number of matching genotypes. If axis is specified, returns the sum along the given axis.

anhima.gt.windowed_genotype_counts(pos, gn, t, window_size, start_position=None, stop_position=None)[source]

Count genotype calls of a given type for a single sample in non-overlapping windows over the genome.

Parameters:

pos : array_like, int

A sorted 1-dimensional array of genomic positions from a single chromosome/contig.

gn : array_like, int

A 1-D array of genotypes for a single sample, where each genotype is coded as a single integer.

t : int

The genotype to count.

window_size : int

The size in base-pairs of the windows.

start_position : int, optional

The start position for the region over which to work.

stop_position : int, optional

The stop position for the region over which to work.

Returns:

counts : ndarray, int

Genotype counts for each window.

bin_edges : ndarray, float

The edges of the windows.

anhima.gt.windowed_genotype_density(pos, gn, t, window_size, start_position=None, stop_position=None)[source]

Compute per-base-pair density of genotype calls of a given type for a single sample in non-overlapping windows over the genome.

Parameters:

pos : array_like, int

A sorted 1-dimensional array of genomic positions from a single chromosome/contig.

gn : array_like, int

A 1-D array of genotypes for a single sample, where each genotype is coded as a single integer.

t : int

The genotype to count.

window_size : int

The size in base-pairs of the windows.

start_position : int, optional

The start position for the region over which to work.

stop_position : int, optional

The stop position for the region over which to work.

Returns:

density : ndarray, float

Genotype density for each window.

bin_edges : ndarray, float

The edges of the windows.

anhima.gt.windowed_genotype_rate(pos, gn, t, window_size, start_position=None, stop_position=None)[source]

Compute per-variant rate of genotype calls of a given type for a single sample in non-overlapping windows over the genome.

Parameters:

pos : array_like, int

A sorted 1-dimensional array of genomic positions from a single chromosome/contig.

gn : array_like, int

A 1-D array of genotypes for a single sample, where each genotype is coded as a single integer.

t : int

The genotype to count.

window_size : int

The size in base-pairs of the windows.

start_position : int, optional

The start position for the region over which to work.

stop_position : int, optional

The stop position for the region over which to work.

Returns:

rate : ndarray, float

Per-variant rate for each window.

bin_edges : ndarray, float

The edges of the windows.

anhima.gt.plot_windowed_genotype_counts(pos, gn, t, window_size, start_position=None, stop_position=None, ax=None, plot_kwargs=None)[source]

Plots counts of genotype calls of a given type for a single sample in non-overlapping windows over the genome.

Parameters:

pos : array_like, int

A sorted 1-dimensional array of genomic positions from a single chromosome/contig.

gn : array_like, int

A 1-D array of genotypes for a single sample, where each genotype is coded as a single integer.

t : int

The genotype to count.

window_size : int

The size in base-pairs of the windows.

start_position : int, optional

The start position for the region over which to work.

stop_position : int, optional

The stop position for the region over which to work.

ax : axes, optional

The axes on which to draw. If not provided, a new figure will be created.

plot_kwargs : dict-like

Additional keyword arguments passed through to plt.plot.

Returns:

ax : axes

The axes on which the plot was drawn.

anhima.gt.plot_windowed_genotype_density(pos, gn, t, window_size, start_position=None, stop_position=None, ax=None, plot_kwargs=None)[source]

Plots per-base-pair density of genotype calls of a given type for a single sample in non-overlapping windows over the genome.

Parameters:

pos : array_like, int

A sorted 1-dimensional array of genomic positions from a single chromosome/contig.

gn : array_like, int

A 1-D array of genotypes for a single sample, where each genotype is coded as a single integer.

t : int

The genotype to count.

window_size : int

The size in base-pairs of the windows.

start_position : int, optional

The start position for the region over which to work.

stop_position : int, optional

The stop position for the region over which to work.

ax : axes, optional

The axes on which to draw. If not provided, a new figure will be created.

plot_kwargs : dict-like

Additional keyword arguments passed through to plt.plot.

Returns:

ax : axes

The axes on which the plot was drawn.

anhima.gt.plot_windowed_genotype_rate(pos, gn, t, window_size, start_position=None, stop_position=None, ax=None, plot_kwargs=None)[source]

Plots per-variant rate of genotype calls of a given type for a single sample in non-overlapping windows over the genome.

Parameters:

pos : array_like, int

A sorted 1-dimensional array of genomic positions from a single chromosome/contig.

gn : array_like, int

A 1-D array of genotypes for a single sample, where each genotype is coded as a single integer.

t : int

The genotype to count.

window_size : int

The size in base-pairs of the windows.

start_position : int, optional

The start position for the region over which to work.

stop_position : int, optional

The stop position for the region over which to work.

ax : axes, optional

The axes on which to draw. If not provided, a new figure will be created.

plot_kwargs : dict-like

Additional keyword arguments passed through to plt.plot.

Returns:

ax : axes

The axes on which the plot was drawn.

anhima.gt.plot_discrete_calldata(a, labels=None, colors='wbgrcmyk', states=None, ax=None, pcolormesh_kwargs=None)[source]

Plot a color grid from discrete calldata (e.g., genotypes).

Parameters:

a : array_like, int, shape (n_variants, n_samples)

2-dimensional array of integers containing the call data to plot.

labels : sequence of strings, optional

Axis labels (e.g., sample IDs).

colors : sequence, optional

Colors to use for different values of the array.

states : sequence, optional

Manually specify discrete calldata states (if not given will be determined from the data).

ax : axes, optional

The axes on which to draw. If not provided, a new figure will be created.

pcolormesh_kwargs : dict-like, optional

Additional keyword arguments passed through to plt.pcolormesh.

Returns:

ax : axes

The axes on which the plot was drawn.

anhima.gt.plot_continuous_calldata(a, labels=None, ax=None, pcolormesh_kwargs=None)[source]

Plot a color grid from continuous calldata (e.g., DP).

Parameters:

a : array_like, shape (n_variants, n_samples)

2-dimensional array of integers or floats containing the call data to plot.

labels : sequence of strings, optional

Axis labels (e.g., sample IDs).

ax : axes, optional

The axes on which to draw. If not provided, a new figure will be created.

pcolormesh_kwargs : dict-like, optional

Additional keyword arguments passed through to plt.pcolormesh.

Returns:

ax : axes

The axes on which the plot was drawn.

anhima.gt.plot_diploid_genotypes(gn, labels=None, colors='wbgr', states=(-1, 0, 1, 2), ax=None, colormesh_kwargs=None)[source]

Plot diploid genotypes as a color grid.

Parameters:

gn : array_like, int, shape (n_variants, n_samples)

An array where each genotype is coded as a single integer as described above.

labels : sequence of strings, optional

Axis labels (e.g., sample IDs).

colors : sequence, optional

Colors to use for different values of the array.

states : sequence, optional

Manually specify discrete calldata states (if not given will be determined from the data).

ax : axes, optional

The axes on which to draw. If not provided, a new figure will be created.

colormesh_kwargs : dict-like

Additional keyword arguments passed through to plt.pcolormesh.

Returns:

ax : axes

The axes on which the plot was drawn.

anhima.gt.plot_genotype_counts_by_sample(gn, states=(-1, 0, 1, 2), colors='wbgr', labels=None, ax=None, width=1, orientation='vertical', bar_kwargs=None)[source]

Plot a bar graph of genotype counts by sample.

Parameters:

gn : array_like, int, shape (n_variants, n_samples)

An array where each genotype is coded as a single integer as described above.

states : sequence, optional

The genotype states to count.

colors : sequence, optional

Colors to use for corresponding states.

labels : sequence of strings, optional

Axis labels (e.g., sample IDs).

ax : axes, optional

The axes on which to draw. If not provided, a new figure will be created.

width : float, optional

Width of the bars (will be used as height if orientation == ‘horizontal’).

orientation : {‘horizontal’, ‘vertical’}

Which type of bar to plot.

bar_kwargs : dict-like

Additional keyword arguments passed through to plt.bar.

Returns:

ax : axes

The axes on which the plot was drawn.

anhima.gt.plot_genotype_counts_by_variant(gn, states=(-1, 0, 1, 2), colors='wbgr', ax=None, width=1, orientation='vertical', bar_kwargs=None)[source]

Plot a bar graph of genotype counts by variant.

Parameters:

gn : array_like, int, shape (n_variants, n_samples)

An array where each genotype is coded as a single integer as described above.

states : sequence, optional

The genotype states to count.

colors : sequence, optional

Colors to use for corresponding states.

ax : axes, optional

The axes on which to draw. If not provided, a new figure will be created.

width : float, optional

Width of the bars (will be used as height if orientation == ‘horizontal’).

orientation : {‘horizontal’, ‘vertical’}

Which type of bar to plot.

bar_kwargs : dict-like

Additional keyword arguments passed through to plt.bar.

Returns:

ax : axes

The axes on which the plot was drawn.

anhima.gt.plot_continuous_calldata_by_sample(a, labels=None, ax=None, orientation='vertical', boxplot_kwargs=None)[source]

Plot a boxplot of continuous call data (e.g., DP) by sample.

Parameters:

a : array_like, shape (n_variants, n_samples)

2-dimensional array of integers or floats containing the call data to plot.

labels : sequence of strings, optional

Axis labels (e.g., sample IDs).

ax : axes, optional

The axes on which to draw. If not provided, a new figure will be created.

orientation : {‘horizontal’, ‘vertical’}

Which type of bar to plot.

boxplot_kwargs : dict-like

Additional keyword arguments passed through to plt.boxplot.

Returns:

ax : axes

The axes on which the plot was drawn.