Linkage disequilibrium

Utilities for calculating and plotting linkage disequilbrium.

See also the examples at:

anhima.ld.pairwise_genotype_ld(gn)[source]

Given a set of genotypes at biallelic variants, calculate the square of the correlation coefficient between all distinct pairs of variants.

Parameters:

gn : array_like

A 2-dimensional array of shape (n_variants, n_samples) where each element is a genotype call coded as a single integer counting the number of non-reference alleles.

Returns:

r_squared : ndarray, float

A 2-dimensional array of squared correlation coefficients between each pair of variants.

anhima.ld.plot_pairwise_ld(r_squared, cmap=u'Greys', flip=True, ax=None)[source]

Make a classic triangular linkage disequilibrium plot, given an array of pairwise correlation coefficients between variants.

Parameters:

r_squared : array_like

A square 2-dimensional array of squared correlation coefficients between pairs of variants.

cmap : color map, optional

The color map to use when plotting. Defaults to ‘Greys’ (0=white, 1=black).

flip : bool, optional

If True, draw the triangle upside down.

ax : axes, optional

The axes on which to draw. If not provided, a new figure will be created.

Returns:

ax : axes

The axes on which the plot was drawn

anhima.ld.plot_windowed_ld(gn, pos, window_size, start_position=None, stop_position=None, percentiles=(5, 95), ax=None, median_plot_kwargs=None, percentiles_plot_kwargs=None)[source]

Plot average LD within non-overlapping genome windows.

Parameters:

gn : array_like

A 2-dimensional array of shape (n_variants, n_samples) where each element is a genotype call coded as a single integer counting the number of non-reference alleles.

pos : array_like

A 1-dimensional array of genomic positions of variants.

window_size : int

The size in base-pairs of the windows.

start_position : int, optional

The start position for the region over which to work.

stop_position : int, optional

The stop position for the region over which to work.

percentiles : sequence of integers, optional

Percentiles to plot in addition to the median.

ax : axes, optional

The axes on which to draw. If not provided, a new figure will be created.

median_plot_kwargs : dict, optional

Keyword arguments to pass through when plotting the median line.

percentiles_plot_kwargs : dict, optional

Keyword arguments to pass through when plotting the percentiles.

Returns:

ax : axes

The axes on which the plot was drawn.

anhima.ld.pairwise_ld_decay(r_squared, pos, step=1)[source]

Compile data on linkage disequilibrium, separation (in number of variants), and physical distance between pairs of variants.

Parameters:

r_squared : array_like

A square 2-dimensional array of squared correlation coefficients between pairs of variants.

pos : array_like

A 1-dimensional array of genomic positions of variants.

step : int, optional

When compiling the data, advance step variants.

Returns:

cor : ndarray, float

Each element in the array is the squared genotype correlation coefficient between a distinct pair of variants.

sep : ndarray, int

Each element in the array is the separation (in number of variants) between a distinct pair of variants.

dist : ndarray, int

Each element in the array is the physical distance between a distinct pair of variants.

anhima.ld.windowed_ld_decay(gn, pos, window_size, step=1)[source]

Compile data on linkage disequilibrium, separation (in number of variants), and physical distance between pairs of variants.

Parameters:

gn : array_like

A 2-dimensional array of shape (n_variants, n_samples) where each element is a genotype call coded as a single integer counting the number of non-reference alleles.

pos : array_like

A 1-dimensional array of genomic positions of variants.

window_size : int, optional

The number of variants to work with at a time.

step : int, optional

When compiling the data within each window, advance step variants.

Returns:

cor : ndarray, float

Each element in the array is the squared genotype correlation coefficient between a distinct pair of variants.

sep : ndarray, int

Each element in the array is the separation (in number of variants) between a distinct pair of variants.

dist : ndarray, int

Each element in the array is the physical distance between a distinct pair of variants.

Notes

Similar to pairwise_ld_decay() except that not all pairs of variants are sampled to speed up computation and use less memory. Variants are divided into non-overlapping windows of size window_size. Genotype LD is calculated for all pairs within each window.

anhima.ld.plot_ld_decay_by_separation(cor, sep, max_separation=100, percentiles=(5, 95), ax=None, median_plot_kwargs=None, percentiles_plot_kwargs=None)[source]

Plot the decay of linkage disequilibrium with separation between variants.

Parameters:

cor : array_like

A 1-dimensional array of squared correlation coefficients between pairs of variants.

sep : array_like

A 1-dimensional array of separations (in number of variants) between pairs of variants.

max_separation : int, optional

Maximum separation to consider.

percentiles : sequence of integers, optional

Percentiles to plot in addition to the median.

ax : axes, optional

The axes on which to draw. If not provided, a new figure will be created.

median_plot_kwargs : dict, optional

Keyword arguments to pass through when plotting the median line.

percentiles_plot_kwargs : dict, optional

Keyword arguments to pass through when plotting the percentiles.

Returns:

ax : axes

The axes on which the plot was drawn.

anhima.ld.plot_ld_decay_by_distance(cor, dist, bins, percentiles=(5, 95), ax=None, median_plot_kwargs=None, percentiles_plot_kwargs=None)[source]

Plot the decay of linkage disequilibrium with physical distance between variants.

Parameters:

cor : array_like

A 1-dimensional array of squared correlation coefficients between pairs of variants.

dist : array_like

A 1-dimensional array of physical distances between pairs of variants.

bins : int or sequence of ints

Number of bins or bin edges. Bins of distance to calculate LD within.

percentiles : sequence of integers, optional

Percentiles to plot in addition to the median.

ax : axes, optional

The axes on which to draw. If not provided, a new figure will be created.

median_plot_kwargs : dict, optional

Keyword arguments to pass through when plotting the median line.

percentiles_plot_kwargs : dict, optional

Keyword arguments to pass through when plotting the percentiles.

Returns:

ax : axes

The axes on which the plot was drawn.

anhima.ld.ld_prune_pairwise(gn, window_size=100, window_step=10, max_r_squared=0.2)[source]

Given a set of genotypes at biallelic variants, find a subset of the variants which are in approximate linkage equilibrium with each other.

Parameters:

gn : array_like

A 2-dimensional array of shape (n_variants, n_samples) where each element is a genotype call coded as a single integer counting the number of non-reference alleles.

window_size : int, optional

The number of variants to work with at a time.

window_step : int, optional

The number of variants to shift the window by.

max_r_squared : float, optional

The maximum value of the genotype correlation coefficient, above which variants will be excluded.

Returns:

included : ndarray, bool

A boolean array of the same length as the number of variants, where a True value indicates the variant at the corresponding index is included, and a False value indicates the corresponding variant is excluded.

Notes

The algorithm is as follows. A window of window_size variants is taken from the beginning of the genotypes array. The genotype correlation coefficient is calculated between each pair of variants in the window. The first variant in the window is considered, and any other variants in the window with linkage above max_r_squared with respect to the first variant is excluded. The next non-excluded variant in the window is then considered, and so on. The window then shifts along by window_step variants, and the process is repeated.