Linkage disequilibrium¶
Utilities for calculating and plotting linkage disequilbrium.
See also the examples at:
- anhima.ld.pairwise_genotype_ld(gn)[source]¶
Given a set of genotypes at biallelic variants, calculate the square of the correlation coefficient between all distinct pairs of variants.
Parameters: gn : array_like
A 2-dimensional array of shape (n_variants, n_samples) where each element is a genotype call coded as a single integer counting the number of non-reference alleles.
Returns: r_squared : ndarray, float
A 2-dimensional array of squared correlation coefficients between each pair of variants.
- anhima.ld.plot_pairwise_ld(r_squared, cmap=u'Greys', flip=True, ax=None)[source]¶
Make a classic triangular linkage disequilibrium plot, given an array of pairwise correlation coefficients between variants.
Parameters: r_squared : array_like
A square 2-dimensional array of squared correlation coefficients between pairs of variants.
cmap : color map, optional
The color map to use when plotting. Defaults to ‘Greys’ (0=white, 1=black).
flip : bool, optional
If True, draw the triangle upside down.
ax : axes, optional
The axes on which to draw. If not provided, a new figure will be created.
Returns: ax : axes
The axes on which the plot was drawn
- anhima.ld.plot_windowed_ld(gn, pos, window_size, start_position=None, stop_position=None, percentiles=(5, 95), ax=None, median_plot_kwargs=None, percentiles_plot_kwargs=None)[source]¶
Plot average LD within non-overlapping genome windows.
Parameters: gn : array_like
A 2-dimensional array of shape (n_variants, n_samples) where each element is a genotype call coded as a single integer counting the number of non-reference alleles.
pos : array_like
A 1-dimensional array of genomic positions of variants.
window_size : int
The size in base-pairs of the windows.
start_position : int, optional
The start position for the region over which to work.
stop_position : int, optional
The stop position for the region over which to work.
percentiles : sequence of integers, optional
Percentiles to plot in addition to the median.
ax : axes, optional
The axes on which to draw. If not provided, a new figure will be created.
median_plot_kwargs : dict, optional
Keyword arguments to pass through when plotting the median line.
percentiles_plot_kwargs : dict, optional
Keyword arguments to pass through when plotting the percentiles.
Returns: ax : axes
The axes on which the plot was drawn.
- anhima.ld.pairwise_ld_decay(r_squared, pos, step=1)[source]¶
Compile data on linkage disequilibrium, separation (in number of variants), and physical distance between pairs of variants.
Parameters: r_squared : array_like
A square 2-dimensional array of squared correlation coefficients between pairs of variants.
pos : array_like
A 1-dimensional array of genomic positions of variants.
step : int, optional
When compiling the data, advance step variants.
Returns: cor : ndarray, float
Each element in the array is the squared genotype correlation coefficient between a distinct pair of variants.
sep : ndarray, int
Each element in the array is the separation (in number of variants) between a distinct pair of variants.
dist : ndarray, int
Each element in the array is the physical distance between a distinct pair of variants.
See also
- anhima.ld.windowed_ld_decay(gn, pos, window_size, step=1)[source]¶
Compile data on linkage disequilibrium, separation (in number of variants), and physical distance between pairs of variants.
Parameters: gn : array_like
A 2-dimensional array of shape (n_variants, n_samples) where each element is a genotype call coded as a single integer counting the number of non-reference alleles.
pos : array_like
A 1-dimensional array of genomic positions of variants.
window_size : int, optional
The number of variants to work with at a time.
step : int, optional
When compiling the data within each window, advance step variants.
Returns: cor : ndarray, float
Each element in the array is the squared genotype correlation coefficient between a distinct pair of variants.
sep : ndarray, int
Each element in the array is the separation (in number of variants) between a distinct pair of variants.
dist : ndarray, int
Each element in the array is the physical distance between a distinct pair of variants.
See also
Notes
Similar to pairwise_ld_decay() except that not all pairs of variants are sampled to speed up computation and use less memory. Variants are divided into non-overlapping windows of size window_size. Genotype LD is calculated for all pairs within each window.
- anhima.ld.plot_ld_decay_by_separation(cor, sep, max_separation=100, percentiles=(5, 95), ax=None, median_plot_kwargs=None, percentiles_plot_kwargs=None)[source]¶
Plot the decay of linkage disequilibrium with separation between variants.
Parameters: cor : array_like
A 1-dimensional array of squared correlation coefficients between pairs of variants.
sep : array_like
A 1-dimensional array of separations (in number of variants) between pairs of variants.
max_separation : int, optional
Maximum separation to consider.
percentiles : sequence of integers, optional
Percentiles to plot in addition to the median.
ax : axes, optional
The axes on which to draw. If not provided, a new figure will be created.
median_plot_kwargs : dict, optional
Keyword arguments to pass through when plotting the median line.
percentiles_plot_kwargs : dict, optional
Keyword arguments to pass through when plotting the percentiles.
Returns: ax : axes
The axes on which the plot was drawn.
- anhima.ld.plot_ld_decay_by_distance(cor, dist, bins, percentiles=(5, 95), ax=None, median_plot_kwargs=None, percentiles_plot_kwargs=None)[source]¶
Plot the decay of linkage disequilibrium with physical distance between variants.
Parameters: cor : array_like
A 1-dimensional array of squared correlation coefficients between pairs of variants.
dist : array_like
A 1-dimensional array of physical distances between pairs of variants.
bins : int or sequence of ints
Number of bins or bin edges. Bins of distance to calculate LD within.
percentiles : sequence of integers, optional
Percentiles to plot in addition to the median.
ax : axes, optional
The axes on which to draw. If not provided, a new figure will be created.
median_plot_kwargs : dict, optional
Keyword arguments to pass through when plotting the median line.
percentiles_plot_kwargs : dict, optional
Keyword arguments to pass through when plotting the percentiles.
Returns: ax : axes
The axes on which the plot was drawn.
- anhima.ld.ld_prune_pairwise(gn, window_size=100, window_step=10, max_r_squared=0.2)[source]¶
Given a set of genotypes at biallelic variants, find a subset of the variants which are in approximate linkage equilibrium with each other.
Parameters: gn : array_like
A 2-dimensional array of shape (n_variants, n_samples) where each element is a genotype call coded as a single integer counting the number of non-reference alleles.
window_size : int, optional
The number of variants to work with at a time.
window_step : int, optional
The number of variants to shift the window by.
max_r_squared : float, optional
The maximum value of the genotype correlation coefficient, above which variants will be excluded.
Returns: included : ndarray, bool
A boolean array of the same length as the number of variants, where a True value indicates the variant at the corresponding index is included, and a False value indicates the corresponding variant is excluded.
Notes
The algorithm is as follows. A window of window_size variants is taken from the beginning of the genotypes array. The genotype correlation coefficient is calculated between each pair of variants in the window. The first variant in the window is considered, and any other variants in the window with linkage above max_r_squared with respect to the first variant is excluded. The next non-excluded variant in the window is then considered, and so on. The window then shifts along by window_step variants, and the process is repeated.