Locating samples and variants

Utilities for locating samples and variants.

See also the examples at:

anhima.loc.view_sample(a, selection, all_samples=None)[source]

View a single column from the array a corresponding to a selected sample.

Parameters:

a : array_like

An array with 2 or more dimensions, where the second dimension corresponds to samples.

selection : int or object

A sample identifier or column index.

all_samples : sequence, optional

A sequence (e.g., list) of sample identifiers corresponding to the second dimension of a, used to map selection to a column index. If not given, assume selection is a column index.

Returns:

b : ndarray

An array obtained from a by taking the column corresponding to the selected sample.

anhima.loc.take_samples(a, selection, all_samples=None)[source]

Extract columns from the array a corresponding to selected samples.

Parameters:

a : array_like

An array with 2 or more dimensions, where the second dimension corresponds to samples.

selection : sequence of ints or objects

A sequence of sample identifiers or column indices.

all_samples : sequence, optional

A sequence (e.g., list) of sample identifiers corresponding to the second dimension of a, used to map selection to column indices. If not given, assume selection is a sequence of column indices.

Returns:

b : ndarray

An array obtained from a by taking columns corresponding to the selected samples.

anhima.loc.query_variants(expression, variants)[source]

Evaluate expression with respect to the given variants.

Parameters:

expression : string

The query expression to apply. The expression will be evaluated by numexpr against the provided variants.

variants : dict-like

The variables to include in scope for the expression evaluation.

Returns:

result : ndarray

The result of evaluating expression against variants.

anhima.loc.compress_variants(a, condition)[source]

Extract rows from the array a corresponding to a boolean condition.

Parameters:

a : array_like

An array to extract rows from (e.g., genotypes).

condition : array_like, bool

A 1-D boolean array of the same length as the first dimension of a.

Returns:

b : ndarray

An array obtained from a by taking rows corresponding to the selected variants.

anhima.loc.take_variants(a, indices, mode='raise')[source]

Extract rows from the array a corresponding to indices.

Parameters:

a : array_like

An array to extract rows from (e.g., genotypes).

indices : sequence of integers

The variant indices to extract.

mode : {‘raise’, ‘wrap’, ‘clip’}, optional

Specifies how out-of-bounds indices will behave.

Returns:

b : ndarray

An array obtained from a by taking rows corresponding to the selected variants.

anhima.loc.locate_position(pos, p)[source]

Locate the index of coordinate p within sorted array of genomic positions pos.

Parameters:

pos : array_like

A sorted 1-dimensional array of genomic positions from a single chromosome/contig, with no duplicates.

p : int

The position to locate.

Returns:

index : int or None

The index of p in pos if present, else None.

anhima.loc.view_position(a, pos, p)[source]

View a slice along the first dimension of a corresponding to a genome position.

Parameters:

a : array_like

The array to extract from.

pos : array_like

A sorted 1-dimensional array of genomic positions from a single chromosome/contig, with no duplicates.

p : int

The position to locate.

Returns:

b : ndarray

A view of a obtained by slicing along the first dimension.

See also

locate_position

anhima.loc.locate_interval(pos, start_position=0, stop_position=None)[source]

Locate the start and stop indices within the pos array that include all positions within the start_position and stop_position range.

Parameters:

pos : array_like

A sorted 1-dimensional array of genomic positions from a single chromosome/contig.

start_position : int

Start position of interval.

stop_position : int

Stop position of interval

Returns:

loc : slice

A slice object with the start and stop indices that include all positions within the interval.

anhima.loc.view_interval(a, pos, start_position, stop_position)[source]

View a contiguous slice along the first dimension of a corresponding to a genome interval defined by start_position and stop_position.

Parameters:

a : array_like

The array to extract from.

pos : array_like

A sorted 1-dimensional array of genomic positions from a single chromosome/contig.

start_position : int

Start position of interval.

stop_position : int

Stop position of interval

Returns:

b : ndarray

A view of a obtained by slicing along the first dimension.

See also

locate_interval

anhima.loc.locate_positions(pos1, pos2)[source]

Find the intersection of two sets of positions.

Parameters:

pos1, pos2 : array_like

A sorted 1-dimensional array of genomic positions from a single chromosome/contig, with no duplicates.

Returns:

cond1 : ndarray, bool

An array of the same length as pos1 where an element is True if the corresponding item in pos1 is also found in pos2.

cond2 : ndarray, bool

An array of the same length as pos2 where an element is True if the corresponding item in pos2 is also found in pos1.

anhima.loc.locate_intervals(pos, start_positions, stop_positions)[source]

Locate items within the pos array that fall within any of the intervals given by start_positions and stop_positions.

Parameters:

pos : array_like

A sorted 1-dimensional array of genomic positions from a single chromosome/contig.

start_positions : array_like, int

Start positions of intervals.

stop_positions : array_like, int

Stop positions of intervals

Returns:

cond1 : ndarray, bool

An array of the same length as pos where an element is True if the corresponding item in pos is also found in any of the intervals.

cond2 : ndarray, bool

An array of the same length as the number of intervals, where an element is True if the corresponding interval contains one or more positions in pos.

anhima.loc.plot_variant_locator(pos, step=1, ax=None, start_position=None, stop_position=None, flip=False, line_args=None)[source]

Plot lines indicating the physical genome location of variants. By default the top x axis is in variant index space, and the bottom x axis is in genome position space.

Parameters:

pos : array_like

A sorted 1-dimensional array of genomic positions from a single chromosome/contig.

step : int, optional

Plot a line for every step variants.

ax : axes, optional

The axes on which to draw. If not provided, a new figure will be created.

start_position : int, optional

The start position for the region over which to work.

stop_position : int, optional

The stop position for the region over which to work.

flip : bool, optional

Flip the plot upside down.

line_args : dict-like

Additional keyword arguments passed through to plt.Line2D.

Returns:

ax : axes

The axes on which the plot was drawn

anhima.loc.windowed_variant_counts(pos, window_size, start_position=None, stop_position=None)[source]

Count variants in non-overlapping windows over the genome.

Parameters:

pos : array_like

A sorted 1-dimensional array of genomic positions from a single chromosome/contig.

window_size : int

The size in base-pairs of the windows.

start_position : int, optional

The start position for the region over which to work.

stop_position : int, optional

The stop position for the region over which to work.

Returns:

counts : ndarray, int

The number of variants in each window.

bin_edges : ndarray, int

The edge positions of each window. Note that this has length len(counts)+1. To determine bin centers use (bin_edges[:-1] + bin_edges[1:]) / 2. To determine bin widths use np.diff(bin_edges).

See also

windowed_variant_counts_plot, windowed_variant_density

anhima.loc.plot_windowed_variant_counts(pos, window_size, start_position=None, stop_position=None, ax=None, plot_kwargs=None)[source]

Plot windowed variant counts.

Parameters:

pos : array_like

A sorted 1-dimensional array of genomic positions from a single chromosome/contig.

window_size : int

The size in base-pairs of the windows.

start_position : int, optional

The start position for the region over which to work.

stop_position : int, optional

The stop position for the region over which to work.

ax : axes, optional

The axes on which to draw. If not provided, a new figure will be created.

plot_kwargs : dict-like

Additional keyword arguments passed through to plt.plot.

Returns:

ax : axes

The axes on which the plot was drawn.

See also

windowed_variant_counts, windowed_variant_density_plot

anhima.loc.windowed_variant_density(pos, window_size, start_position=None, stop_position=None)[source]

Calculate per-base-pair density of variants in non-overlapping windows over the genome.

Parameters:

pos : array_like

A sorted 1-dimensional array of genomic positions from a single chromosome/contig.

window_size : int

The size in base-pairs of the windows.

start_position : int, optional

The start position for the region over which to work.

stop_position : int, optional

The stop position for the region over which to work.

Returns:

density : ndarray, int

The density of variants in each window.

bin_edges : ndarray, int

The edge positions of each window. Note that this has length len(density)+1. To determine bin centers use (bin_edges[:-1] + bin_edges[1:]) / 2. To determine bin widths use np.diff(bin_edges).

See also

windowed_variant_density_plot, windowed_variant_counts

anhima.loc.plot_windowed_variant_density(pos, window_size, start_position=None, stop_position=None, ax=None, plot_kwargs=None)[source]

Plot windowed variant density.

Parameters:

pos : array_like

A sorted 1-dimensional array of genomic positions from a single chromosome/contig.

window_size : int

The size in base-pairs of the windows.

start_position : int, optional

The start position for the region over which to work.

stop_position : int, optional

The stop position for the region over which to work.

ax : axes, optional

The axes on which to draw. If not provided, a new figure will be created.

plot_kwargs : dict-like

Additional keyword arguments passed through to plt.plot.

Returns:

ax : axes

The axes on which the plot was drawn.

See also

windowed_variant_density, windowed_variant_counts_plot

anhima.loc.windowed_statistic(pos, values, window_size, start_position=None, stop_position=None, statistic='mean')[source]

Calculate a statistic for values binned in non-overlapping windows over the genome.

Parameters:

pos : array_like

A sorted 1-dimensional array of genomic positions from a single chromosome/contig.

values : array_like

A 1-D array of the same length as pos.

window_size : int

The size in base-pairs of the windows.

start_position : int, optional

The start position for the region over which to work.

stop_position : int, optional

The stop position for the region over which to work.

statistic : string or function

The function to apply to values in each bin.

Returns:

stats : ndarray

The values of the statistic within each bin.

bin_edges : ndarray

The edge positions of each window. Note that this has length len(stats)+1. To determine bin centers use (bin_edges[:-1] + bin_edges[1:]) / 2. To determine bin widths use np.diff(bin_edges).

anhima.loc.evenly_downsample_variants(a, k)[source]

Evenly downsample an array along the first dimension to length k (or as near as possible), assuming the first dimension corresponds to variants.

Parameters:

a : array_like

The array to downsample.

k : int

The target number of variants.

Returns:

b : array_like

A downsampled view of a.

anhima.loc.randomly_downsample_variants(a, k)[source]

Evenly downsample an array along the first dimension to length k, assuming the first dimension corresponds to variants.

Parameters:

a : array_like

The array to downsample.

k : int

The k number of variants.

Returns:

b : array_like

A downsampled copy of a.