Reference
See below for descriptions of relevant methods for computing IHD.
adjust_spectra_for_nuccomp(a_spectra, b_spectra, a_denom, b_denom)
Scale the counts in the input mutation spectra
such that they are adjusted in terms of nucleotide context.
For a given mutation type (e.g., C>T), if the a_spectra
contains more C>[A/T/G] mutations, it may simply be because
the samples represented in a_spectra
had more cytosines
that were accessible to variant calling. If a_denom
contains
more C nucleotides than b_denom
, we adjust the counts of the
C>[A/T/G] mutations in a_spectra
by a scaling factor
(b_denom
/ a_denom
). And vice versa.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
a_spectra |
ndarray
|
1D numpy array containing the aggregate mutation spectrum in group "A." |
required |
b_spectra |
ndarray
|
1D numpy array containing the aggregate mutation spectrum in group "B." |
required |
a_denom |
ndarray
|
1D numpy array containing the aggregate number of callable base pairs in group "A." At each index i in this array, the total count of accessible nucleotides should correspond to the "reference" nucleotide of the corresponding mutation type in |
required |
b_denom |
ndarray
|
1D numpy array containing the aggregate number of callable base pairs in group "B." At each index i in this array, the total count of accessible nucleotides should correspond to the "reference" nucleotide of the corresponding mutation type in |
required |
Returns:
Name | Type | Description |
---|---|---|
adj_a_spectra |
ndarray
|
The first of the input mutation spectra, adjusted for nucleotide context. |
adj_b_spectra |
ndarray
|
The second of the input mutation spectra, adjusted for nucleotide context. |
Source code in ihd/utils.py
231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 |
|
calculate_confint(spectra, genotype_matrix, covariate_matrix, distance_method=compute_manual_chisquare, n_permutations=10000, progress=True, adjust_statistics=True, conf_int=90.0)
Calculate a confidence interval around the maximum observed
distance peak by performing bootstrap resampling. In each of the
n_permutations
trials do the following: 1. resample the input mutation spectra
with replacement. 2. run an IHD scan by computing the distance between
the aggregate mutation spectrum of samples with either genotype
at every marker in the genotype_matrix
. 3. record the maximum distance encountered
at any marker, as well as the marker index at which that distance was observed.
We'll consider that marker to be the likely "peak."
Then, we return the bounds of the markers that contain 95% of all observed
peaks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spectra |
ndarray
|
A 2D numpy array of mutation spectra in all genotyped samples, of size (N, M) where N is the number of samples and M is the number of mutation types. |
required |
genotype_matrix |
ndarray
|
A 2D numpy array of genotypes at every genotyped marker, of size (G, N), where G is the number of genotyped sites and N is the number of samples. |
required |
covariate_matrix |
ndarray
|
A 2D numpy array of size (C, N), where C is the number of covariates and N is the number of samples. |
required |
distance_method |
Callable
|
Callable method to compute the distance between aggregate mutation spectra. Must accept two 1D numpy arrays and return a single floating point value. Defaults to |
compute_manual_chisquare
|
n_permutations |
int
|
Number of permutations to perform (i.e., number of times to resample the spectra with replacement. Defaults to 1_000. |
10000
|
progress |
bool
|
Whether to output a count of how many permutations have completed. Defaults to False. |
True
|
adjust_statistics |
bool
|
Whether to compute adjusted statistics at each marker by regressing genotype similarity against statistics. Defaults to True. |
True
|
conf_int |
float
|
The confidence interval to compute around the marker with the highest distance value. Defaults to 95. |
90.0
|
Returns:
Name | Type | Description |
---|---|---|
conf_ints |
Tuple
|
Tuple of two integers, corresponding to the lower and upper bound markers defining the specified confidence interval. |
Source code in ihd/utils.py
581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 |
|
calculate_covariate_by_marker(covariate_matrix, genotype_matrix)
For each covariate in the covariate_matrix
, compute the ratio of
covariate values in samples with A or B genotypes at each marker in the
genotype_matrix
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
covariate_matrix |
ndarray
|
A 2D numpy array of size (C, N), where C is the number of covariates and N is the number of samples. |
required |
genotype_matrix |
ndarray
|
A 2D numpy array of genotypes at every genotyped marker, of size (G, N), where G is the number of genotyped sites and N is the number of samples. |
required |
Returns:
Type | Description |
---|---|
ndarray
|
np.ndarray: A 2D numpy array of size (G, C), where G is the number of genotyped sites and C is the number of covariates. |
Source code in ihd/utils.py
414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 |
|
compute_allele_frequency(genotype_matrix)
Given a genotype matrix of size (G, N) where G is the number of genotyped sites and N is the number of samples, compute the allele frequency at every marker.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
genotype_matrix |
ndarray
|
A 2D numpy array of genotypes at every genotyped marker, of size (G, N), where G is the number of genotyped sites and N is the number of samples. |
required |
Returns:
Name | Type | Description |
---|---|---|
allele_frequencies |
ndarray
|
A 1D numpy array of size (G, ) containing allele frequencies at every genotyped site in the collection of N samples. |
Source code in ihd/utils.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
|
compute_genotype_similarity(genotype_matrix)
Compute the genetic similarity between groups of haplotypes at every genotyped marker. At each marker, divide the haplotypes into two groups based on the parental allele each haplotype inherited. In each group, calculate allele frequencies at every marker along the genome. Then, calculate the Pearson correlation coefficient between the two groups' allele frequency vectors.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
genotype_matrix |
ndarray
|
A 2D numpy array of genotypes at every genotyped marker, of size (G, N), where G is the number of genotyped sites and N is the number of samples. |
required |
Returns:
Name | Type | Description |
---|---|---|
genotype_similarity |
ndarray
|
A 1D numpy array of size (G, ), where G is the number of genotyped sites. |
Source code in ihd/utils.py
192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 |
|
compute_haplotype_distance(a_haps, b_haps, distance_method=compute_manual_chisquare)
Compute the distance between the aggregate mutation spectrum of two collections of haplotype mutation data. The input arrays should both be 2D numpy arrays of size (N, M), with rows and columns corresponding to samples and mutation types, respectively. This method will first aggregate the mutation spectra across samples to create two new 1D arrays, each of size (M, ). Then, it will compute the distance between those two 1D arrays using the provided distance method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
a_haps |
ndarray
|
2D array of size (N, M) containing the mutation spectrum of each sample, where N is the number of samples and M is the number of mutation types. |
required |
b_haps |
ndarray
|
2D array of size (N, M) containing the mutation spectrum of each sample, where N is the number of samples and M is the number of mutation types. |
required |
distance_method |
Callable
|
Callable method to compute the distance between aggregate mutation spectra. Must accept two 1D numpy arrays and return a single floating point value. Defaults to |
compute_manual_chisquare
|
Returns:
Name | Type | Description |
---|---|---|
distance |
float64
|
Distance between the aggregate mutation spectra of the two haplotypes. |
Source code in ihd/utils.py
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
|
compute_manual_chisquare(a, b)
Compute a chi-square test of independence between two
groups of observations. Since numba doesn't cooperate with the
scipy.stats
implementation, I've written a "manual" version
of the calculation so that it's jit-able.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
a |
ndarray
|
1D numpy array of (N, ) observations. |
required |
b |
ndarray
|
1D numpy array of (N, ) obseravtions. |
required |
Returns:
Name | Type | Description |
---|---|---|
chisquare_stat |
float64
|
The Chi-square statistic comparing observed vs. expected values of each observation. |
Source code in ihd/utils.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
|
compute_manual_cosine_distance(a, b)
Compute the cosine distance between
two 1D numpy arrays. Although methods to compute
cosine similarity and distance exist in scipy
and
sklearn
, they are not numba.njit
'able.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
a |
ndarray
|
A 1D numpy array of size (N, ). |
required |
b |
ndarray
|
A 1D numpy array of size (N, ). |
required |
Returns:
Name | Type | Description |
---|---|---|
cosine_distance |
float64
|
Cosine distance between a and b. |
Source code in ihd/utils.py
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 |
|
compute_nansum(a, row=True)
Compute the sum of a 2D numpy array
on a per-row basis, ignoring nans. Since numba
does not
support kwargs in the np.nansum
function, it's
necessary to piece this out into its own function
so that it can be decorated with numba.njit
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
a |
ndarray
|
A 2D numpy array of size (N, M). |
required |
row |
bool
|
Whether to calculate means by row or column. Defaults to True. |
True
|
Returns:
Name | Type | Description |
---|---|---|
rowsums |
ndarray
|
A 1D numpy array of size (N, ) containing sums of every row in the input. |
Source code in ihd/utils.py
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
|
compute_residuals(X, y)
Use ordinary least-squares (OLS) to fit a linear model
predicting y
as a function of X
. Then, compute the residuals
between the predicted y-values and the true y-values.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
2D numpy array of size (M, N), where M is the number of observations and N is the number of independent variables. |
required |
y |
ndarray
|
1D numpy array of size (M, ), containing the dependent variables for all M observations. |
required |
Returns:
Name | Type | Description |
---|---|---|
resids |
ndarray
|
1D numpy array of size (M, ) containing residuals. |
Source code in ihd/utils.py
458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 |
|
compute_spectra(mutations, k=1, cpg=True)
Compute the mutation spectrum of every sample in the input pd.DataFrame of mutation data. The input dataframe should either contain a single entry for every mutation observed in every sample, or the aggregate count of each mutation type observed in each sample. The input dataframe must have at least three columns: 'Strain' (denoting the sample), 'kmer' (denoting the 3-mer context of the mutation -- e.g, CCT>CAT), and 'count' (denoting the number of times a mutation of type 'kmer' was observed in 'sample'). If the dataframe contains a single entry for every mutation observed in every sample, then the 'count' column should contain a value of 1 in every row.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mutations |
DataFrame
|
Pandas dataframe containing information about the mutations observed in each sample. The dataframe must have at least three columns: 'sample' (denoting the sample), 'kmer' (denoting the 3-mer context of the mutation), and 'count' (denoting the number of times a mutation of type 'kmer' was observed in 'sample'). |
required |
k |
int
|
k-mer size of mutations (e.g., k=3 means that we will treat mutations as NXN NYN, where Ns represent the nucleotide contexts on either side of the mutation, whereas k=1 means we will treat them as X Y). Defaults to 1. |
1
|
cpg |
bool
|
whether to separate CpG>TpG mutations into their own mutation type, distinct from C>T. Defaults to True. |
True
|
Returns:
Name | Type | Description |
---|---|---|
samples |
List[str]
|
A list of samples in the dataframe (which are indexed in the same order as the rows of the 2D |
mutation_types |
List[str]
|
A list of the unique mutation types observed in the dataframe. |
spectra |
ndarray
|
A 2D numpy array of mutation spectra in all genotyped samples, of size (N, M) where N is the number of samples and M is the number of mutation types. |
Source code in ihd/utils.py
704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 |
|
get_covariate_matrix(pheno, samples, covariate_cols=['n_generations'])
Generate a matrix of covariate values for each sample. Returns
a 2D numpy array of size (C, N), where C is the number of covariates
specified in the covariate_cols
argument, and N is the number of samples.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pheno |
DataFrame
|
Pandas dataframe containing metadata about each sample. This dataframe should have a "sample" column and should contain each of the columns specified in |
required |
samples |
List[str]
|
List of samples to subset the dataframe. |
required |
covariate_cols |
List[str]
|
List of column names in |
['n_generations']
|
Returns:
Type | Description |
---|---|
ndarray
|
np.ndarray: A 2D numpy array of size (C, N), where C is the number of covariates and N is the number of samples. |
Source code in ihd/utils.py
381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 |
|
perform_ihd_scan(spectra, genotype_matrix, genotype_similarity, covariate_ratios, distance_method=compute_manual_chisquare, adjust_statistics=True)
Iterate over every genotyped marker in the genotype_matrix
,
divide the haplotypes into two groups based on sample genotypes at the
marker, and compute the distance between the aggregate mutation spectra
of each group. Return a list of cosine distances of length (G, ), where
G is the number of genotyped sites.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spectra |
ndarray
|
A 2D numpy array of mutation spectra in all genotyped samples, of size (N, M) where N is the number of samples and M is the number of mutation types. |
required |
genotype_matrix |
ndarray
|
A 2D numpy array of genotypes at every genotyped marker, of size (G, N), where G is the number of genotyped sites and N is the number of samples. |
required |
genotype_similarity |
ndarray
|
A 1D numpy array of size (G, ), where G is the number of genotyped sites. Contains the correlation coefficient between genome-wide allele frequencies of haplotypes with A vs. B genotypes at every site. |
required |
covariate_ratios |
ndarray
|
A 1D numpy array of size (G, ), where G is the number of genotyped sites. Contains the ratio of covariate values between haplotypes with A vs. B genotypes at every site. |
required |
distance_method |
Callable
|
Callable method to compute the distance between aggregate mutation spectra. Must accept two 1D numpy arrays and return a single floating point value. Defaults to |
compute_manual_chisquare
|
adjust_statistics |
bool
|
Whether to compute adjusted statistics at each marker by regressing genotype similarity against statistics. Defaults to True. |
True
|
Returns:
Name | Type | Description |
---|---|---|
distances |
List[float64]
|
List of inter-haplotype distances at every marker. |
Source code in ihd/utils.py
303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 |
|
perform_permutation_test(spectra, genotype_matrix, genotype_similarity, covariate_ratios, strata, distance_method=compute_manual_chisquare, n_permutations=1000, progress=False, adjust_statistics=True)
Conduct a permutation test to assess the significance of
any observed IHD peaks. In each of the n_permutations
trials,
do the following: 1. create a shuffled version of the input mutation spectra
, so that
sample names/indices no longer correspond to the appropriate mutation
spectrum. 2. run an IHD scan by computing the distance between
the aggregate mutation spectrum of samples with either genotype
at every marker in the genotype_matrix
. 3. store the maximum distance encountered at any marker.
Then, return a list of the maximum distances encountered in each of
the trials. Alternatively, if comparison_wide
is True, return a matrix of
size (P, G), where P is the number of permutations and G is the number of
genotyped markers, in which we store the distance value encountered at
every marker in every permutation trial.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spectra |
ndarray
|
A 2D numpy array of mutation spectra in all genotyped samples, of size (N, M) where N is the number of samples and M is the number of mutation types. |
required |
genotype_matrix |
ndarray
|
A 2D numpy array of genotypes at every genotyped marker, of size (G, N), where G is the number of genotyped sites and N is the number of samples. |
required |
genotype_similarity |
ndarray
|
A 1D numpy array of correlation coefficients of size (G, ), where G is the number of genotyped sites. At each element of the array, we store the correlation coefficient between genome-wide D allele frequencies calculated in samples with either allele at the corresponding site G_i. |
required |
covariate_ratios |
ndarray
|
A 1D numpy array of size (G, ), where G is the number of genotyped sites. Contains the ratio of covariate values between haplotypes with A vs. B genotypes at every site. |
required |
strata |
ndarray
|
A 1D numpy array of "group labels" of size (N, ), where N is the number of samples. If samples are assigned different group labels, their spectra will be permuted within those groups during the permutation testing step. |
required |
distance_method |
Callable
|
Callable method to compute the distance between aggregate mutation spectra. Must accept two 1D numpy arrays and return a single floating point value. Defaults to |
compute_manual_chisquare
|
n_permutations |
int
|
Number of permutations to perform (i.e., number of times to shuffle the spectra and compute IHDs at each marker). Defaults to 1_000. |
1000
|
progress |
bool
|
Whether to output a count of how many permutations have completed. Defaults to False. |
False
|
adjust_statistics |
bool
|
Whether to compute adjusted statistics at each marker by regressing genotype similarity against statistics. Defaults to True. |
True
|
Returns:
Name | Type | Description |
---|---|---|
null_distances |
ndarray
|
2D numpy array of size (P, G) where P is the number of permutations and G is either 1 (if we're computing a genome-wide distance threshold) or the number of genotyped markers (if we're computing thresholds at each individual marker). |
Source code in ihd/utils.py
485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 |
|
shuffle_spectra(spectra, groups)
Randomly shuffle the rows of a 2D numpy array of
mutation spectrum data of size (N, M), where N is the number
of samples and M is the number of mutation types. Shuffled array
is returned such that sample mutation spectra no longer correspond to the
appropriate sample indices into the rows of the array. Samples are also
shuffled within the specified groups
. In other words, we only shuffle
the indices of samples that correspond to a unique value in the groups
array.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spectra |
ndarray
|
2D numpy array of mutation spectrum data of shape (N, M), where N is the number of samples and M is the number of mutation types. |
required |
groups |
ndarray
|
1D numpy array of group labels (e.g., epochs) that each sample belongs to, of shape (N, ) where N is the number of samples. Mutation spectra will only be shuffled within their assigned group. |
required |
Returns:
Name | Type | Description |
---|---|---|
shuffled_spectra |
ndarray
|
The input array, but with shuffled rows. |
Source code in ihd/utils.py
150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 |
|