Stats Utility Functions#

This module provides access to functions to perform various statistical operations.

sample_standard_Normal(size, random_seed=None)[source]#: Sample a standard normal random variable, of shape=size

sample_Rademacher(size, dtype=<class 'float'>, random_seed=None)[source]#: Sample a Rademacher distribution, and return a sample of the passed size (Numpy array is returned for any size)

trace_estimator(matvec_fun, vector_size, sample_size=100, optimize_sample_size=False, min_sample_size=10, max_sample_size=500, distribution='Rademacher', random_seed=None, accuracy=0.01, verbose=False)[source]#

Given a matvec_fun that returns the result of multiplying the matrix (e.g., underlying covariance matrix) by a vector, estimate the trace of the matrix using randomization

Parameters:

matvec_fun – a function that returns the result of multiplying the matrix (e.e., underlying covariance matrix) by a vector, estimate the trace of the matrix using randomization
vector_size – sample size to use for randomized trace estimation
optimize_sample_size (bool) – if True, sample_size is replaced with an optimized version, where the optimal sample size minimizes the variance of the estimator, and must be between min_sample_size, and max_sample_size
min_sample_size (int)
max_sample_size (int)
distribution (str) – The probability distribution used for random sampling. Both ‘Rademacher’ and ‘Gaussian’ are supported.
accuracy (float) – convergence criterion
verbose (bool) – screen verbosity of progress

Returns:

converged: (bool) a flag indicating whether the estimator converged or not
trace: randomization estimate of the trace of the matrix whose product by a vector is evaulated by matvec_fun
sample_history
trace_history

logdet_estimator(matvec_fun, vector_size, lanczos_degree=16, sample_size=32, randomization_vectors=None, distribution='Rademacher', random_seed=None, verbose=False, **kwargs)[source]#

Given a matvec_fun that returns the result of multiplying the matrix (e.g., underlying covariance matrix) by a vector, estimate the logarithm of the determinant (logdet) of the matrix using a stochastic Lanczos approximatioa stochastic Lanczos approximationn.

Note

We implement the algorithm described in the following paper:

Parameters:

matvec_fun – a function that returns the result of multiplying the matrix (e.e., underlying covariance matrix) by a vector. NOTE: It is assumed that matvec_fun represents a symmetric positive definite linear operator. No check is performed to verify this.
vector_size – size of the vector to use for the matvec function
lanczos_degree – degree of the Lanczos approximation. Default is 16.
sample_size – number of random vectors to use for the stochastic Lanczos approximation. Default is 32. Ignored if randomization_vectors is provided.
randomization_vectors – a list of random vectors to use for the stochastic Lanczos approximation. If provided, sample_size is ignored. Default is None.
distribution – The probability distribution used for random sampling. Both ‘Rademacher’ and ‘Gaussian’ are supported. Default is ‘Rademacher’.
random_seed – random seed for random number generator. Default is None.
verbose – screen verbosity of progress

Returns:

A tuple containing the following elements:

converged: a flag indicating whether the estimator converged or not.
logdet: randomization estimate of the logdet of the matrix whose product by a vector is evaulated by matvec_fun
sample_history
logdet_history

Return type:

tuple[bool, float, list[ndarray], list[ndarray]]

lanczos(H, v0, k=20, termination_tol=1e-10)[source]#

Perform k iterations of Lanczos’ algorithm.

Note

See chapter 10 of the following reference for the implemented algorithm:

Golub, G. H., & Van Loan, C. F. (2013). Matrix computations. The Johns Hopkins University Press.

Parameters:

H (np.ndarray | sp.spmatrix | splinalg.LinearOperator) – Matrix to perform Lanczos iterations on. Does not need to be of these types, but must implement the .shape property, .T method and @ operator.
psi0 – Initial vector for the Lanczos iteration.
k (int) – Number of iterations to perform.
termination_tol (float) – Tolerance for early termination of the Lanczos iteration.

Returns:

A tuple containing the following elements:

T: Tridiagonal matrix of size k x k containing the Lanczos iterates.
vecs: Matrix of size n x k containing the Lanczos vectors.

Return type:

tuple[ndarray, ndarray]

rsvd(A, k, q=1, compute_uv=True, Omega=None, p=20, distribution='Gaussian', random_seed=None, verbose=False)[source]#

Perform a Singular Value Decomposition (SVD) of a matrix A using a randomized SVD (rSVD) algorithm.

Note

See the following paper for more details of the algorithm:

Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. In SIAM Review (Vol. 53, Issue 2, pp. 217–288). Society for Industrial & Applied Mathematics (SIAM). https://doi.org/10.1137/090771806

Parameters:

A (np.ndarray | sp.spmatrix | splinalg.LinearOperator) – Matrix to decompose. Does not need to be of these types, but must implement the .shape property, .T method and @ operator.
k (int) – Number of singular values/vectors to compute.
q (int) – Number of randomized subspace iterations to perform. Default is 1.
is_symmetric (bool) – If True, the matrix is assumed to be symmetric, and the algorithm is optimized for this case. Default is False.
compute_uv (bool) – If True, the left and right singular vectors are computed and returned in U and Vh, respectively. Default is True.
Omega (np.ndarray | sp.spmatrix) – Random matrix to use for the rSVD algorithm. If not specified, a random matrix is generated internally. Default is None.
p (int) – Number of columns of Omega to use. If Omega is provided, this setting is ignored. Default is 20.
distribution (str) – The probability distribution used for random sampling. Both ‘Rademacher’ and ‘Gaussian’ are supported. Default is ‘Gaussian’.
random_seed (int) – random seed for random number generator. Default is None.
verbose (bool) – if True, print debug information.

calculate_rmse(first, second, ignore_nan=True)[source]#

Calculate the root mean squared error between two vectors of the same type.

Parameters:

first – first array/vector
second – second array/vector
ignore_nan – if True np.nan values are ignored from the comparison and counting, otherwise any nan value in the difference will result in nan accounted for.

Returns:

rmse: the corresponding root mean squared error

factorize_covariances(covariance, lower=True)[source]#

Genereate Cholesky decomposition of the passed covariance matrix/kernel. For scalar covariance, this is the standard deviation. For matrices, this is the lower triangular matrix by default.

Parameters:: covariance – float, or 2d numpy.ndarray or scipy.spmatrix
Returns:: square root of covariance if scalar, otherwise, lower triangular portion of the matrix Cholesky decomposition. Return type is same as covariance type unless a callable is passed. In this case, a scipy linear operator is returned.

create_covariance_matrix(size=2, corr=False, stdev=None, return_factor=False, random_seed=None)[source]#

construct a numpy covariance matrix of dimension (size x size), with/without correlations if return factor is True, the Cholesky factor (lower) is returned, otherwise the covariance matrix is returned

The passed standard deviations stdev are not exactly the same as square root of the variances as they are used to set the Cholesky factor of the covariance matrix in the presence of correlations.

posterior_covariance_matrix(B, R, F=None, w=None)[source]#

Assuming Gaussianity, and given prior covariance matrix, and observation covariance matrix, construct the posterior covariance matrix

Parameters:

B – Prior Covariance matrix
R – Observation error covariance matrix
F – Forward model: \(y = F x + \eta; \eta \sim N(0, R)\); F is set to I if None is passed
w – Design variable; set to 1 if None is Passed

Returns:

A: Posterior covariance matrix \((F^T R^{-1/2} W R^{-1/2} F + B^{-1})^{-1}\)

calculate_oed_criterion(post_cov, oed_criterion='A')[source]#

Given a matrix post_cov, calculate (exactly) the optimality criterion

Parameters:

post_cov – the posterior covariance matrix, out of which optimality criterion is evaluated
oed_criterion – ‘A’ for A-optimality (Trace(post_cov)), ‘D’ for D-optimality (log-det (post_cov))

Returns:

oed_objective:

Trace(post_cov) if oed_criterion is set to ‘A’,
log-det(post_cov) if oed_criterion is set to ‘D’

create_complex_covariance_matrices(size=2, stdev=None, random_seed=None)[source]#

construct numpy covariance matrices of dimension (size x size) for complex random variables/vectors, with/without correlations and/or cross-correlations between real and imaginary parts.

This function calculates covariances and pseudo covariances by sampling real and imaginary parts, and then calculate covariances from the sample

The passed standard deviations stdev are not exactly the same as square root of the variances; they are just used to sample real and imaginary parts which are then aggregated.

real_to_complex_covariances(C_RR, C_IR, C_RI, C_II)[source]#

Convert covariances and cross covariances or real and imaginary parts to covariances and pseudo-covariances (relations) of complex random variable/vector

Parameters:

C_RR – covariances of the real part of a complex random vector/variable
C_IR – cross-covariances of the real with the imaginary part of a complex random vector/variable
C_RI – cross-covariances of the imaginary with the real part of a complex random vector/variable
C_II – covariances of the imaginary part of a complex random vector/variable

Returns:

K: the covariances of the complex variable
J: the pseudo covariances

complex_to_real_covariances(K, J, ignore_design=False)[source]#

Convert covariances and cross covariances or real and imaginary parts to covariances and pseudo-covariances (relations) of a complex random variable/vector

Parameters:

K – the covariances of the complex variable
J – the pseudo covariances
ignore_design (bool) – if False the internal design is ignored, otherwise, the matrices are projected on the space spanned by the active entries of the design.

Returns:

C_RR: covariances of the real part of a complex random vector/variable
C_IR: cross-covariances of the real with the imaginary part of a complex random vector/variable
C_RI: cross-covariances of the imaginary with the real part of a complex random vector/variable
C_II: covariances of the imaginary part of a complex random vector/variable

complex_sample_covariance(sample, ddof=1)[source]#

Evaluate sample covariances and pseudo covariances for a given sample from the distribution of a complex-valued random variable/vector. Sample covariances are evaluated based on conjugate transpose, but pseudo covariances are evaluated based on transpose of the deviations.

Parm sample:

ArrayLike iterable of two dimensions. Each row represetns a sample of the underlying distribution

Parameters:

ddof (int) – ddof=1 will return unbiased estimate since in the covariance formula, we divide by sample_size-ddof.

Returns:

covariances: sample-based covariance matrix
pseudo_covariances: sample-based pseudo covariance matrix