Stats Utility Functions#

Statistical and linear algebra utilities for PyOED.

This module provides randomized estimators (trace, log-determinant), Lanczos iteration, randomized SVD, covariance matrix construction and factorization, RMSE computation, and optimality-criterion evaluation used throughout PyOED.

sample_standard_Normal(size, random_seed=None)[source]#

Sample from a standard normal distribution.

Parameters:
  • size – shape of the output array (scalar or tuple).

  • random_seed (int | None) – seed for the random number generator.

Returns:

array of standard normal samples with the given shape.

Return type:

numpy.ndarray

sample_Rademacher(size, dtype=<class 'float'>, random_seed=None)[source]#

Sample from a Rademacher distribution (\(\pm 1\) with equal probability).

Parameters:
  • size – shape of the output array (scalar or tuple).

  • dtype (type) – data type of the returned array. Default is float.

  • random_seed (int | None) – seed for the random number generator.

Returns:

array of \(\pm 1\) values with the given shape.

Return type:

numpy.ndarray

trace_estimator(matvec_fun, vector_size, sample_size=100, optimize_sample_size=False, min_sample_size=10, max_sample_size=500, distribution='Rademacher', random_seed=None, accuracy=0.01, verbose=False)[source]#

Given a matvec_fun that returns the result of multiplying the matrix (e.g., underlying covariance matrix) by a vector, estimate the trace of the matrix using randomization

Parameters:
  • matvec_fun – a function that returns the result of multiplying the matrix (e.e., underlying covariance matrix) by a vector, estimate the trace of the matrix using randomization

  • vector_size – sample size to use for randomized trace estimation

  • optimize_sample_size (bool) – if True, sample_size is replaced with an optimized version, where the optimal sample size minimizes the variance of the estimator, and must be between min_sample_size, and max_sample_size

  • min_sample_size (int)

  • max_sample_size (int)

  • distribution (str) – The probability distribution used for random sampling. Both ‘Rademacher’ and ‘Gaussian’ are supported.

  • accuracy (float) – convergence criterion

  • verbose (bool) – screen verbosity of progress

Returns:

  • converged: (bool) a flag indicating whether the estimator converged or not

  • trace: randomization estimate of the trace of the matrix whose product by a vector is evaulated by matvec_fun

  • sample_history

  • trace_history

logdet_estimator(matvec_fun, vector_size, lanczos_degree=16, sample_size=32, randomization_vectors=None, distribution='Rademacher', random_seed=None, verbose=False, **kwargs)[source]#

Given a matvec_fun that returns the result of multiplying the matrix (e.g., underlying covariance matrix) by a vector, estimate the logarithm of the determinant (logdet) of the matrix using a stochastic Lanczos approximatioa stochastic Lanczos approximationn.

Note

We implement the algorithm described in the following paper:

Parameters:
  • matvec_fun – a function that returns the result of multiplying the matrix (e.e., underlying covariance matrix) by a vector. NOTE: It is assumed that matvec_fun represents a symmetric positive definite linear operator. No check is performed to verify this.

  • vector_size – size of the vector to use for the matvec function

  • lanczos_degree – degree of the Lanczos approximation. Default is 16.

  • sample_size – number of random vectors to use for the stochastic Lanczos approximation. Default is 32. Ignored if randomization_vectors is provided.

  • randomization_vectors – a list of random vectors to use for the stochastic Lanczos approximation. If provided, sample_size is ignored. Default is None.

  • distribution – The probability distribution used for random sampling. Both ‘Rademacher’ and ‘Gaussian’ are supported. Default is ‘Rademacher’.

  • random_seed – random seed for random number generator. Default is None.

  • verbose – screen verbosity of progress

Returns:

A tuple containing the following elements:

  • converged: a flag indicating whether the estimator converged or not.

  • logdet: randomization estimate of the logdet of the matrix whose product by a vector is evaulated by matvec_fun

  • sample_history

  • logdet_history

Return type:

tuple[bool, float, list[ndarray], list[ndarray]]

lanczos(H, v0, k=20, termination_tol=1e-10)[source]#

Perform k iterations of Lanczos’ algorithm.

Note

See chapter 10 of the following reference for the implemented algorithm:

Golub, G. H., & Van Loan, C. F. (2013). Matrix computations. The Johns Hopkins University Press.

Parameters:
  • H (np.ndarray | sp.spmatrix | splinalg.LinearOperator) – Matrix to perform Lanczos iterations on. Does not need to be of these types, but must implement the .shape property, .T method and @ operator.

  • psi0 – Initial vector for the Lanczos iteration.

  • k (int) – Number of iterations to perform.

  • termination_tol (float) – Tolerance for early termination of the Lanczos iteration.

Returns:

A tuple containing the following elements:

  • T: Tridiagonal matrix of size k x k containing the Lanczos iterates.

  • vecs: Matrix of size n x k containing the Lanczos vectors.

Return type:

tuple[ndarray, ndarray]

rsvd(A, k, q=1, compute_uv=True, Omega=None, p=20, distribution='Gaussian', random_seed=None, verbose=False)[source]#

Perform a Singular Value Decomposition (SVD) of a matrix A using a randomized SVD (rSVD) algorithm.

Note

See the following paper for more details of the algorithm:

Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. In SIAM Review (Vol. 53, Issue 2, pp. 217–288). Society for Industrial & Applied Mathematics (SIAM). https://doi.org/10.1137/090771806

Parameters:
  • A (np.ndarray | sp.spmatrix | splinalg.LinearOperator) – Matrix to decompose. Does not need to be of these types, but must implement the .shape property, .T method and @ operator.

  • k (int) – Number of singular values/vectors to compute.

  • q (int) – Number of randomized subspace iterations to perform. Default is 1.

  • is_symmetric (bool) – If True, the matrix is assumed to be symmetric, and the algorithm is optimized for this case. Default is False.

  • compute_uv (bool) – If True, the left and right singular vectors are computed and returned in U and Vh, respectively. Default is True.

  • Omega (np.ndarray | sp.spmatrix) – Random matrix to use for the rSVD algorithm. If not specified, a random matrix is generated internally. Default is None.

  • p (int) – Number of columns of Omega to use. If Omega is provided, this setting is ignored. Default is 20.

  • distribution (str) – The probability distribution used for random sampling. Both ‘Rademacher’ and ‘Gaussian’ are supported. Default is ‘Gaussian’.

  • random_seed (int) – random seed for random number generator. Default is None.

  • verbose (bool) – if True, print debug information.

calculate_rmse(first, second, ignore_nan=True)[source]#

Calculate the root mean squared error (RMSE) between two vectors.

Parameters:
  • first – first array/vector.

  • second – second array/vector (must be the same type and size as first).

  • ignore_nan (bool) – if True, NaN values are excluded from the comparison and the count.

Returns:

the root mean squared error.

Return type:

float

factorize_covariances(covariance, lower=True)[source]#

Generate Cholesky decomposition of a covariance matrix/kernel.

Deprecated since version This: function is no longer available. Use pyoed.utility.math.factorize_spsd_matrix() instead.

Parameters:
  • covariancefloat, or 2-D numpy.ndarray or scipy.spmatrix.

  • lower (bool) – if True, return the lower triangular factor.

Returns:

square root of covariance if scalar, otherwise the lower triangular Cholesky factor.

Raises:

NotImplementedError – always.

create_covariance_matrix(size=2, corr=False, stdev=None, return_factor=False, random_seed=None)[source]#

Construct a numpy covariance matrix of dimension (size x size).

Parameters:
  • size (int) – dimension of the covariance matrix. Must be >= 1.

  • corr (bool) – if True, include off-diagonal correlations; otherwise the matrix is diagonal.

  • stdev – standard deviations for each dimension. If None, random values in (1e-5, 1] are used.

  • return_factor (bool) – if True, return the lower Cholesky factor instead of the full covariance matrix.

  • random_seed (int | None) – seed for the random number generator.

Returns:

covariance matrix (or its Cholesky factor if return_factor is True).

Return type:

numpy.ndarray

Note

The passed standard deviations are not exactly the square root of the variances; they are used to set the Cholesky factor, which is then used to construct the covariance via \(C = L L^T\).

posterior_covariance_matrix(B, R, F=None, w=None)[source]#

Construct the posterior covariance matrix assuming Gaussianity.

\[\mathbf{A} = \left( \mathbf{F}^T \mathbf{R}^{-1/2} \mathbf{W} \mathbf{R}^{-1/2} \mathbf{F} + \mathbf{B}^{-1} \right)^{-1}\]
Parameters:
  • B (numpy.ndarray) – prior covariance matrix of shape (Ns, Ns).

  • R (numpy.ndarray) – observation error covariance matrix of shape (No, No).

  • F (numpy.ndarray | None) – forward model (observation operator) of shape (No, Ns). Set to identity if None.

  • w (numpy.ndarray | None) – design variable of length No. Set to all ones if None.

Returns:

posterior covariance matrix of shape (Ns, Ns).

Return type:

numpy.ndarray

calculate_oed_criterion(post_cov, oed_criterion='A')[source]#

Evaluate an OED optimality criterion on a posterior covariance matrix.

Parameters:
  • post_cov (numpy.ndarray) – the posterior covariance matrix (must be square, 2-dimensional).

  • oed_criterion (str) –

    the criterion to evaluate:

    • 'A' — A-optimality: \(\operatorname{Tr}(\mathbf{A})\)

    • 'D' — D-optimality: \(\log\det(\mathbf{A})\)

Returns:

the scalar value of the optimality criterion.

Return type:

float

Raises:

ValueError – if oed_criterion is not recognized.

create_complex_covariance_matrices(size=2, stdev=None, random_seed=None)[source]#

Construct covariance matrices for complex random variables/vectors.

Calculates covariances and pseudo-covariances by sampling real and imaginary parts independently and computing sample statistics.

Parameters:
  • size (int) – dimension of the complex random vector. Must be >= 1.

  • stdev – standard deviations for real and imaginary parts (length 2*size). If None, random values are used.

  • random_seed (int | None) – seed for the random number generator.

Returns:

a tuple (covariances, pseudo_covariances) of numpy arrays.

Return type:

tuple[numpy.ndarray, numpy.ndarray]

Note

The passed standard deviations are not exactly the square root of the variances; they are used to scale the real and imaginary samples.

real_to_complex_covariances(C_RR, C_IR, C_RI, C_II)[source]#

Convert covariances and cross covariances or real and imaginary parts to covariances and pseudo-covariances (relations) of complex random variable/vector

Parameters:
  • C_RR – covariances of the real part of a complex random vector/variable

  • C_IR – cross-covariances of the real with the imaginary part of a complex random vector/variable

  • C_RI – cross-covariances of the imaginary with the real part of a complex random vector/variable

  • C_II – covariances of the imaginary part of a complex random vector/variable

Returns:

  • K: the covariances of the complex variable

  • J: the pseudo covariances

complex_to_real_covariances(K, J, ignore_design=False)[source]#

Convert covariances and cross covariances or real and imaginary parts to covariances and pseudo-covariances (relations) of a complex random variable/vector

Parameters:
  • K – the covariances of the complex variable

  • J – the pseudo covariances

  • ignore_design (bool) – if False the internal design is ignored, otherwise, the matrices are projected on the space spanned by the active entries of the design.

Returns:

  • C_RR: covariances of the real part of a complex random vector/variable

  • C_IR: cross-covariances of the real with the imaginary part of a complex random vector/variable

  • C_RI: cross-covariances of the imaginary with the real part of a complex random vector/variable

  • C_II: covariances of the imaginary part of a complex random vector/variable

complex_sample_covariance(sample, ddof=1)[source]#

Evaluate sample covariances and pseudo covariances for a given sample from the distribution of a complex-valued random variable/vector. Sample covariances are evaluated based on conjugate transpose, but pseudo covariances are evaluated based on transpose of the deviations.

Parameters:
  • sample – array-like of two dimensions. Each row represents a sample of the underlying distribution.

  • ddof (int) – ddof=1 will return unbiased estimate since in the covariance formula, we divide by sample_size-ddof.

Returns:

  • covariances: sample-based covariance matrix

  • pseudo_covariances: sample-based pseudo covariance matrix