Policy Optimization#

This module provides generic implementations of policy optimization methods those can be used for any probabilistic policy model.

ReinforceOptimizerConfigs(*[, debug, ...])

Configuration dataclass for the ReinforceOptimizer.

ReinforceOptimizer([configs])

Implementation of the REINFORCE algorithm for binary optimization (maximization or minimization).

Standard gradient-based policy optimization algorithms.

class ReinforceOptimizerConfigs(*, debug=False, verbose=False, output_dir='./_PYOED_RESULTS_', size=None, name='ReinforceOptimizer: Gradient-Based Stochastic Policy Optimization', screen_output_iter=1, file_output_iter=100, policy_distribution=None, policy_parameter_bounds=None, policy_parameter_projector=None, fun=None, initial_policy_parameter=None, maximize=None, def_stepsize=0.1, decay_step=False, stochastic_gradient_sample_size=32, importance_sampling=False, importance_sampling_distribution=None, baseline='none', optimal_baseline_sample_size=32, optimal_baseline_n_batches=1, fresh_sample_for_baseline=False, optimal_sample_size=5, maxiter=10000, pgtol=1e-12, monitor_iterations=True, random_seed=None, num_resets=0, uniform_random_sample_size=0)[source]#

Bases: OptimizerConfigs

Configuration dataclass for the ReinforceOptimizer.

Parameters:
  • verbose (bool) – a boolean flag to control verbosity of the object.

  • debug (bool) – a boolean flag that enables adding extra functionality in a debug mode

  • output_dir (str | Path) – the base directory where the output files will be saved.

  • name (str) – name of the optimizer. Default is None.

  • size (int | None) – dimension of the target/solution variable. Not all optimizers use this, but it is added for unification.

  • screen_output_iter (int) – iteration interval for screen output. Default is 1. Note that this should be a positive integer to enforce proper effect.

  • file_output_iter (int) – iteration interval for file output. Default is 1. Note that this should be a positive integer to enforce proper effect.

  • policy_distribution (Distribution | None) – a probabilistic distribution of the stochastic policy. This is the distribution of the random variable (action). The optimizer aims to optimize the parameter of this distribution.

  • policy_parameter_bounds (Sequence | None) – bounds (if any) on the policy parameters (e.g., box constraints)

  • projector – optional projection operator for the gradient. If passed, this need to be a callable that take as input a policy parameter vector, and returns another version with projected components. This projector is passed a parameter (after applying any bounds (if any)).

  • fun (Callable | None) – the objective function to be optimization. For a given state/design of size size it returns a scalar.

  • initial_policy_parameter (ndarray | Iterable | None) – the initial probabilistic policy parameter. If not passed, the parameters in the policy distribution are used.

  • maximize (bool | None) – if True the objective is to be maximized, otherwise minimization is assumed

  • def_stepsize (float) – default step size (learning rate) in the direction (or the opposite) of the gradient. Default is 1.

  • decay_step (bool) – if True reduce the learning rate (step size) as the algorithm proceeds

  • stochastic_gradient_sample_size (int) – size of the sample to generate (from current policy at each iteration) to estimate (using MC) the policy gradient

  • importance_sampling (bool) – if True apply importance sampling to minimize variability of the stochastic gradient.

  • importance_sampling_distribution (Distribution | None) – if importance_sampling is True, this must be a valid distribution that is used for sampling and evaluating PMF against underlying target policy distribution.

  • baseline (str | None) –

    name/type of the baseline used to reduce variability of the stochastic estimate of the gradient to use (if any); possible values are None, ‘none’, ‘optimal’, ‘heuristic’. These are explained as follows:

    • ’none’ | None: no baseline is used.

    • ’optimal’: a constant baseline is found by approximating the optimal baseline where optimality is defined as minimum variability of the estimator.

  • optimal_baseline_sample_size (int) – size of the batch/sample to use for optimal baseline estimation; used if baseline==True.

  • optimal_baseline_n_batches (int) – number of batches for optimal baseline estimation. Used only if baseline==True.

  • fresh_sample_for_baseline (bool) – if True generate new sample(s) for estimating the optimal baseling instesad of reusing the same sample used for estimating the gradient (the stochastic estimate).

  • optimal_sample_size (int) – size of the sample to return from the final/optimal policy upon termination of the optimizer

  • num_resets (None | int) – number of resets during optimization

  • uniform_random_sample_size (int) – size of a random sample generated uniformly (e.g., with 0.5 probability) for performance evaluation.

policy_distribution: Distribution | None#
policy_parameter_bounds: Sequence | None#
policy_parameter_projector: Callable | None#
fun: Callable | None#
initial_policy_parameter: ndarray | Iterable | None#
maximize: bool | None#
def_stepsize: float#
decay_step: bool#
stochastic_gradient_sample_size: int#
importance_sampling: bool#
importance_sampling_distribution: Distribution | None#
baseline: str | None#
optimal_baseline_sample_size: int#
optimal_baseline_n_batches: int#
fresh_sample_for_baseline: bool#
optimal_sample_size: int#
maxiter: int#
pgtol: float#
monitor_iterations: bool#
random_seed: int | None#
num_resets: None | int#
uniform_random_sample_size: int#
__init__(*, debug=False, verbose=False, output_dir='./_PYOED_RESULTS_', size=None, name='ReinforceOptimizer: Gradient-Based Stochastic Policy Optimization', screen_output_iter=1, file_output_iter=100, policy_distribution=None, policy_parameter_bounds=None, policy_parameter_projector=None, fun=None, initial_policy_parameter=None, maximize=None, def_stepsize=0.1, decay_step=False, stochastic_gradient_sample_size=32, importance_sampling=False, importance_sampling_distribution=None, baseline='none', optimal_baseline_sample_size=32, optimal_baseline_n_batches=1, fresh_sample_for_baseline=False, optimal_sample_size=5, maxiter=10000, pgtol=1e-12, monitor_iterations=True, random_seed=None, num_resets=0, uniform_random_sample_size=0)#
class ReinforceOptimizerResults(*, policy_distribution=None, initial_policy_parameter=None, converged=False, maximize=None, message='', maxiter_reached=False, importance_sampling_distribution=None, sampled_designs=None, sampled_designs_objval=None, optimal_policy_parameter=None, optimal_policy_sample=None, optimal_policy_sample_objval=None, optimal_design=None, optimal_design_objval=None, optimization_trajectory=None, optimization_trajectory_stoch_objval=None, optimization_trajectory_best_sample=None, optimization_trajectory_best_sample_objval=None, num_resets=None, reset_at_iterations=None, uniform_random_sample=None, uniform_random_sample_objval=None, configurations=None)[source]#

Bases: OptimizerResults

Container for results/data generated by the ReinforceOptimizer.

Parameters:
  • policy_distribution (Type[Distribution] | None) – the class used to create the policy distribution.

  • initial_policy_parameter (ndarray | Iterable | None) – the initial probabilistic policy parameter. If not passed, the parameters in the policy distribution are used.

  • maximize (bool | None) – if True the objective is to be maximized, otherwise minimization is assumed. If None, it is not set

  • converged (bool) – True if the algorithm converged, False otherwise.

  • message (str) – a string message describing the final status of the algorithm

  • maxiter_reached (bool) – whether the maximum number of iterations of the algorithm has been reached/exceeded.

  • sampled_designs (list[list[Iterable]] | None) – a list (of lists) containing all sampled design variables (samples) for each iteration of the algorithm. The length of this list is equal to the number of iterations taken by the algorithm.

  • sampled_designs_objval (list[list[float]] | None) – a list (of lists) of the objective value of each design variable/sample in sampled_designs.

  • optimal_policy – the parameters of the optimal policy (probability of activation of each entry of the design)

  • optimal_policy_sample (ndarray | Iterable | None) – design sample generated by sampling the optimal policy (Bernoulli samples with probababilities set to the optimal policy parameters optimal_policy_parameters)

  • optimal_policy_sample_objval (ndarray | Iterable | None) – value of the objective function \(J\) evaluated at each design in optimal_sample

  • optimal_design (ndarray | Iterable | None) – the optimal design (binary vector) sampled from the optimal_policy.

  • optimal_design_objval (float | None) – the value of the objective function evaluated at the above optimal design.

  • optimization_trajectory (ndarray | Iterable | None) – trajectory (consecutive parameters) of the policy parameters generated along the optimization iterations; populated only if monitor_iterations configuration is set to True.

  • optimization_trajectory_stoch_objval (ndarray | Iterable | None) – value (MC estimate) of the stochastic objective function, that is \(E[J]\) along the optimization trajectory; populated only if monitor_iterations configuration is set to True.

  • optimization_trajectory_best_sample (ndarray | Iterable | None) – the best sample point at each iteration. This is the best sampled design at each iteration. For maximization, it is the sampled design with maximum objective value (among the sample)

  • optimization_trajectory_best_sample_objval (ndarray | Iterable | None) – the objective value corresponding to each design in optimization_trajectory_best_sample.

  • num_resets (int | None) – number of resets during optimization (if set in the configurations). if given it must be be a positive integer indicating the number of times the optimizer is reset to initial_policy after convergence. If either 0 or None is given, num_resets is set to ‘0’ thus not resets are carried out

  • uniform_random_sample (None | list[Any]) – Sample (list of designs/etc.) generated with uniform probability.

  • uniform_random_sample_objval (None | list[float]) – the objective value corresponding to each sample point in uniform_random_sample.

  • reset_at_iterations (list | None) – a list holding the iterations after which solver was reset (if valid value of num_resets is set in the configurations).

  • configurations (None | dict) – a dictionary holding problem configurations. This can be obtained by calling asdict() of the configurations object used by the optimizer, e.g., BinaryReinforceOptimizerConfigs.

policy_distribution: Type[Distribution] | None#
initial_policy_parameter: ndarray | Iterable | None#
converged: bool#
maximize: bool | None#
message: str#
maxiter_reached: bool#
importance_sampling_distribution: Type[Distribution] | None#
sampled_designs: list[list[Iterable]] | None#
sampled_designs_objval: list[list[float]] | None#
optimal_policy_parameter: ndarray | Iterable | None#
optimal_policy_sample: ndarray | Iterable | None#
optimal_policy_sample_objval: ndarray | Iterable | None#
optimal_design: ndarray | Iterable | None#
optimal_design_objval: float | None#
optimization_trajectory: ndarray | Iterable | None#
optimization_trajectory_stoch_objval: ndarray | Iterable | None#
optimization_trajectory_best_sample_objval: ndarray | Iterable | None#
optimization_trajectory_best_sample: ndarray | Iterable | None#
num_resets: int | None#
reset_at_iterations: list | None#
uniform_random_sample: None | list[Any]#
uniform_random_sample_objval: None | list[float]#
configurations: None | dict#
property best_along_the_route#

Return a dictionary with two entries/keys (str):

  • ‘objval’: this key/value entry holds the best objective value seen during the exploration (before collecting the sample from the optimal policy).

  • ‘designs’: a list of lists of of (one or more) designs corresponding to the objective value ‘objval’.

Note

The dictionary will be empty if the optimizer did not monitor iterations.

__init__(*, policy_distribution=None, initial_policy_parameter=None, converged=False, maximize=None, message='', maxiter_reached=False, importance_sampling_distribution=None, sampled_designs=None, sampled_designs_objval=None, optimal_policy_parameter=None, optimal_policy_sample=None, optimal_policy_sample_objval=None, optimal_design=None, optimal_design_objval=None, optimization_trajectory=None, optimization_trajectory_stoch_objval=None, optimization_trajectory_best_sample=None, optimization_trajectory_best_sample_objval=None, num_resets=None, reset_at_iterations=None, uniform_random_sample=None, uniform_random_sample_objval=None, configurations=None)#
class ReinforceOptimizer(configs=None)[source]#

Bases: Optimizer

Implementation of the REINFORCE algorithm for binary optimization (maximization or minimization). The algorithm assumes a generic probabilistic policy and considers optimizing the parameters of this distribution to optimizes a given function/objective that evaluates at samples from this policy.

Note

This class considers optimization of a parametric policy. Thus, the configurations must provide a valid policy_distribution which gives access to:

  1. parameter getter/setter and parameter_manager context manager that sets the distriubtion parameter.

  2. sample generates random variables (actions/designs) from this distribution according to the distribution parameters.

  3. pmf evaluate the probability mass function

  4. log_pmf evaluate the logarithm of the probability mass function

  5. grad_pmf evaluate the gradient of the probability mass function

  6. grad_log_pmf evaluate the gradient of the log-probability mass function

This is an implementation of the REINFORCE algorithm for stochastic binary optimization. The objective is to solve one of the following optimization problems \(\max_{d} E[J(d)]\) or \(\min_{d} E[J(d)]\) based on the value of maximize. The optimization is assumed to be unconstrained and any soft constraints are already absorbed into fun passed in the configurations dictionary.

Parameters:

configs (ReinforceOptimizerConfigs | dict | None) – Configurations of the ReinforceOptimizer, either as an instance of ReinforceOptimizerConfigs, a dictionary, or None. The default is None, in which case, the default configurations provided by ReinforceOptimizerConfigs are used.

__init__(configs=None)[source]#
validate_configurations(configs, raise_for_invalid=True)[source]#

Check the passed configuratios and make sure they are conformable with each other, and with current configurations once combined. This guarantees that any key-value pair passed in configs can be properly used

Note

Here only the locally-defined configurations in ReinforceOptimizerConfigs are validated. Finally, super classes validators are called.

Parameters:
Returns:

flag indicating whether passed configurations dictionary is valid or not

Raises:
Return type:

bool

update_configurations(**kwargs)[source]#

Take any set of keyword arguments, and lookup each in the configurations, and update as nessesary/possible/valid

Raises:

PyOEDConfigsValidationError – if invalid configurations passed

validate_policy_parameter(parameter)[source]#

Given a parameter; validate it as a parameter of the underlying policy distribution. If invalid raise TypeError, otherwise, return a flattened 1d array of the policy parameter.

validate_design(design)[source]#

Given a design/variable/sample; validate it as a sample from the underlying policy. If invalid raise TypeError, otherwise, return a flattened 1d array of the policy sample.

objective_function_value(design)[source]#

A function to prevent redundant computations of the objective (utility function)

Parameters:

design – binary state/design variable to evaluate objective value at

Returns:

value of the objective/utility function at the passed design

Raises:

TypeError if the passed design is not of the right size

stochastic_policy_objective_value(parameter, fun=None, sample=None, sample_size=32)[source]#

The stochastic objective function is the expectation of the configured objective function This is estimated here based on Monte-Carlo (average) approximation.

Here, we define the stochastic objective function \(E(J(x))\) where \(U\) is the deterministic objective fun (utility function) defined in the configurations, and \(x\) is a design variable following a the configured policy distribution with parameter parameter.

Parameters:
  • parameter (numpy.ndarray|iterable) – the policy parameters of the underlying policy distribution

  • fun (callable) – the utility/objective function to calculate the stochastic objective. If None, the fun from the configurations dictionary is used. This argument is provided to enable altering the objective while calculating the derivatives (e.g., by adding a baseline) to reduce variance of the estimator(s).

  • sample (numpy.ndarray) – a sample generated from the passed policy (provided in case samples are generated by other routines).

  • sample_size (int) – number of sample points to generater.

Returns:

value/estimate of the stochastic objective value

Return type:

float

Raises:

TypeError if a sample is provided but has wrong number of columns (must be equal to the dimension of the underlying Bernoulli distribution).

stochastic_policy_objective_gradient(parameter, fun=None, sample=None, sample_objval=None, sample_grad_log_pmf=None, sample_size=None, optimal_baseline=False)[source]#

Evaluate the gradient (approximate using MC sampling) of the stochastic objective function reformulation of the binary optimization problem. The gradient is approximated by employing the kernel trick of the expectation, and by using the gradient-log-probability of the policy distribution under the passed parameter.

Parameters:
  • parameter (numpy.ndarray|iterable) – the policy parameters (success probability of the underlying Bernoulli distribution)

  • fun (callable) – the utility/objective function to calculate the stochastic objective. If None, the fun from the configurations dictionary is used. This argument is provided to enable altering the objective while calculating the derivatives (e.g., by adding a baseline) to reduce variance of the estimator(s).

  • sample (numpy.ndarray) – a sample generated from the passed policy (provided in case samples are generated by other routines).

  • sample_objval – The objective value (fun(x) where fun is the registered function to be optimized) for each x in the passed sample.

  • sample_grad_log_pmf – The gradient of the logarithm of the probability mass function (grad-log-PMF) of the underlying bernoulli (conditional or not) model. function to be optimized) for each x in the passed sample.

  • sample_size (None|int) – number of sample points to generate. Set to the size of the passed sample, otherwise if not an integer, the sample size is loaded from the configurations.

..note::

If a valid sample is passed, then sample_size is discarded. In this case, sample_size is set to the number of sample points in sample. Both sample, and sample_size, are of course discarded if exact is True.

Returns:

gradient of the stochastic objective value (exact or approximate)

Return type:

numpy.ndarray

evaluate_pmf(design, parameter)[source]#

Evaluate the value of the probability mass function (PMF) for a given sampled design.

Parameters:
  • design – binary state/design variable to evaluate objective value at

  • parameter (numpy.ndarray|iterable) – the policy parameters (success probability of the underlying Bernoulli distribution)

evaluate_grad_log_pmf(design, parameter)[source]#

Evaluate the gradient of the log of PMF value for a given sampled design.

Parameters:
  • design – binary state/design variable to evaluate objective value at

  • parameter (numpy.ndarray|iterable) – the policy parameters (success probability of the underlying Bernoulli distribution)

Returns:

the gradient of the log probability associated with the passed design

evaluate_batch_grad_log_pmf(design, parameter, batch_as_column=False)[source]#
evaluate_grad_pmf(design, parameter)[source]#

Evaluate the gradient of the PMF value for a given sampled design.

Parameters:
  • design – binary state/design variable to evaluate objective value at

  • parameter (numpy.ndarray|iterable) – the policy parameters (success probability of the underlying Bernoulli distribution)

Returns:

the gradient of the log probability associated with the passed design

generate_sample(parameter, sample_size=1)[source]#

Generate sample of size sample_size for a given policy parameter (parameter).

Parameters:
  • parameter (numpy.ndarray|iterable) – the policy parameters (success probability of the underlying Bernoulli distribution)

  • sample_size (int) – number of desings/states to sample from the Bernoulli distribution with with the passed parameter parameter

Returns:

  • sample: 2d array of shape (sample_size x dim) where dim is the size of parameter

generate_uniform_random_sample(sample_size, parameter=None)[source]#

Sample the probability space uniformly (assuming 0.5 probabilities).

Parameters:

sample_size (int) – sample size

Returns:

  • uniform_random_sample: a list/array of experimental designs (e.g., actions)

  • uniform_random_sample_objval: 1d array contianing objective values corresponding to sample

optimal_baseline_estimate(parameter, fun=None, sample=None, sample_objval=None, sample_grad_log_pmf=None, sample_size=32, n_batches=None)[source]#

Use the passed sample to calculate a sample based estimate of the optimal baseline value.

Note

Developing a general approach for estimating the optimal baseline. The goal is that the implementation here is valid for any policy distribution, and the user can extend this functionality for more accurate optimal baseline estimate. The estimate is \(E[\widehat{g}^{T} d}] / var(d)\) where \(var\) is the variance (total variance) and \(g\) is the gradient without baseline, and \(d\) is the average of gradient log-probability, that is \(d=\frac{1}{ns} \sum_{k=1}^{ns} \nabla_{p} log Prob(x|p)\) with \(p\) the distribution parameter and \(x\) is the random variable of interest.

Parameters:
  • parameter (numpy.ndarray|iterable) – the policy parameters (success probability of the underlying Bernoulli distribution)

  • fun (callable) – the utility/objective function to calculate the stochastic objective. If None, the fun from the configurations dictionary is used. This argument is provided to enable altering the objective while calculating the derivatives (e.g., by adding a baseline) to reduce variance of the estimator(s).

  • sample – Given sample

  • sample_objval – The objective value (fun(x) where fun is the registered function to be optimized) for each x in the passed sample.

  • sample_grad_log_pmf – The gradient of the logarithm of the probability mass function (grad-log-PMF) of the underlying bernoulli (conditional or not) model. function to be optimized) for each x in the passed sample.

  • sample_size (int) – number of samples to use for the MC approximation. Only used if sample is not passed.

  • n_batches (int) – number of batches of samples to approximate the optimal baselin. Multiple samples are used to calculate both vectors \(d\) and \(g\) and then sample-based average is used for both \(E[d^T]\) and E[D^T g].

Returns:

optimal baseline estimate

Return type:

float

bruteforce_objective_values(fun=None)[source]#

Evaluate the value of the objective (obj) at all possible binary combinations;

Returns:

a dictionary of objective values evaluated for each possible binary instance. The dictionary indexed by unique integer indexes.

Return type:

dict[int,float]

..note::

This function should be used ONLY for testing in small-dimensional problems.

solve(initial_policy_parameter=None)[source]#

Start solving the binary optimization problem using the stochastic formulatioon.

Parameters:

initial_policy_parameterNone or ``iterable` to be used as the initial policy parameter of the underlying Bernoulli distribution

Returns:

an instance of ReinforceOptimizerResults.

Raises:

TypeError if the initial policy has wrong size or any of the values fall outside the interval [0, 1]

Note

Steps of the algorithm are as follows:

  1. Start with the initial state

  2. Calculate the stochastic gradient with/without baseline

  3. Optionally, calculate Hessian (approximation)

  4. Take a step in the direction of:

  • -stepsize gradient ; for minimization (maximize is False)

  • +stepsize` gradient ; for maximization (maximize is True)

multi_solve(initial_policy_parameter=None, num_solves=10)[source]#

A wrapper around sovle() which re-solves the optimization problem multiple times dectated by num_solves and returns a list of dictionaries holding results for each time the problem is solved.

..note::

This function only iterates over solve() This enables empirical verification of the robustness of the algorithm

Parameters:
  • initial_policy_parameter – the initial policy parameter of the underlying Bernoulli distribution. This is the start solution of each of the num_solves. If None, the one in the configurations is used.

  • num_solves (int) – number of times the optimization problem is solved

Returns:

A list with each entry being a dictionary holding results returned by solve().

plot_results(results, overwrite=False, bruteforce=False, fun_symbol='U', output_dir=None, keep_plots=False, fontsize=22, line_width=3, usetex=True, show_axis_grids=True, axis_grids_alpha=(0.25, 0.4), plots_format='pdf', dpi=600, **kwargs)[source]#

Given the results returned by solve() (or partial results saved through successive iterations), visualize optimization results. This function, creates the following plots:

  1. A plot showing the value of the objective function of sampled designs/realizations from the underlying policy. This shows the mean and the uncertainty envelop over successive iterations.

  2. A box plot similar to the first plot, but instead of showing uncertainty envelop box plots are used so that outliers can be detected.

Note

Adding additional kwargs to enable high flexibility in calling this method.

Parameters:
  • results (dict | ReinforceOptimizerResults) – results returned by solve() (or a dictionary created from it, e.g., by calling ReinforceOptimizerResults.asdict())

  • bruteforce (bool) –

    create bruteforce plot(s) (if accessible through the policy_distribution)

    Warning

    This can take forever if the cardinality (number of possiblity) is large. Be careful when to set this to True as it is only meant for testing in very small systems.

  • fun_symbol (str) – symbol to use for labeling the objective function.

  • output_dir – location to save plots to; if None the current working directory is used

  • keep_plots (bool) – if True, the created figures will not be closed. This can be useful, e.g., for displaying results in notebooks

  • fontsize (int) – font size for plots

  • usetex (bool) – use LaTeX for displaying equations on plots (requires LaTeX support)

  • plots_format (str) – plots format/extension

  • dpi – depth per inch

Returns:

a dictionary with two entries/keys:

  • ’figures’ a dictionary indexed by plot identifier with value set to the corresponding figure handle (or None if keep_plots is set to False)

  • ’stats’: a dictionary holding statistics including number of function evaluations, and percentage of space explored.

Raises:

TypeError if results is invalid

property size#

Dimension/size of the optimziation variable.

property parameter_size#

Dimension/size of the optimziation variable.

property policy_distribution#
property objective_value_tracker#