Policy Optimization#
This module provides generic implementations of policy optimization methods those can be used for any probabilistic policy model.
|
Configuration dataclass for the |
|
Implementation of the REINFORCE algorithm for binary optimization (maximization or minimization). |
Standard gradient-based policy optimization algorithms.
- class ReinforceOptimizerConfigs(*, debug=False, verbose=False, output_dir='./_PYOED_RESULTS_', size=None, name='ReinforceOptimizer: Gradient-Based Stochastic Policy Optimization', screen_output_iter=1, file_output_iter=100, policy_distribution=None, policy_parameter_bounds=None, policy_parameter_projector=None, fun=None, initial_policy_parameter=None, maximize=None, def_stepsize=0.1, decay_step=False, stochastic_gradient_sample_size=32, importance_sampling=False, importance_sampling_distribution=None, baseline='none', optimal_baseline_sample_size=32, optimal_baseline_n_batches=1, fresh_sample_for_baseline=False, optimal_sample_size=5, maxiter=10000, pgtol=1e-12, monitor_iterations=True, random_seed=None, num_resets=0, uniform_random_sample_size=0)[source]#
Bases:
OptimizerConfigsConfiguration dataclass for the
ReinforceOptimizer.- Parameters:
verbose (bool) – a boolean flag to control verbosity of the object.
debug (bool) – a boolean flag that enables adding extra functionality in a debug mode
output_dir (str | Path) – the base directory where the output files will be saved.
name (str) – name of the optimizer. Default is None.
size (int | None) – dimension of the target/solution variable. Not all optimizers use this, but it is added for unification.
screen_output_iter (int) – iteration interval for screen output. Default is 1. Note that this should be a positive integer to enforce proper effect.
file_output_iter (int) – iteration interval for file output. Default is 1. Note that this should be a positive integer to enforce proper effect.
policy_distribution (Distribution | None) – a probabilistic distribution of the stochastic policy. This is the distribution of the random variable (action). The optimizer aims to optimize the parameter of this distribution.
policy_parameter_bounds (Sequence | None) – bounds (if any) on the policy parameters (e.g., box constraints)
projector – optional projection operator for the gradient. If passed, this need to be a callable that take as input a policy parameter vector, and returns another version with projected components. This projector is passed a parameter (after applying any bounds (if any)).
fun (Callable | None) – the objective function to be optimization. For a given state/design of size
sizeit returns a scalar.initial_policy_parameter (ndarray | Iterable | None) – the initial probabilistic policy parameter. If not passed, the parameters in the policy distribution are used.
maximize (bool | None) – if
Truethe objective is to be maximized, otherwise minimization is assumeddef_stepsize (float) – default step size (learning rate) in the direction (or the opposite) of the gradient. Default is 1.
decay_step (bool) – if
Truereduce the learning rate (step size) as the algorithm proceedsstochastic_gradient_sample_size (int) – size of the sample to generate (from current policy at each iteration) to estimate (using MC) the policy gradient
importance_sampling (bool) – if
Trueapply importance sampling to minimize variability of the stochastic gradient.importance_sampling_distribution (Distribution | None) – if
importance_samplingisTrue, this must be a valid distribution that is used for sampling and evaluating PMF against underlying target policy distribution.baseline (str | None) –
name/type of the baseline used to reduce variability of the stochastic estimate of the gradient to use (if any); possible values are
None, ‘none’, ‘optimal’, ‘heuristic’. These are explained as follows:’none’ |
None: no baseline is used.’optimal’: a constant baseline is found by approximating the optimal baseline where optimality is defined as minimum variability of the estimator.
optimal_baseline_sample_size (int) – size of the batch/sample to use for optimal baseline estimation; used if baseline==True.
optimal_baseline_n_batches (int) – number of batches for optimal baseline estimation. Used only if baseline==True.
fresh_sample_for_baseline (bool) – if
Truegenerate new sample(s) for estimating the optimal baseling instesad of reusing the same sample used for estimating the gradient (the stochastic estimate).optimal_sample_size (int) – size of the sample to return from the final/optimal policy upon termination of the optimizer
num_resets (None | int) – number of resets during optimization
uniform_random_sample_size (int) – size of a random sample generated uniformly (e.g., with 0.5 probability) for performance evaluation.
- policy_distribution: Distribution | None#
- importance_sampling_distribution: Distribution | None#
- __init__(*, debug=False, verbose=False, output_dir='./_PYOED_RESULTS_', size=None, name='ReinforceOptimizer: Gradient-Based Stochastic Policy Optimization', screen_output_iter=1, file_output_iter=100, policy_distribution=None, policy_parameter_bounds=None, policy_parameter_projector=None, fun=None, initial_policy_parameter=None, maximize=None, def_stepsize=0.1, decay_step=False, stochastic_gradient_sample_size=32, importance_sampling=False, importance_sampling_distribution=None, baseline='none', optimal_baseline_sample_size=32, optimal_baseline_n_batches=1, fresh_sample_for_baseline=False, optimal_sample_size=5, maxiter=10000, pgtol=1e-12, monitor_iterations=True, random_seed=None, num_resets=0, uniform_random_sample_size=0)#
- class ReinforceOptimizerResults(*, policy_distribution=None, initial_policy_parameter=None, converged=False, maximize=None, message='', maxiter_reached=False, importance_sampling_distribution=None, sampled_designs=None, sampled_designs_objval=None, optimal_policy_parameter=None, optimal_policy_sample=None, optimal_policy_sample_objval=None, optimal_design=None, optimal_design_objval=None, optimization_trajectory=None, optimization_trajectory_stoch_objval=None, optimization_trajectory_best_sample=None, optimization_trajectory_best_sample_objval=None, num_resets=None, reset_at_iterations=None, uniform_random_sample=None, uniform_random_sample_objval=None, configurations=None)[source]#
Bases:
OptimizerResultsContainer for results/data generated by the
ReinforceOptimizer.- Parameters:
policy_distribution (Type[Distribution] | None) – the class used to create the policy distribution.
initial_policy_parameter (ndarray | Iterable | None) – the initial probabilistic policy parameter. If not passed, the parameters in the policy distribution are used.
maximize (bool | None) – if
Truethe objective is to be maximized, otherwise minimization is assumed. IfNone, it is not setconverged (bool) –
Trueif the algorithm converged,Falseotherwise.message (str) – a string message describing the final status of the algorithm
maxiter_reached (bool) – whether the maximum number of iterations of the algorithm has been reached/exceeded.
sampled_designs (list[list[Iterable]] | None) – a list (of lists) containing all sampled design variables (samples) for each iteration of the algorithm. The length of this list is equal to the number of iterations taken by the algorithm.
sampled_designs_objval (list[list[float]] | None) – a list (of lists) of the objective value of each design variable/sample in
sampled_designs.optimal_policy – the parameters of the optimal policy (probability of activation of each entry of the design)
optimal_policy_sample (ndarray | Iterable | None) – design sample generated by sampling the optimal policy (Bernoulli samples with probababilities set to the optimal policy parameters
optimal_policy_parameters)optimal_policy_sample_objval (ndarray | Iterable | None) – value of the objective function \(J\) evaluated at each design in
optimal_sampleoptimal_design (ndarray | Iterable | None) – the optimal design (binary vector) sampled from the
optimal_policy.optimal_design_objval (float | None) – the value of the objective function evaluated at the above optimal design.
optimization_trajectory (ndarray | Iterable | None) – trajectory (consecutive parameters) of the policy parameters generated along the optimization iterations; populated only if
monitor_iterationsconfiguration is set toTrue.optimization_trajectory_stoch_objval (ndarray | Iterable | None) – value (MC estimate) of the stochastic objective function, that is \(E[J]\) along the optimization trajectory; populated only if
monitor_iterationsconfiguration is set toTrue.optimization_trajectory_best_sample (ndarray | Iterable | None) – the best sample point at each iteration. This is the best sampled design at each iteration. For maximization, it is the sampled design with maximum objective value (among the sample)
optimization_trajectory_best_sample_objval (ndarray | Iterable | None) – the objective value corresponding to each design in
optimization_trajectory_best_sample.num_resets (int | None) – number of resets during optimization (if set in the configurations). if given it must be be a positive integer indicating the number of times the optimizer is reset to
initial_policyafter convergence. If either 0 orNoneis given,num_resetsis set to ‘0’ thus not resets are carried outuniform_random_sample (None | list[Any]) – Sample (list of designs/etc.) generated with uniform probability.
uniform_random_sample_objval (None | list[float]) – the objective value corresponding to each sample point in
uniform_random_sample.reset_at_iterations (list | None) – a list holding the iterations after which solver was reset (if valid value of
num_resetsis set in the configurations).configurations (None | dict) – a dictionary holding problem configurations. This can be obtained by calling
asdict()of the configurations object used by the optimizer, e.g.,BinaryReinforceOptimizerConfigs.
- policy_distribution: Type[Distribution] | None#
- importance_sampling_distribution: Type[Distribution] | None#
- property best_along_the_route#
Return a dictionary with two entries/keys (str):
‘objval’: this key/value entry holds the best objective value seen during the exploration (before collecting the sample from the optimal policy).
‘designs’: a list of lists of of (one or more) designs corresponding to the objective value ‘objval’.
Note
The dictionary will be empty if the optimizer did not monitor iterations.
- __init__(*, policy_distribution=None, initial_policy_parameter=None, converged=False, maximize=None, message='', maxiter_reached=False, importance_sampling_distribution=None, sampled_designs=None, sampled_designs_objval=None, optimal_policy_parameter=None, optimal_policy_sample=None, optimal_policy_sample_objval=None, optimal_design=None, optimal_design_objval=None, optimization_trajectory=None, optimization_trajectory_stoch_objval=None, optimization_trajectory_best_sample=None, optimization_trajectory_best_sample_objval=None, num_resets=None, reset_at_iterations=None, uniform_random_sample=None, uniform_random_sample_objval=None, configurations=None)#
- class ReinforceOptimizer(configs=None)[source]#
Bases:
OptimizerImplementation of the REINFORCE algorithm for binary optimization (maximization or minimization). The algorithm assumes a generic probabilistic policy and considers optimizing the parameters of this distribution to optimizes a given function/objective that evaluates at samples from this policy.
Note
This class considers optimization of a parametric policy. Thus, the configurations must provide a valid policy_distribution which gives access to:
parameter getter/setter and parameter_manager context manager that sets the distriubtion parameter.
sample generates random variables (actions/designs) from this distribution according to the distribution parameters.
pmf evaluate the probability mass function
log_pmf evaluate the logarithm of the probability mass function
grad_pmf evaluate the gradient of the probability mass function
grad_log_pmf evaluate the gradient of the log-probability mass function
This is an implementation of the REINFORCE algorithm for stochastic binary optimization. The objective is to solve one of the following optimization problems \(\max_{d} E[J(d)]\) or \(\min_{d} E[J(d)]\) based on the value of maximize. The optimization is assumed to be unconstrained and any soft constraints are already absorbed into fun passed in the configurations dictionary.
- Parameters:
configs (ReinforceOptimizerConfigs | dict | None) – Configurations of the
ReinforceOptimizer, either as an instance ofReinforceOptimizerConfigs, a dictionary, or None. The default is None, in which case, the default configurations provided byReinforceOptimizerConfigsare used.
- validate_configurations(configs, raise_for_invalid=True)[source]#
Check the passed configuratios and make sure they are conformable with each other, and with current configurations once combined. This guarantees that any key-value pair passed in configs can be properly used
Note
Here only the locally-defined configurations in
ReinforceOptimizerConfigsare validated. Finally, super classes validators are called.- Parameters:
configs (dict | ReinforceOptimizerConfigs) – full or partial (subset) configurations to be validated
raise_for_invalid (bool) – if True raise
TypeErrorfor invalid configrations type/key. Default True
- Returns:
flag indicating whether passed configurations dictionary is valid or not
- Raises:
AttributeError – if any (or a group) of the configurations does not exist in the model configurations
ReinforceOptimizerConfigs.PyOEDConfigsValidationError – if the configurations are invalid and raise_for_invalid is set to True.
- Return type:
- update_configurations(**kwargs)[source]#
Take any set of keyword arguments, and lookup each in the configurations, and update as nessesary/possible/valid
- Raises:
PyOEDConfigsValidationError – if invalid configurations passed
- validate_policy_parameter(parameter)[source]#
Given a parameter; validate it as a parameter of the underlying policy distribution. If invalid raise TypeError, otherwise, return a flattened 1d array of the policy parameter.
- validate_design(design)[source]#
Given a design/variable/sample; validate it as a sample from the underlying policy. If invalid raise TypeError, otherwise, return a flattened 1d array of the policy sample.
- objective_function_value(design)[source]#
A function to prevent redundant computations of the objective (utility function)
- Parameters:
design – binary state/design variable to evaluate objective value at
- Returns:
value of the objective/utility function at the passed design
- Raises:
TypeErrorif the passed design is not of the right size
- stochastic_policy_objective_value(parameter, fun=None, sample=None, sample_size=32)[source]#
The stochastic objective function is the expectation of the configured objective function This is estimated here based on Monte-Carlo (average) approximation.
Here, we define the stochastic objective function \(E(J(x))\) where \(U\) is the deterministic objective fun (utility function) defined in the configurations, and \(x\) is a design variable following a the configured policy distribution with parameter parameter.
- Parameters:
parameter (numpy.ndarray|iterable) – the policy parameters of the underlying policy distribution
fun (callable) – the utility/objective function to calculate the stochastic objective. If None, the fun from the configurations dictionary is used. This argument is provided to enable altering the objective while calculating the derivatives (e.g., by adding a baseline) to reduce variance of the estimator(s).
sample (numpy.ndarray) – a sample generated from the passed policy (provided in case samples are generated by other routines).
sample_size (int) – number of sample points to generater.
- Returns:
value/estimate of the stochastic objective value
- Return type:
- Raises:
TypeErrorif a sample is provided but has wrong number of columns (must be equal to the dimension of the underlying Bernoulli distribution).
- stochastic_policy_objective_gradient(parameter, fun=None, sample=None, sample_objval=None, sample_grad_log_pmf=None, sample_size=None, optimal_baseline=False)[source]#
Evaluate the gradient (approximate using MC sampling) of the stochastic objective function reformulation of the binary optimization problem. The gradient is approximated by employing the kernel trick of the expectation, and by using the gradient-log-probability of the policy distribution under the passed parameter.
- Parameters:
parameter (numpy.ndarray|iterable) – the policy parameters (success probability of the underlying Bernoulli distribution)
fun (callable) – the utility/objective function to calculate the stochastic objective. If None, the fun from the configurations dictionary is used. This argument is provided to enable altering the objective while calculating the derivatives (e.g., by adding a baseline) to reduce variance of the estimator(s).
sample (numpy.ndarray) – a sample generated from the passed policy (provided in case samples are generated by other routines).
sample_objval – The objective value (fun(x) where fun is the registered function to be optimized) for each x in the passed sample.
sample_grad_log_pmf – The gradient of the logarithm of the probability mass function (grad-log-PMF) of the underlying bernoulli (conditional or not) model. function to be optimized) for each x in the passed sample.
sample_size (None|int) – number of sample points to generate. Set to the size of the passed sample, otherwise if not an integer, the sample size is loaded from the configurations.
- ..note::
If a valid sample is passed, then sample_size is discarded. In this case, sample_size is set to the number of sample points in sample. Both sample, and sample_size, are of course discarded if exact is True.
- Returns:
gradient of the stochastic objective value (exact or approximate)
- Return type:
- evaluate_pmf(design, parameter)[source]#
Evaluate the value of the probability mass function (PMF) for a given sampled design.
- Parameters:
design – binary state/design variable to evaluate objective value at
parameter (numpy.ndarray|iterable) – the policy parameters (success probability of the underlying Bernoulli distribution)
- evaluate_grad_log_pmf(design, parameter)[source]#
Evaluate the gradient of the log of PMF value for a given sampled design.
- Parameters:
design – binary state/design variable to evaluate objective value at
parameter (numpy.ndarray|iterable) – the policy parameters (success probability of the underlying Bernoulli distribution)
- Returns:
the gradient of the log probability associated with the passed design
- evaluate_grad_pmf(design, parameter)[source]#
Evaluate the gradient of the PMF value for a given sampled design.
- Parameters:
design – binary state/design variable to evaluate objective value at
parameter (numpy.ndarray|iterable) – the policy parameters (success probability of the underlying Bernoulli distribution)
- Returns:
the gradient of the log probability associated with the passed design
- generate_sample(parameter, sample_size=1)[source]#
Generate sample of size sample_size for a given policy parameter (parameter).
- Parameters:
parameter (numpy.ndarray|iterable) – the policy parameters (success probability of the underlying Bernoulli distribution)
sample_size (int) – number of desings/states to sample from the Bernoulli distribution with with the passed parameter parameter
- Returns:
sample: 2d array of shape
(sample_size x dim)wheredimis the size of parameter
- generate_uniform_random_sample(sample_size, parameter=None)[source]#
Sample the probability space uniformly (assuming 0.5 probabilities).
- Parameters:
sample_size (int) – sample size
- Returns:
uniform_random_sample: a list/array of experimental designs (e.g., actions)
uniform_random_sample_objval: 1d array contianing objective values corresponding to sample
- optimal_baseline_estimate(parameter, fun=None, sample=None, sample_objval=None, sample_grad_log_pmf=None, sample_size=32, n_batches=None)[source]#
Use the passed sample to calculate a sample based estimate of the optimal baseline value.
Note
Developing a general approach for estimating the optimal baseline. The goal is that the implementation here is valid for any policy distribution, and the user can extend this functionality for more accurate optimal baseline estimate. The estimate is \(E[\widehat{g}^{T} d}] / var(d)\) where \(var\) is the variance (total variance) and \(g\) is the gradient without baseline, and \(d\) is the average of gradient log-probability, that is \(d=\frac{1}{ns} \sum_{k=1}^{ns} \nabla_{p} log Prob(x|p)\) with \(p\) the distribution parameter and \(x\) is the random variable of interest.
- Parameters:
parameter (numpy.ndarray|iterable) – the policy parameters (success probability of the underlying Bernoulli distribution)
fun (callable) – the utility/objective function to calculate the stochastic objective. If None, the fun from the configurations dictionary is used. This argument is provided to enable altering the objective while calculating the derivatives (e.g., by adding a baseline) to reduce variance of the estimator(s).
sample – Given sample
sample_objval – The objective value (fun(x) where fun is the registered function to be optimized) for each x in the passed sample.
sample_grad_log_pmf – The gradient of the logarithm of the probability mass function (grad-log-PMF) of the underlying bernoulli (conditional or not) model. function to be optimized) for each x in the passed sample.
sample_size (int) – number of samples to use for the MC approximation. Only used if sample is not passed.
n_batches (int) – number of batches of samples to approximate the optimal baselin. Multiple samples are used to calculate both vectors \(d\) and \(g\) and then sample-based average is used for both \(E[d^T]\) and E[D^T g].
- Returns:
optimal baseline estimate
- Return type:
float
- bruteforce_objective_values(fun=None)[source]#
Evaluate the value of the objective (obj) at all possible binary combinations;
- Returns:
a dictionary of objective values evaluated for each possible binary instance. The dictionary indexed by unique integer indexes.
- Return type:
dict[int,float]
- ..note::
This function should be used ONLY for testing in small-dimensional problems.
- solve(initial_policy_parameter=None)[source]#
Start solving the binary optimization problem using the stochastic formulatioon.
- Parameters:
initial_policy_parameter – None or ``iterable` to be used as the initial policy parameter of the underlying Bernoulli distribution
- Returns:
an instance of ReinforceOptimizerResults.
- Raises:
TypeErrorif the initial policy has wrong size or any of the values fall outside the interval [0, 1]
Note
Steps of the algorithm are as follows:
Start with the initial state
Calculate the stochastic gradient with/without baseline
Optionally, calculate Hessian (approximation)
Take a step in the direction of:
-stepsize gradient ; for minimization (maximize is False)
+stepsize` gradient ; for maximization (maximize is True)
- multi_solve(initial_policy_parameter=None, num_solves=10)[source]#
A wrapper around
sovle()which re-solves the optimization problem multiple times dectated by num_solves and returns a list of dictionaries holding results for each time the problem is solved.- ..note::
This function only iterates over
solve()This enables empirical verification of the robustness of the algorithm
- Parameters:
initial_policy_parameter – the initial policy parameter of the underlying Bernoulli distribution. This is the start solution of each of the num_solves. If None, the one in the configurations is used.
num_solves (int) – number of times the optimization problem is solved
- Returns:
A list with each entry being a dictionary holding results returned by
solve().
- plot_results(results, overwrite=False, bruteforce=False, fun_symbol='U', output_dir=None, keep_plots=False, fontsize=22, line_width=3, usetex=True, show_axis_grids=True, axis_grids_alpha=(0.25, 0.4), plots_format='pdf', dpi=600, **kwargs)[source]#
Given the results returned by
solve()(or partial results saved through successive iterations), visualize optimization results. This function, creates the following plots:A plot showing the value of the objective function of sampled designs/realizations from the underlying policy. This shows the mean and the uncertainty envelop over successive iterations.
A box plot similar to the first plot, but instead of showing uncertainty envelop box plots are used so that outliers can be detected.
Note
Adding additional kwargs to enable high flexibility in calling this method.
- Parameters:
results (dict | ReinforceOptimizerResults) – results returned by
solve()(or a dictionary created from it, e.g., by callingReinforceOptimizerResults.asdict())bruteforce (bool) –
create bruteforce plot(s) (if accessible through the policy_distribution)
Warning
This can take forever if the cardinality (number of possiblity) is large. Be careful when to set this to
Trueas it is only meant for testing in very small systems.fun_symbol (str) – symbol to use for labeling the objective function.
output_dir – location to save plots to; if
Nonethe current working directory is usedkeep_plots (bool) – if
True, the created figures will not be closed. This can be useful, e.g., for displaying results in notebooksfontsize (int) – font size for plots
usetex (bool) – use LaTeX for displaying equations on plots (requires LaTeX support)
plots_format (str) – plots format/extension
dpi – depth per inch
- Returns:
a dictionary with two entries/keys:
’figures’ a dictionary indexed by plot identifier with value set to the corresponding figure handle (or None if keep_plots is set to False)
’stats’: a dictionary holding statistics including number of function evaluations, and percentage of space explored.
- Raises:
TypeErrorifresultsis invalid
- property size#
Dimension/size of the optimziation variable.
- property parameter_size#
Dimension/size of the optimziation variable.
- property policy_distribution#
- property objective_value_tracker#