Policies¶

Model Based Base Policy¶

class blackbox_mpc.policies.ModelBasedBasePolicy(trajectory_evaluator)[source]¶

__init__(trajectory_evaluator)[source]¶

This is the model based policy base class for controlling the agent

Parameters: trajectory_evaluator (EvaluatorBase) – Defines the trajectory evaluator to be used in the optimizer to evaluate trajectories.

__weakref__¶: list of weak references to the object (if defined)

act(observations, t, exploration_noise=False)[source]¶

This is the act function for the model based policy base class, which should be called to provide the action to be executed at the current time step.

Parameters

observations (tf.float32) – Defines the current observations received from the environment.
t (tf.float32) – Defines the current timestep.
exploration_noise (bool) – Defines if exploration noise should be added to the action to be executed.

Returns

action (tf.float32) – The action to be executed for each of the runner (dims = runner X dim_U)
next_observations (tf.float32) – The next observations predicted using the dynamics function learned so far.
rewards_of_next_state (tf.float32) – The predicted reward if the action was executed using the predicted observations.

reset()[source]¶: This is the reset function for the model based policy base class, which should be called at the beginning of the episode.

Model Predictive Control Policy¶

class blackbox_mpc.policies.MPCPolicy(trajectory_evaluator=None, optimizer=None, tf_writer=None, log_dir=None, reward_function=None, env_action_space=None, env_observation_space=None, dynamics_function=None, dynamics_handler=None, true_model=False, optimizer_name=None, num_agents=None, save_model_frequency=1, saved_model_dir=None, **optimizer_args)[source]¶

__init__(trajectory_evaluator=None, optimizer=None, tf_writer=None, log_dir=None, reward_function=None, env_action_space=None, env_observation_space=None, dynamics_function=None, dynamics_handler=None, true_model=False, optimizer_name=None, num_agents=None, save_model_frequency=1, saved_model_dir=None, **optimizer_args)[source]¶

This is the model predictive control policy for controlling the agent

Parameters

trajectory_evaluator (EvaluatorBase) – Defines the trajectory evaluator to be used in the optimizer to evaluate trajectories.
tf_writer (tf.summary) – Tensorflow writer to be used in logging the data.
optimizer_name (str) – optimizer name between in [‘CEM’, ‘CMA-ES’, ‘PI2’, ‘RandomSearch’, ‘PSO’, ‘SPSA’].
env_action_space (gym.ActionSpace) – Defines the action space of the gym environment.
env_observation_space (gym.ObservationSpace) – Defines the observation space of the gym environment.
dynamics_function (DeterministicDynamicsFunctionBaseClass) – Defines the system dynamics function.
dynamics_handler (SystemDynamicsHandler) – The system_dynamics_handler is a handler of the state, actions and targets processing funcs as well as the dynamics function.
reward_function (tf_function) – Defines the reward function with the prototype: tf_func_name(current_state, current_actions, next_state), where current_state is BatchXdim_S, next_state is BatchXdim_S and current_actions is BatchXdim_U.
true_model (bool) – boolean defining if its a true model dynamics or not.
log_dir (string) – Defines the log directory to save the normalization statistics in.
num_agents (tf.int32) – Defines the number of runner running in parallel
saved_model_dir (string) – Defines the saved model directory where the model is saved in, in case of loading the model.
save_model_frequency (Int) – Defines how often the model should be saved (defined relative to the number of refining iters)
optimizer_args (args) – other arguments specific to the optimizer.

act(observations, t, exploration_noise=False)[source]¶

This is the act function for the model predictive control policy, which should be called to provide the action to be executed at the current time step.

Parameters

observations (tf.float32) – Defines the current observations received from the environment.
t (tf.float32) – Defines the current timestep.
exploration_noise (bool) – Defines if exploration noise should be added to the action to be executed.

Returns

action (tf.float32) – The action to be executed for each of the runner (dims = runner X dim_U)
next_observations (tf.float32) – The next observations predicted using the dynamics function learned so far.
rewards_of_next_state (tf.float32) – The predicted reward if the action was executed using the predicted observations.

reset()[source]¶: This is the reset function for the model predictive control policy, which should be called at the beginning of the episode.

switch_optimizer(optimizer=None, optimizer_name='', **optimizer_args)[source]¶

This function is used to switch the optimizer of model predictive control policy.

Parameters

optimizer (OptimizerBaseClass) – Optimizer to be used that optimizes for the best action sequence and returns the first action.
optimizer_name (str) – optimizer name between in [‘CEM’, ‘CMA-ES’, ‘PI2’, ‘RandomSearch’, ‘PSO’, ‘SPSA’].
optimizer_args (args) – other arguments specific to the optimizer.

Model Free Base Policy¶

class blackbox_mpc.policies.ModelFreeBasePolicy[source]¶

__init__()[source]¶: This is the model free policy base class for controlling the agent

__weakref__¶: list of weak references to the object (if defined)

act(observations, t, exploration_noise=False)[source]¶

This is the act function for the model free policy base class, which should be called to provide the action to be executed at the current time step.

Parameters

observations (tf.float32) – Defines the current observations received from the environment.
t (tf.float32) – Defines the current timestep.
exploration_noise (bool) – Defines if exploration noise should be added to the action that will be executed.

Returns

action – The action to be executed for each of the runner (dims = runner X dim_U)

Return type

tf.float32

reset()[source]¶: This is the reset function for the model free policy base class, which should be called at the beginning of the episode.

Random Policy¶

class blackbox_mpc.policies.RandomPolicy(number_of_agents, env_action_space)[source]¶

__init__(number_of_agents, env_action_space)[source]¶

This is the random policy for controlling the agent

Parameters

env_action_space (gym.ActionSpace) – Defines the action space of the gym environment.
number_of_agents (tf.int32) – Defines the number of runner running in parallel

act(observations, t, exploration_noise=False)[source]¶

This is the act function for the random policy, which should be called to provide the action to be executed at the current time step.

Parameters

observations (tf.float32) – Defines the current observations received from the environment.
t (tf.float32) – Defines the current timestep.
exploration_noise (bool) – Defines if exploration noise should be added to the action to be executed.

Returns

action – The action to be executed for each of the runner (dims = runner X dim_U)

Return type

tf.float32

reset()[source]¶: This is the reset function for the random policy, which should be called at the beginning of the episode.