Policies¶
Model Based Base Policy¶
-
class
blackbox_mpc.policies.ModelBasedBasePolicy(trajectory_evaluator)[source]¶ -
__init__(trajectory_evaluator)[source]¶ This is the model based policy base class for controlling the agent
- Parameters
trajectory_evaluator (EvaluatorBase) – Defines the trajectory evaluator to be used in the optimizer to evaluate trajectories.
-
__weakref__¶ list of weak references to the object (if defined)
-
act(observations, t, exploration_noise=False)[source]¶ This is the act function for the model based policy base class, which should be called to provide the action to be executed at the current time step.
- Parameters
observations (tf.float32) – Defines the current observations received from the environment.
t (tf.float32) – Defines the current timestep.
exploration_noise (bool) – Defines if exploration noise should be added to the action to be executed.
- Returns
action (tf.float32) – The action to be executed for each of the runner (dims = runner X dim_U)
next_observations (tf.float32) – The next observations predicted using the dynamics function learned so far.
rewards_of_next_state (tf.float32) – The predicted reward if the action was executed using the predicted observations.
-
Model Predictive Control Policy¶
-
class
blackbox_mpc.policies.MPCPolicy(trajectory_evaluator=None, optimizer=None, tf_writer=None, log_dir=None, reward_function=None, env_action_space=None, env_observation_space=None, dynamics_function=None, dynamics_handler=None, true_model=False, optimizer_name=None, num_agents=None, save_model_frequency=1, saved_model_dir=None, **optimizer_args)[source]¶ -
__init__(trajectory_evaluator=None, optimizer=None, tf_writer=None, log_dir=None, reward_function=None, env_action_space=None, env_observation_space=None, dynamics_function=None, dynamics_handler=None, true_model=False, optimizer_name=None, num_agents=None, save_model_frequency=1, saved_model_dir=None, **optimizer_args)[source]¶ This is the model predictive control policy for controlling the agent
- Parameters
trajectory_evaluator (EvaluatorBase) – Defines the trajectory evaluator to be used in the optimizer to evaluate trajectories.
tf_writer (tf.summary) – Tensorflow writer to be used in logging the data.
optimizer_name (str) – optimizer name between in [‘CEM’, ‘CMA-ES’, ‘PI2’, ‘RandomSearch’, ‘PSO’, ‘SPSA’].
env_action_space (gym.ActionSpace) – Defines the action space of the gym environment.
env_observation_space (gym.ObservationSpace) – Defines the observation space of the gym environment.
dynamics_function (DeterministicDynamicsFunctionBaseClass) – Defines the system dynamics function.
dynamics_handler (SystemDynamicsHandler) – The system_dynamics_handler is a handler of the state, actions and targets processing funcs as well as the dynamics function.
reward_function (tf_function) – Defines the reward function with the prototype: tf_func_name(current_state, current_actions, next_state), where current_state is BatchXdim_S, next_state is BatchXdim_S and current_actions is BatchXdim_U.
true_model (bool) – boolean defining if its a true model dynamics or not.
log_dir (string) – Defines the log directory to save the normalization statistics in.
num_agents (tf.int32) – Defines the number of runner running in parallel
saved_model_dir (string) – Defines the saved model directory where the model is saved in, in case of loading the model.
save_model_frequency (Int) – Defines how often the model should be saved (defined relative to the number of refining iters)
optimizer_args (args) – other arguments specific to the optimizer.
-
act(observations, t, exploration_noise=False)[source]¶ This is the act function for the model predictive control policy, which should be called to provide the action to be executed at the current time step.
- Parameters
observations (tf.float32) – Defines the current observations received from the environment.
t (tf.float32) – Defines the current timestep.
exploration_noise (bool) – Defines if exploration noise should be added to the action to be executed.
- Returns
action (tf.float32) – The action to be executed for each of the runner (dims = runner X dim_U)
next_observations (tf.float32) – The next observations predicted using the dynamics function learned so far.
rewards_of_next_state (tf.float32) – The predicted reward if the action was executed using the predicted observations.
-
reset()[source]¶ This is the reset function for the model predictive control policy, which should be called at the beginning of the episode.
-
switch_optimizer(optimizer=None, optimizer_name='', **optimizer_args)[source]¶ This function is used to switch the optimizer of model predictive control policy.
- Parameters
optimizer (OptimizerBaseClass) – Optimizer to be used that optimizes for the best action sequence and returns the first action.
optimizer_name (str) – optimizer name between in [‘CEM’, ‘CMA-ES’, ‘PI2’, ‘RandomSearch’, ‘PSO’, ‘SPSA’].
optimizer_args (args) – other arguments specific to the optimizer.
-
Model Free Base Policy¶
-
class
blackbox_mpc.policies.ModelFreeBasePolicy[source]¶ -
-
__weakref__¶ list of weak references to the object (if defined)
-
act(observations, t, exploration_noise=False)[source]¶ This is the act function for the model free policy base class, which should be called to provide the action to be executed at the current time step.
- Parameters
observations (tf.float32) – Defines the current observations received from the environment.
t (tf.float32) – Defines the current timestep.
exploration_noise (bool) – Defines if exploration noise should be added to the action that will be executed.
- Returns
action – The action to be executed for each of the runner (dims = runner X dim_U)
- Return type
tf.float32
-
Random Policy¶
-
class
blackbox_mpc.policies.RandomPolicy(number_of_agents, env_action_space)[source]¶ -
__init__(number_of_agents, env_action_space)[source]¶ This is the random policy for controlling the agent
- Parameters
env_action_space (gym.ActionSpace) – Defines the action space of the gym environment.
number_of_agents (tf.int32) – Defines the number of runner running in parallel
-
act(observations, t, exploration_noise=False)[source]¶ This is the act function for the random policy, which should be called to provide the action to be executed at the current time step.
- Parameters
observations (tf.float32) – Defines the current observations received from the environment.
t (tf.float32) – Defines the current timestep.
exploration_noise (bool) – Defines if exploration noise should be added to the action to be executed.
- Returns
action – The action to be executed for each of the runner (dims = runner X dim_U)
- Return type
tf.float32
-