Useful Utilities¶

Learning Dynamics¶

blackbox_mpc.utils.dynamics_learning.learn_dynamics_from_policy(env, policy, number_of_rollouts, task_horizon, dynamics_function=None, system_dynamics_handler=None, epochs=30, learning_rate=0.001, validation_split=0.2, batch_size=128, is_normalized=True, nn_optimizer=<class 'tensorflow.python.keras.optimizer_v2.adam.Adam'>, tf_writer=None, exploration_noise=False, log_dir=None, save_model_frequency=1, saved_model_dir=None, start_episode=0)[source]¶

This is the learn dynamics function for the runner class which samples n rollouts using a random policy and then uses these rollouts to learn a dynamics function for the system.

Parameters

env (parallelgymEnv) – a wrapped gym environment using blackbox.environment_utils.EnvironmentWrapper funcs
policy (ModelFreeBasePolicy or ModelBasedBasePolicy) – the policy used for learning the dynamics.
number_of_rollouts (Int) – Number of rollouts/ episodes to perform for each of the agents in the vectorized environment.
task_horizon (Int) – The task horizon/ episode length.
dynamics_function (DeterministicDynamicsFunctionBaseClass) – Defines the system dynamics function.
learning_rate (float) – Learning rate to be used in training the dynamics function.
epochs (Int) – Number of epochs to be used in training the dynamics function everytime train is called.
validation_split (float32) – Defines the validation split to be used of the rollouts collected.
batch_size (int) – Defines the batch size to be used for training the model.
nn_optimizer (tf.keras.optimizers) – Defines the optimizer to use with the neural network.
is_normalized (bool) – Defines if the dynamics function should be trained with normalization or not.
log_dir (string) – Defines the log directory to save the normalization statistics in.
tf_writer (tf.summary) – Tensorflow writer to be used in logging the data.
system_dynamics_handler (SystemDynamicsHandler) – The system_dynamics_handler is a handler of the state, actions and targets processing funcs as well
saved_model_dir (string) – Defines the saved model directory where the model is saved in, in case of loading the model.
save_model_frequency (Int) – Defines how often the model should be saved (defined relative to the number of refining iters)
start_episode (Int) – the episode index for tensorflow logging purposes
exploration_noise (bool) – Defines if exploration noise should be added to the action to be executed.

Returns

system_dynamics_handler – The system_dynamics_handler holds the trained system dynamics.

Return type

SystemDynamicsHandler

Model Based RL¶

blackbox_mpc.utils.iterative_mpc.learn_dynamics_iteratively_w_mpc(env, number_of_initial_rollouts, number_of_rollouts_for_refinement, number_of_refinement_steps, task_horizon, env_action_space=None, env_observation_space=None, initial_policy=None, refinement_policy=None, planning_horizon=None, reward_function=None, is_normalized=True, optimizer_name='CEM', optimizer=None, num_agents=None, nn_optimizer=<class 'tensorflow.python.keras.optimizer_v2.adam.Adam'>, dynamics_function=None, system_dynamics_handler=None, log_dir=None, tf_writer=None, save_model_frequency=1, saved_model_dir=None, exploration_noise=False, epochs=30, learning_rate=0.001, validation_split=0.2, batch_size=128, start_episode=0, **optimizer_args)[source]¶

This is the learn dynamics function iteratively using mpc policy for the runner class which samples n rollouts using an initial policy and then uses these rollouts to learn a dynamics function for the system which is then used to _sample further rollouts to refine the dynamics function.

Parameters

env (parallelgymEnv) – a wrapped gym environment using blackbox.environment_utils.EnvironmentWrapper funcs
env_action_space (gym.ActionSpace) – Defines the action space of the gym environment.
env_observation_space (gym.ObservationSpace) – Defines the observation space of the gym environment.
num_agents (tf.int32) – Defines the number of runner running in parallel
dynamics_function (DeterministicDynamicsFunctionBaseClass) – Defines the system dynamics function.
system_dynamics_handler (SystemDynamicsHandler) – The system_dynamics_handler is a handler of the state, actions and targets processing funcs as well.
number_of_initial_rollouts (Int) – Number of initial rollouts/ episodes to perform for each of the agents in the vectorized environment.
number_of_rollouts_for_refinement (Int) – Number of refinement rollouts/ episodes to perform for each of the agents in the vectorized environment.
number_of_refinement_steps (Int) – Number of refinemnet steps train, collect, train..etc to run for.
task_horizon (Int) – The task horizon/ episode length.
initial_policy (ModelBasedBasePolicy or ModelFreeBasePolicy) – The policy to be used in collecting the initial episodes from the different agents.
refinement_policy (ModelBasedBasePolicy) – The policy to be used in collecting the followup episodes to refine the policy.
exploration_noise (bool) – If noise should be added to the actions to help in exploration.
learning_rate (float) – Learning rate to be used in training the dynamics function.
epochs (Int) – Number of epochs to be used in training the dynamics function everytime train is called.
validation_split (float32) – Defines the validation split to be used of the rollouts collected.
batch_size (int) – Defines the batch size to be used for training the model.
nn_optimizer (tf.keras.optimizers) – Defines the optimizer to use with the neural network.
is_normalized (bool) – Defines if the dynamics function should be trained with normalization or not.
reward_function (tf_function) – Defines the reward function with the prototype: tf_func_name(current_state, current_actions, next_state), where current_state is BatchXdim_S, next_state is BatchXdim_S and current_actions is BatchXdim_U.
planning_horizon (tf.int32) – Defines the planning horizon for the optimizer (how many steps to lookahead and optimize for).
optimizer (OptimizerBaseClass) – Optimizer to be used that optimizes for the best action sequence and returns the first action.
optimizer_name (str) – optimizer name between in [‘CEM’, ‘CMA-ES’, ‘PI2’, ‘RandomSearch’, ‘PSO’, ‘SPSA’].
saved_model_dir (string) – Defines the saved model directory where the model is saved in, in case of loading the model.
save_model_frequency (Int) – Defines how often the model should be saved (defined relative to the number of refining iters)
start_episode (Int) – the episode index for tensorflow logging purposes
exploration_noise – Defines if exploration noise should be added to the action to be executed.
log_dir (string) – Defines the log directory to save the normalization statistics in.
tf_writer (tf.summary) – Tensorflow writer to be used in logging the data.

Returns

system_dynamics_handler (SystemDynamicsHandler) – The system_dynamics_handler holds the trained system dynamics.
mpc_policy (ModelBasedBasePolicy) – The policy that was refined to be used as a control policy

Pendulum Reward Function¶

blackbox_mpc.utils.pendulum.pendulum_reward_function(current_state, next_state, actions)[source]¶

The pendulum state reward function

Parameters

current_state (tf.float32) – represents the current state of the system (Bxdim_S)
next_state (tf.float32) – represents the next state of the system (Bxdim_S)
Returns –

rewards: tf.float32
The reward corresponding to each of the pairs current_state, next_state

Pendulum True Model¶

class blackbox_mpc.utils.pendulum.PendulumTrueModel(name=None)[source]¶

__call__(x, train)[source]¶

This is the call function for the pendulum true model.

Parameters

x (tf.float32) – Defines the (s_t, a_t) which is the state and action stacked on top of each other, (dims = Batch X (dim_S + dim_U)) [cos(theta), sin(theta), dtheta, u]
train (tf.bool) – Placeholder to confirm with the base class.

Returns

output – Defines the next state (s_t+1) with (dims = Batch X dim_S), [cos(theta), sin(theta), dtheta]

Return type

tf.float32

__init__(name=None)[source]¶

This is the pendulum true model for the gym environment

Parameters: name (String) – Defines the name of the block of the pendulum true model.

Recording Videos¶

blackbox_mpc.utils.recording.record_rollout(env, horizon, policy, record_file_path)[source]¶

This is the recording function for the runner class which samples one episode with a specified length using the provided policy and records it in a video.

Parameters

horizon (Int) – The task horizon/ episode length.
policy (ModelBasedBasePolicy or ModelFreeBasePolicy) – The policy to be used in collecting the episodes from the different agents.
record_file_path (String) – specified the file path to save the video that will be recorded in.

Rollout Collection¶

blackbox_mpc.utils.rollouts.perform_rollouts(env, number_of_rollouts, task_horizon, policy, exploration_noise=False, tf_writer=None, start_episode=0)[source]¶

This is the perform_rollouts function for the runner class which samples n episodes with a specified length using the provided policy.

Parameters

env (parallelgymEnv) – a wrapped gym environment using blackbox.environment_utils.EnvironmentWrapper funcs
number_of_rollouts (Int) – Number of rollouts/ episodes to perform for each of the agents in the vectorized environment.
task_horizon (Int) – The task horizon/ episode length.
policy (ModelBasedBasePolicy or ModelFreeBasePolicy) – The policy to be used in collecting the episodes from the different agents.
exploration_noise (bool) – If noise should be added to the actions to help in exploration.
tf_writer (tf.summary) – Tensorflow writer to be used in logging the data.
start_episode (Int) – the episode index for tensorflow logging purposes

Returns

traj_obs ([np.float32]) – List with length=number_of_rollouts which holds the observations starting from the reset observations.
traj_acs ([np.float32]) – List with length=number_of_rollouts which holds the actions taken by the policy.
traj_rews ([np.float32]) – List with length=number_of_rollouts which holds the rewards taken by the policy.

Target transforms¶

blackbox_mpc.utils.transforms.default_inverse_transform_targets(current_state, delta)[source]¶

This is the default inverse transform targets function used, which reverses the preprocessing of the targets of the dynamics function to obtain the real current_state not the relative one, The default one is (current_state = target + current_state).

Parameters

current_state (tf.float32) – The current_state has a shape of (Batch X dim_S)
delta (tf.float32) – The delta has a shape of (Batch X dim_S) which is equivilant to the target of the network.

blackbox_mpc.utils.transforms.default_transform_targets(current_state, next_state)[source]¶

This is the default transform targets function used, which preprocesses the targets of the network before training the dynamics function using the inputs and targets. The default one is (target = next_state - current_state).

Parameters

current_state (tf.float32) – The current_state has a shape of (Batch X dim_S)
next_state (tf.float32) – The next_state has a shape of (Batch X dim_S)