Useful Utilities

Learning Dynamics

blackbox_mpc.utils.dynamics_learning.learn_dynamics_from_policy(env, policy, number_of_rollouts, task_horizon, dynamics_function=None, system_dynamics_handler=None, epochs=30, learning_rate=0.001, validation_split=0.2, batch_size=128, is_normalized=True, nn_optimizer=<class 'tensorflow.python.keras.optimizer_v2.adam.Adam'>, tf_writer=None, exploration_noise=False, log_dir=None, save_model_frequency=1, saved_model_dir=None, start_episode=0)[source]

This is the learn dynamics function for the runner class which samples n rollouts using a random policy and then uses these rollouts to learn a dynamics function for the system.

Parameters
  • env (parallelgymEnv) – a wrapped gym environment using blackbox.environment_utils.EnvironmentWrapper funcs

  • policy (ModelFreeBasePolicy or ModelBasedBasePolicy) – the policy used for learning the dynamics.

  • number_of_rollouts (Int) – Number of rollouts/ episodes to perform for each of the agents in the vectorized environment.

  • task_horizon (Int) – The task horizon/ episode length.

  • dynamics_function (DeterministicDynamicsFunctionBaseClass) – Defines the system dynamics function.

  • learning_rate (float) – Learning rate to be used in training the dynamics function.

  • epochs (Int) – Number of epochs to be used in training the dynamics function everytime train is called.

  • validation_split (float32) – Defines the validation split to be used of the rollouts collected.

  • batch_size (int) – Defines the batch size to be used for training the model.

  • nn_optimizer (tf.keras.optimizers) – Defines the optimizer to use with the neural network.

  • is_normalized (bool) – Defines if the dynamics function should be trained with normalization or not.

  • log_dir (string) – Defines the log directory to save the normalization statistics in.

  • tf_writer (tf.summary) – Tensorflow writer to be used in logging the data.

  • system_dynamics_handler (SystemDynamicsHandler) – The system_dynamics_handler is a handler of the state, actions and targets processing funcs as well

  • saved_model_dir (string) – Defines the saved model directory where the model is saved in, in case of loading the model.

  • save_model_frequency (Int) – Defines how often the model should be saved (defined relative to the number of refining iters)

  • start_episode (Int) – the episode index for tensorflow logging purposes

  • exploration_noise (bool) – Defines if exploration noise should be added to the action to be executed.

Returns

system_dynamics_handler – The system_dynamics_handler holds the trained system dynamics.

Return type

SystemDynamicsHandler

Model Based RL

blackbox_mpc.utils.iterative_mpc.learn_dynamics_iteratively_w_mpc(env, number_of_initial_rollouts, number_of_rollouts_for_refinement, number_of_refinement_steps, task_horizon, env_action_space=None, env_observation_space=None, initial_policy=None, refinement_policy=None, planning_horizon=None, reward_function=None, is_normalized=True, optimizer_name='CEM', optimizer=None, num_agents=None, nn_optimizer=<class 'tensorflow.python.keras.optimizer_v2.adam.Adam'>, dynamics_function=None, system_dynamics_handler=None, log_dir=None, tf_writer=None, save_model_frequency=1, saved_model_dir=None, exploration_noise=False, epochs=30, learning_rate=0.001, validation_split=0.2, batch_size=128, start_episode=0, **optimizer_args)[source]

This is the learn dynamics function iteratively using mpc policy for the runner class which samples n rollouts using an initial policy and then uses these rollouts to learn a dynamics function for the system which is then used to _sample further rollouts to refine the dynamics function.

Parameters
  • env (parallelgymEnv) – a wrapped gym environment using blackbox.environment_utils.EnvironmentWrapper funcs

  • env_action_space (gym.ActionSpace) – Defines the action space of the gym environment.

  • env_observation_space (gym.ObservationSpace) – Defines the observation space of the gym environment.

  • num_agents (tf.int32) – Defines the number of runner running in parallel

  • dynamics_function (DeterministicDynamicsFunctionBaseClass) – Defines the system dynamics function.

  • system_dynamics_handler (SystemDynamicsHandler) – The system_dynamics_handler is a handler of the state, actions and targets processing funcs as well.

  • number_of_initial_rollouts (Int) – Number of initial rollouts/ episodes to perform for each of the agents in the vectorized environment.

  • number_of_rollouts_for_refinement (Int) – Number of refinement rollouts/ episodes to perform for each of the agents in the vectorized environment.

  • number_of_refinement_steps (Int) – Number of refinemnet steps train, collect, train..etc to run for.

  • task_horizon (Int) – The task horizon/ episode length.

  • initial_policy (ModelBasedBasePolicy or ModelFreeBasePolicy) – The policy to be used in collecting the initial episodes from the different agents.

  • refinement_policy (ModelBasedBasePolicy) – The policy to be used in collecting the followup episodes to refine the policy.

  • exploration_noise (bool) – If noise should be added to the actions to help in exploration.

  • learning_rate (float) – Learning rate to be used in training the dynamics function.

  • epochs (Int) – Number of epochs to be used in training the dynamics function everytime train is called.

  • validation_split (float32) – Defines the validation split to be used of the rollouts collected.

  • batch_size (int) – Defines the batch size to be used for training the model.

  • nn_optimizer (tf.keras.optimizers) – Defines the optimizer to use with the neural network.

  • is_normalized (bool) – Defines if the dynamics function should be trained with normalization or not.

  • reward_function (tf_function) – Defines the reward function with the prototype: tf_func_name(current_state, current_actions, next_state), where current_state is BatchXdim_S, next_state is BatchXdim_S and current_actions is BatchXdim_U.

  • planning_horizon (tf.int32) – Defines the planning horizon for the optimizer (how many steps to lookahead and optimize for).

  • optimizer (OptimizerBaseClass) – Optimizer to be used that optimizes for the best action sequence and returns the first action.

  • optimizer_name (str) – optimizer name between in [‘CEM’, ‘CMA-ES’, ‘PI2’, ‘RandomSearch’, ‘PSO’, ‘SPSA’].

  • saved_model_dir (string) – Defines the saved model directory where the model is saved in, in case of loading the model.

  • save_model_frequency (Int) – Defines how often the model should be saved (defined relative to the number of refining iters)

  • start_episode (Int) – the episode index for tensorflow logging purposes

  • exploration_noise – Defines if exploration noise should be added to the action to be executed.

  • log_dir (string) – Defines the log directory to save the normalization statistics in.

  • tf_writer (tf.summary) – Tensorflow writer to be used in logging the data.

Returns

  • system_dynamics_handler (SystemDynamicsHandler) – The system_dynamics_handler holds the trained system dynamics.

  • mpc_policy (ModelBasedBasePolicy) – The policy that was refined to be used as a control policy

Pendulum Reward Function

blackbox_mpc.utils.pendulum.pendulum_reward_function(current_state, next_state, actions)[source]

The pendulum state reward function

Parameters
  • current_state (tf.float32) – represents the current state of the system (Bxdim_S)

  • next_state (tf.float32) – represents the next state of the system (Bxdim_S)

  • Returns

    rewards: tf.float32

    The reward corresponding to each of the pairs current_state, next_state

Pendulum True Model

class blackbox_mpc.utils.pendulum.PendulumTrueModel(name=None)[source]
__call__(x, train)[source]

This is the call function for the pendulum true model.

Parameters
  • x (tf.float32) – Defines the (s_t, a_t) which is the state and action stacked on top of each other, (dims = Batch X (dim_S + dim_U)) [cos(theta), sin(theta), dtheta, u]

  • train (tf.bool) – Placeholder to confirm with the base class.

Returns

output – Defines the next state (s_t+1) with (dims = Batch X dim_S), [cos(theta), sin(theta), dtheta]

Return type

tf.float32

__init__(name=None)[source]

This is the pendulum true model for the gym environment

Parameters

name (String) – Defines the name of the block of the pendulum true model.

Recording Videos

blackbox_mpc.utils.recording.record_rollout(env, horizon, policy, record_file_path)[source]

This is the recording function for the runner class which samples one episode with a specified length using the provided policy and records it in a video.

Parameters
  • horizon (Int) – The task horizon/ episode length.

  • policy (ModelBasedBasePolicy or ModelFreeBasePolicy) – The policy to be used in collecting the episodes from the different agents.

  • record_file_path (String) – specified the file path to save the video that will be recorded in.

Rollout Collection

blackbox_mpc.utils.rollouts.perform_rollouts(env, number_of_rollouts, task_horizon, policy, exploration_noise=False, tf_writer=None, start_episode=0)[source]

This is the perform_rollouts function for the runner class which samples n episodes with a specified length using the provided policy.

Parameters
  • env (parallelgymEnv) – a wrapped gym environment using blackbox.environment_utils.EnvironmentWrapper funcs

  • number_of_rollouts (Int) – Number of rollouts/ episodes to perform for each of the agents in the vectorized environment.

  • task_horizon (Int) – The task horizon/ episode length.

  • policy (ModelBasedBasePolicy or ModelFreeBasePolicy) – The policy to be used in collecting the episodes from the different agents.

  • exploration_noise (bool) – If noise should be added to the actions to help in exploration.

  • tf_writer (tf.summary) – Tensorflow writer to be used in logging the data.

  • start_episode (Int) – the episode index for tensorflow logging purposes

Returns

  • traj_obs ([np.float32]) – List with length=number_of_rollouts which holds the observations starting from the reset observations.

  • traj_acs ([np.float32]) – List with length=number_of_rollouts which holds the actions taken by the policy.

  • traj_rews ([np.float32]) – List with length=number_of_rollouts which holds the rewards taken by the policy.

Target transforms

blackbox_mpc.utils.transforms.default_inverse_transform_targets(current_state, delta)[source]

This is the default inverse transform targets function used, which reverses the preprocessing of the targets of the dynamics function to obtain the real current_state not the relative one, The default one is (current_state = target + current_state).

Parameters
  • current_state (tf.float32) – The current_state has a shape of (Batch X dim_S)

  • delta (tf.float32) – The delta has a shape of (Batch X dim_S) which is equivilant to the target of the network.

blackbox_mpc.utils.transforms.default_transform_targets(current_state, next_state)[source]

This is the default transform targets function used, which preprocesses the targets of the network before training the dynamics function using the inputs and targets. The default one is (target = next_state - current_state).

Parameters
  • current_state (tf.float32) – The current_state has a shape of (Batch X dim_S)

  • next_state (tf.float32) – The next_state has a shape of (Batch X dim_S)