Create an I2A agent which collects all necessary imformation from environment
- Figure out how to pass current PPO agent into I2A
- Actor Network acting as the rolloutout strategy.
- Create interface to receive observation and output next observation + reward
- Implement functionality as a Recurrent NN (what loss function? MSE? Cross Entropy?). Prediction on pixel space
- Implement Conditioned-(
$\beta$ ) Variational Auto Encoder to move away from pixel space - Figure out how to train C-$\beta$-VAE from training policies online (How to fit training in our RL loop)
- The rollout encoding strategy is carried on by a Recurrent Convolutional Network.
- Create an LSTM which encodes backwards the imagined trajectories into an rollout embeddings
- Many model architecture can be used here (cf. PlaNet algorithm and papers...).
- Implement a vanilla approach, with great care when it comes to data representation manipulation, i.e. switching between batched representation and sequence representation.
- Create an aggregator which simply concatenates the rollout embeddings into an imagination code.
- Use an attention mechanism
- Concatenate imagination code with last layer of the Model Free (ppo) into a fully connected layer which outputs an action and a value function.
- Compute loss and propagate gradients backwards? (FIGURE OUT)