Skip to content

Latest commit

 

History

History
28 lines (18 loc) · 7.91 KB

File metadata and controls

28 lines (18 loc) · 7.91 KB

Probabilistic Artificial Intelligence 2023

Task 0: Implementation of Baysian Interference

The goal of this simple exercise is to calculate the posterior probabilities of $P(H_i|X)$ for $i=1,2,3$ using Bayes theorem. According this formula, the probability of the hypothesis $H_i$ given $X$ can be calculated by multiplying the sum of the porbability of all $X$ given $H_i$ with the corresponding prior probability for the hypothetis $H_i$. All this is divided by the marginal likelihood.

Task 1: Gaussian Process Regression

The goal of this task was to model air pollution measurements using Gaussian Process Regression and predict the pollution at new locations. To implement the Gaussian process regressor I used the Sklearn library. Since the priors have zero mean, I enabled the option to normalize the target values, which gets reverted before the GP prediction happens. In order to find a suitable kernel function I implemented a 3-fold cross validator that tested 5 different kernels. For each model, the asymmetric cost gets calculated. Based on the results of the cross-validation, the superposition of the Matern kernel with nu=1.5 and length_scale=0.054 and the white kernel with noise_level=0.00528 received the lowest average cost over all folds. The white kernel has been added due to the fact that the training data contains noise. I also tested one kernel without a white kernel for comparison. Other kernels I tested can be found inside the model_selection method and the corresponding results are mentioned as comments in the main method. To ensure that the predictions at areas with binary value 1 are not below the ground truth, I decided to set the predictions equal to the sum of the posterior standard deviation and the posterior mean for these areas. To manage the large dataset problem, I used the train_test_split method from Sklearn to randomly sample 70% of the data as the training set. This ensures a manageable computational load without compromising the richness of the data.

Task 2: Approximate Baysian Inference in Neural Networks via SWA-Gaussian

The goal of this task is to fit an approximate Gaussian posterior over the weights of the neural network using SWAG. With this approach, a more calibrated model can be achieved whose confidence aligns more with what its actual performance is. First, I implemented the SWAG diagonal which approximated the posterior distribution as N(theta_swa, Sigma_diag), which managed to pass the easy baseline without adjusting the hyperparameters. With the full SWAG approach (described in Algorithm 1 of Maddox et al 2019), which extends the variance estimation by a low-rank variance approximation obtained from the deviation matrix D_hat (storing the last K mean parameter estimates), all baselines could be passed with cost 0.829 and ECE 0.049. Also here, I kept the same hyperparameters given by the template. To further improve the predictions of the model, I implemented a learning rate decay which starts with a constant learning rate of 0.045 until epoch 10 and then switches to a linear decay with a final learning rate of 0.03. The constant phase allows the model to explore the solution space more broadly. The linear decay phase helps in fine-tuning the weights, which eventually leads to a more precise convergence to a (hopefully) global minimum. By increasing the epochs from 30 to 40 the performance could be increased. With this setup, a public cost of 0.795 and ECE of 0.067 could be reached. An additional post-hoc calibration using temperature scaling (due to the successes mentioned in papers) has been tested as well but without actual costs and ECE improvements. The reason is due to overfitting to the validation data since this data set has been used to obtain the temperature scale using LBFGS (learning rate of 0.01, max_iteration of 40). A possible option to reduce overfitting here would be to use cross-validation.

Task 3: Bayesian Optimization

The goal of this task is to implement Bayesian Optimization to tune the structural features of a drug candidate. First, we implement the Gaussian process regressor for the functions f and v based on the provided values in the description. We achieve the best fitting using an RBF kernel with a length scale of 0.5 and variance of 0.5 for the function f, and for the constraint function, we use a Matern kernel with a length scale of 1 and nu = 2.5. Since the function f maps the structural features into a range between 0 and 1, we achieve better results by normalizing the targets using the normalize_y=True argument of the GP regressor.

For the acquisition function, we tested the upper confidence bound, the Thomson sampling, and the expected improvement. We implement the constraint v using the Lagrangian relaxation for the first two approaches. For the latter, we used the approach described in the first paper mentioned in the task description. For regions that satisfy the constraints with high probability (≥ 0.95), the acquisition function returned EI((mean_f - f_optimal - xi)/std_f)*cdf(4, mean_v, std_v), otherwise the x value got optimized only by cdf(4, mean_v, std_v) until a safe region has been found. None of them consistently reached the hard baseline, mainly because they returned too many unsafe points. To decrease the number of unsafe points, one can assume that the probabilistic constraint is violated everywhere. In this case, the x_optimal is found by using only the cumulative distribution function of the constraint v. Using only this, ensures that a safe region is found. This leads us to the best acquisition function called ImprovementBasedOnConstraintCDF. Although this approach guarantees safe points, it’s not necessarily the best method to reduce regret. For that, the objective function needs to be considered in the optimization process too.

For the next_recommendation method, we call the optimize_acquisition_function. Previously, we added some randomization if the last consecutive points were too close to each other to prevent local minimum approximations. This did not make any major improvements.

The best solution is selected by first filtering out all points that do not satisfy the constraint v(x) < 4. From these feasible points, the structural feature x_optimal is chosen based on the biggest value f. For a better understanding of the Bayesian optimization process, we implement a plot function that includes the objective and constraint function, the estimated optimal point, the initial point, and all the other used points.

Task 4: Soft Actor Critic Implementation

In this task, I implemented SAC with automating entropy adjustment that uses an actor and two critics NN.

The actor class needs to return the action and the corresponding log probability. The given papers suggest using only a stochastic policy for the training and a deterministic for the testing. The stochastic policy samples a bounded (using tanh) action from a clamped normal distribution with mean and stdv returned by the actor NN. The latter returns the bounded mean. To calculate the log_prob of the action in the stochastic case I implemented Eq 21 (ref. in code). To prevent log(1 - tanh^2(a)) = log(0) I added a small value (1e-6) to the argument.

To set up the agent class I added added an actor, two critics, and two target critics (all with 2 hidden layers of size 256). Inside the get_action, I return the action computed by the actor, which either uses a deterministic (testing) or a stochastic policy (training).

In the training method, I first implemented the critic's update according to Eq 7 (ref. in code). I approximated Q_hat with samples (since it’s an expectation) using “r+gamma*(min(Q1, Q2)-alpha*log(pi(a’|s’))”, where a’ is sampled from the policy given s’. After calculating the MSE loss for both critics, a gradient step is done and the target networks get updated using Polyak averaging. Next, the policy gets updated based on the cost function described in Eq. 12 (ref. in code). Finally, I optimize the entropy temperature using Eq. 18 (ref. in code).