The great power of deep learning, particularly on large unstructured data, comes at the price of an intrinsic fragility of these methods. Adversarial attacks pose serious cybersecurity concerns and have attracted a large body of work in the past few years.
In this area, we have explored intrinsic robustness of Bayesian methods and proved that a fully learned Bayesian model (i.e. in the limit of an infinite dataset and training at loss zero) is intrinsically robust to gradient based attacks. We explored the finite regime, showing that Bayesian methods, when properly trained are more robust, and that robustness correlates with training accuracy. We also studied the effect of approximate inference on these conclusions. Inspired by this work, we explored stability of explanations to adversarial attacks in the Bayesian case. We are now exploring strategies based on purification to improve robustness and previously have investigated noise-based approaches leveraging random projections.
Keywords: Adversarial robustness of bayesian methods, Adversarial defences based on purification, robustness of explanations for bayesian methods, noise-based adversarial defences, adversarial attacks on protein language models.
In the following, there is a reasoned bibliography, in which each class of contributions is briefly described and the relevant bibliography is cited.