BayesAdapter: Being Bayesian, Inexpensively and Reliably, via

Proceedings of Machine Learning Research 189, 2022 ACML 2022

BayesAdapter: Being Bayesian, Inexpensively and Reliably,

via Bayesian Fine-tuning

Zhijie Deng [email protected]

Qing Yuan Research Institute, Shanghai Jiao Tong University

Jun Zhu [email protected]

Dept. of Comp. Sci. & Tech., BNRist Center, THU-Bosch Joint ML Center, Tsinghua University

Editors: Emtiyaz Khan and Mehmet G¨onen

Abstract

Despite their theoretical appealingness, Bayesian neural networks (BNNs) are left behind

in real-world adoption, mainly due to persistent concerns on their scalability, accessibil-

ity, and reliability. In this work, we develop the BayesAdapter framework to relieve these

concerns. In particular, we propose to adapt pre-trained deterministic NNs to be varia-

tional BNNs via cost-eﬀective Bayesian ﬁne-tuning. Technically, we develop a modularized

implementation for the learning of variational BNNs, and refurbish the generally appli-

cable exemplar reparameterization trick through exemplar parallelization to eﬃciently re-

duce the gradient variance in stochastic variational inference. Based on the the lightweight

Bayesian learning paradigm, we conduct extensive experiments on a variety of benchmarks,

and show that our method can consistently induce posteriors with higher quality than

competitive baselines, yet signiﬁcantly reducing training overheads. Code is available at

https://github.com/thudzj/ScalableBDL.

Keywords: Bayesian neural networks; variational inference; uncertainty quantiﬁcation.

1. Introduction

Much eﬀort has been devoted to developing expressive Bayesian neural networks (BNNs)

to make accurate and reliable decisions (MacKay, 1992; Neal, 1995; Graves, 2011; Blundell

et al., 2015). The principled uncertainty quantiﬁcation capacity of BNNs is critical for

realistic decision-making, ﬁnding applications in scenarios ranging from model-based rein-

forcement learning (Depeweg et al., 2016), active learning (Hern´andez-Lobato and Adams,

2015) to healthcare (Leibig et al., 2017) and autonomous driving (Kendall and Gal, 2017).

BNNs are also known to be capable of resisting over-ﬁtting and over-conﬁdence.

Nonetheless, BNNs are falling far behind in terms of adoption in real-world applications

compared with deterministic NNs (He et al., 2016a; Vaswani et al., 2017), due to various

issues. For example, typical approximate inference methods for BNNs are often diﬃcult to

simultaneously maintain eﬃcacy and scalability (Zhang et al., 2018; Maddox et al., 2019).

Implementing a BNN algorithm requires substantially more expertise than implementing a

deterministic NN program. Moreover, as revealed, BNNs trained from scratch without the

“cold posterior” trick are often systematically worse than their point-estimate counterparts

in terms of predictive performance (Wenzel et al., 2020a); some easy-to-use BNNs (e.g.,

Deng Zhu

Figure 1: (Left): BayesAdapter boosts the accuracy of ImageNet classiﬁers without compromising

model calibration (estimated by expected calibration error (ECE) (Guo et al., 2017)). By contrast,

deterministic ﬁne-tuning only marginally improves the accuracy of pre-trained models yet aggra-

vates over-conﬁdence. RN refers to ResNet (He et al., 2016a) and DN refers to DenseNet (Huang

et al., 2017). (Right): BayesAdapter learns a CIFAR-10 classiﬁer which approaches or outperforms

competing baselines in terms of ECE for CIFAR-10 corruptions (Hendrycks and Dietterich, 2019).

Each box summarizes the ECE across 19 types of skew. We perform Bayesian ﬁne-tuning for only

4 and 12 epochs on ImageNet and CIFAR-10 respectively. More details are deferred to Sec 4.

Monte Carlo dropout) tend to suﬀer from mode collapse in function space, thus usually

give uncertainty estimates of poor ﬁdelity (Fort et al., 2019).

To mitigate these issues, we develop a pre-training & ﬁne-tuning workﬂow for learning

variational BNNs given an inherent connection between variational BNNs (Blundell et al.,

2015) and regular deep neural networks (DNNs). The resultant BayesAdapter framework

learns a variational BNN by performing several rounds of Bayesian ﬁne-tuning, starting

from a pre-trained deterministic NN. BayesAdapter is eﬀective and lightweight, and con-

joins the complementary beneﬁts from deterministic training and Bayesian reasoning, e.g.,

performance matching the pre-trained deterministic models, resistance to over-ﬁtting, reli-

able uncertainty estimates, etc. (ﬁnd evidence in Figure 1).

To improve the usability of BayesAdapter, we provide a modularized implementation

for the stochastic variational inference (SVI) under multiple representative variational dis-

tributions, including mean-ﬁeld Gaussian and parameter-sharing ensemble. Reducing the

variance of stochastic gradients is crucial for stabilizing and accelerating SVI, while the

pioneering works such as local reparameterization (Kingma et al., 2015) and Flipout (Wen

et al., 2018) can only deal with speciﬁc variational distributions, e.g., Gaussians and dis-

tributions whose samples can be reparameterized with symmetric perturbations. To tackle

this issue, we refurbish the widely-criticized exemplar reparameterization (Kingma et al.,

2015) by accelerating the exemplar-wise computations through parallelization, giving rise

to an eﬃcient and general-purpose gradient variance reduction technique.

We conduct extensive experiments to validate the advantages of BayesAdapter over com-

peting baselines, in aspects covering eﬃciency, predictive performance, and quality of un-

certainty estimates. Desirably, we scale up BayesAdapter to big data (e.g., ImageNet (Deng

et al., 2009)), deep architectures (e.g., ResNets (He et al., 2016a)), and practical scenarios

(e.g., face recognition (Deng et al., 2019)), and observe promising results. We also perform

a series of ablation studies to reveal the characteristics of the proposed approach.

BayesAdapter

2. Background

In this section, we motivate BayesAdapter by drawing a connection between variational

BNNs and DNNs trained by maximum a posteriori (MAP) estimation.

Let D = {(x

(i)

, y

(i)

)}

i=1

be a given training set, where x

(i)

∈ R

and y

(i)

∈ Y denote

the input data and label, respectively. A DNN model can be ﬁt via MAP estimation:

max

[log p(y

(i)

; w)] +

log p(w). (1)

We use w ∈ R

to denote the high-dimensional model parameters, with p(y|x; w) as the

predictive distribution associated with the model. The prior p(w), when taking the form of

an isotropic Gaussian N(w; 0, σ

I), reduces to the weight decay regularizer with coeﬃcient

λ = 1/(σ

n) in optimization. Nevertheless, deterministic training may easily cause over-

ﬁtting and over-conﬁdence, rendering the learned models of poor reliability (see Figure 1).

Naturally, BNNs come into the picture to address these limitations.

Typically, BNNs learn by inferring the posterior p(w|D) given the prior p(w) and the

likelihood p(D|w). Among the wide spectrum of BNN algorithms (MacKay, 1992; Neal,

1995; Graves, 2011; Blundell et al., 2015; Liu and Wang, 2016; Gal and Ghahramani, 2016),

variational BNNs are particularly promising due to their analogy to ordinary backprop.

Formally, variational BNNs use a θ-parameterized variational distribution q(w|θ) to ap-

proximate p(w|D), by maximizing the evidence lower bound (ELBO) (scaled by 1/n):

max

q (w|θ)



log p(y

(i)

; w)



| {z }

ell

−

(q(w|θ)∥p(w))

| {z }

, (2)

where L

ell

is the expected log-likelihood and L

is the complexity loss. By casting posterior

inference into optimization, Eq. (2) makes the training of BNNs resemble that of DNNs.

After training, the variational posterior is leveraged for prediction through marginalization:

p(y|x, D) ≈ E

q (w|θ)

p(y|x; w) ≈

s=1

p(y|x; w

(s)

), (3)

where w

(s)

∼ q(w|θ), s = 1, ..., S, with S denoting the number of Monte Carlo (MC)

samples. Eq. (3) is known as posterior predictive, Bayes ensemble, or Bayes model average.

We can simultaneously quantify the epistemic uncertainty with these MC samples. A

principled uncertainty metric is the mutual information between the model parameter and

the prediction (Smith and Gal, 2018), estimated by (H denotes the Shannon entropy):

I(w, y|x, D) ≈ H

s=1

p(y|x; w

(s)

)

−

s=1



p(y|x; w

(s)

)



. (4)

However, most of the existing variational BNNs exhibit limitations in scalability and

performance (Osawa et al., 2019; Wenzel et al., 2020a), compared with their determinis-

tic counterparts. This is mainly due to the higher diﬃculty of learning high-dimensional

distributions from scratch than point estimates.

Deng Zhu

MC estimate of

expected log-likelihood

Variational BNN

Predictions

grad.

Pre-trained DNN

grad.

Complexity loss

Adaptation

<latexit sha1_base64="A52TXafAKk26vffgmjbz7AoeP0g=">AAACCnicbVDLSsNAFJ3UV62vqEs30SLUTUlE1I1Q6sZlBfuAJg2T6aQdOnkwMxHLNGs3/oobF4q49Qvc+TdO2iy09cCFwzn3cu89XkwJF6b5rRWWlldW14rrpY3Nre0dfXevxaOEIdxEEY1Yx4McUxLipiCC4k7MMAw8itve6Drz2/eYcRKFd2IcYyeAg5D4BEGhJFc/tOVDT1bISWqnriRXVtqTEzuAYogglfV0krp62ayaUxiLxMpJGeRouPqX3Y9QEuBQIAo571pmLBwJmSCI4rRkJxzHEI3gAHcVDWGAuSOnr6TGsVL6hh8xVaEwpurvCQkDzseBpzqzI/m8l4n/ed1E+JeOJGGcCByi2SI/oYaIjCwXo08YRoKOFYGIEXWrgYaQQSRUeiUVgjX/8iJpnVat86p5e1au1fM4iuAAHIEKsMAFqIEb0ABNgMAjeAav4E170l60d+1j1lrQ8pl98Afa5w8E9Zse</latexit>

(i)

}

|B|

i=1

<latexit sha1_base64="f3AWVxHWmOJuc83a3Tp7SjwsiT4=">AAACJnicbZBLSwMxEMez9VXrq+rRS7AI7aXsiqhYChUvHivYB3Trkk3TNjT7MJlVy9pP48Wv4sVDRcSbH8X0gWjrQOA3/5lhMn83FFyBaX4aiYXFpeWV5GpqbX1jcyu9vVNVQSQpq9BABLLuEsUE91kFOAhWDyUjnitYze1djOq1OyYVD/xr6Ies6ZGOz9ucEtCSky52b+Iszw3wWRE/TLB071gTKmC78JPYinv4NqvzRxu6DIhj5Zx0xsyb48DzYE0hg6ZRdtJDuxXQyGM+UEGUalhmCM2YSOBUsEHKjhQLCe2RDmto9InHVDMenznAB1pp4XYg9fMBj9XfEzHxlOp7ru70CHTVbG0k/ldrRNA+bcbcDyNgPp0sakcCQ4BHnuEWl4yC6GsgVHL9V0y7RBIK2tmUNsGaPXkeqod56zhvXh1lSudTO5JoD+2jLLLQCSqhS1RGFUTRE3pBQ/RmPBuvxrvxMWlNGNOZXfQnjK9vOcGjNA==</latexit>

(i)

:= x

(i)

; w

(i)

⇠ q(w

|✓

)

<latexit sha1_base64="SrRSNsmrKkrPudjU0uRjEx45mCc=">AAACJ3icbZBLS8NAEMc39V1fVY9eFotQLyUpoqIoFS8eFewDmhg22027uHm4O1FK7Lfx4lfxIqiIHv0mbtsgah1Y+M1/ZpidvxcLrsA0P4zcxOTU9MzsXH5+YXFpubCyWldRIimr0UhEsukRxQQPWQ04CNaMJSOBJ1jDuzoZ1Bs3TCoehRfQi5kTkE7IfU4JaMktHHUv0xLf6uP9Q5xh9datjOgA2wf4O7MVD/B1Sed3NnQZELey5RaKZtkcBh4HK4MiyuLMLTzb7YgmAQuBCqJUyzJjcFIigVPB+nk7USwm9Ip0WEtjSAKmnHR4Zx9vaqWN/UjqFwIeqj8nUhIo1Qs83RkQ6Kq/tYH4X62VgL/npDyME2AhHS3yE4EhwgPTcJtLRkH0NBAquf4rpl0iCQVtbV6bYP09eRzqlbK1UzbPt4vV48yOWbSONlAJWWgXVdEpOkM1RNE9ekQv6NV4MJ6MN+N91Jozspk19CuMzy+IWaNS</latexit>

(i)

:= h

(i)

; w

(i)

⇠ q(w

|✓

)

<latexit sha1_base64="QoB0YEyc2z+y134Pp/BzIf3PUQo=">AAACJnicbZBNS8NAEIY3flu/oh69LBahvZTEiooiVLx4VLAqNDFstpt26ebD3YlSYn+NF/+KFw+KiDd/its2iLYOLDzzzgyz8/qJ4Aos69OYmJyanpmdmy8sLC4tr5ira5cqTiVldRqLWF77RDHBI1YHDoJdJ5KR0Bfsyu+c9OtXd0wqHkcX0E2YG5JWxANOCWjJM4/aN1mJl3v44AjnWLv3qkM6dA7xT+IoHuLbks4fHGgzIF617JlFq2INAo+DnUMR5XHmma9OM6ZpyCKggijVsK0E3IxI4FSwXsFJFUsI7ZAWa2iMSMiUmw3O7OEtrTRxEEv9IsAD9fdERkKluqGvO0MCbTVa64v/1RopBPtuxqMkBRbR4aIgFRhi3PcMN7lkFERXA6GS679i2iaSUNDOFrQJ9ujJ43C5XbF3K9b5TrF2nNsxhzbQJiohG+2hGjpFZ6iOKHpEz+gVvRlPxovxbnwMWyeMfGYd/Qnj6xsrgKMs</latexit>

(i)

:= h

(i)

; w

(i)

⇠ q(w

|✓

)

<latexit sha1_base64="3tMknqmJhufJt4buVVgFme13Ebo=">AAACAXicbZDLSgMxFIbP1Futt1E3gptgEaqLMiOiIggVNy4r2Au00yGTpm1o5kKSUctQN76KGxeKuPUt3Pk2pu0stPWHwMd/zuHk/F7EmVSW9W1k5uYXFpeyy7mV1bX1DXNzqyrDWBBaISEPRd3DknIW0IpiitN6JCj2PU5rXv9qVK/dUSFZGNyqQUQdH3cD1mEEK2255k6vlRTYwRCdX6CHCZbuXbt16Jp5q2iNhWbBTiEPqcqu+dVshyT2aaAIx1I2bCtSToKFYoTTYa4ZSxph0sdd2tAYYJ9KJxlfMET72mmjTij0CxQau78nEuxLOfA93elj1ZPTtZH5X60Rq86Zk7AgihUNyGRRJ+ZIhWgUB2ozQYniAw2YCKb/ikgPC0yUDi2nQ7CnT56F6lHRPilaN8f50mUaRxZ2YQ8KYMMplOAaylABAo/wDK/wZjwZL8a78TFpzRjpzDb8kfH5A70UlSU=</latexit>

(i)

:= x

(i)

⇤

<latexit sha1_base64="HHuIRxpVZDIfKVb/bztITMuwIu4=">AAACAXicbZDLSgMxFIbP1Futt1E3gptgEaqLMlNERRAqblxWsBdop0MmTdvQzIUko5ShbnwVNy4UcetbuPNtTNtZaOsPgY//nMPJ+b2IM6ks69vILCwuLa9kV3Nr6xubW+b2Tk2GsSC0SkIeioaHJeUsoFXFFKeNSFDse5zWvcH1uF6/p0KyMLhTw4g6Pu4FrMsIVtpyzb1+OymwoxG6uEQplh/cUvvYNfNW0ZoIzYOdQh5SVVzzq9UJSezTQBGOpWzaVqScBAvFCKejXCuWNMJkgHu0qTHAPpVOMrlghA6100HdUOgXKDRxf08k2Jdy6Hu608eqL2drY/O/WjNW3XMnYUEUKxqQ6aJuzJEK0TgO1GGCEsWHGjARTP8VkT4WmCgdWk6HYM+ePA+1UtE+LVq3J/nyVRpHFvbhAApgwxmU4QYqUAUCj/AMr/BmPBkvxrvxMW3NGOnMLvyR8fkDpZqVFg==</latexit>

(i)

:= h

(i)

⇤

<latexit sha1_base64="J3oJkKqFuFr4+ud2MNrDSO4WjCo=">AAACAXicbZDLSgMxFIbP1Futt1E3gptgEaqLMqOiIggVNy4r2Au00yGTpm1o5kKSUcpQN76KGxeKuPUt3Pk2pu0stPWHwMd/zuHk/F7EmVSW9W1k5uYXFpeyy7mV1bX1DXNzqyrDWBBaISEPRd3DknIW0IpiitN6JCj2PU5rXv96VK/dUyFZGNypQUQdH3cD1mEEK2255k6vlRTYwRBdXKIUSw/ucevQNfNW0RoLzYKdQh5SlV3zq9kOSezTQBGOpWzYVqScBAvFCKfDXDOWNMKkj7u0oTHAPpVOMr5giPa100adUOgXKDR2f08k2Jdy4Hu608eqJ6drI/O/WiNWnXMnYUEUKxqQyaJOzJEK0SgO1GaCEsUHGjARTP8VkR4WmCgdWk6HYE+fPAvVo6J9WrRuT/KlqzSOLOzCHhTAhjMowQ2UoQIEHuEZXuHNeDJejHfjY9KaMdKZbfgj4/MHpyCVFw==</latexit>

(i)

:= h

(i)

⇤

Figure 2: BayesAdapter adapts pre-trained DNNs to be variational BNNs and then launches few

rounds of Bayesian ﬁne-tuning. We provide a modularized implementation for Bayesian ﬁne-tuning,

allowing users to learn a variational BNN as if training a DNN under a weight decay regularizer.

Given that MAP converges to a mode of the Bayesian posterior, it might be plausible to

adapt pre-trained deterministic DNNs to be Bayesian economically. Following this hypoth-

esis, we repurpose the converged parameters w

∗

of MAP – take w

∗

as the initialization of

the parameters of the approximate posterior. Laplace approximation (MacKay, 1992) is a

classic method in this spirit, which assumes a Gaussian approximate posterior, and adapts

∗

and the local curvature at w

∗

as the Gaussian mean and variance respectively. Yet,

Laplace approximation is inﬂexible and usually computationally prohibitive (only after the

introduction of Gauss-Newton approximation, KFAC approximation, and particularly the

last-layer approximation, the cost of Laplace approximation becomes aﬀordable (Daxberger

et al., 2021)). Alternatively, we develop the more practical Bayesian ﬁne-tuning scheme,

whose core notion is to ﬁne-tune the imperfect approximate posterior by maximizing ELBO.

3. BayesAdapter

We describe BayesAdapter in this section. Figure 2 gives its illustration.

In BayesAdapter, the conﬁguration of the variational distribution q(w|θ) plays a decisive

role. Although a wealth of variationals have emerged for adoption (Louizos and Welling,

2016; Li and Turner, 2017; Wen et al., 2018), on one hand, more complicated ones (Louizos

and Welling, 2017; Shi et al., 2018b) are routinely accompanied by less scalable learning;

on the other hand, the aforementioned hypothesis inspiring Bayesian ﬁne-tuning entails

an explicit alignment between the DNN parameters w

∗

and the variational parameters θ.

Thus, we primarily concern the typical mean-ﬁeld Gaussian distribution as well as a more

powerful one that resembles Deep Ensemble (Lakshminarayanan et al., 2017).

3.1. Mean-ﬁeld Gaussian (MFG) Variational

Without losing generality, we write the MFG variational as q(w|θ) = N(w; µ, diag(exp(2ψ))),

with µ, ψ ∈ R

denoting the mean and the logarithm of standard deviation respectively.

In this sense, we can naturally initialize µ with w

∗

at the beginning of ﬁne-tuning to ease

approximate inference and to enable the investigation of more qualiﬁed posterior modes.

BayesAdapter

As in MAP, we also assume an isotropic Gaussian prior p(w) = N(w; 0, σ

I). Then the

gradients of the complexity loss can be derived analytically:

∇

= −λµ, ∇

= −λexp(2ψ) +

, with λ =

. (5)

Intuitively, the above gradients for the variational parameters correspond to a variant of the

vanilla weight decay in DNNs. Having identiﬁed this, we can implement a module similar to

weight decay to implicitly be responsible for the complexity loss, leaving only the expected

log-likelihood L

ell

required to be explicitly handled. We will elaborate on the details of

solving max L

ell

after presenting a more expressive variational conﬁguration.

3.2. Parameter-sharing Ensemble (PSE) Variational

Despite simplicity, the MFG variational can be limited in expressiveness for capturing the

multi-modal parameter posterior of over-parameterized neural networks. Empowered by the

observation that Deep Ensemble (Lakshminarayanan et al., 2017) is a compelling Bayesian

marginalization mechanism in deep learning (Wilson and Izmailov, 2020), we intend to

develop a low-cost ensemble-like variational for more practical Bayesian deep learning.

Speciﬁcally, we ﬁrst deﬁne the variational as a uniform mixture of C Gaussians: q(w|θ) =

N(w; w

(c)

, Σ

(c)

), where Σ

(c)

∈ R

p×p

is positive-deﬁnite and its elements are indepen-

dent of the dimension p.

In this sense, the complexity loss boils down to the KL divergence

between a mixture of Gaussians and a Gaussian, which, yet, cannot be calculated ana-

lytically in general. Nevertheless, under the mild assumption that w

(c)

∈ R

is normally

distributed and p is large enough, the KL divergence can be approximated by a weighted

sum of the KL divergences between the Gaussian components and the Gaussian prior (refer

to (Gal and Ghahramani, 2015) for detailed discussion and proof). Namely,

−

N(w; w

(c)

, Σ

(c)

)



N(w; 0, σ

≈ −

2σ

c=1



∥w

(c)

∥

+ trace(Σ

(c)

) −σ

log |Σ

(c)



+ constant.

(6)

Based on the observation that Bayesian model average beneﬁts signiﬁcantly more from

the exploration of new modes than navigation around a local mode (Wilson and Izmailov,

2020), we assume Σ

(c)

, c = 1, ..., C, to be a constant diagonal matrix σ

I with σ

approach-

ing 0. Namely, we purely chase multi-mode exploration and leave the joint optimization

of Σ

(c)

and w

(c)

for future investigation. Then, q(w|θ) almost amounts to a mixture of

deltas (i.e., an ensemble) and with high probability we can approximate the realisation of w

by a uniform sample from {w

(1)

, ..., w

(C)

}. Meanwhile, the complexity loss approximately

becomes −

c=1

∥w

(c)

∥

+ constant, and we can easily implement a weight decay-like

module to be responsible for its gradient. We comment here that it may be more plausible

to alternatively leverage the rigorous quasi-KL divergence (Hron et al., 2018) for estimating

the divergence between a mixture of deltas and the Gaussian prior, left as a future work.

1. We deﬁne the variational as a mixture of Gaussians instead of a mixture of deltas to ensure the variational

is absolutely continuous w.r.t. the prior. This avoids the singularity issue in variational inference (Hron

et al., 2018).

Deng Zhu

Simulating an ensemble is far from our ultimate goal due to the required high cost.

To make the variational economical, we explore a valuable insight from recent works (Wen

et al., 2020; Wenzel et al., 2020b) that the parameters of diﬀerent ensemble components

can be partially shared without undermining eﬀectiveness.

Speciﬁcally, abusing w to notate the parameter matrix of size m

× m

out

in a neural

network layer, we generate C components via: w

(c)

= l

(c)

◦

w, c = 1, ..., C, where

w ∈ R

×m

out

are the shared parameters and l

(c)

∈ R

×r

and r

(c)

∈ R

r×m

out

correspond

to r-rank decomposition of some perturbations. ◦ is element-wise multiplication. The shared

parameters

w can be initialized as w

∗

to ease and speedup Bayesian ﬁne-tuning. When

the rank r is suitably small, the above design can signiﬁcantly reduce the model size, and

save the training eﬀort. Of note that the previous works (Wen et al., 2020; Wenzel et al.,

2020b) conﬁne r to be 1 to permit the adoption of a speciﬁc gradient variance reduction

trick. Conversely, we loosen this constraint by using a more generally applicable variance

reduction tactic, detailed below.

3.3. A Reliable Estimation of the Expected Log-likelihood L

ell

Given the high non-linearity of deep NNs and the large volume of data in real-world sce-

narios, we follow the stochastic variational inference (SVI) paradigm for estimating L

ell

Formally, given a mini-batch of data B = {(x

(i)

, y

(i)

)}

|B|

i=1

, we solve

max

′

ell

|B|

i=1

log p(y

(i)

; w), (7)

where w is drawn from the MFG or PSE variational via reparameterization (Kingma and

Welling, 2013). The gradients w.r.t. the variational parameters can be derived automati-

cally with autodiﬀ libraries, thus the training resembles that of regular DNNs.

However, gradients derived by L

′

ell

might exhibit high variance, caused by sharing the

sampled parameters w across data in the mini-batch (Kingma et al., 2015). Popular tech-

niques for addressing this issue typically assume a restrictive form of variational distri-

bution (Kingma et al., 2015; Wen et al., 2018), struggling to handle structured distribu-

tions like the proposed PSE with > 1 rank. Fortunately, there is a generally applicable

strategy for reducing gradient variance in stochastic variational inference named exemplar

reparametrization (ER), which samples dedicated parameters for every exemplar in the

minibatch for estimating L

ell

∗

ell

|B|

i=1

log p(y

(i)

; w

(i)

), w

(i)

i.i.d.

∼ q(w|θ), i = 1, ..., |B|. (8)

We can see that the computational cost of ER is identical to that of vanilla reparam-

eterization, but ER was criticized for that the involved exemplar-wise computations could

not be eﬃciently done within the popular computation libraries in 2015 (Kingma et al.,

2015). With the rapid development of high-performance device-propriety kernel backends

(e.g. cuDNN (Chetlur et al., 2014)) in recent years, we wonder is the criticism still hold?

To this end, we ﬁrst refurbish ER to ﬁt nowadays ML frameworks. Our key insight here

is to perform multiple exemplar-wise computations in parallel with a single kernel launch,

BayesAdapter

def ER_conv(input, theta, stride, padding, groups):

“””input: [b, i, h, w]; theta: variational parameters”””

b = input.shape[0]

# sample a batch of conv filters w: [b, o, i, k, k]

w = mc_sample(theta, num_mc_samples=b)

# reshape w to have shape [b*o, i, k, k]

w = w.flatten(start_dim=0, end_dim=1)

# reshape input to have shape [1, b*i, h, w]

input = input.flatten(start_dim=0, end_dim=1).unsqueeze(0)

# perform b convs in parallel

output = conv2d(input, w, stride, padding, groups*b)

# reshape the result to standard format

return output.view(b, -1, output.shape[2], output.shape[3])

Figure 3: (Left): Implementation of exemplar reparametrization for 2D convolution in

PyTorch (Paszke et al., 2019). (Right): Memory and time cost comparison among exemplar

reparametrization (ER), vanilla reparametrization (VR) (Kingma and Welling, 2013), local

reparametrization (LR) (Kingma et al., 2015), and Flipout (Wen et al., 2018) with mean-

ﬁeld Gaussian variational used (estimated on ImageNet with ResNet-50 architecture).

e.g., organize exemplar-wise matrix multiplications as a batch matrix multiplication;

organize exemplar-wise convolutions as a group convolution (see Figure 3 (Left)). We

then conduct an empirical study on the computation cost of ER and relevant methods

using MFG variational. Figure 3 (Right) shows the results.

Surprisingly, ER’s time eﬃciency is comparable with that of local reparameteriza-

tion (Kingma et al., 2015) and Flipout (Wen et al., 2018), while its memory cost is even

lower. This is perhaps because local reparameterization and Flipout both need to calculate

and store one extra mini-batch of feature maps, which are rather large in ImageNet models.

Note that the added memory cost of ER upon vanillar reparameterization comes from the

storage of a mini-batch of temporary parameters. We clarify that the primary merit of ER

over existing methods is the higher generality rather than better learning outcomes. We

hope that ER will beneﬁt the further development of new variational distributions.

3.4. A Plug-and-play Library

We wrap the details of the aforementioned modularized stochastic variational inference and

ER strategy for MFG and PSE in a plug-and-play Python library to free the users from

the diﬃculties of implementing BayesAdapter.

4. Experiments

We apply BayesAdapter to a diverse set of benchmarks for empirical veriﬁcation.

Settings. In general, we pre-train DNNs following standard protocols or fetch the pre-

trained checkpoints available online, and then perform Bayesian ﬁne-tuning. We randomly

initialize the newly added variational parameters (e.g., ψ, l

(c)

, r

(c)

). Unless otherwise stated,

we set r = 1 and C = 20 for PSE and use the ER trick during training. We use S = 20 MC

samples to make prediction and quantify epistemic uncertainty. We conduct experiments

on 8 RTX 2080Ti GPUs. Full details are deferred to Appendix B.

Baselines. We consider extensive baselines including: (1) MAP, which is the ﬁne-

tuning start point, (2) Laplace Approx.: which preforms Laplace approximation with di-

agonal Fisher information matrix, (3) MC Dropout, which is a dropout variant of MAP,

Deng Zhu

Table 1: Comparison on test accuracy and negative log-likelihood (NLL) on CIFAR-10 benchmark.

We use underline to emphasize the results obtained given signiﬁcantly more training eﬀort. For

BayesAdapter, we repeat every experiment for 3 times and report the error bar.

Method Accuracy (%) ↑ NLL ↓

MAP 96.92 0.1312

Laplace Approx. 96.41 0.1204

MC Dropout 96.95 0.1151

SWAG 96.32 0.1122

Deep Ensemble 97.40 0.0869

VBNN (MFG) 96.95 0.0994

VBNN (PSE) 96.88 0.1328

BayesAdapter (MFG) 97.10±0.03 0.1007±0.0014

BayesAdapter (PSE) 97.13±0.03 0.0936±0.0010

(4) VBNN, which refers to from-scratch trained variational BNNs. In particular, the vari-

ational BNN methods like BayesAdapter and VBNN are evaluated on both the MFG and

PSE variationals. We also include Deep Ensemble (Lakshminarayanan et al., 2017), and

SWAG (Maddox et al., 2019) as baselines on CIFAR-10 benchmark (Krizhevsky et al.,

2009).

4.1. CIFAR-10 Classiﬁcation

We ﬁrst conduct experiments on CIFAR-10 with wide-ResNet-28-10 architecture (Zagoruyko

and Komodakis, 2016). We perform Bayesian ﬁne-tuning for 12 epochs with the weight

decay coeﬃcient λ set as 2e-4. Table 1 outlines the comparison on prediction performance.

It is worth noting that BayesAdapter substantially outperforms MAP, Laplace Approx.,

and MC Dropout in aspect of predictive performance. BayesAdapter also surpasses SWAG

due to that SWAG does not directly beneﬁt from high-performing pre-trained MAP models.

The accuracy upper bound is Deep Ensemble, which trains 5 isolated MAPs and assembles

their predictions to explicitly investigate diverse function modes, but it is much more expen-

sive than BayesAdapter. VBNN is clearly defeated by BayesAdapter, conﬁrming our claim

that performing Bayesian ﬁne-tuning from the converged deterministic checkpoints is ben-

eﬁcial to explore more qualiﬁed posteriors. BayesAdapter (PSE) surpasses BayesAdapter

(MFG), especially in the aspect of NLL. Unless speciﬁed otherwise, we refer to BayesAdapter

(PSE) as BayesAdapter in the following.

Converged ELBO. We compare the converged (training) ELBO of BayesAdapter and

VBNN : the former gives L

ell

= −0.019 and L

= −2806.8 while the latter gives L

ell

−0.032 and L

= −2384.3. This implies that Bayesian ﬁne-tuning makes the approximate

posterior converge to somewhere with better data ﬁtting than from-scratch VI.

CIFAR-10 corruptions. We then assess the quality of predictive uncertainty on

CIFAR-10 corruptions (Hendrycks and Dietterich, 2019). Figure 1 (Right) shows the results,

which reﬂect the eﬃcacy of BayesAdapter for promoting model calibration.

2. Currently, we have not scaled Deep Ensemble and SWAG, whihc both require storing tens of NN weights

copies, up to ImageNet due to resource constraints.

BayesAdapter

Table 2: Comparison on test accuracy and NLL on ImageNet benchmark.

Method Accuracy (%) ↑ NLL ↓

MAP 76.13 0.9618

Laplace Approx. 75.89 0.9739

MC Dropout 74.88 0.9884

VBNN (MFG) 75.97 0.9435

VBNN (PSE) 75.12 0.9865

BayesAdapter (MFG) 76.45±0.05 0.9303±0.0005

BayesAdapter (PSE) 76.80±0.03 0.9159±0.0010

Figure 4: Error vs. conﬁdence plots

for models trained on CIFAR-10 and

tested on both CIFAR-10 and SVHN.

CIFAR-10 vs SVHN. Following He et al. (2020),

we evaluate the trained models on both CIFAR-10 and

SVHN (Netzer et al., 2011) test sets. For every con-

ﬁdence threshold 0 ≤ τ < 1, we compute the average

error rate for predictions with ≥ τ conﬁdence (all pre-

dictions on SVHN data are regarded as incorrect). We

depict the error vs. conﬁdence curves in Figure 4. It

is clear that BayesAdapter (PSE) has made more con-

servative predictions on the out-of-distribution (OOD)

SVHN data than all baselines. BayesAdapter (PSE)

even outperforms the expensive Deep Ensemble, im-

plying that the parameter-sharing mechanism may im-

pose further regularization on learning. The compari-

son also conﬁrms from-scratch variational BNNs, even

with the PSE variational, have diﬃculties to ﬁnd good posteriors.

Speedup. BayesAdapter requires 200 epochs of deterministic training plus 12 epochs of

variational training, while VBNN requires 200 epochs of variational training. Considering

the cost of variational training is several times (about 2.1×) that of deterministic training,

the training time saved by BayesAdapter is considerable.

4.2. ImageNet Classiﬁcation

We then scale up BayesAdapter to ImageNet with ResNet-50 (He et al., 2016a) architecture.

We launch ﬁne-tuning for merely 4 epochs with the weight decay coeﬃcient λ set as 1e-4.

Table 2 reports the empirical comparison. As expected, most results are consistent with

those on CIFAR-10. On this large-scale scenario, it is more clear that the from-scratch

learning baseline VBNN would suﬀer from local optima. The striking improvement of

BayesAdapter upon MAP validates the beneﬁts of Bayesian treatment. Zooming in, we

also note that BayesAdapter (PSE) reveals remarkably higher accuracy than BayesAdapter

(MFG), testifying the superior expressiveness of PSE over MFG.

4.3. Face Recognition

To demonstrate the universality of BayesAdapter, we further apply it to the challenging

face recognition task based on MobileNetV2 architecture (Sandler et al., 2018). We train

Deng Zhu

Table 3: Accuracy ↑ comparison on open-set face recognition with MobileNetV2 architecture.

Method LFW CPLFW CALFW CFP-FF CFP-FP

MAP 98.2% 84.0% 87.6% 97.8% 92.7%

MC Dropout 98.2% 83.6% 87.3% 97.8% 92.8%

BayesAdapter (MFG) 98.4% 83.9% 85.8% 97.6% 92.9%

BayesAdapter (PSE) 98.4% 84.7% 87.8% 97.8% 93.1%

Table 4: Ablation study on the rank r of PSE. (ImageNet)

Method Accuracy (%) # of Param. (M)

MAP 76.13 25.56

BayesAdapter (PSE, r=1) 76.80 27.21

BayesAdapter (PSE, r=8) 76.78 38.76

BayesAdapter (PSE, r=16) 76.80 51.95

models on the CASIA dataset (Yi et al., 2014), and perform comprehensive evaluation on

face veriﬁcation datasets including LFW (Huang et al., 2007), CPLFW (Zheng and Deng,

2018), CALFW (Zheng et al., 2017), and CFP (Sengupta et al., 2016). We launch ﬁne-

tuning for 4 epochs with λ = 5e − 4. We compare our method to MAP and MC Dropout,

two popular baselines in face recognition. We depict the recognition accuracy in Table 3.

It is noteworthy that Bayesian principle can induce better predictive performance for

face recognition models. BayesAdapter (PSE) has outperformed the ﬁne-tuning start point

MAP and the popular baseline MC Dropout in most veriﬁcation datasets, despite being

ﬁne-tuned for only several rounds.

4.4. More Empirical Analyses

Table 5: Comparison on model calibration (ECE ↓).

Method CIFAR-10 ImageNet

MAP 0.0198 0.0373

SWAG 0.0088 -

Deep Ensemble 0.0057 -

VBNN (MFG) 0.0074 0.0183

VBNN (PSE) 0.0188 0.0202

BayesAdapter (MFG) 0.0091 0.0289

BayesAdapter (PSE) 0.0058 0.0129

Model calibration on in-distribution

data. We estimate the model cali-

bration, measured by ECE, of various

methods on in-distribution data, and

report the results in Table 5. The

ECE of SWAG on ImageNet is based

on ResNet-152 (He et al., 2016b). No-

tably, the ECE of BayesAdapter (PSE)

is on par with Deep Ensemble, signiﬁ-

cantly better than the other baselines.

The impact of the rank r for PSE. As stated, we set r = 1 for all the above studies

for maximal parameter saving. Yet, does the small rank r conﬁne the expressiveness of PSE ?

We perform an ablation study to pursue the answer. As shown in Table 4, the capacity of

PSE can already be suﬃciently unleashed when the rank is 1, where only marginally added

parameters are introduced over MAP. This indicates the merit of PSE for eﬃcient learning.

The eﬀectiveness of exemplar reparameterization. We build a toy model with

only a convolutional layer and ﬁx the model input and the target output. We employ the

BayesAdapter

Figure 5: (Left): Test accuracy varies w.r.t. the number of MC samples for Bayes ensemble.

(Right): Comparison on the accuracy for instance buckets of equal size but with rising uncertainty.

(BayesAdapter (MFG), ImageNet)

MFG variational on the convolutional parameters and computing the variance of stochastic

gradients across 500 runs. We average the gradient variance of µ and ψ over all their

coordinates, and observe that vanilla reparameterization typically introduces 100× more

variance than exemplar reparameterization.

The impact of ensemble number. We draw the change of test accuracy w.r.t. the

number of MC samples S for Bayes ensemble in Figure 5 (Left). The model is trained by

BayesAdapter (MFG) on ImageNet. The points on the red line represent the individual

accuracies of the 100 parameter samples. The yellow dashed line refers to the deterministic

inference with only the Gaussian mean. The green line displays the eﬀects of Bayes ensemble

– the predictive performance increases from < 74% to > 76% quickly before seeing 20

parameter samples, and gradually saturates after that.

Uncertainty-based rejective decision. In practice, we expect our models to be

accurate on the data that they are certain about. In this spirit, we gather the epistemic

uncertainty estimates for ImageNet validation data given by BayesAdapter (MFG), based

on which we divide the data into 10 buckets of equal size but with increasing uncertainty.

We depict the average accuracy of each bucket in Figure 5 (Right). As expected, our BNN

is more accurate for instances with smaller uncertainty.

5. Related Work

Fruitful works have emerged in the BNN community in the last decade (Graves, 2011;

Welling and Teh, 2011; Blundell et al., 2015; Kingma and Welling, 2013; Balan et al., 2015;

Liu and Wang, 2016; Kendall and Gal, 2017; Wu et al., 2018). However, most of the existing

works cannot achieve the goal of practicability. For example, some works trade learning

eﬃciency for ﬂexible variational posteriors, leading to restrictive scalability (Louizos and

Welling, 2016, 2017; Shi et al., 2018a; Sun et al., 2019). Khan et al. (2018); Zhang et al.

(2018); Osawa et al. (2019) build Adam-like optimizers to do variational inference, but their

parallel training throughput and compatibility with data augmentation are inferior to SGD.

Approximate Bayesian methods like Monte Carlo dropout (Gal and Ghahramani, 2016) and

Deep Ensemble (Lakshminarayanan et al., 2017) can maintain good predictive performance

but suﬀer from degenerated uncertainty estimates (Fort et al., 2019) or high cost.

Deng Zhu

Laplace approximation (MacKay, 1992; Ritter et al., 2018) is a known approach to trans-

form a DNN to a BNN, but it is inﬂexible due to its postprocessing nature and some strong

assumptions made for practical concerns. Alternatively, BayesAdapter works in the style

of ﬁne-tuning, which is more natural and economical for deep networks. Bayesian modeling

the last layer of a DNN is proposed recently (Kristiadi et al., 2020), and its combination

with BayesAdapter deserves an investigation. BayesAdapter connects to MOPED (Krish-

nan et al., 2020) in that their variational conﬁgurations are both based on MAP. However,

MOPED solves the prior speciﬁcation problem for BNNs while BayesAdapter constitutes a

practical framework to bring variational BNNs to the masses. In detail, MOPED uses MAP

to deﬁne the prior while BayesAdapter uses MAP to initialize the parameters of the varia-

tional distribution. Beyond this, this work makes valuable technical contributions including

the PSE variational and the reﬁnement of exemplar reparameterization. We have also done

a thorough study on how pre-training beneﬁts VI. Moreover, the results in Appendix C

show that MOPED suﬀers from more serious over-ﬁtting than BayesAdapter.

6. Conclusion

This work proposes the BayesAdapter framework to ease the learning of variational BNNs.

Our core idea is to perform Bayesian ﬁne-tuning instead of expensive from-scratch Bayesian

learning. We develop plug-and-play implementations for the stochastic variational inference

under two representative variational distributions, and reﬁne exemplar reparameterization

to eﬃciently reduce gradient variance. We evaluate BayesAdapter in diverse scenarios

and report promising results. One limitation of BayesAdapter is that practitioners may

need to carefully tune the optimization conﬁgurations for Bayesian ﬁne-tuning to achieve

reasonable performance. Regarding future work, the application of BayesAdapter to more

exciting scenarios like contextual bandits deserves further investigation.

Acknowledgments

This work was supported by NSFC Projects (Nos. 62061136001, 62076145, 62076147,

U19B2034, U1811461, U19A2081, 61972224), Beijing NSF Project (No. JQ19016), BN-

Rist (BNR2022RC01006), Tsinghua Institute for Guo Qiang, and the High Performance

Computing Center, Tsinghua University. J.Z is also supported by the XPlorer Prize.

References

Anoop Korattikara Balan, Vivek Rathod, Kevin P Murphy, and Max Welling. Bayesian

dark knowledge. In Advances in Neural Information Processing Systems, pages 3438–

3446, 2015.

Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight un-

certainty in neural network. In International Conference on Machine Learning, pages

1613–1622, 2015.

Sharan Chetlur, Cliﬀ Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan

Catanzaro, and Evan Shelhamer. cudnn: Eﬃcient primitives for deep learning. arXiv

preprint arXiv:1410.0759, 2014.

BayesAdapter

Erik Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias

Bauer, and Philipp Hennig. Laplace redux-eﬀortless bayesian deep learning. Advances in

Neural Information Processing Systems, 34:20089–20103, 2021.

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale

Hierarchical Image Database. In CVPR09, 2009.

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular

margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 4690–4699, 2019.

Stefan Depeweg, Jos´e Miguel Hern´andez-Lobato, Finale Doshi-Velez, and Steﬀen Udluft.

Learning and policy search in stochastic dynamical systems with Bayesian neural net-

works. arXiv preprint arXiv:1605.07127, 2016.

Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. Deep ensembles: A loss landscape

perspective. arXiv preprint arXiv:1912.02757, 2019.

Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: appendix. arXiv

preprint arXiv:1506.02157, 420, 2015.

Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing

model uncertainty in deep learning. In International Conference on Machine Learning,

pages 1050–1059, 2016.

Alex Graves. Practical variational inference for neural networks. In Advances in Neural

Information Processing Systems, pages 2348–2356, 2011.

Chuan Guo, Geoﬀ Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern

neural networks. arXiv preprint arXiv:1706.04599, 2017.

Bobby He, Balaji Lakshminarayanan, and Yee Whye Teh. Bayesian deep ensembles via the

neural tangent kernel. arXiv preprint arXiv:2007.05864, 2020.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for

image recognition. In Proceedings of the IEEE Conference on Computer Vision and

Pattern Recognition, pages 770–778, 2016a.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep

residual networks. In European Conference on Computer Vision, pages 630–645, 2016b.

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to com-

mon corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.

Jos´e Miguel Hern´andez-Lobato and Ryan Adams. Probabilistic backpropagation for scalable

learning of Bayesian neural networks. In International Conference on Machine Learning,

pages 1861–1869, 2015.

Jiri Hron, Alex Matthews, and Zoubin Ghahramani. Variational bayesian dropout: pitfalls

and ﬁxes. In International Conference on Machine Learning, pages 2019–2028. PMLR,

2018.

Deng Zhu

Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely

connected convolutional networks. In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pages 4700–4708, 2017.

Gary B Huang, Marwan Mattar, Tamara Berg, and Eric Learned-Miller. Labeled faces

in the wild: A database forstudying face recognition in unconstrained environments. In

Technical report, 2007.

Alex Kendall and Yarin Gal. What uncertainties do we need in Bayesian deep learning

for computer vision? In Advances in Neural Information Processing Systems, pages

5574–5584, 2017.

Mohammad Emtiyaz Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash

Srivastava. Fast and scalable Bayesian deep learning by weight-perturbation in adam. In

International Conference on Machine Learning, pages 2616–2625, 2018.

Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint

arXiv:1312.6114, 2013.

Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local

reparameterization trick. In Advances in Neural Information Processing Systems, pages

2575–2583, 2015.

Ranganath Krishnan, Mahesh Subedar, and Omesh Tickoo. Specifying weight priors in

bayesian deep neural networks with empirical bayes. In Proceedings of the AAAI Con-

ference on Artiﬁcial Intelligence, volume 34, pages 4477–4484, 2020.

Agustinus Kristiadi, Matthias Hein, and Philipp Hennig. Being bayesian, even just a bit,

ﬁxes overconﬁdence in relu networks. arXiv preprint arXiv:2002.10118, 2020.

Alex Krizhevsky, Geoﬀrey Hinton, et al. Learning multiple layers of features from tiny

images. 2009.

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable

predictive uncertainty estimation using deep ensembles. In Advances in Neural Informa-

tion Processing Systems, pages 6402–6413, 2017.

Christian Leibig, Vaneeda Allken, Murat Se¸ckin Ayhan, Philipp Berens, and Siegfried Wahl.

Leveraging uncertainty information from deep neural networks for disease detection. Sci-

entiﬁc Reports, 7(1):1–14, 2017.

Yingzhen Li and Richard E Turner. Gradient estimators for implicit models. arXiv preprint

arXiv:1705.07107, 2017.

Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian

inference algorithm. In Advances in Neural Information Processing Systems, pages 2378–

2386, 2016.

Christos Louizos and Max Welling. Structured and eﬃcient variational deep learning with

matrix gaussian posteriors. In International Conference on Machine Learning, pages

1708–1716, 2016.

BayesAdapter

Christos Louizos and Max Welling. Multiplicative normalizing ﬂows for variational Bayesian

neural networks. In International Conference on Machine Learning, pages 2218–2227,

2017.

David JC MacKay. A practical Bayesian framework for backpropagation networks. Neural

Computation, 4(3):448–472, 1992.

Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon

Wilson. A simple baseline for bayesian uncertainty in deep learning. In Advances in

Neural Information Processing Systems, pages 13153–13164, 2019.

Radford M Neal. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto,

1995.

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.

Reading digits in natural images with unsupervised feature learning. 2011.

Kazuki Osawa, Siddharth Swaroop, Anirudh Jain, Runa Eschenhagen, Richard E Turner,

Rio Yokota, and Mohammad Emtiyaz Khan. Practical deep learning with Bayesian

principles. arXiv preprint arXiv:1906.02506, 2019.

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,

Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imper-

ative style, high-performance deep learning library. In Advances in neural information

processing systems, pages 8026–8037, 2019.

Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable laplace approximation

for neural networks. In 6th International Conference on Learning Representations, ICLR

2018-Conference Track Proceedings, volume 6. International Conference on Representa-

tion Learning, 2018.

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.

Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE con-

ference on computer vision and pattern recognition, pages 4510–4520, 2018.

Soumyadip Sengupta, Jun-Cheng Chen, Carlos Castillo, Vishal M Patel, Rama Chellappa,

and David W Jacobs. Frontal to proﬁle face veriﬁcation in the wild. In 2016 IEEE Winter

Conference on Applications of Computer Vision (WACV), pages 1–9. IEEE, 2016.

Jiaxin Shi, Shengyang Sun, and Jun Zhu. Kernel implicit variational inference. In Interna-

tional Conference on Learning Representations, 2018a.

Jiaxin Shi, Shengyang Sun, and Jun Zhu. A spectral approach to gradient estimation for

implicit distributions. arXiv preprint arXiv:1806.02925, 2018b.

Lewis Smith and Yarin Gal. Understanding measures of uncertainty for adversarial example

detection. arXiv preprint arXiv:1803.08533, 2018.

Shengyang Sun, Guodong Zhang, Jiaxin Shi, and Roger Grosse. Functional variational

Bayesian neural networks. In International Conference on Learning Representations,

2019.

Deng Zhu

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N

Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances

in neural information processing systems, pages 5998–6008, 2017.

Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics.

In Proceedings of the 28th international conference on machine learning (ICML-11), pages

681–688, 2011.

Yeming Wen, Paul Vicol, Jimmy Ba, Dustin Tran, and Roger Grosse. Flipout: Ef-

ﬁcient pseudo-independent weight perturbations on mini-batches. arXiv preprint

arXiv:1803.04386, 2018.

Yeming Wen, Dustin Tran, and Jimmy Ba. Batchensemble: an alternative approach to

eﬃcient ensemble and lifelong learning. arXiv preprint arXiv:2002.06715, 2020.

Florian Wenzel, Kevin Roth, Bastiaan S Veeling, Jakub Swiatkowski, Linh Tran, Stephan

Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin.

How good is the bayes posterior in deep neural networks really? arXiv preprint

arXiv:2002.02405, 2020a.

Florian Wenzel, Jasper Snoek, Dustin Tran, and Rodolphe Jenatton. Hyperparameter en-

sembles for robustness and uncertainty quantiﬁcation. arXiv preprint arXiv:2006.13570,

2020b.

Andrew Gordon Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic

perspective of generalization. arXiv preprint arXiv:2002.08791, 2020.

Anqi Wu, Sebastian Nowozin, Edward Meeds, Richard E Turner, Jose Miguel Hernandez-

Lobato, and Alexander L Gaunt. Deterministic variational inference for robust bayesian

neural networks. arXiv preprint arXiv:1810.03958, 2018.

Dong Yi, Zhen Lei, Shengcai Liao, and Stan Z Li. Learning face representation from scratch.

arXiv preprint arXiv:1411.7923, 2014.

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint

arXiv:1605.07146, 2016.

Guodong Zhang, Shengyang Sun, David Duvenaud, and Roger Grosse. Noisy natural gra-

dient as variational inference. In International Conference on Machine Learning, pages

5847–5856, 2018.

Tianyue Zheng and Weihong Deng. Cross-pose lfw: A database for studying cross-pose face

recognition in unconstrained environments. Beijing University of Posts and Telecommu-

nications, Tech. Rep, 5, 2018.

Tianyue Zheng, Weihong Deng, and Jiani Hu. Cross-age lfw: A database for studying cross-

age face recognition in unconstrained environments. arXiv preprint arXiv:1708.08197,

2017.