Bayesian ways of doing logistic regression

Logistic distribution is ones of the most widely used distributions in statistics. Even though we are not conscious of it, we are using the logistic distribution whenever we do logistic regression. The frequentist way of fitting a logistic model is quite straightforward. However, the Bayesian logistic model isn’t quite as easy due to its non-conjugacy. In those cases, it is very useful to use the hierarchical representation of the distribution. This post is aimed at reviewing 2 types of modelling logistic regression in a Bayesian way.

The first is to use the logistic distribution directly. Choi and Hobert (2013) proposes a Gibbs sampler using the Polyá-Gamma distribution which isn’t easy to sample from. Those who are interested in the random variate generation algorithms are referred to Devroye (1986). In fact, Devroye released the book free of charge online (here). Anyway, let’s review logistic regression briefly. The probability density function (PDF) of a logistic distribution is as follows:

$f(x\,|\,\mu,s) = \dfrac{e^{-(x-\mu)/s}}{s\left(1+e^{-(x-\mu)/s}\right)^{2}}$

for location parameter $\mu\in\mathbf{R}$ and and scale parameter $s>0$ . The logistic regression is modelled as the linear combination of regressors mapped to the probability parameter of a Bernoulli distribution.

$y_{i} \sim \mathrm{Ber}\left(F\left(\mathbf{x}_{i}'\boldsymbol{\beta}\right)\right)$

Such a function $F$ can be any cumulative distribution function (CDF). If it’s chosen as the CDF of a normal distribution, it’s called the ‘probit regression’ and if it is the CDF of a logistic distribution, it’s called the ‘logistic regression’, as we all know. This is why the likelihood function of $Y$ looks like

$P(Y\,|\,\boldsymbol{\beta}) = \displaystyle\prod_{i=1}^{n}\dfrac{e^{y_{i}\mathbf{x}_{i}'\boldsymbol{\beta}}}{1+e^{\mathbf{x}_{i}'\boldsymbol{\beta}}}$

For simplicity, let’s assign a standard multivariate normal distribution for $\boldsymbol{\beta} \sim \mathcal{N}(\mathbf{0},\mathbf{I})$ . The posterior would be

$\pi(\boldsymbol{\beta}\,|\,Y) \propto \left[\displaystyle\prod_{i=1}^{n}\dfrac{e^{y_{i}\mathbf{x}_{i}'\boldsymbol{\beta}}}{1+e^{\mathbf{x}_{i}'\boldsymbol{\beta}}}\right]e^{-\boldsymbol{\beta}'\boldsymbol{\beta}/2}$

The trick here is to introduce another random variable that is totally irrelevant if $\boldsymbol{\beta}$ is given and could be integrated out so as to restore the original posterior. But we do it so that it makes sampling easier for $\boldsymbol{\beta}$ . We all know that the denominator $1+e^{\mathbf{x}_{i}'\boldsymbol{\beta}}$ is the hindrance here so let’s make it gone.

$\begin{array}{lcl} \pi(\boldsymbol{\beta}\,|\,Y) &\propto & \left[\displaystyle\prod_{i=1}^{n}\dfrac{e^{y_{i}\mathbf{x}_{i}'\boldsymbol{\beta}}}{1+e^{\mathbf{x}_{i}'\boldsymbol{\beta}}}\right]e^{-\boldsymbol{\beta}'\boldsymbol{\beta}/2}\\ \pi(\boldsymbol{\beta},w\,|\,Y) &\propto&\left[\displaystyle\prod_{i=1}^{n}\dfrac{e^{y_{i}\mathbf{x}_{i}'\boldsymbol{\beta}}}{1+e^{\mathbf{x}_{i}'\boldsymbol{\beta}}}\times \dfrac{1+e^{\mathbf{x}_{i}'\boldsymbol{\beta}}}{2e^{\mathbf{x}_{i}'\boldsymbol{\beta}}}e^{-\frac{(\mathbf{x}_{i}'\boldsymbol{\beta})^{2}w_{i}}{2}}g(w_{i}) \right]e^{-\boldsymbol{\beta}'\boldsymbol{\beta}/2}\\ g(w) &=& \displaystyle\sum_{k=0}^{\infty}(-1)^{k}\dfrac{(2k+1)}{\sqrt{2\pi w^{3}}}e^{-\frac{(2k+1)^{2}}{8w}}\mathbb{I}_{(0,\infty)}(w) \end{array}$

Let’s not care about that ugly function $g$ and if we use the identity

$\cosh x = \dfrac{1+e^{2x}}{2e^{x}}$

the joint posterior $\pi(\boldsymbol{\beta},w\,|\,Y)$ is expressed as

$\pi(\boldsymbol{\beta},w\,|\,Y) \propto \left[\displaystyle\prod_{i=1}^{n}\dfrac{e^{y_{i}\mathbf{x}_{i}'\boldsymbol{\beta}}}{1+e^{\mathbf{x}_{i}'\boldsymbol{\beta}}}\times \cosh\left(\dfrac{|\mathbf{x}_{i}'\boldsymbol{\beta}|}{2} \right)e^{-\frac{(\mathbf{x}_{i}'\boldsymbol{\beta})^{2}w_{i}}{2}}g(w_{i}) \right]e^{-\boldsymbol{\beta}'\boldsymbol{\beta}/2}$

Surprisingly, $f(x;c) = \cosh(c/2)e^{-c^{2}x/2}g(x)$ is the density function of a Polyá-Gamma distribution with parameters (1,c), i.e. $\mathrm{PG}(1,c)$ . Since we have introduced a random variable that is irrelevant, we call this the ‘data augmentation’ technique. This implies that if we can generate Polyá-Gamma random variates, it gets much easier to sample from $\pi(\boldsymbol{\beta}\,|\,w,Y)$ rather than $\pi(\boldsymbol{\beta}\,|\,Y)$ since the density of the Polyá-Gamma density cancels out the cumbersome denominator of the logistic CDF. For completeness,

$\begin{array}{lcl} \pi(\boldsymbol{\beta}\,|\,w,Y) &\propto& \left[\displaystyle\prod_{i=1}^{n} \exp\left(y_{i}\mathbf{x}'\boldsymbol{\beta}-\dfrac{\mathbf{x}_{i}'\boldsymbol{\beta}}{2}-\dfrac{w_{i}\boldsymbol{\beta}'\mathbf{x}_{i}\mathbf{x}_{i}'\boldsymbol{\beta} }{2}\right)\right]e^{-\boldsymbol{\beta}'\boldsymbol{\beta}/2}\\ &=& \exp\left(Y'X\boldsymbol{\beta}-\dfrac{\boldsymbol{\beta}'X\mathbf{1}}{2}-\dfrac{\boldsymbol{\beta}'\left(X'WX+\mathbf{I}\right)\boldsymbol{\beta} }{2} \right)\\ &=& \exp\left[-\dfrac{1}{2}\left(\boldsymbol{\beta}'\left(X'WX+\mathbf{I} \right)\boldsymbol{\beta}-2\boldsymbol{\beta}'X'\left(Y-\dfrac{1}{2}\mathbf{I}\right) \right) \right] \end{array}$

where $W=\mathrm{diag}(w_{1},w_{2},\ldots,w_{n})$ . So the resulting expression is a multivariate normal distribution whose mean vector and covariance matrix $\mathbf{m},\Sigma$ are

$\begin{array}{lcl} \Sigma &=&\left(X'WX+\mathbf{I}\right)^{-1}\\ \mathbf{m} &=& \Sigma X'\left(Y-\dfrac{1}{2}\mathbf{I}\right) \end{array}$

Therefore, by introducing Polyá-Gamma random variables, the Gibbs sampler proceeds by repeating sampling from the following two distributions

$\begin{array}{rcl} \boldsymbol{\beta} &\sim& \mathcal{N}\left(\mathbf{m},\Sigma\right)\\ w_{i} &\sim& \mathrm{PG}(1,\mathbf{x}_{i}'\boldsymbol{\beta}) \end{array}$

Next method is using the hierarchical representation of the logistic distribution as a scale mixture of normal as proposed in Stefanski (1991). That is the logistic PDF can be recast as

$f(x) = \dfrac{e^{-x}}{(1+e^{-x})^{2}} = \displaystyle \int_{0}^{\infty} \left[\dfrac{1}{2\nu\sqrt{2\pi}}\exp\left\{-\dfrac{1}{2}\left(\dfrac{x}{2\nu} \right)^{2} \right\} \right]\pi(\nu)\,d\nu$

where $\nu$ follows the Kolmogorov-Smirnov (KS) distribution whose density is

$\pi(\nu) = 8\displaystyle\sum_{\alpha=1}^{\infty}(-1)^{(\alpha+1)}\alpha^{2}\nu\exp\left(-2\alpha^{2}\nu^{2}\right)$

This time we don’t model the regression model directly but we will rather resort to the latent variable representation. That is, we will assume there is a latent variable $y_{i}^{\star}$ that decides which value $y_{i}$ would take on between $\left\{0,1\right\}$ .

$\begin{array}{lcl}y_{i}^{\star} &=& \mathbf{x}_{i}'\boldsymbol{\beta}+e_{i},\quad e_{i}\sim \mathrm{Logistic}(0,1) \\y_{i} &=& \begin{cases}1, & \text{if }y_{i}^{\star}>0\\0, & \text{if }y_{i}^{\star}\leq 0 \end{cases}\end{array}$

We merely replace $e_{i}\sim \mathrm{Logistic}(0,1)$ with its hierarchical representation. Let $\phi$ be the PDF of the standard normal distribution. Then the joint posterior becomes

$\pi(\boldsymbol{\beta},\nu\,|\,Y) = \left[\displaystyle\prod_{i=1}^{n}\left(\mathbb{I}\left(y_{i}^{\star}>0\right)\mathbb{I}\left(y_{i}=1\right)+\mathbb{I}\left(y_{i}^{\star}\leq 0\right)\mathbb{I}\left(y_{i}=0\right)\right)\dfrac{1}{2\nu}\phi\left(\dfrac{y_{i}^{\star}-\mathbf{x}'\boldsymbol{\beta}}{2\nu}\right)\right]\pi(\boldsymbol{\beta},\nu)$

The key point to remember is that if we change our perspective about the order in which $y_{i}^{\star}$ and $y_{i}$ are determined, that is, the modelling assumed once $y_{i}^{\star}$ is determined (though it is not observed), we get to see what $y_{i}$ is. However, during the estimation, we think since $y_{i}$ is decided, $y_{i}^{\star}$ should be have the correct value accordingly. For example, if $y_{i} = 0$ , then $y_{i}^{\star} \leq 0$ because we have already observed $y_{i}=0$ which means $y_{i}^{\star}$ must have been some negative value. So the indicator functions before the normal density just tell us that the support (informally the range of values that a random variable can take on) of $y_{i}^{\star}$ should be ‘truncated’.

$y_{i}^{\star}\,|\,\text{rest} \sim \begin{cases}\mathcal{N}_{+}\left(\mathbf{x}_{i}'\boldsymbol{\beta},4\nu^{2}\right),& \text{if } y_{i}=1\\ \mathcal{N}_{-}\left(\mathbf{x}_{i}'\boldsymbol{\beta},4\nu^{2}\right),& \text{if } y_{i}=0 \end{cases}$

where $\mathcal{N}_{+},\mathcal{N}_{-}$ are left-truncated and right-truncated normal distributions at zero. $\boldsymbol{\beta}$ isn’t affected by the indicator functions so we go as usual:

$\begin{array}{lcl} \boldsymbol{\beta} &\sim&\mathcal{N}\left(\mathbf{m},\Sigma\right)\\ \Sigma &=& \left(\dfrac{1}{4\nu^{2}}X'X+\mathbf{I}\right)^{-1}\\ \mathbf{m} &=& \Sigma\left(\dfrac{1}{4\nu^{2}}X'Y^{\star}\right) \end{array}$

Next is $\nu$ . Since we do not know what distribution the full conditional of $\nu$ is, we shall resort to a Metropolis-Hastings scheme. That is, we sample from the prior $\nu^{\star} \sim \mathrm{KS}$ , and accept it with probability

$\begin{array}{lcl}\alpha &=& \min\left(1,\mathrm{LR}\right)\\ \mathrm{LR} &=& \sqrt{\dfrac{\left(\nu^{(k)}\right)^{2}}{\left(\nu^{\star}\right)^{2}}}\exp\left(-\left(\dfrac{1}{8\left(\nu^{\star}\right)^{2}}-\dfrac{1}{8\left(\nu^{(k)}\right)^{2}} \right)\displaystyle\sum_{i=1}^{n}\left(y_{i}^{\star}-\mathbf{x}_{i}'\boldsymbol{\beta}\right)^{2}\right)\\ \nu^{(k+1)}&=&\begin{cases}\nu^{\star},&\text{with }\alpha\\\nu^{(k)}, &\text{with } 1-\alpha \end{cases} \end{array}$

So I have reviewed 2 modelling methods for Bayesian logistic regression and they both have advantages. However, I would say the Kolmogorov-Smirnov modelling is more general in the sense that it can be used with more complex models such as nonparametric regression.

For the samplers, I have Matlab codes here.

Choi, H. M., & Hobert, J. P. (2013). The Polya-Gamma Gibbs sampler for Bayesian logistic regression is uniformly ergodic. Electronic Journal of Statistics, 7, 2054-2064.
“Non-Uniform Random Variate Generation, Luc Devroye, Springer-Verlag, 1986,
the University of California, 16 Dec 2010, ISBN:0387963057, 9780387963051”
Stefanski, L. A. (1991). A normal scale mixture representation of the logistic distribution. Statistics & Probability Letters, 11(1), 69-70.

Perchance

Bayesian ways of doing logistic regression

One thought on “Bayesian ways of doing logistic regression”

Leave a comment Cancel reply

Share this:

One thought on “Bayesian ways of doing logistic regression”

Leave a comment Cancel reply