Today I want to introduce a very interesting paper of Bhattacharya et al[1]. It is a short miscellanea about how to improve the speed and alleviate the time complexity of the sampling of a multivariate normal random vector. But of course, it doesn’t apply generally to every existing form of normal distributions but rather we should use a special structure of the covariance matrix and the mean vector that arises very often in the context of Bayesian statistics.

The paper is really succinct and contains everything that’s necessary (obviously) so I don’t think I can add more explanation to it, nor do I think it is possible to further summarize it. So I’ll just reiterate what’s explained in the paper.

Most of the times, Bayesian inference resorts to the Markov chain Monte Carlo (MCMC), especially Gibbs sampler if there is no special difficulty. Because of the product structure of the likelihood and the prior, i.e.,

\pi\left(\boldsymbol{\beta}\,|\,\mathbf{Y}\right)\propto L\left(\boldsymbol{\beta}\,|\,\mathbf{Y}\right)\pi\left(\boldsymbol{\beta}\right)

the full conditional distribution of \boldsymbol{\beta} becomes as follows:

\begin{array}{rcl} \boldsymbol{\beta}\,|\,\text{rest} &\sim &\mathcal{N}\left(\mu,\Sigma\right)\\ \Sigma &= &\left(\mathbf{X}'\mathbf{X}+\mathbf{D}^{-1}\right)^{-1},\quad \mu=\Sigma\mathbf{X}'\boldsymbol{\alpha} \end{array}

where \mathbf{D}\in\mathbf{R}^{p\times p} is a symmetric positive definite matrix, \mathbf{X}\in \mathbf{R}^{n\times p}, and \alpha\in \mathbf{R}^{n}. Normally, we use the Cholesky decomposition to sample from such distribution, i.e.

\boldsymbol{\beta}^{(k)} = \Sigma^{1/2}Z+\mu,\qquad Z\sim \mathcal{N}\left(0,\mathbf{I}_{p}\right)

but the paper points out that the computation of the Cholesky factor has complexity O\left(p^{3}\right) so it quickly becomes a huge burden for the computer as p grows larger. So they suggest the following algorithm. I will rewrite what’s written in the paper.

  1. Sample u \sim \mathcal{N}\left(\mathbf{0}_{p},\mathbf{D}\right) and \delta \sim \mathcal{N}\left(\mathbf{0}_{n},\mathbf{I}_{n}\right) independently.

  2. Set v=\mathbf{X}u+\delta.

  3. Solve \left(\mathbf{XDX}'+\mathbf{I}_{n}\right)w=\alpha-v to get w.

  4. Set \boldsymbol{\beta} = u+\mathbf{DX}'w.

The proof that this is equivalent to the one that’s obtained by the Cholesky decomposition can be easily shown by using the Sherman-Morrison-Woodbury matrix identity.

The paper also offers an empirical study of how much faster the algorithm becomes compared to the Cholesky decomposition. The value of this new algorithm shines when the regression problem is high-dimensional so it is natural that the paper also mentions the high-dimensional setting where the priors are aimed to single out the signals from the noises, horse shoe prior in particular.


library(mvnfast)
Bhattacharya <- function(X,D,alpha,diag_flag=TRUE)
{
p <- nrow(D)
n <- nrow(X)
if (diag_flag) {
u <- rnorm(p)*sqrt(diag(D))
delta <- rnorm(n)
v <- drop(X%*%u)+delta
U <- tcrossprod(D,X)
w <- solve(X%*%U+diag(n),alpha-v)
return(u+drop(U%*%w))
} else {
u <- drop(rmvn(1,numeric(p),D))
delta <- rnorm(n)
v <- drop(X%*%u)+delta
U <- tcrossprod(D,X)
w <- solve(X%*%U+diag(n),alpha-v)
return(u+drop(U%*%w))
}
}

[1] Bhattacharya, Anirban, Antik Chakraborty, and Bani K. Mallick. “Fast sampling with Gaussian scale-mixture priors in high-dimensional regression.” Biometrika 103.4 (2016): 985.

Advertisements