In some modern geostatistical problems, statisticians need to analyze multiple correlated responses with the number of observations reaches a million. This big data problem has promoted a rich literature on scalable methodologies for analyzing multivariate large spatial datasets. The scalable spatial process models in the Bayesian paradigm have been found especially attractive due to their flexibility and presence in hierarchical model settings. However, a major computational bottleneck for obtaining a full Bayesian inference, including the inference for latent processes,
arises from the slow MCMC sampling process over a high-dimensional parameter space. This work devises massively scalable Bayesian approaches that can rapidly deliver full Bayesian inference on
multivariate spatial processes that are practically indistinguishable from inference obtained using more expensive alternatives. One key strategy we develop uses the conjugate gradient method to accelerate the posterior sampling process of Bayesian linear model of coregionalization built
based on Nearest-Neighbor Gaussian processes. The algorithm itself yields MCMC chains with convergence rate comparable to a Gibbs sampling over a low dimensional parameter space, while our model can provide full Bayesian inference over parameter space with dimension linear to
the number of observation. The simulation studies reveal that our algorithm provides inference comparable to the widely used Bayesian LMC model, while the former is around 200 times faster than the later. We implemented the algorithm for a real dataset with 2 responses over 1.2 million observed locations and completed the implementation within 1 day and 14 hours over a standard machine.
Authors: Lu Zhang and Sudipto Banerjee, UCLA Department of Biostatistics