Tips & Tricks

Implementing models with bayes_spec is an interactive and iterative practice. Here we provide some helpful guidelines for new and experienced users alike.

Checking model specification

The first objective for any implementation of a bayes_spec model is to ensure that the model is specified properly. To do so, the user should ensure that the model predicts the expected spectral data for a given set of model parameters. The simplest test is to generate synthetic observations from the model (see Basic Tutorial).

Check prior distributions

Once the model seems to be working, the user should draw prior predictive samples to ensure that the specified prior distributions are reasonable given the data. To aid in posterior sampling, these prior distributions should be at least somewhat constraining. This is particularly true for the spectral centroid of the emission (i.e., the velocity or frequency of the emission). If all of the apparent emission appears over a small range of velocity, for example, then that small range of velocity should be described by the specified prior.

Sampling with Variational Inference (VI)

As described in the tutorials, variational inference is an algorithm that approximates the posterior distribution of a model given some data. It is only an approximation and will tend to produce sub-optimal results for complicated models. This is evidenced by differences in the Bayesian Information Criterion (BIC), a goodness-of-fit metric, for a model fit with VI compared to the same model sampled with Monte Carlo Markov Chain methods (see below).

Nonetheless, VI is a good tool for a quick inspection of model performance. The user should test that their model can be fit to their data using VI for two reasons: (1) if a model fails to converge with VI, then there may be a model specification issue or poorly constrained priors (see above), and (2) VI can be used as an initialization for component optimization (see below), so it is useful to determine the optimal hyper-parameters for VI.

There are four important tunable parameters for VI: n, the maximum number of VI iterations, rel_tolerance and abs_tolerance, the relative and absolute convergence thresholds, and learning_rate, the learning rate. The user must determine the values of these hyper-parameters that work best for their model and their data. In general:

Set n to some large number, like 100_000, and then tune the other parameters to ensure convergence before n iterations are reached.
Start with large thresholds and learning_rate (say, abs_threshold=0.1, rel_treshold=0.01, and learning_rate=0.01)
If VI converges too quickly, it may fail to accurately approximate the posterior distribution. This will be evidenced by poor posterior predictive draws (see below). Try decreasing the convergences thresholds and learning rate by a factor of 2 to 10.
Once a working set of VI hyper-parameters are found, use them to initialize MCMC (see below).
Ultimately, the goal is to find the set of VI hyper-parameters that produce accurate results with minimal runtime.

Sampling with Monte Carlo Markov Chain (MCMC) methods

MCMC methods provide more robust constraints on the model posterior distribution. Additionally, with many Markov Chains, we can test whether or not MCMC has “worked” by looking at cross-chain convergence diagnostics. There are multiple MCMC sampling methods in bayes_spec, including Hamiltonian Monte Carlo (the No-U-Turn Sampler; NUTS; implemented by sample()) and Sequential Monte Carlo (SMC; implemented by sample_smc()).

For NUTS, there are several hyper-parameters, including init, the initialization strategy, init_kwargs, the initialization hyper-parameters, chains, the number of Markov Chains, and nuts_kwargs, the NUTS hyper-parameters. The user must determine the optimal NUTS hyper-parameters for their model and data. In general:

The bayes_spec default initialization strategy is advi+adapt_diag, which initializes MCMC using VI. This tends to be good for many models (especially when init_kwargs are set from the VI experiments described above), but the user should also explore how different initialization strategies affect their results. In particular, the user may wish to try the pymc default strategy (either init="auto" or init="jitter+adapt_diag"), which implements a more traditional initialization using a “tuning” phase.
The number of chains should be set as high as allowed by the machine. Ideally, the number of chains should equal the number of available CPU cores. Be sure to also set cores to the same number in order to sample all chains in parallel.
The most important NUTS parameter is target_accept, implemented like nuts_kwargs={"target_accept": 0.8} (the default). This parameter effectively dictates the compromise between sampling efficiency and accuracy. The default value should be good for most models, but if a model is failing to converge (see below), then increasing target_accept to 0.9 or 0.95 may help at the expense of longer sampling time.

For SMC, there are fewer hyper-parameters. The most important are draws, which dictates the number of posterior samples drawn at the end as well as the number of independent Markov Chains per individual SMC chain, and chains, the number of independent SMC chains. In general:

Again, set chains and cores as high as your machine can take.
If the model fails to converge (see below), try increasing draws in increments of 1000.

Convergence diagnostics

There are several diagnostics that the user should check to ensure that MCMC has worked well for their model and data.

Divergences. Quoting the pymc diagnostics guide, divergences “indicate the Hamiltonian Markov chain has encountered regions of high curvature in the target distribution which it cannot adequately explore.” Divergences indicate that something has gone awry. In general, the number of divergences reported at the end of MCMC sampling should be small, much smaller than the number of samples. For complicated physical models, some divergences are inevitable. If the number of divergences is large, check your model specification for places where derived quantities might become non-physical (zero or negative line widths, non-physical intensities, etc.). Otherwise, try increasing target_accept.
Effective sample size (labeled as ess in the output of pm.summary()). The effective sample size for every parameter should be more than 100, ideally close to the number of posterior samples. If the effective sample size is small, then r_hat will likely also indicate poor convergence (see below).
r_hat. This is a cross-chain convergence statistic. A value close to 1, less than 1.01 or so, indicates that each chain has converged to the same posterior distribution, and thus we can be confident that the MCMC results are robust. A large r_hat indicates that the model has not yet converged, and the user should either re-parameterize their model or experiment with the MCMC hyper-parameters described above.
Posterior predictive checks. The user should generate posterior predictive samples and ensure that they reasonably represent the data.

Optimization

In general, once you have tuned the VI and MCMC hyper-parameters as described above, you can simply use the same parameters to optimize the model for the number of “cloud” components. One important caveat is the use of the approx parameter. By default, approx=True, in which case the optimizer performs a first-pass fit for all models using VI. As described above, for some models and data, this approximation might not be good enough, in which case one should use approx=False to sample each model with MCMC. This is slow, but more accurate.

Other Tips and Tricks

Like any model fitting algorithm, MCMC methods are sensitive to the “initial guess”. For a Bayesian model, this typically means having good prior distribution specifications (see above). Additionally, the user may wish to specify more specific initial values for certain model parameters in order to aid in sampling efficiency. A notable example is in the spectral centroid parameter (i.e., velocity or frequency). We can “trick” MCMC into exploring this parameter space by initializing the Markov chains across a range of velocities. For example, in the GaussModel defined in Basic Tutorial, we could specify initial values for the normalized velocity centroid. The following example demonstrates how we initialize the normalized velocity of each cloud equidistant between -1 and 1. The initial velocity is thus equidistant across the width of the velocity prior distribution.

velocity_norm = pm.Normal(
    "velocity_norm",
    mu=0.0,
    sigma=1.0,
    dims="cloud",
    initval=np.linspace(-1.0, 1.0, self.n_clouds),
)
_ = pm.Deterministic(
    "velocity",
    prior_velocity[0] + prior_velocity[1] * velocity_norm,
    dims="cloud",
)