# Stochaskell

Probabilistic programming systems automate the task of translating specifications of probabilistic models into executable inference procedures. Here we present *Stochaskell*, an embedded domain-specific language for specifying probabilistic programs that can be compiled to multiple existing probabilistic programming languages. Programs are written as Haskell code, which transparently constructs an intermediate representation that is then passed to specialised code generation modules. This allows tight integration with existing probabilistic programming systems, with each system providing inference for different parameters of a model. These model-specific inference strategies are also written as Haskell code, providing a powerful means of structuring and composing inference methods.

## Motivation

At one end of the probabilistic programming spectrum, systems like Stan produce high-performance inference programs utilising sampling algorithms such as Hamiltonian Monte Carlo (HMC). However, a consequence of this is that Stan does not support sampling discrete model parameters, requiring users to manually marginalise them out of the model prior to implementing it in Stan.See §7 of *Stan User’s Guide*.

In particular, this precludes the ability to perform inference on models in which the dimensionality of the parameter space is itself a random variable. At the other end of the spectrum, implementations of Church (Goodman et al. 2008) emphasise the ability to perform inference with few limitations on the models permitted.

Unfortunately, the model specification language is often tightly coupled with the system for translating the specification into an executable inference program. This requires users to rewrite their models in a number of different languages in order to fully take advantage of the variety of available probabilistic programming systems. Not only is this a burden on the user, but it also limits the ability to integrate a number of different probabilistic programming systems in the implementation of a single model. For instance, it may be desirable to infer continuous model parameters with Stan, whilst delegating the discrete and trans-dimensional parameters of the same model to a more universally applicable method.

Here we present Stochaskell, a probabilistic programming language designed for portability, allowing models and inference strategies to be written in a single language, but inference to be performed by a variety of probabilistic programming systems. This is achieved through runtime code generation, whereby code is automatically produced to perform inference via an external probabilistic programming system on subsets of the model specified by the user. In this way, users can benefit from the diverse probabilistic programming ecosystem without the cost of manually rewriting models in multiple different languages.

Many probabilistic programming systems — particularly those implementing the technique described by Wingate, Stuhlmüller, and Goodman (2011) — rely on code generation by translating probabilistic programs to a *general-purpose* programming language. Probabilistic C (Paige and Wood 2014) goes further by positioning itself as a code generation *target* language for other probabilistic programming systems. Hakaru (Narayanan et al. 2016) performs inference through program transformation, but both the source and target are expressed in the same probabilistic programming language. The Fun language (Borgström et al. 2011) is implemented through compilation to Infer.NET programs. To the best of our knowledge, Stochaskell is the first probabilistic programming language capable of generating code for multiple probabilistic programming systems, spanning diverse probabilistic programming paradigms.

## Probabilistic Modelling

In this section, we will introduce Stochaskell through a number of example probabilistic programs.Also see A Gentle Introduction to Haskell and the Stochaskell API documentation.

We begin with a simple program to simulate a homogeneous Poisson process over a fixed interval [0,t], using the standard method (Kroese, Taimre, and Botev 2011, sec. 5.4):

```
poissonProcess :: R -> R -> P (Z,RVec)
poissonProcess rate t = do
n <- poisson (rate * t)
s <- orderedSample n (uniform 0 t)
return (n,s)
```

The program takes two real numbers as inputEach represented by the type `R`

in the type signature. Likewise, `Z`

represents integers, `RVec`

a vector of reals, and `P (Z,RVec)`

a probability distribution over pairs of these.

specifying the \text{rate}>0 of the process, and the length t>0 of the interval, respectively. The number of points n is simulated according to a Poisson distribution with mean \text{rate}\times t, then the point locations are sampled uniformly from the interval [0,t] and stored in an ordered vector s. The output of the program is the (random) pair of these two values. We can run the simulation within a Haskell interactive environment, for example:

Of course, the real benefit of probabilistic programming is not just the ability to simulate data, but to infer model parameters given some observed data. However, we defer further discussion of this to the following section, focusing for now on just the programs themselves.

We now consider the following program, which implements a Gaussian process (GP) model (Rasmussen and Williams 2006, sec. 2.2):

```
gp :: (R -> R -> R) -> Z -> RVec -> P RVec
gp kernel n x = do
let mu = vector [ 0 | i <- 1...n ]
cov = matrix [ kernel (x!i) (x!j) | i <- 1...n, j <- 1...n ]
g <- normal mu cov
return g
```

The definitions of `mu`

and `cov`

illustrate the usage of Stochaskell’s abstract arrays.Using the *monad comprehensions* language extension.

In contrast to concrete arrays,Such as those provided by the standard `Data.Array`

module.

the dimensions of abstract arrays can be specified in terms of random variables. Array elements can be accessed using the conventional subscript operator, as witnessed by the expression `(x!i)`

providing the value of the `i`

th location at which we are evaluating the GP.

It is also worth drawing attention to the fact that the first input to this program is a *function* which takes two real locations and returns the desired covariance of the GP at those locations.The kernel function should be symmetric and positive semidefinite (Rasmussen and Williams 2006, sec. 4.1).

This highlights an advantage of the embedded approach, whereby existing Haskell functions can be freely used within Stochaskell programs. For instance, in the following code,Here we are utilising the `Data.Boolean.Overload`

module to ensure the definition is sufficiently polymorphic.

`kernelSE1`

specifies a squared exponential kernel function with unit *signal variance* and *length-scale* hyper-parameters (Rasmussen and Williams 2006, sec. 4.2), and a small Gaussian observation noise of variance 10^{-6}.

```
kernelSE1 = kernelSE (log 1) (log 1)
kernelSE lsv lls2 a b =
exp (lsv - (a - b)^2 / (2 * exp lls2))
+ if a == b then 1e-6 else 0
```

Although it is possible to use the provided `normal`

primitive to sample from a multivariate Gaussian distribution, Plot of five independent simulations of `gp kernelSE1`

.

it is also possible to do so more explicitly by transforming a vector of (univariate) standard normal samples, which can be useful for fine-tuning inference efficiency. The following program defines such a transformation with the aid of a Cholesky decomposition (Kroese, Taimre, and Botev 2011, sec. 4.3):

```
normalChol :: Z -> RVec -> RMat -> P RVec
normalChol n mu cov = do
w <- joint vector [ normal 0 1 | i <- 1...n ]
return (mu + chol cov #> w)
```

The `joint`

keyword indicates that we are sampling a random vector whose elements have the given marginal distributions. Note that, although the elements of this random vector are (conditionally) independent, correlations can be induced by applying a deterministic transformation, as with the matrix-vector multiplication (`#>`

) in the above program.

Building upon this, we can write a GP classification model (Rasmussen and Williams 2006, sec. 3.3), where `bernoulliLogit`

represents the logit-parametrised \mathrm{Bernoulli}(\mathrm{logit}^{-1}(\cdot)) distribution:

```
gpClassifier :: (R -> R -> R) -> Z -> RVec -> P (RVec,BVec)
gpClassifier kernel n x = do
g <- gpChol kernel n x
phi <- joint vector [ bernoulliLogit (g!i) | i <- 1...n ]
return (g,phi)
```

In this example, `gpChol`

is itself a probabilistic program, defined much like the `gp`

program earlier, but utilising the `normalChol`

program instead of the `normal`

primitive. This demonstrates how sub-programs can be easily composed in Stochaskell, another benefit we obtain for free from embedding.

Now for a somewhat different example, this program implements the stick-breaking process of Sethuraman (1994) with concentration parameter \alpha>0:

```
stickBreak :: R -> P RVec
stickBreak alpha = do
sticks <- joint vector [ beta 1 alpha | i <- 1...infinity ] :: P RVec
let sticks' = vector [ 1 - (sticks!i) | i <- 1...infinity ]
rems = scanl (*) 1 sticks'
probs = vector [ (sticks!i) * (rems!i) | i <- 1...infinity ]
return probs
```

Here we are utilising the higher-order function *scan*. Much like the closely-related *fold* operation (sometimes called *reduce* or *accumulate*), scan allows us to recursively combine elements of an array. In this example, `rems`

is a vector containing the cumulative product of `sticks'`

. Although fold and scan provide only a restricted form of recursion, they are powerful enough to be able to implement practically useful models such as Dirichlet processes (DP), as demonstrated in the following program. Stochaskell supports using higher-order functions like these, rather than allowing unlimited recursion, in order to provide a balance between expressiveness and portability, in consideration of the fact that some probabilistic programming systems may support only limited forms of recursion, if at all.

```
dirichletProcess :: R -> P R -> P (P R)
dirichletProcess alpha base = do
probs <- stickBreak alpha
atoms <- joint vector [ base | i <- 1...infinity ]
let randomDistribution = do
stick <- pmf probs
return (atoms!stick)
return randomDistribution
```

Note that we can see from the type signature that this program encodes a *distribution over distributions*. These random distributions can be used like any other sub-program, as shown in this implementation of a DP mixture model (Neal 2000):

```
dpmm :: Z -> P RVec
dpmm n = do
let base = uniform 0 100
paramDist <- dirichletProcess 5 base
params <- joint vector [ paramDist | j <- 1...n ]
values <- joint vector [ normal (params!j) 1 | j <- 1...n ]
return values
```

### Intermediate Representation

See Roberts, Gallagher, and Taimre (2019, sec. 3.1).

## Probabilistic Inference

Statistical inference, within a Bayesian paradigm, involves the consideration of the posterior distribution of model parameters conditioned on observed data. For clarity, we will focus on the common case of performing inference by drawing approximate samples from the posterior via Markov chain Monte Carlo (MCMC) methods. However, note that Stochaskell does not restrict itself to this case, and integration with a wide range of different inference methods is possible. Posterior distributions can be expressed in Stochaskell with a familiar notation,Thanks to monad comprehensions, and a generalised `guard`

function.

for instance given a probabilistic program `prior`

and some `observed`

data:

Stochaskell supports programmable inference, as introduced by Mansinghka, Selsam, and Perov (2014), with inference strategies expressed in the same language as the model, i.e. Haskell. This allows us to seamlessly combine different inference methods, be they provided by native Haskell code or external probabilistic programming systems. In the latter case, runtime code generation allows us to offload inference of different parts of a model without any manual code duplication.

### Stan Integration

Stochaskell integrates with Stan, to provide high-performance inference for (conditionally) fixed-dimensional, continuous model parameters. This is achieved by automatically translating Stochaskell programs into Stan code, compiling the resulting code with CmdStan, then executing the compiled model and parsing the results so that they can be further processed within Haskell. This allows the user to offload inference to Stan with a single line of code, receiving a list of posterior samples in return: Plot of posterior samples from the Gaussian process classification model program from the previous section, sampled via the Stan backend.

The IR outlined in the previous section can be translated to Stan code in a fairly straightforward manner. Each Apply node is translated to variable definition with an automatically generated identifier. Array definitions correspond to a for-loop, or a set of nested for-loops, which iterate over each possible index value and store the appropriate element value into an array. A Fold node produces code that first assigns the default value to a variable, then iteratively updates it by looping over the elements of the vector.

Similar backends are also provided for integration with PyMC3 and Edward.

### Church Integration

To demonstrate the portability of Stochaskell across a variety of probabilistic programming paradigms, we also provide integration with Church. The implementation and usage of this is quite similar to the Stan backend, allowing inference to be easily offloaded: Histogram of n=10^4 samples from the Dirichlet process mixture model program from the previous section, via the WebChurch implementation of Church.

The main conceptual difference from the Stan code generation is that for Church, arrays are represented by *memoised functions*. In particular, random arrays — constructed with the `joint`

keyword in Stochaskell — utilise Church’s support for *stochastic memoisation* (Goodman et al. 2008, sec. 2.1).

The figure on the right illustrates a simulation of the DP mixture model program presented in the previous section, sampled via the Church backend.

### Metropolis–Hastings

Stochaskell provides a native Haskell library for both simulating probabilistic programs, as well as computing the probability density function of the distribution they represent where possible. A notable feature of the latter is that we provide limited support for computing the density of transformed random variables. Suppose we have Z=h(X), where X and Z are real random vectors of common dimension. In the case that the transformation h is invertible, the density of Z can be computed from that of X by applying the formula

p_Z(z)=\frac{p_X(h^{-1}(z))}{|J_h(h^{-1}(z))|}

where the denominator is the absolute value of the determinant of the Jacobian matrix of the transformation h (Kroese, Taimre, and Botev 2011, sec. A.6). The Jacobian matrix is computed using automatic differentiation, and the inverse transform h^{-1}(z) is calculated by the following procedure. Given an expression graph representing h(x), we associate the root node with the value of z, then walk the graph associating values with nodes whenever they can be fully determined by the values of neighbouring nodes. When this process succeeds, the nodes representing x will be associated with the value of h^{-1}(z).

This density computation library allows us to provide a simple implementation of the Metropolis–Hastings (M–H) inference algorithm, with support for *user-defined* proposal distributions:Here `density`

includes the Jacobian adjustment, whereas `density'`

does not. The latter allows us to use proposals which make few random choices relative to the size of the state, under the assumption that they do not involve non-trivial transformations.

*Note: these functions have been deprecated in favour of the more efficient lpdf and lpdfAux functions.*

```
mh target proposal x = do
y <- simulate (proposal x)
let f = density target
q z = density' (proposal z)
a = fromLogFloat $
(f y * q y x) / (f x * q x y)
accept <- bernoulli (if a > 1 then 1 else a)
return (if accept then y else x)
```

Proposals are specified by probabilistic programs written in Stochaskell, alongside the model. For instance, given the following random walk proposal:

we can sample from the distribution represented by some `program`

via M–H as follows:

where `n`

is the number of iterations and `x0`

is some initial state.

### Reversible Jump Markov Chain Monte Carlo

See Roberts, Gallagher, and Taimre (2019).

## Usage

To quickly get started using Stochaskell, it can be launched via Binder. It can also be launched on a local machine by installing Docker and running the following commands:

```
docker pull davidar/stochaskell
docker run -it -p8888:8888 davidar/stochaskell
```

To build from source, rather than using a pre-built image, run the following commands:

```
git clone --recursive https://github.com/davidar/stochaskell.git
docker build -t stochaskell:latest stochaskell/
docker run -it -p8888:8888 stochaskell:latest
```

## References

Borgström, Johannes, Andrew D. Gordon, Michael Greenberg, James Margetson, and Jurgen Van Gael. 2011. “Measure Transformer Semantics for Bayesian Machine Learning.” In *Programming Languages and Systems*, 77–96. https://doi.org/10.1007/978-3-642-19718-5_5.

Goodman, Noah D., Vikash K. Mansinghka, Daniel M. Roy, Keith Bonawitz, and Joshua B. Tenenbaum. 2008. “Church: A Language for Generative Models.” In *UAI 2008, Proceedings of the 24th Conference in Uncertainty in Artificial Intelligence, Helsinki, Finland, July 9-12, 2008*, edited by David A. McAllester and Petri Myllymäki, 220–29. AUAI Press. https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=1346&proceeding_id=24.

Green, Peter J. 1995. “Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination.” *Biometrika* 82 (4): 711–32. https://doi.org/10.1093/BIOMET/82.4.711.

Green, Peter J., and David I. Hastie. 2009. “Reversible Jump MCMC.” https://people.maths.bris.ac.uk/~mapjg/papers/rjmcmc_20090613.pdf.

Kroese, Dirk P., Thomas Taimre, and Zdravko I. Botev. 2011. *Handbook of Monte Carlo Methods*. New York: Wiley.

Mansinghka, Vikash K., Daniel Selsam, and Yura N. Perov. 2014. “Venture: A Higher-Order Probabilistic Programming Platform with Programmable Inference.” *CoRR* abs/1404.0099. http://arxiv.org/abs/1404.0099.

Narayanan, Praveen, Jacques Carette, Wren Romano, Chung-chieh Shan, and Robert Zinkov. 2016. “Probabilistic Inference by Program Transformation in Hakaru (System Description).” In *Functional and Logic Programming*, 62–79. https://doi.org/10.1007/978-3-319-29604-3_5.

Neal, Radford M. 2000. “Markov Chain Sampling Methods for Dirichlet Process Mixture Models.” *Journal of Computational and Graphical Statistics* 9 (2): 249–65. https://doi.org/10.1080/10618600.2000.10474879.

Paige, Brooks, and Frank D. Wood. 2014. “A Compilation Target for Probabilistic Programming Languages.” In *Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014*, 32:1935–43. JMLR Workshop and Conference Proceedings. JMLR.org. http://jmlr.org/proceedings/papers/v32/paige14.html.

Rasmussen, Carl Edward, and Christopher K I Williams. 2006. *Gaussian Processes for Machine Learning*. MIT Press. http://www.gaussianprocess.org/gpml/chapters/.

Roberts, David A, Marcus Gallagher, and Thomas Taimre. 2019. “Reversible Jump Probabilistic Programming.” In *Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics*, edited by Kamalika Chaudhuri and Masashi Sugiyama, 89:634–43. PMLR. http://proceedings.mlr.press/v89/roberts19a.html.

Sethuraman, Jayaram. 1994. “A Constructive Definition of Dirichlet Priors.” *Statistica Sinica* 4 (2): 639–50. http://www.jstor.org/stable/24305538.

Wingate, David, Andreas Stuhlmüller, and Noah D. Goodman. 2011. “Lightweight Implementations of Probabilistic Programming Languages via Transformational Compilation.” In *Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, Usa, April 11-13, 2011*, edited by Geoffrey J. Gordon, David B. Dunson, and Miroslav Dudı́k, 15:770–78. JMLR Proceedings. JMLR.org. http://www.jmlr.org/proceedings/papers/v15/wingate11a/wingate11a.pdf.