A short statistical reasoning test

The second – likelihoodist – is to create a profile likelihood and take the \(q\) quantile. I personally find this approach more intuitive in general because it is contextually picking model parameters, rather than to directly making claims about degrees of belief: we are just trying to pick \(p\) such that it captures the first 5% of the likelihood sum of our binomial model.

There are at least two general – from first principles – approaches to calculate a lower bound fraction without knowing specific formulae. The first – Bayesian – is to assign a prior to \(p\) and then take the \(q\) quantile of the posterior distribution as a lower bound estimate. One example of this is to use the beta distribution prior (closed form and conjugate of the binomial distribution) and then the posterior is given by \(p \sim beta(k+\alpha, n-k+\beta)\), where \(beta(\alpha, \beta)\) was your prior. In R:

Each fraction has to be modelled as a distribution in order take its uncertainty into account. The binomial distribution is a good choice, such that \(k \sim Binom(n,p)\) where \(k\) is the number of successes, \(n\) the number of trials and \(p\) the fraction of successful trials. We are given \(k\) and \(n\), but we are interested in the distribution of \(p\). There are many specific ways to produce a confidence interval for \(p\), the lower bound of which can be used to order the fractions, and so control the risk of over-estimation.

A simple approach – with assumptions about the variance – would be to assume the count is Poisson distributed and to model it as \(x \sim Poisson(\alpha X + \beta)\) where \(X\) is the factor matrix, and then to order the cells by \(p(X=x \mid \alpha, \beta)\) and take the top \(k\). A more advanced approach may be to use a continuous distribution (of which there are more choices) and to also model the variance prior to sorting by the probability.

Lots of unsurprising factors will explain the number domestic burglaries in a cell such as the number of property related crimes more generally, the number of houses/apartments, average income, etc. You should realise that an “expected” count will be well modelled by the unsurprising factors and that an unexpected count will be a deviation from this model.

3. Estimating the number of buses

It often helps to consider what the data generating distribution for a problem could be. In this case, some hidden \(k\) buses with the same chance of geting picked with a sample size of 14, is just like rolling a \(k\) sided dice 14 times and writing down how many times each side occurred: the multinomial distribution. We can simulate our situation by generating a count vector given \(n\) samples for \(k\) buses, and removing the zeros.

simulate <- \(k,n) rmultinom(1, n, rep(1/k, k)) %>% as.vector %>% .[. > 0] # For k=10 buses simulate(10, 14)

## [1] 1 2 1 3 2 4 1

the complication here is that it does not help us to use the multinomial distribution on the resulting sample, because we cannot fix \(k\) and so do not know which “slot” each count should go in nor how many zero counts there are. However, it is not difficult to come up with a density function from first principles.

... continue reading