Bayesian methods are well-established analytical tools. Along with frequentist methods (see Chapter 2), these comprise the two dominant approaches to statistical inference. (A third group of methods, called fiducial or likelihood inference has found less favor.) Reviews of all three perspectives can be found in Young et al. (2005).
Bayesian methods are well-established analytical tools. Along with frequentist methods (see Chapter 2), these comprise the two dominant approaches to statistical inference. (A third group of methods, called fiducial or likelihood inference has found less favor.) Reviews of all three perspectives can be found in Young et al. (2005).
The Bayesian approach differs from the frequentist approach in several important ways. First, a Bayesian expresses uncertainty as a probability, which a frequentist is not allowed to do. For example, when a frequentist sets a confidence interval, he may not say something such as “the probability that the interval contains the true mean of the population is 0.95.” Instead, he must say that “95% of similarly constructed intervals will contain the true mean.” This is because the interval either contains the true mean or it does not. The frequentist does not know which of these is correct, but he may not express his uncertainty as a probability.
In contrast, a Bayesian seeing the same data will construct a credible interval, a region which, to a Bayesian, has some pre-specified probability of containing the unknown parameter value. In some cases a credible interval with probability 0.95 of containing the parameter value is in exact numerical agreement with the 95% confidence interval. But the philosophy and interpretation of a credible interval is very different from that of a confidence interval.
A second difference is that the Bayesian may integrate over the parameter space. This is important in several ways, but one application is in hypothesis testing. A frequentist will test the null hypothesis against the alternative, and he will either reject or fail to reject the null hypothesis at some pre-specified alpha level (often 0.05 or 0.01). In contrast, the Bayesian will calculate her posterior distribution, a probability distribution that describes her belief about the value of the unknown parameter after observing the data. Then she can integrate her posterior distribution over the region corresponding to the null hypothesis, and thus calculate her probability that the null hypothesis is true.
The third difference is the most controversial. Bayesians often start with subjective beliefs about the probabilities of events. For example, based on life experience or just plain prejudice, a Bayesian juror might have an initial belief that the chance that the defendant is guilty is 0.9, which should then be modified through the use of a mathematical tool called “Bayes’ Rule” as evidence is presented. This initial belief is difficult to square with the legal tenet that a defendant should be presumed innocent until guilt is proven, but it may be a more realistic representation of juror thinking than to assume each juror is a blank slate upon which the lawyers paint their narratives.
Savage (1954) proved that under a very general and reasonable definition of rationality, any rational person must act as a Bayesian. Specifically, if agents have beliefs that can be characterized through a probability function on some space, then rational agents should make decisions that maximize their expected utility function (i.e., decisions that, on average, provide the largest benefit or least loss where the benefit or loss need not be monetary). And this compels them to act as Bayesians. On the other hand, another theorem shows that a committee (such as a jury) cannot act collectively as a Bayesian unless they initially hold identical opinions (prior beliefs or simply priors) about the probability distributions for all relevant unknowns in the problem and also agree on the mechanism for generating the observations (cf. Kadane et al., 1993). The usual way of addressing this problem is to invoke another theorem, which states that under general conditions, two Bayesians who observe the same stream of data will converge in their beliefs. Thus, in the context of a jury trial, each will juror have separate prior beliefs, and the lawyers would attempt to present enough evidence that all members of the jury would converge towards unanimity (as required in many states), so that all would vote for acquittal or conviction. Some jurors might need little evidence, others a great deal, but all opinions should be changing in the same direction.
Bayes’ Theorem (or Bayes’ Rule) is a completely standard and universally accepted consequence of elementary probability. Recall that the conditional probability of an event A given that event B is observed is defined as
Let A 1, …, A k be a finite partition of the set of possible outcomes, where a finite partition means that one of A 1, …, A k must happen, but the events are mutually exclusive (disjoint), meaning that it is impossible for two or more of them to happen at the same time. In drawing a card from a standard deck, one possible finite partition is Clubs, Diamonds, Hearts and Spades; another finite partition is Red and Black; a third is aces, twos, …, kings.
Bayes’ Rule says that if A 1, …, A k is a finite partition and one observes the event B, then the probability of A i given that B is
Suppose a random person is given a blood test (e.g., the person wants to join the army, and the blood test is a routine part of the physical examination). And suppose the ELISA test signals that the person has HIV. What is the chance that the ELISA test is correct and the subject is truly HIV positive? Note that having HIV or not having HIV constitutes a finite partition with just two events: one of the two must be true, but it is not possible for both to be true. Using Bayes’s Rule, one finds that the probability of being HIV positive given a positive ELISA test is
This calculation requires that the person being tested is chosen at random (i.e., in a manner that is unrelated to HIV status). It would not apply to someone who came to the physician because he felt ill. Also note that this example can be generalized to breathalyzer tests, fingerprint matches, and other forensic-science tests.
In this example, the Bayesian analyst did not need to use subjective probability, because she knew the baseline prevalence of HIV in the U.S. population and the problem assumed that person being tested was a random draw from that population, allowing her to use an objective probability. However, if the person were being tested because he felt ill or had engaged in risky behavior, then a Bayesian physician would have a different, probably subjective, probability of HIV infection, and that probability would be larger than 0.0034.
Although Bayesian inference has compelling intellectual properties, in its mathematical form, it generally requires the evaluation of complex integrals. Specifically, the integral form of the Bayes’ Rule, which is the analogue of the discrete form given in (3.1), is
The integral in the denominator of (3.2) was a stumbling block to the use of Bayesian inference for several centuries. In most cases, the integral could not be evaluated, and thus the updating calculation could not be completed. Essentially, the only circumstances in which tractable calculation was possible were with conjugate families. In conjugate families, the prior belief on the value of the unknown parameter is assumed to be a member of a convenient family, and the distribution that generates the data, conditional on the value of the parameter, is also assumed to be a member of a convenient family. The convenience of these two families derives from the fact that when the prior belief about the parameter is updated by the data to obtain the new (posterior) belief about the parameter, it turns out that the new belief is in the same distributional family as the prior. The three most prominent of these conjugate families are the beta-binomial, the normal-normal, and the gamma-Poisson families.
Two hypothetical applications illustrate different facets of Bayesian reasoning. These examples will use the beta-binomial and gamma-Poisson families.
The beta-binomial is used when making inferences about a proportion or a probability. For example, suppose a forensic accountant wants to estimate the proportion of fraudulent claims filed with an insurance company by an automobile repair chain. Formally, one would have to assume that an infinite number of claims have been filed, and that the forensic accountant is sampling this population of claims at random. In practice, it is sufficient to simply assume that a very large number of claims have been filed, but it is still essential that the sampling be random.
The binomial distribution describes the number of “successes” in a fixed number of trials, where the chance of success on each trial is independent with constant probability θ. For n trials, the probability of exactly k successes is
In our example, the forensic accountant decided how many claims to sample; this is her n. She then draws a random sample of n claims and audits each. A success is finding a fraudulent claim. If she records a 0 for a failure and a 1 for a success, then, in the notation of (3.2), she observes x 1, …, x n , where each observation x i is either a one or a zero. Based on this sample, she now wants to make an inference about θ, the proportion of fraudulent claims.
As a Bayesian, she has a prior opinion about the value of θ. If that opinion happens to be expressible as a beta distribution, then she is in a conjugate family and can calculate the awkward integral in the denominator of (3.2).
A beta random variable may take any value between 0 and 1, inclusive. Since the beta distribution describes a continuous random variable, the probability that it takes any specific value is zero, but if one integrates the density function of a beta distribution between, say, 0 and 0.5, one obtains the probability the value p is less than or equal to 0.5.
The beta distribution is indexed by two parameters, traditionally represented as α and β. Both α and β must be greater than zero. The density function of a random variable θ with Beta(α, β) distribution is
The beta family is quite flexible, and can represent many possible beliefs about θ. Figure 3.1 shows the shape of the density function for three possible pairs of indices (α, β).
Figure 3.1 This figure shows three different densities for the beta distribution, which illustrate the wide range of prior beliefs that can be incorporated. The case α = β = 1 is uninformative about the probability of success θ; the case α = β = 0.5 represents the belief that θ is probably close to zero or one; and the case α = 3, β = 1 puts more weight on larger values of θ.
Note that when α = β = 1, the density is flat. The forensic accountant might use this density as her prior if she were completely agnostic about θ and thought that all values between 0 and 1 were equally likely. This would be a noninformative prior, since no value of θ is favored. (For a discussion of using a flat prior with data from a Supreme Court case, see Kaye, 1982, p. 779.) Alternatively, it might be reasonable to think that the automobile repair chain is either mostly honest or mostly dishonest, in which case she could use the informative beta prior whose density has α = β = 0.5. Or, based on other evidence, she might believe that the chain is a criminal enterprise, and then she might pick the density corresponding to α = 3 and β = 1.
At this point the Bayesian forensic accountant has collected the sample x 1, …, x n and selected the α and β that specify her beta prior. She now wants to solve (3.2) in order to determine her new belief about the density of θ, the probability of a fraudulent claim given what she has learned from her sample. She can replace x 1, …, x n by the number of successes, i.e., the number of fraudulent claims. (In statistics, there is often a sufficient statistic, one that summarizes all the relevant information in the data; in this case one can show that is sufficient for inference on θ.) Then the posterior density is
The trick to making this conjugate family work is to recognize that the denominator has integrated out θ, so it is some constant. And the numerator is a member of the beta family, since its kernel (the terms in the integrand that depend upon θ) is the product of θ to a power times 1 − θ to a power. So the constant in the denominator combined with the constants in the numerator have to be the value that forces the integral of the density to equal 1. From (3.3), it is clear that the constant must be Γ(n + α + β)/Γ(k + α)Γ(n − k + β).
The posterior density is the complete expression of the forensic accountant’s new belief. She can use it in several ways. For example, the district attorney may have an informal rule of thumb that if the proportion of invalid claims is less than 3%, then it is probably error rather than fraud, and not worth prosecutorial effort. In that case, the forensic accountant would integrate her posterior for θ over the region from 0 to 0.03, to calculate her probability that the district attorney should decline to prosecute.
But it is instructive to look at the mean and variance of her posterior distribution, to see how the new data have affected her previous opinion. One can show that the mean μ and variance σ 2 of a beta distribution with parameters α and β are
Looking at the posterior variance, one sees that
Before leaving this mock example, there are a few points to underscore regarding the use of subjective priors. First, Savage’s results on the rationality of Bayesian decision makers requires the analyst to use her true subjective prior, rather than an noninformative (or objective) prior. Second, noninformative priors are not simple; they may not exist, they may not be unique, and they may be improper (meaning that their integral is infinite, rather than equal to 1). For the beta-binomial example, the Haldane prior, the Jeffreys prior, and the maximum entropy prior are all, in different technical senses, objective priors (cf. Kass and Wasserman, 1996). Third, as the sample size grows, all Bayesians who put non-zero probability on the entire domain of θ will converge to the same opinion, no matter which starting point prior each chose.
For another mock case study that illustrates several facets of Bayesian reasoning, suppose that a hospital review system flags a nurse who has had an unusually large number of his patients die during his shift. The police are consulted, and the district attorney wants to determine how improbable that observed number of deaths might be. The matter is complicated since well-known cases, such as that of Lucia de Berk (cf. Derksen and Meijsing, 2009), have highlighted the fact that, just by chance, an innocent nurse may have an unusual number of fatalities on his shifts. And if hospitals monitor many, many nurses, then it is certain that some innocent nurses will seem suspicious.
The analysis in this situation could entail the use of the gamma-Poisson conjugate family. The Poisson distribution describes the number of independent events that occur in a fixed amount of time, a fixed area, or a fixed volume. Assuming that an innocent nurse has independent deaths on his watch, is reasonable to model the number of deaths as a Poisson random variable with unknown mean θ. For expository simplicity, it is helpful to assume that the period of time in this application is one year.
A Poisson random variable can only take the values 0, 1, …, without any upper limit. (Obviously, in applications, there is generally some upper bound, so this is technically only an approximation, but it is widely used and broadly accurate.) The probability of observing exactly k deaths is
To form a conjugate family, the prior belief about θ must be represented by a gamma distribution. The investigator would have to form that prior based on personal experience, other data, or input from experts. For example, if the nurse in question worked with geriatric patients, then one would expect more natural deaths than if he worked in the pediatric unit. And hospital administrators could provide data on death rates in similar units at similar hospitals. The investigator would combine this information, with her own subjective beliefs, to specify her prior.
The gamma family is a group of distributions indexed by two parameters, α and β, both of which must be greater than zero. The gamma density function for a random variable θ is written as
The gamma family includes the exponential distributions and the chi-squared distributions as special cases. Figure 3.2 illustrates some of the beliefs about θ that can be described through a gamma distribution.
Figure 3.2 This figure shows three different densities for the gamma distribution, which illustrate the kinds of prior belief that may be represented. The case α = 1, β = 0.5 corresponds to an exponential distribution; the other two cases show different means, variances, and skewness.
The mean of a gamma distribution is α/β, and the variance is α/β 2. When specifying a subjective prior, it is often helpful to try to quantify one’s uncertainty by thinking about the mean ν of the random variable θ and one’s uncertainty about its value (perhaps expressed as the range [L, U] that one believes has probability 0.95 of including the true mean of the distribution). Then one can solve the system of equations
If there are data on the number of deaths for nurses who are not under suspicion, say x 1, …, x n , then the analyst can update her subjective prior through Bayes’ Rule to find her posterior belief about the distribution of θ for innocent nurses. Suppose the initial subjective prior has parameters α and β. Then it turns out that the posterior will be a gamma density with parameters and . This calculation can be a little tricky, since one must use data from other nurses who work with similar patients (e.g., geriatric or pediatric), and one must take care to adjust for the differing number of hours each nurse worked during the year.
Returning to the hypothetical case, suppose the investigator wanted to find the posterior distribution for the annual death rate θ of the suspect nurse. As will be seen, this is not the best thing to do, but it is a starting point for a naive Bayesian analysis.
Her use of a prior implicitly accords with the fact that not all nurses have the same value of θ. A well-trained and experienced nurse might have a θ value slightly smaller than a less experienced nurse. This is related to random effects models that are discussed in the next section.
If x people have died on the suspect nurse’s watch, then the conjugacy property shows that the posterior distribution for the suspect nurse’s annual death rate θ has gamma distribution with parameters and . If this density has a mean that is very much larger than the prior mean α/β, then that could be interpreted as evidence against the nurse.
A frequentist might want to decide whether the nurse’s annual death rate θ is less than or equal to θ 0, the average value for all similar nurses, or perhaps a value that is set by hospital policy as a proficiency standard. The frequentist would then test the null hypothesis that the suspect nurse’s θ is less than or equal to θ 0 against the alternative hypothesis that the nurse’s θ is greater than θ 0. This entails calculation of a test statistic and, from that, a significance probability. If the significance probability were very low, say less than 0.01, then the frequentist would reject the null hypothesis and conclude that the nurse’s mean annual death rate was greater than θ 0 (see Chapter 2). Technically, the significance probability is the chance of observing a death count as high or higher than the suspect nurse’s count when the null hypothesis is true.
From a legal standpoint, one must be careful about the conclusion drawn from rejection of the null hypothesis. The investigator decides that the suspect nurse’s fatality rate is larger than the specified null value, but this could be due to incompetence rather than criminality, or a management policy that assigns the most senior nurse to the sickest patients. Nonetheless, such analyses are sometimes undertaken as part of an investigation (for cases and discussions, see Fienberg and Kaye, 1991; Loue, 2010).
Although it may not be the best approach, the Bayesian analogue to this hypothesis test would calculate the posterior probability that the nurse’s θ was greater than θ 0. The Bayesian calculates the posterior distribution for the nurse’s θ, and then integrates over the parameter space from θ 0 to infinity, to find
A different analysis might consider the probability of observing x or more deaths if the distribution for the nurse’s death rate θ were a draw from the prior distribution. This is not actually a Bayesian analysis, since, among other things, Bayesians do not condition on unobserved events (i.e., Bayesians do not find the probability seeing more deaths than were counted). Instead, it is an attempt to harness a Bayesian tool to a frequentist purpose. However, note that in this situation, the frequentist logic forces the analyst to calculate the probability of x or more deaths, since the probability of seeing exactly x deaths can be misleadingly small; the probability of seeing exactly ten ants in a garden is low, but the chance of seeing ten or more could be large.
As before, this calculation requires integration over the parameter space. This is not permitted in frequentist analyses, but it allows the Bayesian investigator to calculate the probability of observing exactly x deaths when the uncertainty about the death rate is expressed as a distribution, which can then be repeated for x + 1, x + 2, and so forth. This hybrid Bayes-frequentist approach incorporates the fact that not all nurses have the same value θ—innocent nurses presumably have θ values that are close to some lower value than a guilty nurse.
Specifically, the Bayesian investigator solves
If the probability of observing a nurse with x or more deaths in, say, a year is very small, then this could be considered evidence of homicide. A numerical example is helpful. Suppose the investigator assumes that nurses in this kind of ward will have an average of 4 patient deaths per year, with a standard deviation of 2. Then, from (3.4), the gamma prior has α = 1 and β = 0.5. Suppose that the suspect nurse has had 12 patients die on during his shift this year. Then, for this prior, the probability of exactly that many deaths is
To find the probability of 12 or more deaths in a year, the investigator calculates
However, recall that there are probably many, many nurses who work in hospitals that monitor staff for unusual fatality patterns. Just by chance, one unlucky nurse will surely accumulate a suspicious number of deaths. To correct for this, one must calculate the probability that among m monitored nurses who are innocent, at least one of them will have 12 or more fatalities during the year.
There are different strategies for making this correction. In principle, a well-calibrated Bayesian need not make any adjustment, because her prior for the guilt of the nurse takes account of the fact that there are many nurses being monitored and thus the chance that this specific nurse is criminal is automatically low. Alternatively, the False Discovery Rate of Benjamini and Hochberg (1995) admits a Bayesian interpretation (cf. Muller et al., 2006). Scott and Berger (2006) develop a Bayesian solution for multiple testing when each of the hypotheses is independent, in the sense that the guilt or innocence of one nurse is unaffected by the guilt or innocence of another, and Gelman et al. (2012) tackles the more complicated case when dependence is present. These approaches are generally somewhat technical.
For an easier path, consider a Bayesian perspective on adjustment for multiple testing (Westfall et al., 1997). Let p be the probability that an innocent nurse has k or more fatalities in a year. Then among m monitored nurses who are innocent of murder, the probability that all of them will have fewer than k fatalities is (1 − p) m . This is simple probability, and only assumes that the nurses are independent. From this, it follows that the probability that one or more nurses has k or more fatalities is 1 − (1 − p) m .
Recall the mock example, in which the nurse’s probability of having 12 or more deaths in one year is 0.00771. If 10 comparable and innocent nurses are being monitored, then the probability that at least one of them reaches this level of suspicion is 0.0745. If there are 20 comparable nurses, then the probability that one or more seems suspicious is 0.1434, and if there are 100, then the probability is 0.5388. In a large hospital system, it seems quite likely that many nurses are monitored, and so the evidence in this example is probably not determinative on its own.
Finally, before leaving this example, consider the classical Bayesian approach, which avoids the awkward hybrids and compromises discussed previously. It is directly comparable to the ELISA example. Either the nurse is innocent or guilty—that provides a finite partition. The investigator (or a juror) has a personal subjective prior probability for innocence or guilt, which is based upon all that they know about the circumstances before seeing the data (i.e., the number of fatalities during the year on that nurse’s shift).
To find the posterior probability of guilt, a Bayesian must build a probability model for the number of deaths for an innocent nurse and the number of deaths for a guilty nurse. Different people would likely build that model in different ways. Earlier in this section, there was a model that found the probability of exactly 12 deaths for an innocent nurse as 0.00257. Similar reasoning could build a model for the probability of a guilty nurse having 12 fatalities in one year. It is likely that conjugate families would not seem reasonable to some of these Bayesians, in which case they would use Markov chain Monte Carlo methods, as discussed in the next section.
In any case, after eliciting their personal priors for innocence and guilt, and calculating the probability of exactly 12 deaths under the model for innocent and guilty nurses, all of the Bayesians would then find the posterior probability as
In Bayesian hypothesis testing, one decides which of two models (e.g., those for guilt and innocence) is more likely. The Bayes factor measures the degree to which the data support one or the other model, and it corresponds to a likelihood ratio (see Chapter 2). Formally, the Bayes factor BF(H 0, H A ) is the ratio of the posterior odds of the null hypothesis H 0 and alternative hypothesis H A to the prior odds of the null and alternative. Specifically,
A Bayes factor of, say, 5 means that after seeing the data, the alternative hypothesis is five times more likely relative to the null hypothesis than it was a priori, before seeing the data. In scientific settings, a Bayes factor of 10 or more is generally considered to be strong evidence for the alternative hypothesis (Jeffreys, 1998 , p. 432). But in forensic settings, the standard is generally much more stringent. The Department of Justice (2018, p. 4) proposed uniform language for legal testimony in which a Bayes factor between 2 and 100 is described as “limited support,” a factor between 100 and 10,000 corresponds to moderate support, a factor between 10,000 and 1,000,000 is strong support, and a value greater than 1,000,000 is very strong support.
The logarithm of the Bayes factor is called the weight of evidence. Good (1985) has a slightly theoretical discussion of this concept, partly in the context of the legal system. Also, for more extensive discussions of the use of Bayesian reasoning in court cases, see Fienberg and Kadane (1983); Kaye et al. (2011).
Conjugacy is sometimes helpful, but most realistic applications are more complex. It is likely that distribution of the data and distribution of the honest Bayesian belief do not correspond to a conjugate family, and thus, in general, there is no solution that can be expressed as a finite number of standard mathematical terms—some numerical approximation is needed, and these can be computationally intractable or difficult to interpret. This limited the use of Bayesian methods for almost a century.
Gelfand and Smith (1990) found a way forward. Instead of trying to approximate the posterior density analytically, they developed a computer-intensive procedure that could sample repeatedly from the posterior distribution. Variants of this procedure are called Gibbs Sampling and Markov chain Monte Carlo (MCMC). The former is a special case of the latter, and these procedures have been further developed and improved by many researchers.
When one can draw large samples from the posterior density, it is possible to estimate the density with as much accuracy as the application requires. With a sample of size n, one can put weight 1/n on each observation to obtain an estimator that is a (possibly multivariate) histogram that converges to the true density function. If more accuracy is needed, one can simply increase n.
These procedures are not foolproof, and a number of technical problems that can arise (although these pathologies generally do not arise in reasonably posed problems—MCMC is robust).
One potential problem is that the Markov chain can take a long time to converge to its stationary distribution, the point at which the samples that are drawn are taken from the true posterior distribution. The usual solution is to let the chain run for a several thousand iterations, discard those draws, and then begin recording the sample values (this initial period is called the burn-in). To confirm that the chain has reached its stationary distribution, the statistician will typically monitor trace plots. These plots help diagnose when convergence has been reached, and other technical problems that can occur.
A second issue is that, at convergence, the sampled observations are draws from the posterior, but they are not an independent (random) sample. Consecutive draws are correlated, and so one needs to draw very large samples to ensure that the empirical distribution closely approximates the true posterior distribution.
A third potential issue arises when the support of the posterior density (i.e., the region where the density is greater than zero or, equivalently, the region where there it is possible to observe a data point) has a complex shape. Recall that the support of the beta density was the interval from 0 to 1, and the support of the gamma density was the positive numbers. These are simple supports. But in multivariate settings, certain geometries of the support lead to slow mixing (i.e., slow exploration of the support), so the analyst could be deceived into thinking that the histogram of the data accurately reflects the true distribution when it does not.
Multivariate parameter spaces arise when measuring the amounts of several trace elements in a glass fragment or bullet lead. For example, the FBI used to perform a test called Compositional Analysis of Bullet Lead (CABL). The test estimated the proportion of seven trace elements in a bullet: antimony, copper, arsenic, silver, tin, bismuth, and cadmium. This implies a seven-dimensional parameter space, and the goal is to estimate (θ 1, …, θ 7), the proportions of the trace elements in bullet. When these means were very similar to the estimates from another bullet, the FBI experts used to testify that the bullets were manufactured on the same day, from the same “melt” of lead.
CABL testimony is no longer given, following scientific criticism (National Research Council, 2004), but it is illustrative to note that in this example the support of the posterior density will seven-dimensional simplex, since θ i ≥ 0 for all i = 1, …, 7 and . This simplicial region is not problematic for MCMC, but more exotic regions are.
Suppose, for example, that the support of the multivariate density looked like a dumbbell. Then the MCMC algorithm might easily wander around in one of the end weights for a very long time before finding the bar that connects to the other end weight. In that case the traceplot would look fine for a long time, and the analyst might wrongly conclude that convergence had been achieved, even though less than half of the support had been explored. (But, as is probably evident, this kind of pathology arises very rarely, and should not be problematic for routine Bayesian applications.)
The main message is that Bayesian statisticians are now able to analyze problems that are much more complex and realistic than used to be possible. It has been used in many cases to interpret complex DNA mixtures (cf. Chapter 11). MCMC, the computational algorithm for doing such analyses, is a mature and reliable technique, but it should be implemented by a trained statistician since there are certain kinds of pitfalls that require special expertise to avoid.
For additional discussion of MCMC, there are a number of possible books, at a range of technical detail. One that is both popular and fairly accessible is Gelman et al. (2013). Gamerman and Lopes (2006) is a bit older and more theoretical. McElreath (2018) is recent, fairly introductory, and has many examples.
Modern Bayesian analysis can address essentially any question that frequentist methods do, but there are some areas where the Bayesian perspective is especially elegant. These situations are probably rare in legal contexts, but it may be useful to sketch a few examples.
First, consider random effects models. To return to the automobile repair example, suppose one suspects that a particular shop routinely overcharges for, say, an engine tune-up. A sample of data from different repair shops reveals what each has charged for tune-ups performed during the previous year. Some shops have done many repairs, but for others, there are only a few observations. The goal is to calculate each shop’s average cost, borrowing strength from those with many observations to improve estimates for those with few.
Let Y ij be the charge for the ith tune-up at the jth shop. Then the fixed effects model is
In a fixed effects model, it is very difficult to borrow strength. But in a random effects model, it is easy. Fixed effects models are not Bayesian hierarchical models, but random effects models are. One can use the Durbin-Wu-Hausman test (Hausman, 1978) to decide whether to use a fixed effects or random effects model.
The random effects model is similar to the fixed effects model but allows the parameters to be random. The model is
Shrinkage borrows strength by saying that the charges at the jth shop tend to be similar to those at the other shops. This could enable an attorney understand how much variation there is in average tune-up fees across different shops, and thus to better assess whether the suspect shop’s charges are extreme. This methodology would be directly applicable to cases alleging wage discrimination, excessive pollution, or medical malpractice, where the observations would be salaries in similar jobs, daily emissions of nitrogen oxide, or survival times of patients following a procedure, respectively.
Besides fixed effects and random effects models, there are mixed effects models (cf. Sahai and Ageel, 2012, for a discussion of all three types of models). These arise when some of the variables are not random, but others are. In the repair shop example, where older vehicles would be expected to require more labor, there might be information on the age of the vehicle, A i , which is a fixed effect. In that case, the model is
Let the vector of scores be , the vector of ages be , and the vector of errors (individual effects) be (all are n × 1). There are k shops, and let the k × 1 vector of shop effects be . Suppose , , and that these two random vectors are independent.
With these assumptions (which can be made much more general), the point estimates are:
The important point is that one can improve estimates using Bayesian methods to borrow strength from similar cases, while still incorporating known covariates. Thus, an age discrimination suit could control for fixed effects, such as years of education, while still using random effects models to identify inequities. Or one could use fixed effects such as hours of operation to assess the amount of pollution from a factory, or the severity of illness when looking at patient survival times.
A second useful technique is hierarchical Bayesian models. In these, one places a prior over the parameters in the prior distribution, and one can extend this to higher levels by placing a third prior on the parameters in the second prior (forming a hierarchy). These models tend to be robust, in the sense that the posterior distribution is less sensitive to the prior specification. Additionally, there are many situations where this approach facilitates the construction of reasonable models.
The random and mixed effects repair-shop cases were two-level hierarchical models. There was a prior on the the distribution of the effect, which was input to the model for the data. In general, a three-level hierarchical model finds
A third area in which Bayesian methods are attractive is variable selection. A famous example in which this would apply is the case brought against Harris Trust & Savings Bank for wage discrimination against women (U.S. Department of the Treasury v. Harris Trust and Savings Bank, 1979). The bank denied discriminating and contended that wages were based only on seniority, experience, education, and age.
The case was complex and had many facets, but a statistical analysis by Harold Roberts presented during the trial used a multiple linear regression model in which the response variable was salary and the explanatory variables were seniority, experience, education, age, and gender. Roberts did a standard frequentist t-test to determine whether the gender coefficient was significantly different from zero. The test showed that the coefficient was very different from zero, which strongly suggested that gender was indeed a factor in determining wages. (Although experts continue to use such multiple linear regression in discrimination litigation, there are better tools.)
The Bayesian alternative (not used in the case), has a prior that puts probability p on the value of the zero for the gender coefficient, and probability 1 − p on the gender coefficient being a random draw from a very dispersed distribution, such a normal distribution with mean 0 and standard deviation 5,000. The Bayesian would use the salary data to get a posterior distribution on the coefficient, and presumably would find that the posterior probability that the coefficient is zero is quite small, and that the posterior probability that the coefficient is negative (indicating discrimination against women) and large. The analyst would also calculate the posterior mean of the coefficient, which would also be useful in assessing damages.
The Bayesian world is big, and there are many other kinds of methods and applications that are relevant to legal questions. But Bayesian tools should not be applied without guidance from experts with advanced degrees in statistics. Nor, for that matter, should frequentist analyses be done by non-professionals.
Bayesian inference is a versatile and well-established method of modern statistical inference. People are generally more likely to express opinions in terms of probability, as a Bayesian does, than in the contorted language of frequentist exposition.
However, Bayesian methods are less common in legal cases than they are in published scientific articles. There are several reasons for this: courts have accepted the use of other methods in the past; realistic applications are computer-intensive; and using prior information in framing an analysis and defending its conclusions can create difficulties for lawyers.
A much richer discussion of the Bayesian perspective in the context of the law is given in the book Statistics and the Law, edited by De Groot et al. (1986). It is dated only in that it precedes the invention of MCMC techniques, but the fundamental issues and philosophy are addressed from many angles. Case law and evidentiary issues are discussed in Kaye et al. (2011).
This work was done as part of the research program on Forensic Statistics that was organized by the Statistical and Applied Mathematical Sciences Institute, with support from the National Science Foundation.