RefWk4.pdf

THE AMERICAN STATISTICIAN
ċĉĊ7*TMJ,51*LM,0*Ċ64ç/ 7. 8Rc_af cp%qAmpl cp
https://doi.org/!”.!”#”/”””$!$”%.&”!’.!$(!$$(

Teaching Bayes’Theorem: Strength of Evidence as Predictive Accuracy

Je!rey N. Roudera,b and Richard D. Moreyc

aDepartment of Cognitive Science, University of California Irvine, Irvine, CA; bDepartment of Psychological Science, University of Missouri, Columbia,
MO; cDepartment of Cognitive Science, School of , Cardi) University, Cardi), United Kingdom

ARTICLE HISTORY
Received March &”!*
Revised May &”!’

KEYWORDS
Bayes factor; Bayes rule;
Bayes theorem; Statistical
evidence

ABSTRACT
Although teaching Bayes’theorem is popular, the standard approach—targeting posterior distributions of
parameters—may be improved. We advocate teaching Bayes’ theorem in a ratio form where the posterior
beliefs relative to the prior beliefs equals the conditional probability of data relative to the marginal prob-
ability of data. This form leads to an interpretation that the strength of evidence is relative predictive accu-
racy. With this approach, students are encouraged to view Bayes’ theorem as an updating mechanism, to
obtain a deeper appreciation of the role of the prior and of marginal data, and to view estimation and model
comparison from a uni”ed perspective.

As Bayesian statistics increases in popularity, it is essential to
have e!ective ways of teaching Bayes’ theorem. In this note,
we present an approach suitable for advanced undergrad-
uates and beginning graduate students in introduction-to-
mathematical statistics or Bayesian-analysis courses. There has
been much work on best methods for teaching conditional
probability in introductory courses, especially with use of
frequencies and intersections (e.g., Gigerenzer and Ho!rage
1995; Albert 1997; Berry 1997). Bayes’ theorem, however,
should be treated separate from conditional probability. Con-
ditional probability is useful to Bayesians and frequentists
alike, but only Bayesians use conditional probability to update
probabilities on parameters and models. Our focus is on the
teaching of Bayes’ theorem as a means of statistical infer-
ence. We assume that students are familiar with concepts of
conditional, joint, and marginal probabilities and probability
distributions.

!. The Proportional Form of Bayes’Theorem

In the vast majority of texts, Bayes’ theorem is stated as

! (” |Y ) =
p(Y|” )! (” )

p(Y )
, (1)

where ! (” |Y ) and ! (” ) denote the posterior and prior distribu-
tions of the (possibly multivariate) parameter “, and p(Y|” ) and
p(Y ) are the likelihood and marginal likelihood of the (possibly
multivariate) data Y . For a continuous parameter, the marginal
likelihood is

p(Y ) =
!

#

p(Y|” )! (” ) d” ,

where # represents the parameter space of “. For discrete
parameters, the integration is replaced by summation.

CONTACT Je)rey N. Rouder [email protected] Department of Cognitive Science, University of California Irvine, Irvine, CA +&*+’.
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/r/TAS.

After providing (1), many texts introduce a proportional form
of Bayes’ theorem:

! (“|Y ) ! l(” ;Y ) ” ! (” ), (2)

where the likelihood l is the probability of the observed data Y ,
that is, p(Y|” ) for “xed Y and variable “. This form has a handy
mnemonic: “the posterior is proportional to the likelihood times
the prior.” The proportional form may be illustrated with graphs
of priors, likelihoods, and posteriors such as those in Figure 1.
Here, it may be seen that the posterior re#ects the pull of both the
likelihood and prior, and that there is no posterior mass where
there is no prior mass.

When the proportional form is used, instructors will often
(correctly) state that because p(Y ) is not a function of “, it
is simply a normalizing constant with an unknown value.
Commonly used posterior sampling methods require only
knowledge of the posterior distribution up to a constant of pro-
portionality; thus, p(Y ) may be safely ignored. Because many
courses on Bayesian statistics make heavy use of such sampling
methods, it is perhaps not surprising that the proportional
version is the one that is predominantly stressed in Bayesian
texts including Gelman et al. (2004); Jackman et al. (2009);
Kruschke et al. (2014) as well as in introductory mathematical
statistics texts such as Hogg and Craig (1978).

We “nd, however, that when the proportional version is
stressed, our students miss out on critical Bayesian elements.
First, they do not develop an intuition about p(Y ), the marginal
probability of data. This marginal is a uniquely Bayesian con-
cept and intuition about it is critical for understanding Bayesian
model comparison and Bayesian model criticism. Second, they
tend to stress the output, the posterior, rather than the process of
updating. Third, students often take a dim view of priors. In our
experience, they view the priors as subjective and likelihoods as
objective. Consequently, students left to their own devices often

© 2019 American Statistical Association

TEACHER’S CORNER

Figure !. The relationship between the prior, likelihood, and posterior. The relation-
ship is based on proportionality, and values on the y-axis need not be included.

favor #at, di!use priors. Fourth, because p(Y ) is not used in esti-
mation and is critical in model comparison, students see estima-
tion and model comparison as separate rather than uni”ed.

“. The Ratio Form of Bayes’Theorem

To address these di$culties and to promote a deeper under-
standing of Bayes’ theorem, we follow a line of argument perhaps
“rst presented by Carnap (1962). We augment the proportional
form with the following ratio form:

! (“|Y )
! (” )

=
p(Y|” )
p(Y )

. (3)

Though (3) is simply a rearrangement of (1), this form makes
clear some important implications of Bayes’ theorem. We teach
it as follows:

The left-hand side of (3) concerns probabilities over param-
eters, and these probabilities serve as beliefs. The ratio describes
how beliefs about values of ” are updated in light of data.
Figure 2(a) shows an example where ” is the parameter in a
binomial model. The datum in this case is 7 heads in 10 #ips.
The prior is a beta (2.5, 1) that slightly favors larger values of
“; the posterior is beta (9.5, 4). Two example points are pro-
vided, ” = 0.75 and ” = 0.3. For ” = 0.75, the posterior and
prior density is 3.24 and 1.62, respectively, and hence the updat-
ing factor is 2.0. Here, the datum has increased the plausibility
of the point. For ” = 0.30, the updating factor is 0.07, indicating
that the datum has decreased the plausibility of the point. The
left-hand side of (3), the updating factor, is shown as a function
of ” in Figure 2(b). As side exercises we ask students to “nd inter-
vals where the data have decreased the plausibility by more than

10-1. We also ask students to explore how the prior a!ects the
updating by plotting the left-hand side of (3) for di!erent pri-
ors. This exercise can be extended to improper priors, say a beta
(0, 0) prior, where the updating factor must be in”nitely large.
Such a peculiar state is not obvious in the proportional form,
and motivates the need for caution when using improper priors.

We follow Je!reys (1961) who called this updating factor the
strength of the evidence from the data about “. Evidence from
data is how the data license an update of beliefs.

The right-hand side of (3) concerns probabilities of observed
data, and the term p(Y ) is new to many students who have not
had Bayesian statistics. We “nd that some students have the
mistaken intuition that this term should be 1.0 to re#ect that the
observed data were observed. To combat this intuition, we “nd
it helpful to refer to probability mass functions over outcomes
as predictions. We start with p(Y|” ), the numerator, as it is most
accessible. If ” is speci”ed, say at ” = 0.75, then p(Y|” ) provides
a probability distribution over outcomes (Figure 2(c)). Here, we
can ask how well the observed datum, 7 heads in 10 #ips, was
predicted by the setting ” = 0.75 (see the starred point). We can
compare this prediction to predictions from other models, and
Figure 2(d) shows the case for ” = 0.30. We also can introduce
other prediction patterns at this point, and show Figure 2(e)
as an example. We point out to students that whatever these
patterns are, they must sum to 1.0. The implication is that if
the prediction for one outcome is increased, the prediction for
the others must be decreased to maintain the sum. With these
three “gures, students can compute ratios of how much better
one pattern predicted the observed datum than another. We
stress understanding these plots as a priori predictions, and
then using these predictions to compare models.

Once students have been introduced to the concept of com-
paring predictions, we switch over to the concept of conditional
and marginal predictions. Figure 2(c) and 2(d) is conditional
predictions, conditional on speci”c values of “. Figure 2(e) is
marginal over the beta (2.5, 1) prior in Figure 2(a). Here, we
introduce the denominator, p(Y ), and its computation through
the of Total Probability, namely, p(Y ) =



p(Y|” )! (” ) d”

(Rice 2010, also uses this approach of marginalizing to form
probabilities over data for priors). We call p(Y ) the marginal
prediction of the data for prior ! (” ). The terms p(Y|” ) and
p(Y ) are plotted as a function of ” in Figure 2(f) and labeled
“conditional” and “marginal,” respectively. The right-hand side
of (3), the ratio, is shown in Figure 2(g). We term this ratio the
gain in predictive accuracy for “.

Bayes’ theorem states the equality of the left and right sides
of (3), which can be seen by noting the equality of Figure 2(b)
and 2(g). The updating factor for a value of “, the strength of
evidence from the data, is how well the data are predicted when
conditioned on this value relative to the marginal prediction. In
words, we say that “strength of evidence for a parameter value
is precisely the relative gain in predictive accuracy when con-
ditioning on it” (see Morey, Romeijn, and Rouder 2016). We
may even use the short-hand mnemonic, “strength of evidence
is relative predictive accuracy.” We “nd that allowing students to
make this connection between evidence and prediction provides
them with a deeper insight into Bayes’ theorem than is a!orded
by the proportional form.

THE AMERICAN STATISTICIAN 187

Figure “. Updating with Bayes’theorem. (a) Prior and posterior distributions for ‘ heads in !” flips. (b) The left-hand side of ($), which is the updating factor and may be
defined as the strength of evidence from the data for values of ” . (c)–(d) Probability of outcomes for ” = #.$% and ” = #.&#, respectively. The starred point is the observed
value of ‘ heads in !” flips. We find that students best understand these as predictions about where data should be observed. (e) Marginal probability of outcomes with
marginalization across ” with respect to the prior. These may be called the marginal predictions. (f) p(Y|” ) and p(Y ) as a function of ” for ‘ heads in !” flips. (g) The ratio
p(Y|” )/p(Y ), or the gain in predictive accuracy for values of “. Bayes’rule is a statement of the equality of plots (b) and (g).

#. Uni$ed Estimation and Model Comparison

To unify estimation and model comparison, we “nd it useful to
introduce the concept of relative strength of evidence for com-
peting parameter values. To start, we ask students to compare
the relative evidence for two values “0 and “1

! (“1|Y )
! (“1 )

! (“0|Y )
! (“0 )

=
p(Y|”1 )
p(Y )

p(Y|”0 )
p(Y )

=
p(Y|”1)
p(Y|”0)

.

Here, the relative strength of evidence is the ratio of proba-
bilities of data, or gain in predictive accuracy. This development
is an example of model comparison. We are comparing one
model with a speci”c point value of “0 versus another model
with a speci”c point value of “1. This example may be leveraged

to introduce model comparison more generally, and we do so
next. In general, the relative strength of evidence is the Bayes
factor, and the above example shows the Bayes factor for these
two constrained models.

One advantage of the ratio form is that it seamlessly uni”es
parameter estimation and model comparison. The inclusion of
p(Y ) in the RHS of (3) indicates that even parameter estima-
tion yields only relative evidence. All speci”c ” values can be
thought of as restrictions on a general model with a prior across
all “. Each of these restrictions is being implicitly compared to
the model in which they are nested, whose likelihood is p(Y ).

To show students the uni”cation, we let MA be our previous
model on “, the probability of heads on a #ip of a certain coin,
de”ned by

Y|” # Binomial(” , N),
” # Uniform(0, 1).

J. N. ROUDER AND R. D. MOREY188

Figure #. (a)–(b) The probability of outcomes under the general model, MA and under the fair-coin model, MB, respectively. (c) The ratio of these probabilities is the
Bayes factor between the models.

We integrate out ” to obtain p(Y ) = 0.0909 when Y = 7 for 10
#ips. In fact, p(Y ) = 0.0909 = 1/11 for all values of Y for this
uniform prior. Figure 3(a) shows the probability of data, the pre-
dictions, for all the outcomes of the ten-#ip experiment.

Suppose we wish to compare this general model to a fair-coin
model, denoted MB. The fair-coin model is

Y # Binomial(0.5, N).

The predictions of the model are given by
#N
Y
$
(0.5)10 and

shown in Figure 3(b).
The strength of evidence for each model is given, respectively,

by

! (MA|Y)
! (MA)

=
p(Y |MA)

p(Y )
,

and
! (MB|Y)
! (MB )

=
p(Y|MB )

p(Y )
.

The marginal density of data now is marginal over all
considered models. Let M be the class of I models indexed
M$, M%, . . . , MI. Then,

p(Y ) =
%

i

p(Y |M& )! (M& ).

The example helps students understand that the probability
(or density) of data can be expressed three ways: conditional on a
particular model and parameter value; conditional on a partic-
ular model but marginal across all parameters in the space for
that model, or marginal across several models. For example, the
probability of four heads may be conditional on a speci”c value

THE AMERICAN STATISTICIAN 189

of “, say ” = 0.5 in MA, and this probability is found from a
simple binomial calculation (0.205). It may be conditional on
MA but marginalized across all parameters. This calculation
involves an integral across simple binomial calculations, and
the value is 0.0909. Finally, the probability can be marginal-
ized across uncertainty in whether the appropriate model is MA
or MB, the appropriate calculation is the weighted average of
0.0909 and 0.205, where the weights re#ect ! (M& ).

The usefulness of relative strength of evidence for models is
now a straightforward extension of the previous development.
The relative strength of evidence, the Bayes factor for these mod-
els, is the ratio of predictive probabilities (densities)

! (MA|Y)
! (MA )

! (MB|Y)
! (MB )

= BFAB =
p(Y |MA)
p(Y |MB )

. (4)

The ratio of these predictions, the Bayes factor, is shown in
Figure 3(c). Had we observed a low number of heads, sayY = 2,
we could note that the observation is better predicted under
model MA (with probability 0.0909) than under model MB
(with probability 0.044). The ratio is 2-to-1 in favor of MA.
Conversely, had we observed a moderate number, say Y = 5,
then the observation is better predicted by the fair-coin model,
model MB than by the more general model MA. The ratio here
is 2.7-to-1 in favor of MB.

The de”nition of p(Y|” )/p(Y ) as predictive accuracy allows
students to easily see one of the oft-touted bene”ts of Bayes
factors: they automatically reward parsimonious models (e.g.,
Je!erys and Berger 1991). More parsimonious models are ones
that make speci”c predictions. The laws of probability require
that a model with di!use data predictions has low probability
(or density) for any observed data; hence, it will be harder to
obtain high amounts of evidence for a less parsimonious model
unless the more parsimonious model predicts the data even less
accurately.

%. Conclusion

The take-home message here is that teaching the ratio form of
Bayes’ theorem can have a number of bene”ts: First, students
tend to see Bayes’ theorem as a way of updating beliefs. This
leads to a focus on the updating itself as much as on the resul-
tant posterior. With such a focus, it is easier to show students

how prior speci”cation a!ects updating, the role of the prior in
model speci”cation, and the di$culties with improper priors.
Second, students learn to reason about the strength of statis-
tical evidence, which may not be a concept they have encoun-
tered except informally. Expressing strength of evidence as the
degree to which a set of propositions can accurately predict data
is particularly intuitive. Finally, students can unite estimation
and model comparison as similar application of Bayes’ theorem.
Students’ decision to use model comparison versus parameter
estimation can be driven by the question at hand, and not by
blanket recommendations to avoid one or the other.

Funding

This research was supported by National Science Foundation grants BCS-
1240359 and SES-102408.

References

Albert, J. H. (1997), “Teaching Bayes Rule: A Data-Oriented Approach,” The
American Statistician, 51, 247–253. [ ]

Berry, D. A. (1997), “Teaching Elementary Bayesian With Real
Applications in Science,” The American Statistician, 51, 241–246. [ ]

Carnap, R. (1962), Logical Foundations of Probability, Chicago, IL: The Uni-
versity of Chicago Press. [ ]

Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. (2004), Bayesian Data
Analysis (2nd ed.), London: Chapman and Hall. [ ]

Gigerenzer, G., and Ho!rage, U. (1995), “How to Improve Bayesian Rea-
soning Without Instruction: Frequency Formats,” Psychological Review,
102, 684–704. [ ]

Hogg, R. V., and Craig, A. T. (1978), Introduction to Mathematical ,
New York: MacMillan. [ ]

Jackman, S. (2009), Bayesian Analysis for the Social Sciences, Chichester, UK:
Wiley. [ ]

Je!reys, H. (1961), Theory of Probability (3rd ed.), New York: Oxford Uni-
versity Press. [ ]

Je!erys, W. H., and Berger, J. O. (1991), “Sharpening Ockham’s Razor on
a Bayesian Strop,” Technical Report #91-44C, Department of Statis-
tics, Purdue University. Available at http://quasar.as.utexas.edu/papers/
ockham.pdf. [ ]

Kruschke, J. K. (2014), Doing Bayesian Data Analysis: A Tutorial with R,
JAGS, and Stan (2nd ed), Waltham, MA: Academic Press. [ ]

Morey, R. D., Romeijn, J. W., and Rouder, J. N. (2016), “The Philosophy of
Bayes Factors and the Quanti”cation of Statistical Evidence,” Journal of
Mathematical , 72, 6–18. [ ]

Rice, J. A. (2010), Mathematical and Data Analysis (3rd ed.), Bel-
mont, CA: Thomson. [ ]

J. N. ROUDER AND R. D. MOREY190

186

186

186

186

186

186

186

187

187

187

187

190

Copyright of American Statistician is the property of Taylor & Francis Ltd and its content
may not be copied or emailed to multiple sites or posted to a listserv without the copyright
holder’s express written permission. However, users may print, download, or email articles for
individual use.

Place your order
(550 words)

Approximate price: $22

Calculate the price of your order

550 words
We'll send you the first draft for approval by September 11, 2018 at 10:52 AM
Total price:
$26
The price is based on these factors:
Academic level
Number of pages
Urgency
Basic features
  • Free title page and bibliography
  • Unlimited revisions
  • Plagiarism-free guarantee
  • Money-back guarantee
  • 24/7 support
On-demand options
  • Writer’s samples
  • Part-by-part delivery
  • Overnight delivery
  • Copies of used sources
  • Expert Proofreading
Paper format
  • 275 words per page
  • 12 pt Arial/Times New Roman
  • Double line spacing
  • Any citation style (APA, MLA, Chicago/Turabian, Harvard)

Our guarantees

Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.

Money-back guarantee

You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.

Read more

Zero-plagiarism guarantee

Each paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.

Read more

Free-revision policy

Thanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.

Read more

Privacy policy

Your email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.

Read more

Fair-cooperation guarantee

By sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.

Read more
Open chat
1
You can contact our live agent via WhatsApp! Via + 1 929 473-0077

Feel free to ask questions, clarifications, or discounts available when placing an order.

Order your essay today and save 20% with the discount code GURUH