# 20. Learning Probabilistic Models

The data used for Figure bayes-candy-figure on page bayes-candy-figure can be viewed as being generated by $h_5$. For each of the other four hypotheses, generate a data set of length 100 and plot the corresponding graphs for $P(h_i|d_1,\ldots,d_N)$ and $P(D_{N+1}=lime|d_1,\ldots,d_N)$. Comment on your results.

Repeat Exercise bayes-candy-exercise, this time plotting the values of $P(D_{N+1}=lime|h_{MAP})$ and $P(D_{N+1}=lime|h_{ML})$.

Suppose that Ann’s utilities for cherry and lime candies are $c_A$ and $\ell_A$, whereas Bob’s utilities are $c_B$ and $\ell_B$. (But once Ann has unwrapped a piece of candy, Bob won’t buy it.) Presumably, if Bob likes lime candies much more than Ann, it would be wise for Ann to sell her bag of candies once she is sufficiently sure of its lime content. On the other hand, if Ann unwraps too many candies in the process, the bag will be worth less. Discuss the problem of determining the optimal point at which to sell the bag. Determine the expected utility of the optimal procedure, given the prior distribution from Section statistical-learning-section.

Two statisticians go to the doctor and are both given the same
prognosis: A 40% chance that the problem is the deadly disease $A$, and
a 60% chance of the fatal disease $B$. Fortunately, there are anti-$A$
and anti-$B$ drugs that are inexpensive, 100% effective, and free of
side-effects. The statisticians have the choice of taking one drug,
both, or neither. What will the first statistician (an avid Bayesian)
do? How about the second statistician, who always uses the maximum
likelihood hypothesis?

The doctor does some research and discovers that disease $B$ actually
comes in two versions, dextro-$B$ and levo-$B$, which are equally likely
and equally treatable by the anti-$B$ drug. Now that there are three
hypotheses, what will the two statisticians do?

Explain how to apply the boosting method of Chapter concept-learning-chapter to naive Bayes learning. Test the performance of the resulting algorithm on the restaurant learning problem.

Consider $N$ data points $(x_j,y_j)$, where the $y_j$s are generated from the $x_j$s according to the linear Gaussian model in Equation (linear-gaussian-likelihood-equation). Find the values of $\theta_1$, $\theta_2$, and $\sigma$ that maximize the conditional log likelihood of the data.

Consider the noisy-OR model for fever described
in Section canonical-distribution-section. Explain how
to apply maximum-likelihood learning to fit the parameters of such a
model to a set of complete data. (*Hint*: use the chain
rule for partial derivatives.)

This exercise investigates properties of
the Beta distribution defined in
Equation (beta-equation).

1. By integrating over the range $[0,1]$, show that the normalization
constant for the distribution $[a,b]$ is given by
$\alpha = \Gamma(a+b)/\Gamma(a)\Gamma(b)$ where $\Gamma(x)$ is the **Gamma function**,
defined by $\Gamma(x+1)x\cdot\Gamma(x)$ and
$\Gamma(1)1$. (For integer $x$,
$\Gamma(x+1)x!$.)

2. Show that the mean is $a/(a+b)$.

3. Find the mode(s) (the most likely value(s) of $\theta$).

4. Describe the distribution $[\epsilon,\epsilon]$ for very
small $\epsilon$. What happens as such a distribution is updated?

Consider an arbitrary Bayesian network, a complete data set for that network, and the likelihood for the data set according to the network. Give a simple proof that the likelihood of the data cannot decrease if we add a new link to the network and recompute the maximum-likelihood parameter values.

Consider a single Boolean random variable $Y$ (the “classification”).
Let the prior probability $P(Y=true)$ be $\pi$. Let’s try to
find $\pi$, given a training set $D=(y_1,\ldots,y_N)$ with $N$
independent samples of $Y$. Furthermore, suppose $p$ of the $N$ are
positive and $n$ of the $N$ are negative.

1. Write down an expression for the likelihood of $D$ (i.e., the
probability of seeing this particular sequence of examples, given a
fixed value of $\pi$) in terms of $\pi$, $p$, and $n$.

2. By differentiating the log likelihood $L$, find the value of $\pi$
that maximizes the likelihood.

3. Now suppose we add in $k$ Boolean random variables
$X_1, X_2,\ldots,X_k$ (the “attributes”) that describe each sample,
and suppose we assume that the attributes are conditionally
independent of each other given the goal $Y$. Draw the Bayes net
corresponding to this assumption.

4. Write down the likelihood for the data including the attributes,
using the following additional notation:

- $\alpha_i$ is $P(X_i=true \| Y=true)$.

- $\beta_i$ is $P(X_i=true \| Y=false)$.

- $p_i^+$ is the count of samples for which $X_i=true$
and $Y=true$.

- $n_i^+$ is the count of samples for which $X_i=false$
and $Y=true$.

- $p_i^-$ is the count of samples for which $X_i=true$
and $Y=false$.

- $n_i^-$ is the count of samples for which $X_i=false$
and $Y=false$.

\[*Hint*: consider first the probability of seeing a
single example with specified values for $X_1, X_2,\ldots,X_k$ and
$Y$.\]

5. By differentiating the log likelihood $L$, find the values of
$\alpha_i$ and $\beta_i$ (in terms of the various counts) that
maximize the likelihood and say in words what these
values represent.

6. Let $k = 2$, and consider a data set with 4 all four possible
examples of thexor function. Compute the maximum
likelihood estimates of $\pi$, $\alpha_1$, $\alpha_2$, $\beta_1$,
and $\beta_2$.

7. Given these estimates of $\pi$, $\alpha_1$, $\alpha_2$, $\beta_1$,
and $\beta_2$, what are the posterior probabilities
$P(Y=true | x_1,x_2)$ for each example?

Consider the application of EM to learn the parameters for the network
in Figure mixture-networks-figure(a), given the true
parameters in Equation (candy-true-equation).
1. Explain why the EM algorithm would not work if there were just two
attributes in the model rather than three.
2. Show the calculations for the first iteration of EM starting from
Equation (candy-64-equation).
3. What happens if we start with all the parameters set to the same
value $p$? (*Hint*: you may find it helpful to
investigate this empirically before deriving the general result.)
4. Write out an expression for the log likelihood of the tabulated
candy data on page candy-counts-page in terms of the parameters,
calculate the partial derivatives with respect to each parameter,
and investigate the nature of the fixed point reached in part (c).