Consider a single Boolean random variable $Y$ (the “classification”). Let the prior probability $P(Y=true)$ be $\pi$. Let’s try to find $\pi$, given a training set $D=(y_1,\ldots,y_N)$ with $N$ independent samples of $Y$. Furthermore, suppose $p$ of the $N$ are positive and $n$ of the $N$ are negative.

Write down an expression for the likelihood of $D$ (i.e., the probability of seeing this particular sequence of examples, given a fixed value of $\pi$) in terms of $\pi$, $p$, and $n$.

By differentiating the log likelihood $L$, find the value of $\pi$ that maximizes the likelihood.

Now suppose we add in $k$ Boolean random variables $X_1, X_2,\ldots,X_k$ (the “attributes”) that describe each sample, and suppose we assume that the attributes are conditionally independent of each other given the goal $Y$. Draw the Bayes net corresponding to this assumption.

Write down the likelihood for the data including the attributes, using the following additional notation:

$\alpha_i$ is $P(X_i=true  Y=true)$.

$\beta_i$ is $P(X_i=true  Y=false)$.

$p_i^+$ is the count of samples for which $X_i=true$ and $Y=true$.

$n_i^+$ is the count of samples for which $X_i=false$ and $Y=true$.

$p_i^$ is the count of samples for which $X_i=true$ and $Y=false$.

$n_i^$ is the count of samples for which $X_i=false$ and $Y=false$.
[Hint: consider first the probability of seeing a single example with specified values for $X_1, X_2,\ldots,X_k$ and $Y$.]


By differentiating the log likelihood $L$, find the values of $\alpha_i$ and $\beta_i$ (in terms of the various counts) that maximize the likelihood and say in words what these values represent.

Let $k = 2$, and consider a data set with 4 all four possible examples of thexor function. Compute the maximum likelihood estimates of $\pi$, $\alpha_1$, $\alpha_2$, $\beta_1$, and $\beta_2$.

Given these estimates of $\pi$, $\alpha_1$, $\alpha_2$, $\beta_1$, and $\beta_2$, what are the posterior probabilities $P(Y=true  x_1,x_2)$ for each example?