Naive Bayes

January 16, 2022 Classification Mathematics Statistics

Let’s focus on this table.

Subscript, $X_d$ , is used to represent the feature/dimension. Superscript, $X^i$ , is used to represent the observation (here the $i^{th}$ observation).

First, we make some basic assertions about the data.

Distributions

Each $Y^i$ has a Categorical distribution:

$Y^i | \pi \sim Cat(\pi)$

(1)

The following is equivalent:

$p(Y^i | \pi) = Cat(y^i|\pi) = \prod_{k= 0}^{K} \pi_{k}^{I(y^i=k)}$

(2)

$x=y$ , $\pi =(\pi_{0},\pi_{1},...,\pi_{K})$ and $\pi_c$ is the probability of the random variable $Y^i$ taking on the value $c$ . This means that each $Y^i$ comes from a distribution where the possible outcomes are from the set $\{ 0,1,2,...,K \}$ (in the table $K=2$ ). So:

$p(Y^i = c | \pi) = \prod_{k= 0}^{K} \pi_{k}^{I(c=k)} = \pi_{c}$

(3)

Since this is a probability distribution, we have the constraint $\sum_{k=0}^{K} \pi_{k} = 1$ . Given a value for $Y$ , each $X_d^i$ has a Bernoulli distribution (takes on values in $\{0,1\}$ ):

$X_d^i | Y=c,\phi_{dc} \sim Ber(\phi_{dc})$

(4)

The following is equivalent:

$p(x_d^i | Y=c,\phi_{dc}) = Ber(x_d^i | \phi_{dc}) = \phi_{dc}^{x_d^i} (1-\phi_{dc})^{(1-x_d^i)}$

(5)

The distribution parameters used above are sometimes summarised into a single parameter:

$\theta = ( \pi , \phi )$

(6)

Likelihood

The probability of seeing this Data, $\mathcal{S} =S^{i}$ , given the parameters $\theta$ can be represented as follows:

$P(S^i | \theta ) = P(x^i,y^i | \theta ) = P(x^{i} | y^{i}, \theta ) P(y^{i} | \theta )$

(7)

Note that $P(x^i,y^i | \theta )$ is the same as the strict notation $P(X^i=x^i,Y^i=y^i | \theta )$ . The last equality above is due to conditional probability. $x^i$ is only dependent on $\phi$ and $y^i$ :

$P(S^i | \theta ) = P(x^{i} | y^{i}, \phi ) P(y^{i} | \pi )$

(8)

The “naive” assumption that the dimensions are independent allows us to write the above as follows:

$P(S^i | \theta ) = \prod_{d=1}^{D} P(x_d^{i} | y^{i}, \phi ) P(y^{i} | \pi )$

(9)

In Equation (4), the distribution of $X_d^i$ is defined when conditioned upon $Y^i$ . So we apply the law of total probability:

$P(S^i | \theta ) = \prod_{d=1}^{D} \prod_{k=0}^{K} P(x_d^{i} | y^{i} = k, \phi ) \prod_{k=0}^{K} P(y^{i} = k | \pi )$

(10)

Replacing the distributions defined earlier:

$P(S^i | \theta ) = \prod_{d=1}^{D} \prod_{k=0}^{K} \Phi_{dk}^{x_d^i} (1-\Phi_{dk})^{1-x_d^i} \prod_{k=0}^{K} \pi_{k}^{I(y^{i}=k)}$

(11)

Equation (11) is the likelihood of a single observation. The likelihood for the entirety of the data is as follows:

$P(S | \theta ) = \prod_{i=1}^{N} P(S^i | \theta ) = \prod_{d=1}^{D} \prod_{k=0}^{K} \prod_{i:y^{i}=k} \Phi_{dk}^{x_d^i} (1-\Phi_{dk})^{1-x_d^i} \prod_{k=0}^{K} \prod_{i:y^{i}=k} \pi_{k}^{I(y^{i}=k)}$

(12)

Notice how the product over all observations is limited to a product over all limitations for that particular class.

MLE

To find the maximum likelihood estimate we maximise Equation (12) with respect to $\pi$ and $\phi$ subject to constraint $\sum_{k=0}^{K} \pi_{k} = 1$ in Equation (3). If we try to find the maximum of Equation (12) directly, we will have a hard time with the derivatives. However, if we apply the $log$ function to Equation (11), this will greatly simplify the calculus:

$\widehat{ \theta }_{MLE} = \underset{\theta}{\arg\max} \; P(S | \theta ) = \underset{\theta}{\arg\max} \; log P(S | \theta )$

(13)

We go ahead and take the log:

$log P(S | \theta ) = \sum_{d=1}^{D} \sum_{k=0}^{K} \sum_{i:y^{i}=k} log \big[ \Phi_{dk}^{x_d^i} (1-\Phi_{dk})^{1-x_d^i} \big] + \sum_{k=0}^{K} \sum_{i:y^{i}=k} log \big[ \pi_{k}^{I(y^{i}=k)} \big]$ $= \sum_{d=1}^{D} \sum_{k=0}^{K} \sum_{i:y^{i}=k} \big[ x_d^i log \Phi_{dk} + (1-x_{d}^{i}) log (1-\Phi_{dk}) \big] + \sum_{k=0}^{K} \sum_{i:y^{i}=k} I(y^{i}=k) log \pi_{k}$

(14)

Noting that $\sum_{i:y^{i}=k} x_d^i = N_{dk}$ is the total number of 1s in dimension $d$ for class $y=k$ (or equivalently the sum of columns $d$ for class $k$ ) and also that $\sum_{i:y^{i}=k} I(y^{i}=k) = n_k$ is the number of observations in class $y=k$ , we have:

$log P(S | \theta ) = \sum_{d=1}^{D} \sum_{k=0}^{K} \big[ N_{dk} log \Phi_{dk} + (1-N_{dk}) log (1-\Phi_{dk}) \big] + \sum_{k=0}^{K} n_k log \pi_{k}$

(15)

We apply the constraint on $\pi$ :

$ll = log P(S | \theta ) + \lambda \big( 1- \sum_{k=0}^{K} \pi_{k} \big)$

(16)

Taking the derivative of Equation (16) recovers the constraint:

$\frac{\partial ll}{\partial \lambda} = 1- \sum_{k=0}^{K} \pi_{k} = 0 \Rightarrow \sum_{k=0}^{K} \pi_{k} = 1$

(17)

Next we take the derivative with respect to $\pi_k$ :

$\frac{\partial ll}{\partial \pi_k} = \frac{n_k}{\pi_k} - \lambda = 0 \Rightarrow n_k = \lambda \pi_k$

(18)

Summing Equation (18):

$\sum_{k=0}^{K} n_k = \lambda \sum_{k=0}^{K} \pi_k \Rightarrow N = \lambda$

(19)

Putting Equation (19) into Equation (18) we obtain a solution for $\widehat{\pi}_{MLE}$ with the constraint:

$\widehat{\pi}_{k} = \frac{n_k}{N}$

(20)

Similarly, we obtain $\widehat{\phi}_{MLE}$ :

$\frac{\partial ll}{\partial \phi_{dk}} = \sum_{i:y^i=k} \big[ \frac{x_d^i}{\phi_{dk}} - \frac{1-x_d^i}{1-\phi_{dk}} \big] = \frac{N_{dk}}{\phi_{dk}} - \frac{1-N_{dk}}{1-\phi_{dk}} = 0$

(21)

Solving for $\phi_{dk}$ :

$\widehat{\phi}_{dk} = \frac{N_{dk}}{n_{k}}$

(22)

We have the following MLE solutions:

$\widehat{ \theta }_{MLE} = (\widehat{\pi}_{MLE},\widehat{\phi}_{MLE}) = \left( \left\{\frac{n_k}{N}\right\},\left\{ \frac{N_{dk}}{n_k} \right\} \right)$

(23)

where

$n_d = \sum_{i=0}^{N} I(y^i = d)$ $N_{dk}=\sum_{i:y^i=k} x_d^i =\sum_{i=0}^{N} I(x_d^i = 1,y^i = k)$

(24)

Classifying with Naive Bayes

Using the above, given a new data point $X=(1,0)$ , the probability that $Y=k$ is then:

$P(Y=k | X=x,\theta) = \frac{P(Y=k,X=x|\theta)}{P(X=x|\theta)}$ $\propto \prod_{d=1}^{D} \widehat{\phi}^{x_d} (1-\widehat{\phi})^{(1-x_d)} \pi_k$

(25)

which is just Equation (11) with the MLE solutions for a single observation and for class $k$ only. We can then find the class which has the maximum probability (notice that we don’t have to actually compute the denominator in Equation (25) since it’s the same for each $k$ ):

$\widehat{y} = \underset{k}{\arg\max} \; P(Y=k | X=x,\theta ) = \underset{k}{\arg\max} \; \prod_{d=1}^{D} \widehat{\phi}^{x_d} (1-\widehat{\phi})^{(1-x_d)} \pi_k$

(26)

Let’s test it out on Table 1 for only the first 3 columns ( $Y,X_1,X_2$ ). We first calculate the MLE solutions (using equations (23) and (24)):

$N = 9 \; \text{(total observations)}$ $n_0=n_1=n_2=3,$ $N_{00}=2,N_{01}=0,N_{02}=1,$ $N_{10}=1,N_{11}=2,N_{12}=1,$ $\widehat{\pi}_0 = \frac{n_0}{N} = \frac{3}{9} = \widehat{\pi}_1 = \widehat{\pi}_2,$ $\widehat{\phi}_{00} = \frac{N_{00}}{N_0} = \frac{2}{3},$ $\widehat{\phi}_{01} = \frac{N_{01}}{N_1} = \frac{0}{3},$ $\widehat{\phi}_{02} = \frac{N_{02}}{N_2} = \frac{1}{3},$ $\widehat{\phi}_{10} = \frac{N_{10}}{N_0} = \frac{1}{3},$ $\widehat{\phi}_{11} = \frac{N_{11}}{N_1} = \frac{2}{3},$ $\widehat{\phi}_{12} = \frac{N_{12}}{N_2} = \frac{1}{3}$

(27)

Plugging these into Equation (26) we find $\widehat{y} = 0$ since:

$P(Y=0 | X=(1,0),\theta ) = 0.05$ $P(Y=1 | X=(1,0),\theta ) = 0$ $P(Y=2 | X=(1,0),\theta ) = 0.025$

(28)

This makes sense since, according to the data, if you get a data point $X=(1,0)$ , it’s most likely going to be of class $Y=0$ . There is still some small chance it would be class $Y=2$ because $X_1 = 1$ has occurred and $X_2 = 0$ has occurred, albeit not simultaneously. However, $X_1 = 1$ has never occurred for class $Y=1$ , hence should be given the lowest probability. An important note about Equations (28) is that the probabilities may not sum to 1. This is because there is a denominator in Equation (25) which has been omitted since it is the same for all cases.

Bayesian Naive Bayes

The problem with the example above is that $Y=1$ is given a zero probability. This represents a problem which would become worse at higher dimensions. In Equation (27), $\widehat{\phi}_{01}$ takes on the value 0 since $N_{01} = 0$ as there has never been a case where $X_1=1$ in class $Y=1$ . However, this is very likely to be the case at higher dimensions where it will be common for a few columns to not align with the new observation. We would still want to obtain a non-zero probability for $P(Y=1 | X=(1,0),\theta )$ . To do this we approach Naive Bayes from a Bayesian point of view providing prior distributions for $\phi$ and $\pi$ .

Let’s look at the general form of Equation (28) to see how we would classify, $Y$ , in a Bayesian setting given a new observation $X$ :

$P(Y=k|X=x,\mathcal{S})=\frac{P(Y=k,X=x,\mathcal{S})}{P(x,\mathcal{S})}=\frac{P(x|Y=k,\mathcal{S})P(Y=k|\mathcal{S})}{P(x,\mathcal{S})}=\frac{\prod_{d=1}^{D} P(x_d|Y=k,\mathcal{S})P(Y=k|\mathcal{S})}{P(x,\mathcal{S})}$

(29)

Note that the last equality in Equation (29) makes the assumption that the random variables of different dimensions are independent from each other. We know from our previous assumption (Equation (4)) that $X_d^i | Y=c,\phi_{dc} \sim Ber(\phi_{dc})$ . So we will make these Bernoulli parameters ( $\phi_{dc}$ ) explicit in the first term in the numerator of Equation (29) via the theorem of total probability:

$P(x_d|Y=k,\mathcal{S}) = \int_{\phi_{dk}} P(x_d,\phi_{dk}|Y=k,\mathcal{S})d\phi_{dk}$

(30)

Applying the definition of conditional probability:

$P(x_d|Y=k,\mathcal{S}) = \int_{\phi_{dk}} P(x_d|Y=k,\mathcal{S},\phi_{dk})P(\phi_{dk}|\mathcal{S})d\phi_{dk}=\int_{\phi_{dk}} Ber(x_d | Y=k,\phi_{dk})P(\phi_{dk}|\mathcal{S})d\phi_{dk}$

(31)

Notice that in the last equality of Equation (31) we dropped the dependence of the first term on the data ( $\mathcal{S}$ ) since the new observation is not dependent on the prior dataset in the approach. Similarly, the second term in the numerator of Equation (29) can be written:

$P(Y=k|\mathcal{S}) = \int_{\pi} P(Y=k|\pi)P(\pi|\mathcal{S})d\pi=\int_{\pi} Cat(Y=k|\pi)P(\pi|\mathcal{S})d\pi$

(32)

In Equation (32) we skipped the step of explicitly applying the theorem of total probability as we did in Equation (31). Putting equations (31) and (32) into Equation (29):

$P(Y=k|x,\mathcal{S}) \propto \left[\prod_{d=1}^{D} \int_{\phi_{dk}}Ber(x_d | Y=k,\phi_{dk})P(\phi_{dk} |\mathcal{S})d\phi_{dk} \right] \left[\int_{\pi}Cat(Y=k|\pi)P(\pi|\mathcal{S})d\pi \right]$

(33)

We removed the denominator in Equation (29) and hence Equation (33) is a proportionality equation. Equation (33) contains 2 terms that are posterior probabilities for the parameters:

$P(\theta|\mathcal{S})=\frac{P(\mathcal{S}|\theta) P(\theta)}{P(\mathcal{S})}$

(34)

where $\theta$ is as defined in Equation (6). The first term in the numerator of Equation (34) is the likelihood for $\theta$ while the second term is the prior. The likelihood is given by Equation (12) which we derived during the MLE calculation. To make the multiplication of the likelihood and the prior easier, we can choose a prior distribution for $\theta$ that is similar in form as the likelihood. Here, we will make another assumption: the prior parameters are independent of each other. So that:

$P(\theta) = \prod_{d=1}^{D} \prod_{k=0}^{K} P(\phi_{dk})P(\pi)$

(35)

If we let

$\phi_{dk} \sim Beta(\beta_0,\beta_1)$

(36)

so that

$P(\phi_{dk}|\beta_0,\beta_1) = Beta(\phi_{dk}|\beta_0,\beta_1) = \frac{1}{B(\beta_0,\beta_1)} \phi_{dk}^{\beta_0-1}(\phi_{dk})^{\beta_1-1}$

(37)

and

$\pi \sim Dir(\alpha)$

(38)

so that

$P(\pi|\alpha) = Dir(\pi|\alpha)=\frac{1}{B(\alpha)}\prod_{k=0}^{K} \pi_{k}^{\alpha_k-1}I(\pi \in [0,1]^K)$

(39)

Equation (35) becomes

$P(\theta) = \left[ \prod_{d=1}^{D} \prod_{k=0}^{K} \frac{1}{B(\beta_0,\beta_1)} \phi_{dk}^{\beta_0-1}(\phi_{dk})^{\beta_1-1} \right] \left[ \frac{1}{B(\alpha)}\prod_{k=0}^{K} \pi_{k}^{\alpha_k-1} \right]$

(40)

such that $\pi \in [0,1]^K$ . This constraint has been moved out of the equation. It just means $\pi_k$ is between 0 and 1 for any $k$ .

Equation (40) is in the same form as the likelihood. Putting equations (12) and (40) into (34) to get the posterior:

$P(\theta|\mathcal{S}) = a \times \left[ \prod_{d=1}^{D} \prod_{k=0}^{K} \prod_{i:y^{i}=k} \Phi_{dk}^{x_d^i} (1-\Phi_{dk})^{1-x_d^i} \right] \left[ \prod_{k=0}^{K} \prod_{i:y^{i}=k} \pi_{k}^{I(y^{i}=k)} \right]$ $\times \left[ \prod_{d=1}^{D} \prod_{k=0}^{K} \frac{1}{B(\beta_0,\beta_1)} \phi_{dk}^{\beta_0-1}(\phi_{dk})^{\beta_1-1} \right] \left[ \frac{1}{B(\alpha)}\prod_{k=0}^{K} \pi_{k}^{\alpha_k-1} \right]$

(41)

where the denominator in Equation (34) has been absorbed into the constant $a$ . We are now going to rearrange and collect terms. Collapsing the multiplicative terms over $i$ , i.e. $\prod_{i}^{N} z^{i+b} = z^{\sum_i i + bN}$ :

$P(\theta|\mathcal{S}) = a \times \left[ \prod_{d=1}^{D} \prod_{k=0}^{K} \phi_{dk}^{\sum_{i:y^i=k} x_d^i} (1-\phi_{dk})^{\sum_{i:y^i=k} (1-x_d^i)} \prod_{k=0}^{K}\pi_k^{\sum_{i:y^i=k}I(y^i=k)} \right]$ $\times \left[ \prod_{d=1}^{D} \prod_{k=0}^{K} \frac{1}{B(\beta_0,\beta_1)} \phi_{dk}^{\beta_0-1}(\phi_{dk})^{\beta_1-1} \right] \left[ \frac{1}{B(\alpha)}\prod_{k=0}^{K} \pi_{k}^{\alpha_k-1} \right]$ $= a \times \left[ \prod_{d=1}^{D} \prod_{k=0}^{K} \phi_{dk}^{N_{dk}} (1-\phi_{dk})^{n_k-N_{dk}} \prod_{k=0}^{K}\pi_k^{n_k} \right]$ $\times \left[ \prod_{d=1}^{D} \prod_{k=0}^{K} \frac{1}{B(\beta_0,\beta_1)} \phi_{dk}^{\beta_0-1}(\phi_{dk})^{\beta_1-1} \right] \left[ \frac{1}{B(\alpha)}\prod_{k=0}^{K} \pi_{k}^{\alpha_k-1} \right]$ $= a \times \left[ \prod_{d=1}^{D} \prod_{k=0}^{K} \frac{1}{B(\beta_0,\beta_1)} \phi_{dk}^{N_{dk} + \beta_0-1} (1-\phi_{dk})^{n_k-N_{dk}+\beta_1-1} \prod_{k=0}^{K}\frac{1}{B(\alpha)}\pi_k^{n_k + \alpha_k - 1} \right]$ $= \prod_{d=1}^{D} \prod_{k=0}^{K} Beta(\phi_{dk}|N_{dk}+\beta_0,n_k-N_{dk}+\beta_1) \times Dir(\pi|N+\alpha)$

(42)

There are two observations which informed the last equality in Equation (42): The posterior is a probability which satisfies the axioms of a probability measure; the penultimate equality in Equation (42) has the same form as a Beta distribution and a Dirichlet distribution. Then these distributions must indeed be the Beta and Dirichlet distributions with $a$ and the other constants combining to ensure integration to unity. This is an informal justification of this but suffices for us here.

Putting Equation (42) into Equation (33) gives us a way for making a classification (Bayesian Naive Bayes):

$P(Y=k|x,\mathcal{S}) \propto \prod_{d=1}^{D} \int_{\phi_{dk}} Ber(x_d|Y=k,\phi_{dk})P(\phi_{dk}|\mathcal{S})d\phi_{dk}$ $\times \int_{\pi}Cat(Y=k|\pi)P(\pi|\mathcal{S})d\pi$ $=\prod_{d=1}^{D} \int_{\phi_{dk}} \phi_{dk}^{x_d} (1-\phi_{dk})^{1-x_d} Beta(\phi_{dk}|N_{dk}+\beta_0,n_k-N_{dk}+\beta_1)d\phi_{dk}$ $\times \int_{\pi} \prod_{k=1}^{K} \pi_{k}^{I(c=k)} Dir(\pi|N+\alpha)d\pi$ $= \prod_{d:x_d=1} \int_{\phi_{dk}} \phi_{dk} Beta(\phi_{dk}|N_{dk}+\beta_0,n_k-N_{dk}+\beta_1) d\phi_{dk}$ $\times \prod_{d:x_d=0} \int_{\phi_{dk}} (1-\phi_{dk}) Beta(\phi_{dk}|N_{dk}+\beta_0,n_k-N_{dk}+\beta_1) d\phi_{dk}$ $\times \int_{\pi} \pi_k Dir(\pi|N+\alpha) d\pi$

(43)

In the second equality we just replaced the definitions of the Bernoulli and the Categorical distributions. In the last equality we split the product into to terms with power 1 and terms with power 0. Notice that the integrals are the expectations of the corresponding distributions (i.e. let $X$ have distribution $\mathcal{P}$ . Then $X \sim \mathcal{P}$ and $\int_x x P(x)dx = E_{\mathcal{P}}[x]$ is just the expectation). We can then rewrite Equation (43) as:

$P(Y=k|x,\mathcal{S}) \propto \prod_{d:x_d=1}E_{\phi_{dk}}[\phi_{dk}] \prod_{d:x_d=0}E_{\phi_{dk}}[1-\phi_{dk}] E_{\pi}[\pi_k]$ $= \prod_{d=0}^{D} \bar{\phi}_{dk}^{x_d} (1-\bar{\phi}_{dk})^{1-x_d} \bar{\pi}_k$

(44)

The mean of a $Beta(x|a,b)$ distribution is $\frac{a}{a+b}$ and the mean of a $Dir(\pi_k|\alpha)$ distribution is $\frac{\alpha_k}{\sum_{j}\alpha_j}$ . The means in Equation (44) are then:

$\bar{\phi}_{dk}=\frac{N_{dk}+\beta_0}{n_k+\beta_0+\beta_1}$

$\bar{\pi}_k=\frac{n_k+\alpha_k}{N+\sum_j\alpha_j}$

(45)

Typically, for an uninformative prior we set $\beta_0=\beta_1=1$ and $\alpha = (1,1,1,1,...,1)$ . These are just uniform distributions and its higher dimensional analogue. Notice that Equation (44) is exactly the same as Equation (25) but with the parameter estimates from a Bayesian perspective – they are now means of the corresponding posterior distributions incorporating prior weights and not just the likelihoods.

Classifying with Bayesian Naive Bayes

Now we will generate a new classification for the same observation $x=(1,0)$ . Equation (27) for the parameters becomes:

$N = 9 \; \text{(total observations)}$ $\beta_0=\beta_1=\alpha_i=1 \; \forall \; i,$ $n_0=n_1=n_2=3,$ $N_{00}=2,N_{01}=0,N_{02}=1,$ $N_{10}=1,N_{11}=2,N_{12}=1,$ $\bar{\pi}_0 = \frac{n_0 + \alpha_0}{N + \sum_{j}\alpha_j} = \frac{3+1}{9+3} = \frac{1}{3} = \bar{\pi}_1 = \bar{\pi}_2,$ $\bar{\phi}_{00} = \frac{N_{00}+\beta_0}{n_0+\beta_0+\beta_1} = \frac{3}{5},$ $\bar{\phi}_{01} =\frac{1}{5},$ $\bar{\phi}_{02} = \frac{2}{5},$ $\bar{\phi}_{10} = \frac{2}{5},$ $\bar{\phi}_{11} = \frac{3}{5},$ $\bar{\phi}_{12} = \frac{2}{5}$

(46)

The final probabilities for each class are then:

$P(Y=0|X=(1,0),\widehat{\theta}) = \prod_{d=0}^{D} \bar{\phi_{d0}}^{x_d}(1-\bar{\phi_{d0}})^{1-x_d} \pi_0 = \frac{9}{125}$ $P(Y=1|X=(1,0),\widehat{\theta}) = \prod_{d=0}^{D} \bar{\phi_{d1}}^{x_d}(1-\bar{\phi_{d1}})^{1-x_d} \pi_1 = \frac{2}{125}$ $P(Y=2|X=(1,0),\widehat{\theta}) = \prod_{d=0}^{D} \bar{\phi_{d2}}^{x_d}(1-\bar{\phi_{d2}})^{1-x_d} \pi_2 = \frac{6}{125}$

(47)

Just as in Equation (28), we have 0 as the most likely class with class 2 the second most likely. However, this time we predict class 1 with a non-zero probability.

TaggedBayesian Classification mathematics Naive Bayes