13.3. Bayes Plugin Classifier#

In this next section, we will explain some of the intuition of the Naive Bayes classifer [1], which might require a bit of a probability and statistics background.

The core idea of the Naive Bayes classifier is, if our data have features which are zeros and ones, we can use the class-conditional probabilities to determine whether a point is from class \(1\) or class \(2\). The idea is as follows. First, we assume that all of the individual features of our data (the features, in our case, are the edges of the network that are in the signal subnetwork) are independent, then the likelihood of observing a particular sequence of edges in the signal subnetwork of the \(m^{th}\) network if that network is in class \(y\) is as follows. First, we will just write down some simpler notation for the quantity we want:

\[\begin{align*} Pr(\text{observing }A^{(m)}\text{ given we assume $m$ is in class $y$} \text{ where the signal subnetwork is $\mathcal S$}) = Pr(A^{(m)} | y_m = y; \mathcal S) \end{align*}\]

Nothing happened in this step just yet; we just made a smaller notation for the quantity on the left that we will use later on. In this next step, we use the fact that, if edges existing and not existing are independent, then the probability of observing a sequence of edges is the product of the probabilities of observing each individual edge. This is called an independent edge assumption. We proceed as follows:

\[\begin{align*} Pr(A^{(m)}| y_m = y; \mathcal S) &= \prod_{(i,j) \in \mathcal S} Pr(a_{ij}^{(m)} | y_m = y; \mathcal S) \end{align*}\]

Note that we are taking the product of each pair of edges, \((i,j)\), which are in the signal sub-network. Next, if we remember back to the Independent-Edge Random Network (IER), we assumed that if \(\mathbf A^{(m)}\) was an \(IER_n(P^y)\) network, that every edge \(\mathbf a_{ij}^{(m)}\) had a probability of \(p^y_{ij}\) of taking a value of \(1\) (the edge exists), and a probability of \(1 - p_{ij}^y\) of taking a value of \(0\) (the edge does not exist). What this means is that \(\mathbf a_{ij}^{(m)}\) is something called a Bernoulli distributed random variable. The probability of seeing a particular realization \(a_{ij}^{(m)}\) of a Bernoulli distributed random variable is relatively straightforward to see. We want the quantity \(Pr(A^{(m)} | y_m = y; \mathcal S)\) to reflect the following two facts we’ve already discussed:

  1. If \(a_{ij}^{(m)}\) is \(1\), then \(Pr(A^{(m)} | \mathbf y_m = y; \mathcal S)\) is \(p_{ij}^y\).

  2. If \(a_{ij}^{(m)}\) is \(0\), then \(Pr(A^{(m)} | \mathbf y_m = y; \mathcal S)\) is \(1 - p_{ij}^y\). Is there a succinct expression that we can express this with? Yes! Try the following equation:

\[\begin{align*} Pr(A^{(m)} | \mathbf y_m = y; \mathcal S) &= (p_{ij}^y)^{a_{ij}^{(m)}} (1 - p_{ij}^y)^{1 - a_{ij}^{(m)}} \end{align*}\]

Notice that if \(a_{ij}^{(m)}\) is \(1\), then \(1 - a_{ij}^{(m)}\) is \(0\). Therefore, \((1 - p_{ij}^y)^{1 - a_{ij}^{(m)}}\) is just \(1\), since any number raised to the \(0\) power is \(1\). On the other hand, \((p_{ij}^y)^{a_{ij}^{(m)}}\) is \(p_{ij}^y\), since any number raised to the \(1\) power is itself. Therefore, this expression fits the bill for us. So, we can simplify our expression as:

\[\begin{align*} Pr(A^{(m)} | y_m = y; \mathcal S) &= \prod_{(i,j) \in \mathcal S} (p_{ij}^y)^{a_{ij}^{(m)}} (1 - p_{ij}^y)^{1 - a_{ij}^{(m)}} \end{align*}\]

And we’re almost there! The next thing we’re going to do is a little tricky, but fortunately we can turn to Bayes’ Theorem for help. What we want to do is compute the probability of observing both \(A^{(m)}\) and \(y_m\) having the value \(y\), not the probability of observing \(A^{(m)}\) given that we assume that \(y_m\) is \(y\). In statistical notation, what this amounts to is:

\[\begin{align*} Pr(\text{observing }A^{(m)}\text{ and }\text{$m$ is in class $y$} \text{ where the signal subnetwork is $\mathcal S$}) = Pr(A^{(m)}, y_m = y; \mathcal S) \end{align*}\]

This expression is a little confusing, but its interpretation is relatively straightforward: it is the probability that we see both things happening at the same time (the random network \(\mathbf A^{(m)}\) takes the value \(A^{(m)}\) and the random class \(\mathbf y_m\) has the value \(y\)) rather than just assume that the random class \(\mathbf y_m\) already is \(y\). We can compute this quantity by remembering Bayes’ Theorem in [2]:

\[\begin{align*} Pr(A^{(m)} | \mathbb y_m = y; \mathcal S) &= \frac{Pr(A^{(m)}, \mathbb y_m = y; \mathcal S)}{Pr(\mathbb y_m = y)} \end{align*}\]

Note that this implies the following:

\[\begin{align*} Pr(A^{(m)}, \mathbb y_m = y; \mathcal S) &= Pr(A^{(m)} | \mathbb y_m = y; \mathcal S)Pr(\mathbb y_m = y) \end{align*}\]

Plugging in the value we obtained above:

\[\begin{align*} Pr(A^{(m)}, \mathbb y_m = y; \mathcal S) &= Pr(\mathbb y_m = y)\prod_{(i, j) \in \mathcal S} (p_{ij}^y)^{a_{ij}^{(m)}} (1 - p_{ij}^y)^{1 - a_{ij}^{(m)}} \end{align*}\]

Remember that the parameter of the signal subnetwork model, \(\pi_y\), represented the probability of our \(Y\)-sided die landing on class \(y\), and therefore was the probability that the random class \(\mathbf y_m\) took the value \(y\). Therefore, \(Pr(\mathbb y_m = y) = \pi_y\), since these two quantites represent the same thing! So:

\[\begin{align*} Pr(A^{(m)}, \mathbb y_m = y; \mathcal S) &= \pi_y \prod_{(i, j) \in \mathcal S} (p_{ij}^y)^{a_{ij}^{(m)}} (1 - p_{ij}^y)^{1 - a_{ij}^{(m)}} \end{align*}\]

Finally, we introduce a new quantity. This quantity is very similar to the above quantity we came across, but with some terms reversed:

\[\begin{align*} Pr(\text{$m$ is in class $y$}\text{ given we observe }A^{(m)} \text{ where the signal subnetwork is $\mathcal S$}) = Pr(\mathbb y_m = y | A^{(m)}; \mathcal S) \end{align*}\]

Remember that we saw that using Bayes Theorem, we can just rewrite this expression as:

\[\begin{align*} Pr(\mathbb y_m = y | A^{(m)}; \mathcal S) &= \frac{Pr(A^{(m)}, \mathbb y_m = y; \mathcal S)}{Pr(A^{(m)}; \mathcal S)} \end{align*}\]

But using the expression we just obtained for \(Pr(A^{(m)}, \mathbb y_m = y; \mathcal S)\), we can write this down as:

\[\begin{align*} Pr(\mathbb y_m = y | A^{(m)}; \mathcal S) &= \frac{\pi_y \prod_{(i, j) \in \mathcal S} (p_{ij}^y)^{a_{ij}^{(m)}} (1 - p_{ij}^y)^{1 - a_{ij}^{(m)}}}{Pr(A^{(m)}; \mathcal S)} \end{align*}\]

So, what this quantity tells us is the probability that item \(m\) is in class \(y\), given that we observe that item \(m\) has the network \(A^{(m)}\) with signal subnetwork \(\mathcal S\)! When we have a new piece of data, this is an excellent quantity to compute! The reason for this is that, if we see a new network and want to assign it to a class, we want to choose the class that we think is most reasonable, or most probable. Therefore, given a new network to classify, it is reasonable to just estimate the class of this new network to be the class which is most probable, which is:

\[\begin{align*} \hat y_m &= \text{argmax}_{y \in \left\{1, ..., Y\right\}}Pr(\mathbb y_m = y | A^{(m)}; \mathcal S) \\ &= \text{argmax}_{y \in \left\{1,..., Y\right\}}\frac{\pi_y \prod_{(i, j) \in \mathcal S} (p_{ij}^y)^{a_{ij}^{(m)}} (1 - p_{ij}^y)^{1 - a_{ij}^{(m)}}}{Pr(A^{(m)}; \mathcal S)} \end{align*}\]

What this states in words is, we check all possible values that \(y\) could take (1, 2, 3, … all the way to \(Y\)), and compute the probability that the class is \(y\) given the network we observed. Then, we just choose which of the values was most plausible, and return it for our prediction \(\hat y_m\). But wait! We can simplify this expression even further. Note that the quantity \(Pr(A^{(m)}; \mathcal S)\) has no dependence on the particular class we are checking for a given value of \(y\). This means that the denominator will be the same for all of the possible values of \(y\), and therefore won’t impact which of the \(y\)s is the actual maximum. Therefore, we can just drop this term entirely, giving us the quantity that will become the objective function for our classification task:

\[\begin{align*} \hat y_m &= \text{argmax}_{y \in \left\{1,..., Y\right\}}\pi_y \prod_{(i, j) \in \mathcal S} (p_{ij}^y)^{a_{ij}^{(m)}} (1 - p_{ij}^y)^{1 - a_{ij}^{(m)}} \end{align*}\]

Now what you might be wondering is, we don’t know \(\mathcal S\), the vector \(\vec \pi\), nor the probability matrices \(P_1,..., P_Y\). This is why this is called a “Bayes Plugin” classifier. What we do is we estimate the signal subnetwork \(\mathcal S\) using the approach we outlined above to produce \(\hat{\mathcal S}\), and then we use the class vector \(\vec y\) that we observed to produce estimates of \(\vec \pi\), where:

\[\begin{align*} \hat{\pi_y} &= \frac{M_y}{M} \end{align*}\]

where \(M\) is the total number of networks, and \(M_y\) is the number of networks where \(y_m = y\). Finally, we compute the estimated probability entries, using:

\[\begin{align*} \hat p_{ij}^y &= \frac{1}{M_y} \sum_{m : y_m = y} a_{ij}^{(m)} \end{align*}\]

What this means is that, for each edge \((i,j)\) which is in the estimated signal subnetwork \(\hat{\mathcal S}\), we look at the \(m\)s where the class \(y_m\) is \(y\), and then we just compute the fraction of the edges which exist across all of the \(M_y\) networks where \(y_m = y\). Finally, we estimate the class for our new network \(A^{(m)}\) by just “plugging in” these values to the objective function:

\[\begin{align*} \hat y_m &= \text{argmax}_{y \in \left\{1,..., Y\right\}}\hat \pi_y \prod_{(i, j) \in \hat{\mathcal S}} (\hat p_{ij}^y)^{a_{ij}^{(m)}} (1 - \hat p_{ij}^y)^{1 - a_{ij}^{(m)}} \end{align*}\]

which is the objective function for the Bayes Plugin classifier.

13.3.1. References#

1

Trevor Hastie, Robert Tibshirani, and Jerome H. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York, NY, USA, 2009. ISBN 978-0-38784884-6. URL: https://books.google.com/books/about/The_Elements_of_Statistical_Learning.html?id=eBSgoAEACAAJ&source=kp_book_description.

2

James Joyce. Bayes' Theorem. June 2003. [Online; accessed 4. Feb. 2023]. URL: https://plato.stanford.edu/entries/bayes-theorem.