Community Detection
Contents
6.1. Community Detection#
In Section 5.1 you learned how to estimate the parameters for an SBM via Maximum Likelihood Estimation (MLE). Unfortunately, we have skipped a relatively fundamental problem with SBM parameter estimation. You will notice that everything we have covered about the SBM to date assumes that you the assignments to one of
Consider, for instance, the school example you have worked extensively with. In the context of the SBM, it makes sense to guess that two individuals will have a higher chaance of being friends if they attend the same school than if they did not go to the same school. Remember, when you knew what school each student was from and could order the students by school ahead of time, the network looked like this:
from graspologic.simulations import sbm
import numpy as np
ns = [50, 50] # network with 100 nodes
B = [[0.5, 0.2], [0.2, 0.5]] # block matrix
# sample a single simple adjacency matrix from SBM(z, B)
A = sbm(n=ns, p=B, directed=False, loops=False)
zs = [1 for i in range(0, 50)] + [2 for i in range(0, 50)]
Click to show
from graphbook_code import draw_multiplot, cmaps
import matplotlib.pyplot as plt
draw_multiplot(A, labels=zs, title="$SBM_n(z, B)$ Simulation, nodes ordered by school");

The block structure is completely obvious, and it seems like you could almost just guess which nodes are from which communities by looking at the adjacency matrix (with the proper ordering). Ant therein lies the issue: if you did not know which school each student was from, you would have no way of actually using the technique you described in the preceding chapter to estimate parameters for your SBM. If your nodes were just randomly ordered, you might see something instead like this:
# generate a reordering of the n nodes
vtx_perm = np.random.choice(np.sum(ns), size=np.sum(ns), replace=False)
Aperm = A[tuple([vtx_perm])] [:,vtx_perm]
Click to show
draw_multiplot(Aperm, title="$SBM_n(z, B)$ Simulation, random node order");

It is extremely unclear which nodes are in which community, because there is no immediate block structure to the network. So, what can you do?
What you will do is use a clever technique we discussed in Section 5.3, called a spectral embedding, to learn not only about each edge, but to learn about each node in relation to all of the other nodes. What do we mean by this? What we mean is that, while looking at a single edge in your network, you only have two possible choices: the edge exists or it does not exist. However, if you instead consider nodes in relation to one another, you can start to deduce patterns about how your nodes might be organized in the community sense. While seeing that two students Bill and Stephanie were friends will not tell us whether Bill and Stephanie were in the same school, if you knew that Bill and Stephanie also shared many other friends (such as Denise, Albert, and Kwan), and those friends also tended to be friends, that piece of information might tell us that they all might happen to be school mates.
The embedding technique you will tend to employ, the spectral embedding, allows you to pick up on these community sorts of tendencies. When we call a set of nodes a community in the context of a network, what we mean is that these nodes are stochastically equivalent. What this means is that this particular group of nodes will tend to have similar other nodes that they are connected to. The spectral embeddings will help us to identify these communities of nodes, and hopefully, when you review the communities of nodes you learn, you will be able to derive some sort of insight as to what, exactly, these communities are. For instance, in your school example, ideally, you might pick up on two communities of nodes, one for each school. The process of learning community assignments for nodes in a network is known as community detection. This problem has been well-established in the network science literature, such as [5], [3], [1], and [4], to name a few.
6.1.1. Stochastic Block Model with unknown communities when you know the number of communities #
When you know the number of communities (even if you don’t know which community each node is in), the procedure for fitting a Stochastic Block Model to a network is relatively straightforward. Let’s consider a similar example to the scenario we evaluated in the lead-in to this section, but with
Which is a Stochastic block model in which the within-community edge probability is
from graspologic.simulations import sbm
ns = [50, 40, 30]
B = [[0.6, 0.2, 0.2],
[0.2, 0.6, 0.2],
[0.2, 0.2, 0.6]]
np.random.seed(1234)
A = sbm(n=ns, p = B)
# the true community labels
z = [0 for i in range(0,ns[0])] + [1 for i in range(0, ns[1])] + [2 for i in range(0, ns[2])]
Click to show
draw_multiplot(A, labels=z, xticklabels=10, yticklabels=10, title="A, Simulated $SBM_{100}( \\vec z, B)$");

Remember, however, that you do not actually know the community labels of each node in
Your goal here, then, is to determine this hidden, or latent, community assignment vector, so that you can learn things about the SBM that you are studying.
You begin by first embedding n_components=3
:
from graspologic.embed import AdjacencySpectralEmbed
# adjacency spectral embedding when we look for 3 communities
ase = AdjacencySpectralEmbed(n_components=3)
Xhat = ase.fit_transform(A)
from graspologic.plot import pairplot
_ = pairplot(Xhat, title="Pairplot of embedding of $A$")

You have some really prominent latent structure here, which is evident from the tightly clustered blobs you have in your pairplot. To see just how well-separated these blobs are, let’s cheat for a second and look at the true labels of each of the points:
Click to show
_ = pairplot(Xhat, labels=z, title="Pairplot of embedding of $A$", legend_name="Community (Unknown)")

So from what we can see, it’s pretty clear that these extremely tight “blobs” in your dataset correspond exactly to the true community labels of the nodes in your network. So, what you need is a technique which can take a latent position matrix, not know anything about these blobs, and and provide us with a way to learn which points correspond to which blob. For this, we turn to unsupervised learning; particularly the
In this next section, we will assume that you are familiar with several concepts from unsupervised learning, including the
6.1.1.1. Estimating community assignments via -means#
First, we use KMeans()
from sklearn
:
Click to show
import warnings
warnings.filterwarnings("ignore")
import numpy as np
np.random.seed(42)
from sklearn.cluster import KMeans
labels_kmeans = KMeans(n_clusters = 3, random_state=1234).fit_predict(Xhat)
6.1.1.2. Evaluating your clustering#
In this instance, you have your true community labels already known, because this is just a simulation. When you know your true node labels ahead of time, you can do some special evaluations that take advantage of the fact that you have the true labels. The first of these is to study the confusion matrix, which you can produce with sklearn
’s confusion_matrix()
:
from sklearn.metrics import confusion_matrix
# compute the confusion matrix between the true labels z
# and the predicted labels labels_kmeans
cf_matrix = confusion_matrix(z, labels_kmeans)
import seaborn as sns
fig, ax = plt.subplots(1,1, figsize=(6,4))
sns.heatmap(cf_matrix, cmap=cmaps["sequential"], ax=ax)
ax.set_title("Confusion matrix")
ax.set_ylabel("True Label (School)")
ax.set_xlabel("Predicted Label");

When you have a good clustering that aligns well with your true labels, you will see that the predictions will tend to be homogeneous: a particular true label will correspond to a particular predicted label, and vice-versa. You can evaluate this homogeneity (how well, in general, one true label corresponds to one predicted label, and vice-versa) using the Adjusted Rand Index (ARI), where a score of closer to
from sklearn.metrics import adjusted_rand_score
ari_kmeans = adjusted_rand_score(z, labels_kmeans)
print(ari_kmeans)
1.0
6.1.1.2.1. By remapping the labels, you can evaluate the error rate#
Looking back at the confusion matrix:
Click to show
fig, ax = plt.subplots(1,1, figsize=(6,4))
sns.heatmap(cf_matrix, cmap=cmaps["sequential"], ax=ax)
ax.set_title("Confusion matrix")
ax.set_ylabel("True Label")
ax.set_xlabel("Predicted Label");

It seems pretty clear that the label correspondance between the true labels and predicted labels is as follows:
Nodes which have a true label of
correspond to a predicted label of ,Nodes which have a true label of
correspond to a predicted label of ,Nodes which have a true label of
correspond to a predicted label of .
This is because all of the nodes which have a true label of graspologic
makes this easy for us, with the remap_labels()
function:
from graspologic.utils import remap_labels
labels_kmeans_remap = remap_labels(z, labels_kmeans)
You can use these remapped labels to understand when KMeans
is, or is not, producing reasonable labels for your investigation. You begin by first looking at a pairs plot, which now will color the points by their assigned community:
Click to show
_ = pairplot(Xhat,
labels=labels_kmeans_remap,
title=f'KMeans on embedding, ARI: {str(ari_kmeans)[:5]}',
legend_name='Predicted label (remapped)',
height=3.5,
palette='muted',);

The final utility of the pairs plot is that you can investigate which points, if any, the clustering technique is getting wrong. You can do this by looking at the classification error of the nodes to each community:
error = z - labels_kmeans_remap # compute which assigned labels from labels_kmeans_remap differ from the true labels z
error = error != 0 # if the difference between the community labels is non-zero, an error has occurred
er_rt = np.mean(error) # error rate is the frequency of making an error
palette = {'Correct Pred.':(0,0.7,0.2),
'Wrong Pred.':(0.8,0.1,0.1)}
error_label = np.array(len(z)*['Correct Pred.']) # initialize numpy array for each node
error_label[error] = 'Wrong Pred.' # add label 'Wrong' for each error that is made
Click to show
pairplot(Xhat,
labels=error_label,
title='Error from KMeans, Error rate: {:.3f}'.format(er_rt),
legend_name='Error label',
height=3.5,
palette=palette,);

Great! Our classification has not made any errors.
To learn about the block matrix SBMEstimator
class, with your predicted labels:
from graspologic.models import SBMEstimator
model = SBMEstimator(directed=False, loops=False)
model.fit(A, y=labels_kmeans_remap)
Bhat = model.block_p_
Click to show
from graphbook_code import heatmap
fig, axs = plt.subplots(1, 3, figsize=(18, 6))
mtxs = [Bhat, np.array(B), np.abs(Bhat - np.array(B))]
titles = ["True $B_{SBM}$", " Prediction $\hat B_{SBM}$", "$|\hat B_{SBM} - B_{SBM}|$"]
for i, ax in enumerate(axs.flat):
heatmap(mtxs[i],
vmin=0,
vmax=1,
font_scale=1.5,
title=titles[i],
ax=ax)

Which appears to be relatively close to the true block matrix.
Recap of inference for Stochastic Block Model with known number of communities
You learned that the adjacency spectral embedding is a key algorithm for making sense of networks which are realizations of SBM random networks. The estimates of latent positions produced by ASE are critical for learning community assignments.
You learned that unsuperised learning (such as K-means) allows us to ues the estimated latent positions to learn community assignments for each node in your network.
You can use
remap_labels
to align predicted labels with true labels, if true labels are known. This is useful for benchmarking techniques on networks with known community labels.You evaluate the quality of unsupervised learning by plotting the predicted node labels and (if you know the true labels) the errorfully classified nodes. The ARI and the error rate summarize how effective your unsupervised learning techniques performed.
6.1.2. Number of communities is not known#
In real data, you almost never have the beautiful canonical modular structure obvious to yourom a Stochastic Block Model. This means that it is extremely infrequently going to be the case that you know exactly how you should set the number of communities,
Let’s first remember back to the single network models section, when you took a Stochastic block model with obvious community structure, and let’s see what happens when you just move the nodes of the adjacency matrix around. You begin with a similar adjacency matrix to
B = [[0.5, 0.2, 0.2],
[0.2, 0.5, 0.2],
[0.2, 0.2, 0.4]]
ns = [50, 40, 30]
np.random.seed(12)
A = sbm(n=ns, p = B)
# the true community labels
z = [0 for i in range(0,ns[0])] + [1 for i in range(0, ns[1])] + [2 for i in range(0, ns[2])]
draw_multiplot(A, labels=z, xticklabels=10, yticklabels=10, title="Simulated SBM($\pi$, B)");

Next, you permute the nodes around to reorder the realized adjacency matrix:
# generate a reordering of the n nodes
vtx_perm = np.random.choice(A.shape[0], size=A.shape[0], replace=False)
A_permuted = A[tuple([vtx_perm])] [:,vtx_perm]
z_perm = np.array(z)[vtx_perm]
Click to show
from graphbook_code import heatmap as hm_code
from graphbook_code import draw_layout_plot as lp_code
fig, axs = plt.subplots(1, 2, figsize=(12, 6))
# heatmap
hm = hm_code(
A_permuted,
ax=axs[0],
cbar=False,
color="sequential",
xticklabels=10,
yticklabels=10
)
# layout plot
lp_code(A_permuted, ax=axs[1], pos=None, labels=z_perm, node_color="qualitative")
plt.suptitle("Simulated $SBM(\\vec z, B)$, reordered vertices", fontsize=20, y=1.1)
fig;

You only get to see the adjacency matrix in the left panel; the panel in the right is constructed by using the true labels (which you do not have!). This means that you proceed for statistical inference about the random network underlying your realized network using only the heatmap you have at left. It is not immediately obvious that this is the realization of a random network which is an SBM with
Our procedure is very similar to what you did previously when the number of communities was known. You again embed, but this time since you don’t know the number of communities you are looking for, you using the “elbow picking” technique to select the dimensionality:
ase_perm = AdjacencySpectralEmbed() # adjacency spectral embedding, with optimal number of latent dimensions selected using elbow picking
Xhat_permuted = ase_perm.fit_transform(A_permuted)
You examine the pairs plot, just like in the section on pairs plots:
_ = pairplot(Xhat_permuted, title="SBM adjacency spectral embedding")

You can still see that there is some level of latent community structure apparent in the pairs plot, as you can see some “blob” separation.
Next, you have the biggest difference with the approach you took previously. Since you do not know the optimal number of clusters
6.1.2.1. Using the silhouette score to deduce an appropriate number of communities#
Now that you have the Silhouette score, deducing an appropriate number of communities is pretty easy. You choose a range of clusters that you think might be appropriate for your network. Let’s say, for instance, you think there might be as many as graspologic
’s KMeansCluster()
function:
from graspologic.cluster import KMeansCluster
km_clust = KMeansCluster(max_clusters = 10, random_state=1234)
km_clust = km_clust.fit(Xhat_permuted);
Next, you visualize the Silhouette Score as a function of the number of clusters:
import seaborn as sns
from pandas import DataFrame as df
nclusters = range(2, 11) # graspologic nclusters goes from 2 to max_clusters
silhouette = km_clust.silhouette_ # obtain the respective silhouette
ss_df = df({"Number of Clusters": nclusters, "Silhouette Score": silhouette}) # place into pandas dataframe
fig, ax = plt.subplots(1,1,figsize=(12, 6))
sns.lineplot(data=ss_df,ax=ax, x="Number of Clusters", y="Silhouette Score");
ax.set_title("Silhouette Analysis of KMeans Clusterings")
fig;

As you can see, Silhouette analysis has indicated the best number of clusters as
labels_autokmeans = km_clust.fit_predict(Xhat_permuted)
As an exercise, you should go through the preceding section on evaluation, and recompute all the diagnostics we did last time for determining whether our predictions were reasonable. You will do this by comparing the true permuted labels z
to the predicted labels on the permuted nodes, labels_autokmeans
.
6.1.3. Community detection with other types of embeddings#
There are a number of ways to take a network and produce an embedding for a network. We have covered ASE and LSE so far in this book, but in the conclusion of this text, you will learn about a few new ones briefly, including GNNs and node2vec()
. These techniques (and many others) can be used, often interchangeably, with ASE
as we discussed above to produce similar predicted community assignments for your network, or different depending on the parameters chosen for the embedding technique and the topology of the network.
6.1.4. Community detection with other types of clustering algorithms#
Similarly, there is no restriction to community detection with using
In this next section, we’ll cover a slight augmentation of the SBM, called the Degree-Corrected SBM (DCSBM), wherein nodes are still grouped into communities, but each node can have an augmented node degree. While the block matrix stays similar overall, the individual edge probabilities are:
Where
Let’s take a look at a simulated example:
from graspologic.simulations import rdpg
def ohe(T):
K = len(np.unique(T))
ohe_dat = np.zeros((len(T), K))
for t in np.unique(T):
ohe_dat[:,t] = (T == t).astype(int)
return ohe_dat
ns = [50, 100, 60]
# strategy for sampling a DCSBM
theta = np.hstack((np.linspace(1, 1, num=ns[0]), np.linspace(1, 1.2, num=ns[1]), np.linspace(0.8, 1.5, num=ns[2])))
B = [[0.3, 0.25, 0.3], [0.25, 0.6, 0.1], [0.3, 0.1, 0.4]]
z = [0 for i in range(0, ns[0])] + [1 for i in range(0, ns[1])] + [2 for i in range(0, ns[2])]
Z = ohe(z)
U, s, Vt = np.linalg.svd(B)
Bsqrt = U @ np.diag(np.sqrt(s))
X = np.diag(theta) @ Z @ Bsqrt
A = rdpg(X)
P_dcsbm = X @ X.transpose()
A = rdpg(X)
Click to show
fig, axs = plt.subplots(1, 3, figsize=(15, 6))
ax=sns.scatterplot(x=X[:,0], y=X[:,1], hue=z, ax=axs[0])
ax.legend_.set_title("Community (Unknown)")
ax.set_title("True Latent Positions")
ax.set_xlabel("Latent dimension 1")
ax.set_ylabel("Latent dimension 2")
heatmap(P_dcsbm, vmin=0, vmax=1, inner_hier_labels=z, title="Probability matrix, DCSBM", ax=axs[1])
heatmap(A, vmin=0, vmax=1, inner_hier_labels=z, title="Sample of DCSBM", ax=axs[2]);

As you can see, the probability matrix is no longer perfectly modular like it would be for an SBM, and in fact, the nodes have an extremely “erratic” looking structure in the probability matrix. However, there is still some fairly clear modularity, which we would hope that our community detection approach might pick up. Let’s see what a spectral embedding looks like for our network sample:
Xhat = AdjacencySpectralEmbed().fit_transform(A)
Click to show
_ = pairplot(Xhat, labels=z, title="Estimate of latent positions", legend_name="Community (Unknown)")

The nodes still look pretty well-separated within-community in the estimates of the latent positions, which is good news since this means our clustering technique should be able to pick this up, right?
The answer is: it depends. Unfortunately,
# predict with k-means
labels_kmeans_erratic = KMeans(n_clusters = 3, random_state=1234).fit_predict(Xhat)
# remap the labels
labels_kmeans_erratic_remap = remap_labels(z, labels_kmeans_erratic)
# compute error rate
error = z - labels_kmeans_erratic_remap # compute which assigned labels from labels_kmeans_remap differ from the true labels z
error = error != 0 # if the difference between the community labels is non-zero, an error has occurred
er_rt = np.mean(error) # error rate is the frequency of making an error
palette = {'Correct Pred.':(0,0.7,0.2),
'Wrong Pred.':(0.8,0.1,0.1)}
error_label = np.array(len(z)*['Correct Pred.']) # initialize numpy array for each node
error_label[error] = 'Wrong Pred.' # add label 'Wrong' for each error that is made
Click to show
_ = pairplot(Xhat,
labels=error_label,
title='Error from KMeans, Error rate: {:.3f}'.format(er_rt),
legend_name='Error label',
height=3.5,
palette=palette,);

So,
Next, we’re going to use a slight augmentation of sklearn
’s GaussianMixture()
for clustering, we instead get something like this:
from sklearn.mixture import GaussianMixture
# predict with gmm
labels_gmm_erratic = GaussianMixture(n_components = 3, random_state=1234).fit_predict(Xhat)
# remap the labels
labels_gmm_erratic_remap = remap_labels(z, labels_gmm_erratic)
# compute error rate
error = z - labels_gmm_erratic_remap # compute which assigned labels from labels_gmm_erratic_remap differ from the true labels z
error = error != 0 # if the difference between the community labels is non-zero, an error has occurred
er_rt = np.mean(error) # error rate is the frequency of making an error
palette = {'Correct Pred.':(0,0.7,0.2),
'Wrong Pred.':(0.8,0.1,0.1)}
error_label = np.array(len(z)*['Correct Pred.']) # initialize numpy array for each node
error_label[error] = 'Wrong Pred.' # add label 'Wrong' for each error that is made
Click to show
_ = pairplot(Xhat,
labels=error_label,
title='Error from GMM, Error rate: {:.3f}'.format(er_rt),
legend_name='Error label',
height=3.5,
palette=palette,);

GMM here does a bit better, making a lot fewer errors. For the reason that the GMM tends to work well in the situation where
6.1.4.1. Choosing number of clusters with GMM#
When you don’t know the number of clusters to use, you can still use the silhouette score; however, with GMM in particular, there is a better approach known as the Bayesian Information Criterion (BIC). If you remember back from your probability or statistics course, the likelihood is the probability of observing the data that you see (in this case, estimated latent positions), given a particular set of parameters (which, in the case of the of the GMM, are the estimated community assignments
The BIC is simply a penalized version of the negative log likelihood, where you penalize the negative log likelihood (make it bigger) when the number of clusters is larger. The BIC is, where
The BIC penalizes more complicated models (it is higher when the number of clusters is high), so in essence, the BIC is simultaneously balancing having a parsimonious number of clusters (they are enough to give a reasonably high probability and a relatively low negative log likelihood, but not so many that we end up with a really complicated and large model).
To choose the optimal number of clusters with GMM using the BIC, we can use the AutoGMMCluster()
. Again, we set max_components
to the maximum number of clusters that we think would be reasonable; in this case,
from graspologic.cluster.autogmm import AutoGMMCluster
autogmm_clust = AutoGMMCluster(max_components=10, random_state=1234)
labels_autogmm_erratic = autogmm_clust.fit_predict(Xhat)
To compute the “best” clustering, AugoGMMCluster()
will automatically estimate GMMs with various numbers of components, parameter choices, and other attributes. The best clusterings are the ones with the lowest BIC, so to compare the BIC across each number of clusters, we should find the minimum BIC for a given number of components:
results = autogmm_clust.results_ # obtain the respective results
# summarize by taking the minimum for each number of clusters
results_summarized = results.groupby("n_components").agg({"bic/aic": "min"})
Click to show
nclusters = range(2, 11) # graspologic nclusters goes from 2 to max_clusters
bic_df = df({"Number of Clusters": nclusters, "BIC": results_summarized["bic/aic"]}) # place into pandas dataframe
fig, ax = plt.subplots(1,1,figsize=(12, 6))
sns.lineplot(data=bic_df,ax=ax, x="Number of Clusters", y="BIC");
ax.set_title("BIC Analysis of GMM Clusterings")
fig;

The optimal number of clusters is identified as
# compute the confusion matrix between the true labels z
# and the predicted labels labels_kmeans
cf_matrix = confusion_matrix(z, labels_autogmm_erratic)
cfm_norm = np.divide(cf_matrix, cf_matrix.sum(axis=0))
fig, ax = plt.subplots(1,1, figsize=(6,2))
sns.heatmap(cfm_norm, cmap=cmaps["sequential"], ax=ax)
ax.set_title("Normalized confusion matrix")
ax.set_ylabel("True Label (School)")
ax.set_xlabel("Predicted Label");

As we can see, for a particular predicted label, almost all of the points correspond to a single unique true label. Stated another way, our clusters are non-overlapping (with respect to the true labels), in that a single predicted label tends to correspond to a single true label.
6.1.5. References#
- 1
Arash A. Amini, Aiyou Chen, Peter J. Bickel, and Elizaveta Levina. Pseudo-likelihood methods for community detection in large sparse networks. arXiv, July 2012. arXiv:1207.2340, doi:10.1214/13-AOS1138.
- 2
M. E. J. Newman. Modularity and community structure in networks. Proc. Natl. Acad. Sci. U.S.A., 103(23):8577–8582, June 2006. doi:10.1073/pnas.0601602103.
- 3
Avanti Athreya, Donniell E. Fishkind, Minh Tang, Carey E. Priebe, Youngser Park, Joshua T. Vogelstein, Keith Levin, Vince Lyzinski, and Yichen Qin. Statistical inference on random dot product graphs: a survey. J. Mach. Learn. Res., 18(1):8393–8484, January 2017. doi:10.5555/3122009.3242083.
- 4
Bisma S. Khan and M. Niazi. Network Community Detection: A Review and Visual Survey. ArXiv, 2017. URL: https://www.semanticscholar.org/paper/Network-Community-Detection%3A-A-Review-and-Visual-Khan-Niazi/2d4e4dc3e46357440c987fc2048f917e196296d0.
- 5
Douglas Reynolds. Gaussian Mixture Models. In Encyclopedia of Biometrics, pages 827–832. Springer, Boston, MA, Boston, MA, USA, July 2015. doi:10.1007/978-1-4899-7488-4_196.