Model Selection
Contents
6.3. Model Selection#
In the past two sections, we’ve covered quite a bit of ground in our understanding of Stochastic Block Models. We introduced the Structured Independent Edge Model, a generalization of the Stochastic Block Model that allows us to compare probabilities of various edge clusters in the network. Next, we introduced the community detection problem, and showed how the adjacency spectal embedding coupled with various adjustments of unsupervised clustering techniques allowed us to learn estimated community assignments
As we have learned over the course of the book so far, one of our core aims in machine learning is to identify the simplest model that maintains a veridical representation of the underlying data with which we are presented. Stated another way, we want a model which is parsimonious: it has enough parameters to explain the data, but not so many that we are overfitting.
Imagine we toss two identical fair coins
Model selection is the task of selecting among several possible statistical models which we think may describe our sample. We do this using quantitative strategies to determine which statistical models are best supported by the data. In our coin flip example with three coins where
For instance, we might propose a statistical model where
6.3.1. Sequentially nested hypotheses allow us to choose the best hypothesis from the candidates#
Let’s formalize this situation a little bit more. We have the following three hypotheses.
A sequence of hypotheses
What advantage does sequentially nesting hypotheses give us? In statistics, and in particular when dealing with sequences of hypothesis tests all of which correspond to candidate models which are based on the Bernoulli distribution, statistical tests can be used to provably identify the best candidate model corresponding to a hypothesis. When we say provably, in this sense, we mean that if the network is big enough, we will get the right answer. Intuitively, this makes sense: if we were to collect a massive amount of data, we would want the test we designed to be able to decipher the correct answer. This means that we can select one of the
6.3.2. Developing sequentially nested hypotheses for network data#
What does this have to do with network data? Let’s imagine that we are presented with the school network that we are accustomed to.
from graspologic.simulations import sbm
import numpy as np
ns = [50, 50] # network with 100 nodes
B = [[0.4, 0.2], [0.2, 0.4]] # block matrix
# sample a single simple adjacency matrix from SBM(z, B)
A = sbm(n=ns, p=B, directed=False, loops=False)
zs = [1 for i in range(0, ns[0])] + [2 for i in range(0, ns[1])]
Click to show
from graphbook_code import draw_multiplot
import matplotlib.pyplot as plt
draw_multiplot(A, labels=zs, title="$SBM_n(\\vec z, B)$ Simulation, nodes ordered by school");

Remember in the two-block SBM, that we have the following block matrix:
The network is undirected by construction, so by definition,
from graphbook_code import SBMEstimator
sbm_mod = SBMEstimator(directed=False, loops=False)
sbm_mod.fit(A, zs)
Bhat = sbm_mod.block_p_
Click to show
import seaborn as sns
import matplotlib.pyplot as plt
def plot_block(X, title="", blockname="School", blocktix=[0.5, 1.5],
blocklabs=["School 1", "School 2"]):
fig, ax = plt.subplots(figsize=(8, 6))
with sns.plotting_context("talk", font_scale=1):
ax = sns.heatmap(X, cmap="Purples",
ax=ax, cbar_kws=dict(shrink=1), yticklabels=False,
xticklabels=False, vmin=0, vmax=1, annot=True)
ax.set_title(title)
cbar = ax.collections[0].colorbar
ax.set(ylabel=blockname, xlabel=blockname)
ax.set_yticks(blocktix)
ax.set_yticklabels(blocklabs)
ax.set_xticks(blocktix)
ax.set_xticklabels(blocklabs)
cbar.ax.set_frame_on(True)
return
plot_block(Bhat, title="Estimated Block Probability Matrix, $\\hat B$")

Next, we assemble the following statistical models:
The network is Erdos-Renyi: the simplest possible model would be the model where all of the block probabilities are the same; that is,
. This is equivalent to an Erdos-Renyi random network with probability , . This doesn’t seem totally reasonable based on the estimated block probability matrix we see above, but it is better to ask questions to see whether something else is better supported by the data than it is to make outright assumptions. For this reason, we will let the null hypothesis . This can be encoded in graspologic using the string,"aaaa"
, or that all four entries of the four-block matrix are the same value, . It is typically the case when testing statistical models to let the null hypothesis be that the network is Erdos-Renyi.Planted partition: As we discussed above, the planted partition model is the model where
, but , where . The within-community edges share a common probability , and the between-community edges share a common probability which is distinct from . The corresponding hypothesis is indicated by . This hypothesis can be encoded in graspologic using the string,"abba"
.Symmetric Heterogeneous: as discussed above, the symmetric heterogeneous model is the moddel where
; that is, the block matrix is symmetric. However, there is heterogeneity in the within-community block probabilities; that is, , but , where . this hypothesis is encoded in graspologic using the string,"abbc"
.
6.3.3. Multiple Hypothesis Correction ensures the -value we obtain is accurate#
When performing multiple statistical tests, we run into the multiple hypothesis correction problem. Let’s imagine that we have
import numpy as np
from graspologic.simulations import er_np
import seaborn as sns
from scipy.stats import binom_test
ncoins = 5000 # the number of networks
p = 0.5 # the true probability
n = 500 # the number of flips
# the number of heads from each experiment
experiments = np.random.binomial(n, p, size=ncoins)
# perform binomial test to see if the number of heads we obtain supports that the
# true probabiily is 0.5
pvals = [binom_test(nheads_i, n, p=p) for nheads_i in experiments]
/tmp/ipykernel_3623/4167236095.py:15: DeprecationWarning: 'binom_test' is deprecated in favour of 'binomtest' from version 1.7.0 and will be removed in Scipy 1.12.0.
pvals = [binom_test(nheads_i, n, p=p) for nheads_i in experiments]
Click to show
import pandas as pd
real_df = pd.DataFrame(pvals, columns=["pval"])
fig, ax = plt.subplots(1,1, figsize=(12, 6))
sns.histplot(real_df, x="pval", stat="probability",
ax=ax, common_norm=False, binwidth=.05)
ax.set_ylabel("Fraction of $p$-values");
ax.set_xlabel("$p$-value");
ax.set_title("$p$-values if the null hypothesis is true");
ax.axvline(0.05, color="red")
ax.text(.08, .08, "Decision threshold, $\\alpha = 0.05$", color="red");

In general, we will correctly reject the alternative hypothesis, and accept the null hypothesis. This is great, since
Basically, the problem here is that while each test individually only has a
As practicians, it feels like it would be pretty problematic to report an analysis and know that an arbitrary fraction of your conclusions are wrong just by random chance. For this reason, a focus of statistics in recent decades has been the development of methods which, in effect, inflate the pvals
, you can adjust the multipletests()
method from statsmodels
. This will give you protection for this “multiple comparisons” issue in your analyses.
Let’s see what happens when we adjust our
from statsmodels.stats.multitest import multipletests
alpha = 0.05 # the desired alpha of the test
_, adj_pvals, _, _ = multipletests(pvals, alpha=alpha, method="holm")
Click to show
real_df = pd.DataFrame(adj_pvals, columns=["pval"])
fig, ax = plt.subplots(1,1, figsize=(12, 6))
sns.histplot(real_df, x="pval", stat="probability",
ax=ax, common_norm=False, binwidth=.05)
ax.set_ylabel("Fraction of adjusted $p$-values");
ax.set_xlabel("Adjusted $p$-value");
ax.set_title("$p$-values if the null hypothesis is true");
ax.axvline(0.05, color="red")
ax.text(.08, .08, "Decision threshold, $\\alpha = 0.05$", color="red");

After adjusting for multiple comparisons, we end up with all of the
In general, for multiple hypothesis correction, we recommend using Holm-Bonferroni correction, which is encoded with the parameter multitest_method="holm"
. We recommend the use of the Holm-Bonferroni approach because it ensures that our
6.3.3.1. Model Selection for unweighted networks#
Finally, to implement the actual hypothesis test, the procedure will be to estimate a test statistic and graspologic
makes this implementation easy for us:
alt_hypotheses = ['abba', 'abbd']
pval, opt_structure = sbm_mod.estimate_block_structure(A, np.array(zs), candidates = alt_hypotheses,
test_method = "fisher_exact", multitest_method = "holm", alpha=.05)
print("p-value: {:.3f}, optimal-structure: {:s}".format(pval, opt_structure))
WARNING:root:All simulated values are lower than table statistic : pval technically of 0.
WARNING:root:All simulated values are lower than table statistic : pval technically of 0.
p-value: 0.002, optimal-structure: abba
So, the model correctly reports that
6.3.4. References#
- 1
Sture Holm. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65–70, 1979. URL: http://www.jstor.org/stable/4615733 (visited on 2023-02-03).
- 2
T. A. Ryan. Multiple comparisons in psychological research. Psychol. Bull., 56(1):26–47, January 1959. arXiv:13623958, doi:10.1037/h0042478.