Let \(g=(V,E)\) be a graph, where \(V\) is the set of vertices that is shared for all \(i\), and \(E_i\) is the set of binary undirected edges between pairs of vertices. Let \(A\) be a binary adjacency matrix where \(a_{uv}=1\) if and only if there is an edge between \(u\) and \(v\), that is, \((uv) \in E\). Assume \(g\) is a realization of a random graph \(G \sim F\), which is sampled from a distribution \(F\).
We consider a random graph models that generalizes the stochastic block model, the structured independent edge models (SIEM).
where \(\tau\) is a grouping of the \(|E|\) edges in \(G\) into \(C\) non-overlapping communities, that is,
\(\cup_{i=1}^{C} \tau_i =E\), and \(\tau_i \cap \tau_j = \emptyset\) for all \(i \neq j\), and \(P\) is a matrix of probabilities where \(P_{ij}\) represents the probability of an edge existing between \(\tau_i\) and \(\tau_j\).
Assuming the number of edges that exist within each edge community \(k\) is binomially distributed with \(n\) trials (the total number of possible edges) and probability \(p\) (the probability of an edge existing at each possible location), the likelihood function is of the form:
\[\begin{align*} L(p | n, k) &= \prod_{k=0}^n f_B(n, k | p) = \prod_{k=0}^n \begin{pmatrix}n \\ k\end{pmatrix}p^k (1 - p)^{n - k} \\ log(L(p | n, k)) &= \sum_{k=0}^n \log\left(\begin{pmatrix}n \\ k\end{pmatrix}\right) + k\log (p) + (n - k)\log (1 - p) \end{align*}\]Maximizing with respect to \(p\):
\[\begin{align*} \frac{\delta log(L(p | n, k))}{\delta p} &= \sum_{k=0}^n \frac{k}{p} - \frac{n - k}{1 - p} = 0 \\ \frac{k}{p} &= \frac{n - k}{1 - p} \\ \mu_p &= \mathbb{E}[p] = \frac{k}{n} \end{align*}\]to get the variance term, we note that \(\hat{p} = \frac{k}{n}\), so then \(Var(p) = \sigma_p = Var\left(\frac{k}{n}\right) = \frac{1}{n^2} Var(k)\). The binomial distriibution can be thought of as an aggregation of \(n\) independent bernoulli trials with probability \(p\); that is, \(X_i \overset{iid}{\sim} Bern(p)\) where \(\mathbb{E}\left[X_i\right] = p\). Given that the variance of independent events sum, we can expand:
\[\begin{align*} Var(\sum_{i=1}^n X_i) &= \sum_{i=1}^n Var(X_i) = \sum_{i=1}^n E\left[X_i^2\right] - E\left[X_i\right]^2 \\ \mathbb{E}\left[X_i^2\right] &= 0^2(1-p) + 1^2(p) = p \\ Var(k) &= \sum_{i=1}^n \mathbb{E}\left[X_i^2\right] - \mathbb{E}\left[X_i\right]^2 \\ &= np(1-p) \end{align*}\]Then \(\sigma_\hat{p} = \frac{p(1-p)}{n}\). As we can see, \(p\) is normally distributed where:
\[\begin{align*} p \sim \mathcal{N}\left(\mu_p, \sigma_p\right) \end{align*}\]Given the above model, we can see that for any pair of edge communities \(i, j\), their difference \(\delta_{ij} = p_i - p_j\), then:
\[\begin{align*} \delta_{ij} \sim \mathcal{N}\left(\mu_{p_i} - \mu_{p_j}, \sigma_{p_i} + \sigma_{p_j}\right) \end{align*}\]where \(R(x_1, x_2)\) is a relation between \(x_1\) and \(x_2\), and \(!R(x_1, x_2)\) is the opposite of the relation between \(x_1\) and \(x_2\).
Welch’s T-Test ~ for testing whether populations have equal means given that they have different variances in the univariate case provides out test statistic:
\[\begin{align*} T_{observed} = \frac{x_1 - x_2}{\sqrt{\frac{s_{1}^2}{n_{1}} + \frac{s_{2}^2}{n_{2}}}}, \end{align*}\]where \(s_1 = \sigma_{\hat{x}_1},\;s_2 = \sigma_{\hat{x}_2}\).
The null distribution, and therefore p-value, is available from R, using the family of functions from the package:
\[\begin{align*} p = 1 - \int_{-\infty}^{T_{observed}} p(x, \nu)dx \end{align*}\]where \(\nu\) is the number of degrees of freedom.
In our below example, we simulate a graph where \(p_{c_1} = 0.3\) and \(p_{c_2} = 0.7\):
nv <- 20 # number of vertices in simulated graph
X <- array(0, dim=c(nv, nv)) # simulated graph initialized
com <- array(0, dim=c(nv, nv)) # probability at each edge
Es <- list()
split <- sample(nv^2, nv^2, replace=FALSE) # sample edges in random order
pedge <- data.frame(community=c(), result=c(), p=c())
Es[[1]] <- split[1:(nv^2/2)] # first half in edge-community 1
Es[[2]] <- split[(nv^2/2+1):(nv^2)] # second half in edge-community 2
p <- c(.3, .7) # probabilities between community 1 and 2 are vastly different
for (i in 1:length(Es)) {
X[Es[[i]]] <- rbinom(n=length(Es[[i]]), prob=p[i], size=1)
com[Es[[i]]] <- i
pedge <- rbind(pedge, data.frame(community=c(i), result=c("true"), p=c(p[i])))
pedge <- rbind(pedge, data.frame(community=c(i), result=c("sampled"), p=c(sum(X[Es[[i]]])/length(Es[[i]]))))
}
pedge$community <- factor(pedge$community)
ggplot(pedge, aes(x=community, y=p, fill=result)) +
geom_bar(stat="identity", width=.5, position = "dodge") +
ggtitle("Comparing True vs. Sim Edge Community, Sample 1")
gs.plot.plot_matrix(com, legend.name = "community", xlabel = "vertex",
ylabel = "vertex", title = "Community Each Edge is from, Sample 1", ffactor=TRUE)
gs.plot.plot_matrix(X, legend.name = "connection", xlabel = "vertex",
ylabel = "vertex", title = "Graph Simulated from SIEM, Sample 1", ffactor=TRUE)
Given a graph from a single sample, we might be interested in whether two estimators within that single sample differ. For example, we can detect the probability of an edge in one community singificantly exceeding the probability of an edge in another community for our example graph:
model <- gs.siem.fit(X, Es) # fit siem model
os_pval <- array(0, dim=c(2, 2))
for (i in 1:length(Es)) {
for (j in 1:length(Es)) {
os_pval[i, j] <- gs.siem.sample.test(model$pr[i], model$pr[j], model$var[i],
model$var[j], df=1, alt='greater')$p
}
}
gs.plot.plot_matrix(os_pval, legend.name = TeX("$p$-value"), xlabel = TeX("$c_i$"),
ylabel = TeX("$c_j$"), title = TeX("$p$-value that $p_{c_i} > p_{c_j}$, Sample 1"),
limits=c(0, 1), vfactor=TRUE)
as we can see, with \(\alpha = 0.05\), we reject the null in favor of the alternative for the combination \(p_{c_2} > p_{c_1}\), which intuitively makes sense under the parameters we estimated previously as \(p_{c_2} = 0.7\) and \(p_{c_1} = 0.3\).
In our below example, we simulate a second graph where \(p_{c_1} = 0.5\) and \(p_{c_2} = 0.6\):
Given a second graph from a different sample, we may be interested in whether two estimators across the 2 samples differ. For example, we might be interested in whether the difference in the estimator of the probability of an edge in \(c_2\) exceeds the probability of an edge in \(c_1\) from sample 1 to sample 2. That is, whether \(\delta_{s_1} = x_{c_1, 1} - x_{c_2. 1} > \delta_{s_2} = x_{c_1, 2} - x_{c_2. 2}\):
X2 <- array(0, dim=c(nv, nv)) # simulated graph initialized
com2 <- array(0, dim=c(nv, nv)) # probability at each edge
pedge <- data.frame(community=c(), result=c(), p=c())
p2 <- c(.5, .6) # probabilities between community 1 and 2 are vastly different
for (i in 1:length(Es)) {
X2[Es[[i]]] <- rbinom(n=length(Es[[i]]), prob=p2[i], size=1)
com[Es[[i]]] <- i
pedge <- rbind(pedge, data.frame(community=c(i), result=c("true"), p=c(p2[i])))
pedge <- rbind(pedge, data.frame(community=c(i), result=c("sampled"), p=c(sum(X[Es[[i]]])/length(Es[[i]]))))
}
pedge$community <- factor(pedge$community)
ggplot(pedge, aes(x=community, y=p, fill=result)) +
geom_bar(stat="identity", width=.5, position = "dodge") +
ggtitle("Comparing True vs. Sim Edge Community, Sample 2")
gs.plot.plot_matrix(com, legend.name = "community", xlabel = "vertex",
ylabel = "vertex", title = "Community Each Edge is from, Sample 2", ffactor=TRUE)
gs.plot.plot_matrix(X, legend.name = "connection", xlabel = "vertex",
ylabel = "vertex", title = "Graph Simulated from SIEM, Sample 2", ffactor=TRUE)
model2 <- gs.siem.fit(X2, Es) # fit siem model
gs.siem.sample.test(model$dpr[2, 1], model2$dpr[2, 1], model2$dvar[2, 1],
model2$dvar[2, 1], df=2, alt='greater')$p
## [1] 0.02942545
and we can see that at \(\alpha=0.05\) we detect a significant difference between the two.
First, we investigate the consistency of our estimators, or whether the estimators converge to the true parameters as sample size increases:
ns = round(10^seq(1, log10(1225), length=10))
ps = seq(0, 1, length=11)
ndat = length(ns)*length(ps)
empty_ar = array(NaN, dim=c(ndat))
results = data.frame(n = empty_ar, p = empty_ar, mu = empty_ar, var = empty_ar)
counter = 1
nsim = 10
for (n in ns) {
for (p in ps) {
v_ar = array(NaN, dim=c(nsim))
m_ar = array(NaN, dim=c(nsim))
for (i in 1:nsim) {
pemp = replicate(n, {
dat = rbinom(n = n, p = p, size=1)
phat = sum(dat)/length(dat)
})
m_ar[i] = abs(mean(pemp) - p)
v_ar[i] = abs(var(pemp) - model.var(p, n))
}
results[counter,] = data.frame(n = n, p = p, mu = mean(m_ar),
var = mean(v_ar))
counter <- counter + 1
}
}
results$n = factor(results$n)
results$p = factor(results$p)
ggplot(results, aes(x = n, y = mu, group=p, color=p)) +
geom_line() +
ggtitle(TeX('Consistency of estimator $\\mu_{\\hat{p}}$, average of 10 simulations')) +
xlab("Number of possible edges") +
ylab(TeX('$\\left|p_{analytical} - \\mu_{\\hat{p}}\\right|$')) +
scale_color_discrete(name=TeX("$p_{analytical}$"))
ggplot(results, aes(x = n, y = var, group=p, color=p)) +
geom_line() +
ggtitle(TeX('Consistency of estimator $\\sigma^2_{\\hat{p}}$, average of 10 simulations')) +
xlab("Number of possible edges") +
ylab(TeX('$\\left|Var(p_{analytical}) - \\sigma^2_{\\hat{p}}\\right|$')) +
scale_color_discrete(name=TeX("$p_{analytical}$"))
As we can see, as our number of possible edges increases, our estimators for \(\mu\) and \(\sigma^2\) converge, indicating we have consistent estimators.
# computes the power of the model under a given significance level
# accepts params for a number of simulations to average power over, and a
# number of graphs for each computation
# number of edges defines the number of edges to use in the binomial simulation
t.power.p <- function(means, ne=1225, sig=.95, nsim=100, ngr=100) {
ucut = qt(sig, df=ngr) # t-statistic of null at the given significance level with ne-2 degrees of freedom
ts = replicate(nsim, { # replicate our described test n tsim times
alt = replicate(ngr, sum(rbinom(n = ne, size=1, prob = means[1]))/ne)
null = replicate(ngr, sum(rbinom(n = ne, size=1, prob = means[2]))/ne)
t.test(alt, null, alternative = "greater", var.equal = FALSE)$statistic
})
ana_tstat = gs.siem.sample.test(means[1], means[2], model.var(means[1], n=ne),
model.var(means[2], n=ne), n1=ngr, n2=ngr)$stat
return(list(power=sum(ts > ucut)/nsim, diff=abs(mean(ts) - ana_tstat)/ana_tstat))
}
In this experiment, we will analyze the power of our test developed. Assuming that the entire graph has average \(p=0.5\), we will simulated from a block model where the probabiliy of the within-group edges have \(p_{within}=0.5 + \epsilon\), and the outside of group edges have \(p_{outside} = 0.5 - \epsilon\). We will assume a significance level of \(0.95\) for our \(T\) cutoff, and fix the number of observations between 0 and \(\frac{2550}{2}=1225\), since our real data has \(2450\) total edges yielding \(1225\) observations per-group. Our simulation will be structured as follows:
p = 0.5
diff = seq(0, 0.1, length=21)
ns = round(10^seq(1, log10(1225), length=10))
ndat = length(ns)*length(diff)
empty_ar = array(NaN, dim=c(ndat))
dat = data.frame(ns = empty_ar, diff=empty_ar, pow=empty_ar, tdiff=empty_ar)
counter = 1
for (j in 1:length(ns)) {
n = ns[j]
for (i in 1:length(diff)) {
in.p = p + diff[i]/2
out.p = p - diff[i]/2
# under the model, assume the p_in is the mean within group, and p_out is the mean outside of group
# compute the standard deviation according to the model
means = c(in.p, out.p)
result = t.power.p(means, ne=n)
dat[counter,] = c(ns=n, diff=diff[i], pow=result$power, tdiff=result$diff)
counter = counter + 1
}
}
First, we look at power as a function of the number of edges in our simulation, as we vary the difference between the within community and outside community probabilities:
dat$ns = factor(dat$ns)
dat$diff = factor(dat$diff)
thresh = data.frame(diff=diff, sig=.05)
thresh$diff = factor(thresh$diff)
ggplot(dat, aes(x = diff, y = pow, group=ns, color=ns)) +
geom_line() +
ggtitle(TeX('Power of Unequal-Variance T-Test with 100 simulations, 100 $\\frac{graphs}{simulation}$')) +
xlab(TeX('Difference in $p_{within} - p_{outside}$')) +
ylab('Power of Test') +
scale_color_discrete(name="number of edges") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
And we also look at how the analytical test-statistic computed from our trials compares to the empirical test-statistics estimated from our simulation procedure:
ggplot(dat, aes(x = diff, y = tdiff, group=ns, color=ns)) +
geom_line() +
ggtitle(TeX('Analytical T-Test compared to Empirical T-Test')) +
xlab(TeX('Difference in $\\left|p_{within} - p_{outside}\\right|$')) +
ylab(TeX('$\\frac{\\left|\\bar{T}_{empirical} - T_{analytical}\\right|}{T_{analytical}}')) +
scale_color_discrete(name="number of edges") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Here, we will verify that our estimators of \(\hat{\delta}\) are correct, that is, that our estimation gets more accurate as our sample size increases:
ns = round(10^seq(1, log10(1225), length=10))
ds = seq(0, 1.0, length=21)
ndat = length(ns)*length(ds)
empty_ar = array(NaN, dim=c(ndat))
results = data.frame(n = empty_ar, d = empty_ar, mu = empty_ar, var = empty_ar)
counter = 1
nsim = 10
for (n in ns) {
for (d in ds) {
v_ar = array(NaN, dim=c(nsim))
m_ar = array(NaN, dim=c(nsim))
for (i in 1:nsim) {
p1 = 0.5 + d/2
p2 = 0.5 - d/2
demp = replicate(n, {
dat1 = rbinom(n = n, p = p1, size=1)
dat2 = rbinom(n = n, p = p2, size=1)
dhat = sum(dat1 - dat2)/length(dat1)
})
params <- gs.siem.model.params(demp)
m_ar[i] = abs(params$p - d)
v_ar[i] = abs(params$var - (model.var(p1, n) + model.var(p2, n)))
}
results[counter,] = data.frame(n = n, d = d, mu = mean(m_ar),
var = mean(v_ar))
counter <- counter + 1
}
}
results$n = factor(results$n)
results$d = factor(results$d)
ggplot(results, aes(x = n, y = mu, group=d, color=d)) +
geom_line() +
ggtitle(TeX('Consistency of estimator $\\mu_{\\hat{\\delta}}$, average of 10 simulations')) +
xlab("Number of possible edges") +
ylab(TeX('$\\left|\\delta_{analytical} - \\mu_{\\hat{\\delta}}\\right|$')) +
scale_color_discrete(name=TeX("$\\delta_{analytical}$"))
ggplot(results, aes(x = n, y = var, group=d, color=d)) +
geom_line() +
ggtitle(TeX('Consistency of estimator $\\sigma^2_{\\hat{\\delta}}$, average of 10 simulations')) +
xlab("Number of possible edges") +
ylab(TeX('$\\left|Var(\\delta_{analytical}) - \\sigma^2_{\\hat{\\delta}}\\right|$')) +
scale_color_discrete(name=TeX("$\\delta_{analytical}$"))
As we can see, as our number of possible edges increases, our estimators for \(\mu\) and \(\sigma^2\) converge, indicating we have consistent estimators.
# computes the power of the model under a given significance level
# accepts params for a number of simulations to average power over, and a
# number of graphs for each computation
# number of edges defines the number of edges to use in the binomial simulation
t.power.delta = function(diffs, ne=1225, sig=.95, nsim=100, ngr=100) {
ucut = qt(sig, df=ngr) # t-statistic of null at the given significance level with ne-2 degrees of freedom
ts = replicate(nsim, { # replicate our described test n tsim times
alt = replicate(ngr, sum(rbinom(n = ne, size=1, prob = 0.5 + diffs[1]/2))/ne - sum(rbinom(n = ne, size=1, prob=0.5 - diffs[1]/2))/ne)
null = replicate(ngr, sum(rbinom(n = ne, size=1, prob = 0.5 + diffs[2]/2))/ne - sum(rbinom(n = ne, size=1, prob = 0.5 - diffs[2]/2))/ne)
t.test(alt, null, alternative = "greater", var.equal = FALSE)$statistic
})
v1 = model.var(0.5 + diffs[1]/2, ne) + model.var(0.5 - diffs[1]/2, ne)
v2 = model.var(0.5 + diffs[2]/2, ne) + model.var(0.5 - diffs[2]/2, ne)
ana_tstat = gs.siem.sample.test(diffs[1], diffs[2], v1, v2, n1=ngr, n2=ngr)$stat
return(list(power=sum(ts > ucut)/nsim, diff=abs(mean(ts) - ana_tstat)/ana_tstat))
}
In this experiment, we will analyze the power of our test developed. Assuming that the entire graph has average \(p=0.5\), we will simulated from a block model where the probabiliy of the within-group edges have \(p_{within}=0.5 + \epsilon\), and the outside of group edges have \(p_{outside} = 0.5 - \epsilon\) for two different populations of graphs. We will assume a significance level of \(0.95\) for our \(T\) cutoff, and fix the number of observations between 0 and \(\frac{2550}{2}=1225\), since our real data has \(2450\) total edges yielding \(1225\) observations per-group. Our simulation will be structured as follows:
maxdiff = 0.2
diff = seq(0, maxdiff, length=21)
ns = round(10^seq(1, log10(1225), length=10))
ndat = length(ns)*length(diff)
empty_ar = array(NaN, dim=c(ndat))
dat = data.frame(ns = empty_ar, diff=empty_ar, pow=empty_ar, tdiff=empty_ar)
counter = 1
for (j in 1:length(ns)) {
n = ns[j]
for (i in 1:length(diff)) {
in.d = maxdiff + diff[i]/2
out.d = maxdiff - diff[i]/2
# under the model, assume the p_in is the mean within group, and p_out is the mean outside of group
# compute the standard deviation according to the model
diffs = c(in.d, out.d)
result = t.power.delta(diffs, ne=n)
dat[counter,] = c(ns=n, diff=diff[i], pow=result$power, tdiff=result$diff)
counter = counter + 1
}
}
First, we look at power as a function of the number of edges in our simulation, as we vary the difference between the \(\delta\) for each population of graphs:
dat$ns = factor(dat$ns)
dat$diff = factor(dat$diff)
thresh = data.frame(diff=diff, sig=.05)
thresh$diff = factor(thresh$diff)
ggplot(dat, aes(x = diff, y = pow, group=ns, color=ns)) +
geom_line() +
ggtitle(TeX('Power of Unequal-Variance T-Test with 100 simulations, 100 $\\frac{graphs}{simulation}$')) +
xlab(TeX('Difference in $\\left|\\delta_1 - \\delta_2\\right|$')) +
ylab('Power of Test') +
scale_color_discrete(name="number of edges") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
And we also look at how the analytical test-statistic computed from our trials compares to the empirical test-statistics estimated from our simulation procedure:
ggplot(dat, aes(x = diff, y = tdiff, group=ns, color=ns)) +
geom_line() +
ggtitle(TeX('Analytical T-Test compared to Empirical T-Test')) +
xlab(TeX('Difference in $\\left|\\delta_1 - \\delta_2\\right|$')) +
ylab(TeX('$\\frac{\\left|\\bar{T}_{empirical} - T_{analytical}\\right|}{T_{analytical}}')) +
scale_color_discrete(name="number of edges") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))