Let \(g=(V,E)\) be a graph, where \(V\) is the set of vertices that is shared for all \(i\), and \(E_i\) is the set of binary undirected edges between pairs of vertices. Let \(A\) be a binary adjacency matrix where \(a_{uv}=1\) if and only if there is an edge between \(u\) and \(v\), that is, \((uv) \in E\). Assume \(g\) is a realization of a random graph \(G \sim F\), which is sampled from a distribution \(F\).
We consider a random graph models that generalizes the stochastic block model, the structured independent edge models (SIEM).

Setting

Statistical Model

\[\begin{align*} A \sim SIEM(P, \tau) \end{align*}\]

where \(\tau\) is a grouping of the \(|E|\) edges in \(G\) into \(C\) non-overlapping communities, that is,
\(\cup_{i=1}^{C} \tau_i =E\), and \(\tau_i \cap \tau_j = \emptyset\) for all \(i \neq j\), and \(P\) is a matrix of probabilities where \(P_{ij}\) represents the probability of an edge existing between \(\tau_i\) and \(\tau_j\).

Estimators

Probability

Assuming the number of edges that exist within each edge community \(k\) is binomially distributed with \(n\) trials (the total number of possible edges) and probability \(p\) (the probability of an edge existing at each possible location), the likelihood function is of the form:

\[\begin{align*} L(p | n, k) &= \prod_{k=0}^n f_B(n, k | p) = \prod_{k=0}^n \begin{pmatrix}n \\ k\end{pmatrix}p^k (1 - p)^{n - k} \\ log(L(p | n, k)) &= \sum_{k=0}^n \log\left(\begin{pmatrix}n \\ k\end{pmatrix}\right) + k\log (p) + (n - k)\log (1 - p) \end{align*}\]

Maximizing with respect to \(p\):

\[\begin{align*} \frac{\delta log(L(p | n, k))}{\delta p} &= \sum_{k=0}^n \frac{k}{p} - \frac{n - k}{1 - p} = 0 \\ \frac{k}{p} &= \frac{n - k}{1 - p} \\ \mu_p &= \mathbb{E}[p] = \frac{k}{n} \end{align*}\]

to get the variance term, we note that \(\hat{p} = \frac{k}{n}\), so then \(Var(p) = \sigma_p = Var\left(\frac{k}{n}\right) = \frac{1}{n^2} Var(k)\). The binomial distriibution can be thought of as an aggregation of \(n\) independent bernoulli trials with probability \(p\); that is, \(X_i \overset{iid}{\sim} Bern(p)\) where \(\mathbb{E}\left[X_i\right] = p\). Given that the variance of independent events sum, we can expand:

\[\begin{align*} Var(\sum_{i=1}^n X_i) &= \sum_{i=1}^n Var(X_i) = \sum_{i=1}^n E\left[X_i^2\right] - E\left[X_i\right]^2 \\ \mathbb{E}\left[X_i^2\right] &= 0^2(1-p) + 1^2(p) = p \\ Var(k) &= \sum_{i=1}^n \mathbb{E}\left[X_i^2\right] - \mathbb{E}\left[X_i\right]^2 \\ &= np(1-p) \end{align*}\]

Then \(\sigma_\hat{p} = \frac{p(1-p)}{n}\). As we can see, \(p\) is normally distributed where:

\[\begin{align*} p \sim \mathcal{N}\left(\mu_p, \sigma_p\right) \end{align*}\]

Difference in Probability

Given the above model, we can see that for any pair of edge communities \(i, j\), their difference \(\delta_{ij} = p_i - p_j\), then:

\[\begin{align*} \delta_{ij} \sim \mathcal{N}\left(\mu_{p_i} - \mu_{p_j}, \sigma_{p_i} + \sigma_{p_j}\right) \end{align*}\]

Hypothesis Testing

Our hypothesis test can be stated as follows: \[\begin{align*} H_0&: !R(x_{1}, x_{2}) \\ H_A&: R(x_{1}, x_{2}). \end{align*}\]

where \(R(x_1, x_2)\) is a relation between \(x_1\) and \(x_2\), and \(!R(x_1, x_2)\) is the opposite of the relation between \(x_1\) and \(x_2\).

Welch’s T-Test ~ for testing whether populations have equal means given that they have different variances in the univariate case provides out test statistic:

\[\begin{align*} T_{observed} = \frac{x_1 - x_2}{\sqrt{\frac{s_{1}^2}{n_{1}} + \frac{s_{2}^2}{n_{2}}}}, \end{align*}\]

where \(s_1 = \sigma_{\hat{x}_1},\;s_2 = \sigma_{\hat{x}_2}\).

The null distribution, and therefore p-value, is available from R, using the family of functions from the package:

\[\begin{align*} p = 1 - \int_{-\infty}^{T_{observed}} p(x, \nu)dx \end{align*}\]

where \(\nu\) is the number of degrees of freedom.

Example

In our below example, we simulate a graph where \(p_{c_1} = 0.3\) and \(p_{c_2} = 0.7\):

nv <- 20  # number of vertices in simulated graph
X <- array(0, dim=c(nv, nv))  # simulated graph initialized
com <- array(0, dim=c(nv, nv))  # probability at each edge
Es <- list()
split <- sample(nv^2, nv^2, replace=FALSE)  # sample edges in random order
pedge <- data.frame(community=c(), result=c(), p=c())

Es[[1]] <- split[1:(nv^2/2)]  # first half in edge-community 1
Es[[2]] <- split[(nv^2/2+1):(nv^2)]  # second half in edge-community 2

p <- c(.3, .7)  # probabilities between community 1 and 2 are vastly different

for (i in 1:length(Es)) {
  X[Es[[i]]] <- rbinom(n=length(Es[[i]]), prob=p[i], size=1)
  com[Es[[i]]] <- i
  pedge <- rbind(pedge, data.frame(community=c(i), result=c("true"), p=c(p[i])))
  pedge <- rbind(pedge, data.frame(community=c(i), result=c("sampled"), p=c(sum(X[Es[[i]]])/length(Es[[i]]))))
}

pedge$community <- factor(pedge$community)
ggplot(pedge, aes(x=community, y=p, fill=result)) +
  geom_bar(stat="identity", width=.5, position = "dodge") +
  ggtitle("Comparing True vs. Sim Edge Community, Sample 1")

gs.plot.plot_matrix(com, legend.name = "community", xlabel = "vertex",
                    ylabel = "vertex", title = "Community Each Edge is from, Sample 1", ffactor=TRUE)

gs.plot.plot_matrix(X, legend.name = "connection", xlabel = "vertex",
                    ylabel = "vertex", title = "Graph Simulated from SIEM, Sample 1", ffactor=TRUE)

One Sample Test

Given a graph from a single sample, we might be interested in whether two estimators within that single sample differ. For example, we can detect the probability of an edge in one community singificantly exceeding the probability of an edge in another community for our example graph:

model <- gs.siem.fit(X, Es)  # fit siem model
os_pval <- array(0, dim=c(2, 2))
for (i in 1:length(Es)) {
  for (j in 1:length(Es)) {
    os_pval[i, j] <- gs.siem.sample.test(model$pr[i], model$pr[j], model$var[i],
                                         model$var[j], df=1, alt='greater')$p
  }
}

gs.plot.plot_matrix(os_pval, legend.name = TeX("$p$-value"), xlabel = TeX("$c_i$"),
                     ylabel = TeX("$c_j$"), title = TeX("$p$-value that $p_{c_i} > p_{c_j}$, Sample 1"),
                     limits=c(0, 1), vfactor=TRUE)

as we can see, with \(\alpha = 0.05\), we reject the null in favor of the alternative for the combination \(p_{c_2} > p_{c_1}\), which intuitively makes sense under the parameters we estimated previously as \(p_{c_2} = 0.7\) and \(p_{c_1} = 0.3\).

Two Sample Test

In our below example, we simulate a second graph where \(p_{c_1} = 0.5\) and \(p_{c_2} = 0.6\):

Given a second graph from a different sample, we may be interested in whether two estimators across the 2 samples differ. For example, we might be interested in whether the difference in the estimator of the probability of an edge in \(c_2\) exceeds the probability of an edge in \(c_1\) from sample 1 to sample 2. That is, whether \(\delta_{s_1} = x_{c_1, 1} - x_{c_2. 1} > \delta_{s_2} = x_{c_1, 2} - x_{c_2. 2}\):

X2 <- array(0, dim=c(nv, nv))  # simulated graph initialized
com2 <- array(0, dim=c(nv, nv))  # probability at each edge
pedge <- data.frame(community=c(), result=c(), p=c())

p2 <- c(.5, .6)  # probabilities between community 1 and 2 are vastly different

for (i in 1:length(Es)) {
  X2[Es[[i]]] <- rbinom(n=length(Es[[i]]), prob=p2[i], size=1)
  com[Es[[i]]] <- i
  pedge <- rbind(pedge, data.frame(community=c(i), result=c("true"), p=c(p2[i])))
  pedge <- rbind(pedge, data.frame(community=c(i), result=c("sampled"), p=c(sum(X[Es[[i]]])/length(Es[[i]]))))
}

pedge$community <- factor(pedge$community)
ggplot(pedge, aes(x=community, y=p, fill=result)) +
  geom_bar(stat="identity", width=.5, position = "dodge") +
  ggtitle("Comparing True vs. Sim Edge Community, Sample 2")

gs.plot.plot_matrix(com, legend.name = "community", xlabel = "vertex",
                     ylabel = "vertex", title = "Community Each Edge is from, Sample 2", ffactor=TRUE)

gs.plot.plot_matrix(X, legend.name = "connection", xlabel = "vertex",
                     ylabel = "vertex", title = "Graph Simulated from SIEM, Sample 2", ffactor=TRUE)

model2 <- gs.siem.fit(X2, Es)  # fit siem model
gs.siem.sample.test(model$dpr[2, 1], model2$dpr[2, 1], model2$dvar[2, 1],
                    model2$dvar[2, 1], df=2, alt='greater')$p
## [1] 0.02942545

and we can see that at \(\alpha=0.05\) we detect a significant difference between the two.

Simulations

\(p\) Estimation

First, we investigate the consistency of our estimators, or whether the estimators converge to the true parameters as sample size increases:

ns = round(10^seq(1, log10(1225), length=10))
ps = seq(0, 1, length=11)
ndat = length(ns)*length(ps)
empty_ar = array(NaN, dim=c(ndat))
results = data.frame(n = empty_ar, p = empty_ar, mu = empty_ar, var = empty_ar)
counter = 1
nsim = 10
for (n in ns) {
  for (p in ps) {
    v_ar = array(NaN, dim=c(nsim))
    m_ar = array(NaN, dim=c(nsim))
    for (i in 1:nsim) {
      pemp = replicate(n, {
        dat = rbinom(n = n, p = p, size=1)
        phat = sum(dat)/length(dat)
        })
      m_ar[i] = abs(mean(pemp) - p)
      v_ar[i] = abs(var(pemp) - model.var(p, n))
    }
    results[counter,] = data.frame(n = n, p = p, mu = mean(m_ar),
                                   var = mean(v_ar))
    counter <- counter + 1
  }
}

results$n = factor(results$n)
results$p = factor(results$p)

ggplot(results, aes(x = n, y = mu, group=p, color=p)) +
  geom_line() +
  ggtitle(TeX('Consistency of estimator $\\mu_{\\hat{p}}$, average of 10 simulations')) +
  xlab("Number of possible edges") +
  ylab(TeX('$\\left|p_{analytical} - \\mu_{\\hat{p}}\\right|$')) +
  scale_color_discrete(name=TeX("$p_{analytical}$"))

ggplot(results, aes(x = n, y = var, group=p, color=p)) +
  geom_line() +
  ggtitle(TeX('Consistency of estimator $\\sigma^2_{\\hat{p}}$, average of 10 simulations')) +
  xlab("Number of possible edges") +
  ylab(TeX('$\\left|Var(p_{analytical}) - \\sigma^2_{\\hat{p}}\\right|$')) +
  scale_color_discrete(name=TeX("$p_{analytical}$"))

As we can see, as our number of possible edges increases, our estimators for \(\mu\) and \(\sigma^2\) converge, indicating we have consistent estimators.

Statistical Power

# computes the power of the model under a given significance level
# accepts params for a number of simulations to average power over, and a
# number of graphs for each computation
# number of edges defines the number of edges to use in the binomial simulation
t.power.p <- function(means, ne=1225, sig=.95, nsim=100, ngr=100) {
  ucut = qt(sig, df=ngr)  # t-statistic of null at the given significance level with ne-2 degrees of freedom
  ts = replicate(nsim, {  # replicate our described test n tsim times
    alt = replicate(ngr, sum(rbinom(n = ne, size=1, prob = means[1]))/ne)
    null = replicate(ngr, sum(rbinom(n = ne, size=1, prob = means[2]))/ne)
    t.test(alt, null, alternative = "greater", var.equal = FALSE)$statistic
  })
  ana_tstat = gs.siem.sample.test(means[1], means[2], model.var(means[1], n=ne),
                                  model.var(means[2], n=ne), n1=ngr, n2=ngr)$stat
  return(list(power=sum(ts > ucut)/nsim, diff=abs(mean(ts) - ana_tstat)/ana_tstat))
}

In this experiment, we will analyze the power of our test developed. Assuming that the entire graph has average \(p=0.5\), we will simulated from a block model where the probabiliy of the within-group edges have \(p_{within}=0.5 + \epsilon\), and the outside of group edges have \(p_{outside} = 0.5 - \epsilon\). We will assume a significance level of \(0.95\) for our \(T\) cutoff, and fix the number of observations between 0 and \(\frac{2550}{2}=1225\), since our real data has \(2450\) total edges yielding \(1225\) observations per-group. Our simulation will be structured as follows:

  • Simulate \(n\) edges from a binomial distribution given \(ne, p + \epsilon\), the alternative samples.
  • Simulate \(n\) edges from a binomial distribution given \(ne, p - \epsilon\), the null samples.
  • Compute the empirical distribution for \(\hat{p}\) for the alternative and null samples, respectively by repeating the above procedure \(ns\) times.
  • derive the power from the respective empirical distribution of \(\hat{p}\) as the fraction of test statistics more extreme than the critical test statistic.
  • compute the difference between the average simulated test statistic and the analytical test statistic.
p = 0.5
diff = seq(0,  0.1, length=21)
ns = round(10^seq(1, log10(1225), length=10))
ndat = length(ns)*length(diff)
empty_ar = array(NaN, dim=c(ndat))
dat = data.frame(ns = empty_ar, diff=empty_ar, pow=empty_ar, tdiff=empty_ar)
counter = 1
for (j in 1:length(ns)) {
  n = ns[j]
  for (i in 1:length(diff)) {
    in.p = p + diff[i]/2
    out.p = p - diff[i]/2
    # under the model, assume the p_in is the mean within group, and p_out is the mean outside of group
    # compute the standard deviation according to the model
    means = c(in.p, out.p)
    result = t.power.p(means, ne=n)
    dat[counter,] = c(ns=n, diff=diff[i], pow=result$power, tdiff=result$diff)
    counter = counter + 1
  }
}

First, we look at power as a function of the number of edges in our simulation, as we vary the difference between the within community and outside community probabilities:

dat$ns = factor(dat$ns)
dat$diff = factor(dat$diff)
thresh = data.frame(diff=diff, sig=.05)
thresh$diff = factor(thresh$diff)
ggplot(dat,  aes(x = diff, y = pow, group=ns, color=ns)) +
  geom_line() +
  ggtitle(TeX('Power of Unequal-Variance T-Test with 100 simulations, 100 $\\frac{graphs}{simulation}$')) +
  xlab(TeX('Difference in $p_{within} - p_{outside}$')) +
  ylab('Power of Test') +
  scale_color_discrete(name="number of edges") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

And we also look at how the analytical test-statistic computed from our trials compares to the empirical test-statistics estimated from our simulation procedure:

ggplot(dat, aes(x = diff, y = tdiff, group=ns, color=ns)) +
  geom_line() +
  ggtitle(TeX('Analytical T-Test compared to Empirical T-Test')) +
  xlab(TeX('Difference in $\\left|p_{within} - p_{outside}\\right|$')) +
  ylab(TeX('$\\frac{\\left|\\bar{T}_{empirical} - T_{analytical}\\right|}{T_{analytical}}')) +
  scale_color_discrete(name="number of edges") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Delta Estimation

Consistency of Estimators for \(\hat{\delta}\)

Here, we will verify that our estimators of \(\hat{\delta}\) are correct, that is, that our estimation gets more accurate as our sample size increases:

ns = round(10^seq(1, log10(1225), length=10))
ds = seq(0, 1.0, length=21)
ndat = length(ns)*length(ds)
empty_ar = array(NaN, dim=c(ndat))
results = data.frame(n = empty_ar, d = empty_ar, mu = empty_ar, var = empty_ar)
counter = 1
nsim = 10
for (n in ns) {
  for (d in ds) {
    v_ar = array(NaN, dim=c(nsim))
    m_ar = array(NaN, dim=c(nsim))
    for (i in 1:nsim) {
      p1 = 0.5 + d/2
      p2 = 0.5 - d/2
      demp = replicate(n, {
        dat1 = rbinom(n = n, p = p1, size=1)
        dat2 = rbinom(n = n, p = p2, size=1)
        dhat = sum(dat1 - dat2)/length(dat1)
        })
      params <- gs.siem.model.params(demp)
      m_ar[i] = abs(params$p - d)
      v_ar[i] = abs(params$var - (model.var(p1, n) + model.var(p2, n)))
    }
    results[counter,] = data.frame(n = n, d = d, mu = mean(m_ar),
                                   var = mean(v_ar))
    counter <- counter + 1
  }
}

results$n = factor(results$n)
results$d = factor(results$d)

ggplot(results, aes(x = n, y = mu, group=d, color=d)) +
  geom_line() +
  ggtitle(TeX('Consistency of estimator $\\mu_{\\hat{\\delta}}$, average of 10 simulations')) +
  xlab("Number of possible edges") +
  ylab(TeX('$\\left|\\delta_{analytical} - \\mu_{\\hat{\\delta}}\\right|$')) +
  scale_color_discrete(name=TeX("$\\delta_{analytical}$"))

ggplot(results, aes(x = n, y = var, group=d, color=d)) +
  geom_line() +
  ggtitle(TeX('Consistency of estimator $\\sigma^2_{\\hat{\\delta}}$, average of 10 simulations')) +
  xlab("Number of possible edges") +
  ylab(TeX('$\\left|Var(\\delta_{analytical}) - \\sigma^2_{\\hat{\\delta}}\\right|$')) +
  scale_color_discrete(name=TeX("$\\delta_{analytical}$"))

As we can see, as our number of possible edges increases, our estimators for \(\mu\) and \(\sigma^2\) converge, indicating we have consistent estimators.

Statistical Power

# computes the power of the model under a given significance level
# accepts params for a number of simulations to average power over, and a
# number of graphs for each computation
# number of edges defines the number of edges to use in the binomial simulation
t.power.delta = function(diffs, ne=1225, sig=.95, nsim=100, ngr=100) {
  ucut = qt(sig, df=ngr)  # t-statistic of null at the given significance level with ne-2 degrees of freedom
  ts = replicate(nsim, {  # replicate our described test n tsim times
    alt = replicate(ngr, sum(rbinom(n = ne, size=1, prob = 0.5 + diffs[1]/2))/ne - sum(rbinom(n = ne, size=1, prob=0.5 - diffs[1]/2))/ne)
    null = replicate(ngr, sum(rbinom(n = ne, size=1, prob = 0.5 + diffs[2]/2))/ne - sum(rbinom(n = ne, size=1, prob = 0.5 - diffs[2]/2))/ne)
    t.test(alt, null, alternative = "greater", var.equal = FALSE)$statistic
  })
  v1 =  model.var(0.5 + diffs[1]/2, ne) + model.var(0.5 - diffs[1]/2, ne)
  v2  =  model.var(0.5 + diffs[2]/2, ne) + model.var(0.5 - diffs[2]/2, ne)
  ana_tstat = gs.siem.sample.test(diffs[1], diffs[2], v1, v2, n1=ngr, n2=ngr)$stat
  return(list(power=sum(ts > ucut)/nsim, diff=abs(mean(ts) - ana_tstat)/ana_tstat))
}

In this experiment, we will analyze the power of our test developed. Assuming that the entire graph has average \(p=0.5\), we will simulated from a block model where the probabiliy of the within-group edges have \(p_{within}=0.5 + \epsilon\), and the outside of group edges have \(p_{outside} = 0.5 - \epsilon\) for two different populations of graphs. We will assume a significance level of \(0.95\) for our \(T\) cutoff, and fix the number of observations between 0 and \(\frac{2550}{2}=1225\), since our real data has \(2450\) total edges yielding \(1225\) observations per-group. Our simulation will be structured as follows:

  • Simulate \(n\) graphs with edges from a binomial distribution given \(ne, p + \epsilon_1\) and \(ne, p - \epsilon_1\), the alternative samples.
  • Simulate \(n\) graphs with edges from a binomial distribution given \(ne, p + \epsilon_2\) and \(ne, p - \epsilon_2\), the null samples.
  • Compute the empirical distribution for \(\hat{p}\) for the alternative and null samples, respectively by repeating the above procedure \(ns\) times.
  • derive the power from the respective empirical distribution of \(\hat{\delta}\) as the fraction of test statistics more extreme than the critical test statistic.
  • compute the difference between the average simulated test statistic and the analytical test statistic.
maxdiff = 0.2
diff = seq(0,  maxdiff, length=21)
ns = round(10^seq(1, log10(1225), length=10))
ndat = length(ns)*length(diff)
empty_ar = array(NaN, dim=c(ndat))
dat = data.frame(ns = empty_ar, diff=empty_ar, pow=empty_ar, tdiff=empty_ar)
counter = 1
for (j in 1:length(ns)) {
  n = ns[j]
  for (i in 1:length(diff)) {
    in.d = maxdiff + diff[i]/2
    out.d = maxdiff - diff[i]/2
    # under the model, assume the p_in is the mean within group, and p_out is the mean outside of group
    # compute the standard deviation according to the model
    diffs = c(in.d, out.d)
    result = t.power.delta(diffs, ne=n)
    dat[counter,] = c(ns=n, diff=diff[i], pow=result$power, tdiff=result$diff)
    counter = counter + 1
  }
}

First, we look at power as a function of the number of edges in our simulation, as we vary the difference between the \(\delta\) for each population of graphs:

dat$ns = factor(dat$ns)
dat$diff = factor(dat$diff)
thresh = data.frame(diff=diff, sig=.05)
thresh$diff = factor(thresh$diff)
ggplot(dat,  aes(x = diff, y = pow, group=ns, color=ns)) +
  geom_line() +
  ggtitle(TeX('Power of Unequal-Variance T-Test with 100 simulations, 100 $\\frac{graphs}{simulation}$')) +
  xlab(TeX('Difference in $\\left|\\delta_1 - \\delta_2\\right|$')) +
  ylab('Power of Test') +
  scale_color_discrete(name="number of edges") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

And we also look at how the analytical test-statistic computed from our trials compares to the empirical test-statistics estimated from our simulation procedure:

ggplot(dat, aes(x = diff, y = tdiff, group=ns, color=ns)) +
  geom_line() +
  ggtitle(TeX('Analytical T-Test compared to Empirical T-Test')) +
  xlab(TeX('Difference in $\\left|\\delta_1 - \\delta_2\\right|$')) +
  ylab(TeX('$\\frac{\\left|\\bar{T}_{empirical} - T_{analytical}\\right|}{T_{analytical}}')) +
  scale_color_discrete(name="number of edges") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))