11.3. Foundation#

To understand network models, it is crucial to understand the concept of a network as a random quantity, taking a probability distribution. We have a realization A, and we think that this realization is random in some way. Stated another way, we think that there exists a network-valued random variable A that governs the realizations we get to see. Since A is a random variable, we can describe it using a probability distribution. The distribution of the random network A is the function Pr which assigns probabilities to every possible configuration that A could take. Notationally, we write that APr, which is read in words as “the random network A is distributed according to Pr.”

In the preceding description, we made a fairly substantial claim: Pr assigns probabilities to every possible configuration that realizations of A, denoted by A, could take. How many possibilities are there for a network with n nodes? Let’s limit ourselves to simple networks: that is, A takes values that are unweighted (A is binary), undirected (A is symmetric), and loopless (A is hollow). In words, An is the set of all possible adjacency matrices A that correspond to simple networks with n nodes. Stated another way: every A that is found in A is a binary n×n matrix (A{0,1}n×n), A is symmetric (A=A), and A is hollow (diag(A)=0, or Aii=0 for all i=1,...,n). We describe An as:

An={A:A is an n×n matrix with 0s and 1s,A is symmetric,A is hollow}

To summarize the statement that Pr assigns probabilities to every possible configuration that realizations of A can take, we write that Pr:An[0,1]. This means that for any AAn which is a possible realization of a random network A, that Pr(A=A) is a probability (it takes a value between 0 and 1). If it is completely unambiguous what the random variable A refers to, we might abbreviate Pr(A=A) with Pr(A). This statement can alternatively be read that the probability that the random variable A takes the value A is Pr(A). Finally, let’s address that question we had in the previous paragraph. How many possible adjacency matrices are in An?

Let’s imagine what just one AAn can look like. Note that each matrix A has n×n=n2 possible entries, in total, since A is an n×n matrix. There are n possible self-loops for a network, but since A is simple, it is loopless. This means that we can subtract n possible edges from n2, leaving us with n2n=n(n1) possible edges that might not be unconnected. If we think in terms of a realization A, this means that we are ignoring the diagonal entries aii, for all i[n]. Remember that a simple network is also undirected. In terms of the realization A, this means that for every pair i and j, that aij=aji. If we were to learn about an entry in the upper triangle of A where aij is such that j>i, note that we have also learned what aji is, too. This symmetry of A means that of the n(n1) entries that are not on the diagonal of A, we would, in fact, “double count” the possible number of unique values that A could have. This means that A has a total of 12n(n1) possible entries which are free, which is equal to the expression (n2). Finally, note that for each entry of A, that the adjacency can take one of two possible values: 0 or 1. To write this down formally, for every possible edge which is randomly determined, we have two possible values that edge could take. Let’s think about building some intuition here:

  1. If A is 2×2, there are (22)=1 unique entry of A, which takes one of 2 values. There are 2 possible ways that A could look:

[0110] or [0000]
  1. If A is 3×3, there are (32)=3×22=3 unique entries of A, each of which takes one of 2 values. There are 8 possible ways that A could look:

[011101110] or [010101010] or [001001110] or [011100100] or [001000100] or [000001010] or [010100000] or [000000000]

How do we generalize this to an arbitrary choice of n? The answer is to use combinatorics. Basically, the approach is to look at each entry of A which can take different values, and multiply the total number of possibilities by 2 for every element which can take different values. Stated another way, if there are 2 choices for each one of x possible items, we have 2x possible ways in which we could select those x items. But we already know how many different elements there are in A, so we are ready to come up with an expression for the number. In total, there are 2(n2) unique adjacency matrices in An. Stated another way, the cardinality of An, described by the expression |An|, is 2(n2). The cardinality here just means the number of elements that the set An contains. When n is just 15, note that |A15|=2(152)=2105, which when expressed as a power of 10, is more than 1030 possible networks that can be realized with just 15 nodes! As n increases, how many unique possible networks are there? In the below figure, look at the value of |An|=2(n2) as a function of n. As we can see, as n gets big, |An| grows really really fast!

Click to show
import seaborn as sns
import numpy as np
from math import comb


n = np.arange(2, 51)
logAn = np.array([comb(ni, 2) for ni in n])*np.log10(2)

ax = sns.lineplot(x=n, y=logAn)
ax.set_title("")
ax.set_xlabel("Number of Nodes")
ax.set_ylabel("Number of Possible Graphs $|A_n|$ (log scale)")
ax.set_yticks([50, 100, 150, 200, 250, 300, 350])
ax.set_yticklabels(["$10^{{{pow:d}}}$".format(pow=d) for d in [50, 100, 150, 200, 250, 300, 350]])
ax;
../../_images/foundation_2_0.png

So, now we know that we have probability distributions on networks, and a set An which defines all of the adjacency matrices that every probability distribution must assign a probability to. Now, just what is a network model? A network model is a set P of probability distributions on An. Stated another way, we can describe P to be:

P{Pr:Pr is a probability distribution on An}

In general, we will simplify P through something called parametrization. We define Θ to be the set of all possible parameters of the random network model, and θΘ is a particular parameter choice that governs the parameters of a specific network-valued random variable A. In this case, we will write P as the set:

P(Θ)={Prθ:θΘ}

If A is a random network that follows a network model, we will write that APrθ, for some arbitrary set of parameters θ.

If you are used to traditional univariate or multivariate statistical modelling, an extremely natural choice for when you have a discrete sample space (like An, which is discrete because we can count it) would be to use a categorical model. In the categorical model, we would have a single parameter for all possible configurations of an n-node network; that is, |θ|=|An|=2(n2). What is wrong with this model? The limitations are two-fold:

  1. As we explained previously, when n is just 15, we would need over 1030 bits of storage just to define θ. This amounts to more than 108 zetabytes, which exceeds the storage capacity of the entire world.

  2. With a single network observed (or really, any number of networks we could collect in the real world) we would never be able to get a reasonable estimate of 2(n2) parameters for any reasonably non-trivial number of nodes n. For the case of one observed network A, an estimate of θ (referred to as θ^) would simply be for θ^ to have a 1 in the entry corresponding to our observed network, and a 0 everywhere else. Inferentially, this would imply that the network-valued random variable A which governs realizations A is deterministic, even if this is not the case. Even if we collected potentially many observed networks, we would still (with very high probability) just get θ^ as a series of point masses on the observed networks we see, and 0s everywhere else. This would mean our parameter estimates θ^ would not generalize to new observations at all, with high probability.

So, what are some more reasonable descriptions of P? We explore some choices below. Particularly, we will be most interested in the independent-edge networks. These are the families of networks in which the generative procedure which governs the random networks assume that the edges of the network are generated independently. Statistical Independence is a property which greatly simplifies many of the modelling assumptions which are crucial for proper estimation and rigorous statistical inference, which we will learn more about in the later chapters.

11.3.1. Equivalence Classes#

In all of the below models, we will explore the concept of the probability equivalence class, or an equivalence class, for short. The probability is a function which in general, describes how effective a particular observation A can be described by a random variable A with parameters θ, written APrθ. The probability will be used to describe the probability Prθ(A) of observing the realization A if the underlying random variable A has parameters θ. Why does this matter when it comes to equivalence classes? An equivalence class is a subset of the sample space EAn, which has the following properties. Holding the parameters θ fixed:

  1. If A and A are members of the same equivalence class E (written A,AE), then Prθ(A)=Prθ(A).

  2. If A and A are members of different equivalence classes; that is, AE and AE where E,E are equivalence classes, then Prθ(A)Prθ(A).

  3. Using points 1 and 2, we can establish that if E and E are two different equivalence classes, then EE=. That is, the equivalence classes are mutually disjoint.

  4. We can use the preceding properties to deduce that given the sample space An and a probability function Prθ, we can define a partition of the sample space into equivalence classes Ei, where iI is an arbitrary indexing set. A partition of An is a sequence of sets which are mutually disjoint, and whose union is the whole space. That is, iIEi=An.

We will see more below about how the equivalence classes come into play with network models, and in a later section, we will see their relevance to the estimation of the parameters θ.

11.3.2. Independent-Edge Random Networks#

The below models are all special families of something called independent-edge random networks. An independent-edge random network is a network-valued random variable, in which the collection of edges are all independent. In words, this means that for every adjacency aij of the network-valued random variable A, that aij is independent of aij, any time that (i,j)(i,j). When the networks are simple, the easiest thing to do is to assume that each edge (i,j) is connected with some probability (which might be different for each edge) pij. We use the ij subscript to denote that this probability is not necessarily the same for each edge. This simple model can be described as aij has the distribution Bern(pij), for every j>i, and is independent of every other edge in A. We only look at the entries j>i, since our networks are simple. This means that knowing a realization of aij also gives us the realizaaion of aji (and thus aji is a deterministic function of aij). Further, we know that the random network is loopless, which means that every aii=0. We will call the matrix P=(pij) the probability matrix of the network-valued random variable A. In general, we will see a common theme for the probabilities of a realization A of a network-valued random variable A, which is that it will greatly simplify our computation. Remember that if x and y are binary variables which are independent, that Pr(x=x,y=y)=Pr(x=x)Pr(y=y). Using this fact:

Pr(A=A)=Pr(a11=a11,a12=a12,...,ann=ann)=Pr(aij=aij for all j>i)=j>iPr(aij=aij),Independence Assumption

Next, we will use the fact that if a random variable aij has the Bernoulli distribution with probability pij, that Pr(aij=aij)=pijaij(1pij)1pij:

Prθ(A)=j>ipijaij(1pij)1pij

Now that we’ve specified a probability and a very generalizable model, we’ve learned the full story behind network models and are ready to skip to estimating parameters, right? Wrong! Unfortunately, if we tried too estimate anything about each pij individually, we would obtain that pij=aij if we only have one realization A. Even if we had many realizations of A, this still would not be very interesting, since we have a lot of pijs to estimate, and we’ve ignored any sort of structural model that might give us deeper insight into A. In the below sections, we will learn successively less restrictive (and hence, more expressive) assumptions about pijs, which will allow us to convey fairly complex random networks, but still enable us with plenty of intteresting things to learn about later on.