9.3. Network Sparsity#

In Section 3.2.3.1, you learned about a very important descriptive property of networks, called the network density. An understanding of the network density gives us the ability to describe another extremely ubiquitous property of networks: the network sparsity. First, let’s introduce a few concepts about sparsity, and then we’ll tie in how this comes into play with network data. As a quick forenote, this section is going to assume that you have a working knowledge of the concept of a sequence, that you have taken an introductory course in statistics enough to familiarize yourself with the concept of an expected value, and that you are familiar with some basic concepts from Calculus (L’Hopital’s rule and derivatives).

9.3.1. What is sparsity, and why does it matter?#

The concept of sparsity has been studied for several decades in matrix theory. Briefly, let’s suppose that we have a matrix \(X\), which has \(n\) rows and \(m\) columns. The matrix looks like this:

\[\begin{align*} X &= \begin{bmatrix} x_{11} & ... & x_{1m} \\ \vdots & \ddots & \vdots \\ x_{n1} & ... & x_{nm}. \end{bmatrix} \end{align*}\]

There are a number of suggestions for what it means for a matrix to be sparse. A matrix can be considered to be sparse if most of the elements are zero. Traditionally, there is no agreement on just how many of the possible number of entries (which is \(m \cdot n\)) have to be zero for the matrix to be considered sparse, but a common cutoff is if the number of non-zero elements is at most the number of rows or columns. For a more practical definition, one of the seminal works on sparse matrices [1] gives us a good standard. Wilkinson writes, “The matrix may be sparse, either with the non-zero elements concentrated on a narrow band centred on the diagonal or alternatively they may be distributed in a less systematic manner. We shall refer to a matrix as dense if the percentage of zero elements or its distribution is such as to make it uneconomic to take advantage of their presence.” For all intents and purposes, Wilkinson’s definition provides us with a foundational understanding of what it means to be sparse: a matrix is sparse if we can benefit from acknowledging its sparsity with the methods that we choose to store, process, and analyze it. Let’s take a look at what this means in practice.

9.3.1.1. The storage implications#

One of the more approachable ways that you might come into contact with sparse matrices is in terms of data storage. As you might be aware, numbers are stored on a computer as a sequence of zeros and ones. When storing data in our matrix, notice that we have \(m \cdot n\) elements. For the sake of simplicity, let’s assume we’re dealing with a matrix where each element is a double precision float (you might have seen this before in numpy as float64, or a floating point decimal with \(64\) bits per element). This means that for every element of the matrix, we use \(64\) zeros or ones, meaning that we will need around \(64 \cdot (m \cdot n)\) zeros or ones to represent the entire matrix.

Let’s say that of these \(n\) rows, we know ahead of time that a lot of the rows are sparse. By “row sparse”, what we mean is that \(x_{ij} = 0\) for all of these sparse rows \(i\). Let’s assume that of the \(n\) total rows, only \(n'\) are not sparse. We could, for instance, store the non-sparse rows in a little set \(\mathcal X\) which has \(n'\) elements telling us which rows are not sparse. For these non-sparse rows, we store all \(m\) pieces of column-wise information, but for the sparse rows, we just ignore them entirely. To store this entire matrix, we will need \(64 \cdot (n' \cdot m)\) (64 bits for each entry of a non-sparse row) \(+ 64 \cdot n'\) (64 bits to store each element of \(\mathcal X\)) \(+ 64\) (to store the total number of rows that the matrix has), for a total of \(64 \cdot (n' \cdot m + n' + 1)\) bits.

If the rows can be sparse, the columns could be too; let’s assume that we have a matrix where \(m'\) of the columns are sparse. Following a similar approach to the above, if we had a list \(\mathcal Y\) with \(m'\) elements telling us which columns were not sparse, we could just store the \(m'\) non-sparse columns (each of which has \(n\) rows), and then the list of the \(m'\) non-zero elements. Like above, we can store this information with \(64 \cdot (n \cdot m' + m' + 1)\) bits.

Finally, we might have both row and column sparsity, in which case it might make sense to just paired indices \((i, j)\) telling us each of the columns that are non-zero. Let’s assume this time that the total number of elements of \(X\) are non-zero is represented by some integer \(k\). To store data with row and column sparsity, We could store \(X\) like this:

Row index	Column Index	Value
\(r_1\)	\(c_1\)	\(x_{r_1, c_1}\)
…	…	…
\(r_k\)	\(c_k\)	\(x_{r_k, c_k}\)

and then we could store a single tuple \((n, m)\) that indicates the matrix’s total number of dimensions. In this case, we would have \(64 \cdot (3K + 2)\) (\(3K\) because there are \(3\) numbers in each row of the table we arranged above) bits of storage needed. Depending on just how sparse \(X\) is, this might allow us to store \(X\) a lot smaller than its original size!

For one big application of this implication that you are already familiar with, we can think about popular image compression algorithms. You have probably heard about jpeg images [2] (those files with the .jpg or jpeg file extension that you might have on your computer). Image compression uses complicated patterns arising in the image to discern the “general idea” of the image, and then “sparsifies” the image by discarding the extraneous information. Usually, with this technique, you can obtain an image that is nearly identical to the original, but might be orders of magnitude smaller in storage size. This goes a bit beyond simple row and column sparsity, but the end-result is the same: you can summarize a very close, but very sparse, approximation of an image using a file that is way smaller than storing the full image data for each pixel of the image individually.

9.3.1.2. The computational implications#

Let’s assume that we have a small task, where for each row in the matrix \(X\), we want to compute the row-wise sum. Stated another way, for a given row \(i\), the quantity that you want to compute is \(\sum_{j = 1}^m x_{ij}\). If you ignore sparsity all-together, you can do this operation pretty easily: there are \(n\) rows, and \(m\) terms that you need to add together for each row, which means that you will have \(n \cdot m\) total operations to perform (for each of \(n\) rows, perform an addition involving \(m\) terms).

Now’ let’s pretend that we have the same matrix \(X\) that we thought about above. If \(X\) is row-sparse, we can store the non-sparse rows \(i\) in a list \(\mathcal X\) with \(n'\) elements. This time, the sum will be:

\[\begin{align*} \sum_{j = 1}^m x_{ij} &= \begin{cases} \sum_{j = 1}^m x_{ij} & i\text{ is one of the non-sparse rows in }\mathcal X \\ 0 & i\text{ is a sparse row} \end{cases} \end{align*}\]

To do this, we have \(n'\) elements in \(\mathcal X\), which means that we need to use \(n' \cdot m\) operations (sum \(m\) elements for each of the \(n'\) non-sparse rows), and then just output \(0\) for the sparse rows.

Likewise, if \(X\) is column-sparse, we store the non-sparse columns \(j\) in a list \(\mathcal Y\) with \(m'\) elements. This time, the sum is:

\[\begin{align*} \sum_{j : j\text{ is a non-sparse column in }\mathcal Y} x_{ij} \end{align*}\]

So now, we have \(n\) rows to compute this quantity for, but we only need to sum \(m'\) elements per row, for a total of \(n \cdot m'\) operations.

If we have both row and column sparsity, we can combine these two ideas, like this:

\[\begin{align*} \sum_{j = 1}^m x_{ij} &= \begin{cases} \sum_{j : j \in \mathcal Y} x_{ij} & i \in \mathcal X \\ 0 & i\text{ is a sparse row} \end{cases} \end{align*}\]

Now, for each of the \(n'\) rows, we are adding \(m'\) terms, which requires \(n' \cdot m'\) operations.

The computational implications of sparsity are quite substantial. One feature of sparse matrices is that they tend to be really, really big. When we say really big, we mean that \(n\) is extremely large, \(m\) is extremely large, or both are really large. This tends to arise a lot in many fields, so we’ll take a look at two of them:

Sparsity in natural language processing

Imagine a matrix which stores information for a collection of documents. In this case, each row \(i\) indexes a single document in the collection, and each row \(j\) represents an individual word that could (or could not) appear in a given document. The entries of this matrix \(x_{ij}\) correspond to the count of a given word \(j\) in document \(i\). If the collection of documents is large enough, there may be many words that do not appear at all in many of the different documents. This arises in the word2vec algorithm, which we covered succinctly in Section 9.1.

Sparsity in genomics

Imagine that for a large group of \(n\) people (the rows of the matrix), you collect data about the person’s genome. The genome is, informally, the collection genetic material unique to each individual. If you remember back from biology, the genome consists of sequences of DNA. For our purposes, we won’t go into too much depth, but you can think about DNA consisting of really, really long sequences of molecules (these moledules are called nucleotides, adenine, thymine, guanine, and cytosine, represented with the letters A, T, G, C) that gives your body all of the instructions it needs to provide all of the functionality to develop and maintain an organism. In a human, for instance, the human genome will typically be a sequence of about three billion of these four nucleotides. As it turns out, we aren’t all identical, and the instructions aren’t either. One thing that genomists study is a phenomena known as a single nucleotide polymorphism, or SNP (pronounced snip). Basically, the idea is most people will have a specific nucleotide at any given location in the genome, but some people will have a different one (for instance, while most people will might have adenosine at a given SNP, some people might have guanine). A common investigation in genomics would be to collect hundreds or thousands of different SNPs (there are over 300 million SNPs identified so far in the human genome, so there are many combinations to study), and encode a matrix \(X\) where the rows correspond to each person being studied, and the columns consist of alternative bases (from the most common one) that people could have at a SNP in the genome. Let’s think about just one of these SNPs, that we’ll call SNP 1, and assume that this SNP is usually an A base, but can also be a G or T base. The first two columns of the data matrix would look like this:

Row number	SNP 1, alternative base G	SNP 1, alternative base T
Person 1	0	0
Person 2	0	1
…	…	…
Person \(n\)	1	0

We repeat this process for all of the SNPs under study, and end up with a matrix that is going to be extremely large depending on how many SNPs we have. However, somewhat by construction, this matrix is also extremely sparse: each column only represents infreuqent alternative bases that a person could have in a SNP, which means that most people are just going to have a zero for any given SNP.

In the situation where datasets have an enormous number of “features” (the columns of the matrices that we described above), one of the first things that many researchers will try to do before learning from the data explicitly is to try and make sense of the data in a simpler way than just the data as-given. This problem is known as dimensionality reduction, and in its simplest form, arises in an algorithm known as Principal Component Analysis, which is conceptually very similar to the spectral embedding we explained in Section 5 and we touched on briefly in Section 1.3.5.1. Basically, the idea is this: many of the “rows” of the sparse matrix examples we saw above might contain useful information that is shared across multiple other rows. In the document example, for instance, we could imagine that some documents (say, on a similar set of topics) might have similar word uses for a subset of words. Likewise, we could imagine that in this genomics dataset, people who are “oddballs” (in relation to the population’s “average genome”, they might be totally normal otherwise) might tend to have similarly odd combinations of SNPs.

If we were to use PCA on these datasets, just like in spectral embeddings, we might be able to “pick up” these informative similarities in the rows, and end up with a datset that succinctly summarizes a large number of features with just a small number of embedded features. However, the key is that PCA also use the svd, just like spectral embedding. If we run a full PCA on these datasets and the number of rows or the number of columns (or both) are really large, we might be left waiting a really, really long time. When the dataset is sparse, we can take advantage of this sparsity, and run a sparse version of PCA instead of using a full svd like the naive implementation of PCA does. For more details on this, check out [3].

9.3.1.3. The algorithmic implications#

Finally, when we have sparse data, computational and storage considerations aside, it can often directly inform which techniques we should (or should not) use. For a common situation that you are already familiar with, let’s rotate back to a discussion we had back in Section 6.2, and then repeated in Section 7.2 and Section 8.2 regarding contingency tables. Let’s relate this to “non-network” data a little more directly.

Imagine that you are the lead data analyst at a pharmaceutical company, and you are testing whether a new vaccine reduces the rate at which people get flu. You randomly select say, \(100,000\) people to be in your study, and of these people, you randomly provide \(2,000\) of these people your new flu vaccine. You observe that over the course of the year, \(100\) people got the flu, and in the group that did not obtain your vaccine, \(1000\) people got your flu. A good way to organize your data is with the same contingency table that we saw previously:

	Number of people who got the flu	Number of people who did not get the flu
Received vaccine	100	1900
Did not receive vaccine	10000	88000

If we wanted to decide how effective the vaccine was for preventing the flu in the general population, we have a few options. When we have contingency tables, the best option (for obtaining the right answer) is the same Fisher exact test that we discussed previously. This test is exact, in that when we run our analysis, the \(p\)-value that we end up with will faithfully represent the probability of observing the data that we saw (if there were no difference in flu rates between people who received our vaccine or did not receive our vaccine). In this case, it actually looks at all of the possible contingency tables that we could end up with where \(2000\) people obtain the vaccine, and \(98000\) people did not obtain the vaccine, and examines how extreme the table we saw (in terms of the number of people who got the flu) was if the probability of getting the flu is the same across the people who received and did not receive the vaccine (and there are a lot of them!). An extremely key (but enormous) caveat to this actually being the case in our hypothetical analysis is that we randomly selected people to our study, and randomly gave people our flu vaccine, which rarely holds in practice, but for now let’s just assume that such an experiment was conducted.

Unfortunately, Fisher’s exact test has a slight caveat: it can be extremely computationally intensive to compute, especially when the number of data observations that we have (in this case, \(200,000\)) is really big (it could be even bigger than \(200,000\)). Another statistic that people will use for these situations often is known as the chi-squared test. Basically, the idea is that, if you look at enough people, you actually don’t need to look at the exact distribution of these contingency tables, like you did for the Fisher’s exact test. When you sample enough people, patterns in these tables will emerge that you can pick up on (and exploit) in terms of what they would look like if there were no difference between the vaccinated and unvaccinated groups. Instead of comparing the table we saw to all possible tables, you can compute a summary function from the table and compare that summary function to compute the \(p\)-value in closed form (in that, you just execute a particular equation, and get your answer). For more details on the chi-squared test, and how it relates to contingency tables, check out [4].

There is a big reason that this “summary” function is reasonable: if we have enough data, and the outcome (getting the flu) isn’t too rare, the chi-squared test and the Fisher’s exact test will give you just about the same answer, but the chi-squared test will take a small fraction of the computation time. If you have to run many of these contingency tables (like we did for the case of the signal subnetwork in Section 8.2), this might turn into a bottleneck for your analysis (from a computation standpoint), and it might situationally make sense to use the chi-squared test (even though it isn’t quite exact).

However, situationally, this approximation can vary from not great to completely misleading. For instance, if our vaccine was for meningitis instead of the flu, which is a much more rare viral condition, we might have only seen \(1\) person who received the vaccine actually get the condition we were investigating, and maybe \(10\) or \(20\) people in our unvaccinated population actually get the condition. When the data is this sparse (in that, the outcome is quite rare) the chi-squared test is in all actuality a completely inappropriate thing to do. In this case, the “foot” that the chi-squared test stands on (the patterns that emerge in summary statistics of all possible contingency tables that we could have observed if there were no difference between the vaccinated and unvaccinated groups) does not apply, and the chi-squared test is unreasonable.

In this sense, the sparsity of the data can actually drive you away from strategies that might situationally be reasonable (such as the chi-squared test) and towards strategies that are more robust to sparsity (such as the Fisher’s exact test).

9.3.2. How does sparsity interplay with networks?#

So, now we have the big question: what does this have to do with networks? As with matrices, we have two interplaying (and in some cases, complementary) ways that we conceptualize sparsity in network data.

9.3.2.1. Sparsity as a property of the random network#

In matrices, we had the first way to conceptualize a sparse matrix being a matrix: a matrix where most of the entries are \(0\). With networks, however, this tends to work out a little bit differently. To ease into this section, we’ll first introduce the concept of a sequence of random variables, and expected values of functions of sequences of random variables.

9.3.2.1.1. Sequences of random variables#

The next building block to conceptualize sparsity of random networks is going to involve some information about sequences of random networks, which might be a lot to wrap your head around. Fortunately, we can conceptualize this a lot like a sequence of coin flips, like you are probably used to by this point.

Imagine that we have a coin, and its output can either be a \(1\) (the coin lands on heads) or a \(0\) (the coin lands on tails). The coin lands on heads with probability \(p\). We can use this coin to produce a sequence \(\vec{\mathbf x}^{(n)}\) which is the outcome of \(n\) random flips of this coin, where \(n\) (like in the network case) just indexes the number of coin flips that we made. These coin flips are all performed independently and identically (the outcome of one coin flip does not effect the outcome of other coin flips, and all of the coin flips have the same probability of landing on heads, \(p\)).

The element \(\mathbf x^{(n)}_i\) is a random variable which represents the outcome of the coin itself. So, a possible realization of \(\vec{\mathbf x}^{(5)}\) could be something like \(\vec{ x}^{(5)}\), where \(\vec{x}^{(5)} = \begin{bmatrix}0 & 0 & 1& 0 & 0\end{bmatrix}^\top\). This would correspond to flipping the coin \(5\) times, and getting \(2\) tails, followed by \(1\) heads, followed by \(2\) more tails. Let’s say that we are interested in the following: what is the sum of all of the elements of \(\vec{\mathbf x}^{(n)}\)? In other words, can we describe \(\sum_{i = 1}^n \mathbf x_i^{(n)}\)?

If you aren’t too familiar with statistics, basically the idea is this: since \(\sum_{i = 1}^n \mathbf x_i^{(n)}\) is a sum of random quantities (the outcomes of unrealized coin flips), we can’t quite describe \(\sum_{i = 1}^n \mathbf x_i^{(n)}\) exactly, in that, we can’t say for sure what value it is going to take. However, we can describe how it will be distributed, and perhaps more interesting for our case, we can describe what value we would expect to see.

In statistics, we define this operation, called the expected value, with the notation \(\mathbb E[\cdot]\). So, in this case where we want to describe the expected value of \(\sum_{i = 1}^n \mathbf x_i^{(n)}\), we would write:

\[\begin{align*} \mathbb E \left[\sum_{i = 1}^n \mathbf x_i^{(n)}\right]. \end{align*}\]

If you are not too familiar with statistics, some basic rules can help us to simplify this down a good deal. If you want more background as to exactly how this works, we would recommend checking out an introductory book in statistics, such as [5]. In this case, since \(\mathbf x_i^{(n)}\) is a finite random quantity (all realizations of it would just be \(1\) with probability \(p\) and \(0\) with probability \(1-p\), and both \(1\) and \(0\) are finite numbers) we can use something called the linearity of expectation, which basically just tells us that the expected value of a sum of random variables is the sum of the expected values of those random variables, so it simplifies like this:

\[\begin{align*} \mathbb E\left[\sum_{i = 1}^n \mathbf x_i^{(n)}\right] &= \sum_{i = 1}^n\mathbb E\left[ \mathbf x_i^{(n)}\right]. \end{align*}\]

This expression looks a bit easier to deal with: the only thing we are left to think about is the terms \(\mathbb E\left[ \mathbf x_i^{(n)}\right]\). First, we know that these terms are all identically distributed. One consequence of terms being identically distributed is that their expected values will all be the same, which means we only need to actually compute this quantity once for all of the elements of \(\vec{\mathbf x}^{(n)}\).

The final thing that we need to use is something which is commonly referred to as the Law of the Unconscious Statistician (LoTUS). Basically, what LoTUS tells us is that if \(\mathbf y\) is a discrete quantity (the possible numerical values it could take can be counted), then we can compute the expected value using a very simple relationship:

\[\begin{align*} \mathbb E\left[\mathbf y\right] &= \sum_{y_i \in \mathcal Y} y_i Pr.(\mathbf y = y_i). \end{align*}\]

Conceptually, what this formula tells us is that the expected value of \(\mathbf y\) is a weighted average of the possible values that it could be realized as, weighted by the probability that a particular value is realized. Let’s see what this looks like for our coin flips:

\[\begin{align*} \mathbb E\left[ \mathbf x_i^{(n)}\right] &= 1 \cdot Pr.( \mathbf x_i^{(n)} = 1) + 0 \cdot Pr.( \mathbf x_i^{(n)} = 0) \\ &= 1 \cdot p + 0 \cdot (1 - p) = p. \end{align*}\]

Here, the random variable \(\mathbf x_i^{(n)}\) could either be a \(1\) (the coin lands on heads) or a \(0\) (the coin lands on tails). We see the value \(1\) with probability \(p\), and the value \(0\) with probability \(1 - p\). When we take the weighted sum, this turns out to just be \(p\). Now, for the quantity that we were interested in:

\[\begin{align*} \mathbb E\left[\sum_{i = 1}^n \mathbf x_i^{(n)}\right] &= \sum_{i = 1}^n p. \end{align*}\]

And a sum of \(n\) \(p\)s is just \(n \cdot p\).

Now, let’s consider a similar question: what portion of the coin flips do we expect to be heads? To conceptualize this quantity, for a single \(\vec{\mathbf x}^{(n)}\), this quantity could be written:

\[\begin{align*} \mathbb E\left[\frac{1}{n} \sum_{i = 1}^n \mathbb x_i^{(n)}\right]. \end{align*}\]

As it turns out, if \(\mathbf y\) is a random variable, another useful result is that for any constant \(c\), that \(\mathbb E[c\mathbf y] = c\mathbb E[\mathbf y]\). Since \(\frac{1}{n}\) is just a constant, we end up with:

\[\begin{align*} \mathbb E\left[\frac{1}{n} \sum_{i = 1}^n \mathbb x_i^{(n)}\right] &= \frac{1}{n} \mathbb E\left[\sum_{i = 1}^n \mathbb x_i^{(n)}\right] \\ &= \frac{1}{n} n \cdot p = p. \end{align*}\]

Where in the last line, we used the result that we just calculated to simplify the expression down.

9.3.2.1.2. Sequences of random networks#

Now that we have that under our belt, let’s pivot back to networks.

Let’s assume that \(\mathbf A^{(n)}\) is the adjacency matrix of a random network which has \(n\) nodes, with entries \(\mathbf a_{ij}\) (which are zero or one). For each node in a simple network, remember that the total number of edges in the network is given by \(\sum_{i > j}a_{ij}\). Likewise, the total number of nodes in the random network with \(n\) nodes can be written the same way, except instead of actually summing realized edges, we are summing up random edges, with \(\sum_{i > j}\mathbf a_{ij}^{(n)}\). All that the superscript \((n)\) is doing here is telling us that this refers to the network with \(n\) nodes.

So, in this case, we have a sequence of random networks, \(\left\{\mathbf A^{(1)}, \mathbf A^{(2)}, ...\right\}\) for different numbers of nodes. For each of these networks, we can’t quite talk about the actual edge sum itself (directly) since they are random, like the above problem with discussing the actual coin outcomes. However, like the above, we can talk about the expected edge sum in the network, like this:

\[\begin{align*} \mathbb E\left[\sum_{i > j}\mathbf a_{ij}^{(n)}\right]. \end{align*}\]

Likewise, remember back to Section 3.2.3.1 that the total number of potential edges was given by \(\binom n 2\). We could also talk about the expected network density, which is:

\[\begin{align*} \mathbb E\left[\frac{\sum_{i > j}\mathbf a_{ij}^{(n)}}{\binom n 2}\right]. \end{align*}\]

Remembering that like above, we can “remove constants” from the expected value term, and we get that the expected network density is:

\[\begin{align*} \frac{\mathbb E\left[\sum_{i > j}\mathbf a_{ij}^{(n)}\right]}{\binom n 2}. \end{align*}\]

This should look extremely familiar to you by this point, as it is (virtually) the same quantity that we saw above. Just like the above, we could derive a similar relationship using the linearity of expectation, since the edge adjacencies are still finite (they are either \(0\) or \(1\)):

\[\begin{align*} \mathbb E\left[\sum_{i > j}\mathbf a_{ij}^{(n)}\right] &= \sum_{i > j}\mathbb E\left[\mathbf a_{ij}^{(n)}\right]. \end{align*}\]

Now that we have this under our belts, we are ready to define a sparse sequence of random networks. A sequence \(\{\mathbf A^{(1)}, A^{(2)}, ...\}\) of random networks is sparse if:

(9.1)#\[\lim_{n \rightarrow \infty}\frac{\mathbb E\left[\sum_{i > j}\mathbf a_{ij}^{(n)}\right]}{\binom n 2} = 0.\]

Conceptually, what this means is that as the network grows, the denominator becomes far larger than the numerator. This contrasts from what we saw above in the coinflip example, where the expected portion of coin flips that were heads was a constant value, \(p\).

Another way to conceptualize this would be to use L’Hopital’s rule, from calculus. In essence, what L’Hopital’s rule asserts is that for two functions \(f(x)\) and \(g(x)\):

\[\begin{align*} \lim_{x \rightarrow c} \frac{f(x)}{g(x)} = \lim_{x \rightarrow c} \frac{\frac{d}{dx}f(x)}{\frac{d}{dx}g(x)}. \end{align*}\]

Remember from calculus that a derivative \(\frac{d}{dx}f(x)\) can be conceptualized as the rate at which \(f(x)\) changes as \(x\) increases for a particular value of \(x\), this means that:

\[\begin{align*} \lim_{n \rightarrow \infty}\frac{\mathbb E\left[\sum_{i > j}\mathbf a_{ij}\right]}{\binom n 2} &= \lim_{n \rightarrow \infty}\frac{\frac{d}{dn}\mathbb E\left[\sum_{i > j}\mathbf a_{ij}\right]}{\frac{d}{dn}\binom n 2}, \end{align*}\]

so basically, the sequence of random networks are sparse if the expected number of edges is increasing at a much slower rate than the number of potential edges is increasing as \(n\) grows, because this ratio goes to zero.

Conceptually, we can equivalently write the sum \(\sum_{i > j}a_{ij}\) for a simple network as \(\frac{1}{2} \sum_{i, j = 1}^n a_{ij} = \frac{1}{2}\sum_{i = 1}^n \sum_{j = 1}^n a_{ij}\) (convince yourself of it by writing it down! Hint: it is because simple networks are symmetric and loopless). Remember, though, that the node degree from Section 3.2.2.2 was \(d_i = \sum_{j = 1}^n a_{ij}\), and similarly, the random node degree is \(\mathbf d_i = \sum_{j = 1}^n \mathbf a_{ij}\). This means that we could rewrite the expression in Equation (9.1) as:

\[\begin{align*} \frac{\mathbb E\left[\sum_{i > j} \mathbf a_{ij}^{(n)}\right]}{\binom n 2} = \frac{\frac{1}{2}\mathbb E\left[\sum_{i =1}^n \sum_{j = 1}^n \mathbf a_{ij}^{(n)}\right]}{\binom n 2} = \frac{\frac{1}{2}\mathbb E\left[\sum_{i =1}^n \mathbf d_i^{(n)}\right]}{\binom n 2}, \end{align*}\]

where \(\mathbf d_i^{(n)}\) is the random node degree for node \(i\) in the random network \(\mathbf A^{(n)}\) with \(n\) nodes. Let’s introduce a new random variable, \(\mathbf d^{(n)}\), and let \(\mathbf d^{(n)}\) be the average node degree in the network with \(n\) nodes. In words, we’ll let \(\mathbf d^{(n)}\) be:

\[\begin{align*} \mathbf d^{(n)} = \frac{1}{n}\sum_{i = 1}^n \mathbf d_i^{(n)}, \end{align*}\]

and therefore by multiplying both sides by \(n\):

\[\begin{align*} \sum_{i = 1}^n \mathbf d_i^{(n)} = n\mathbf d^{(n)}. \end{align*}\]

By expanding the term \(\binom n 2 = \frac{1}{2} n (n - 1)\), cancelling out the factor of \(\frac{1}{2}\), and applying this new result, we get:

\[\begin{align*} \frac{\mathbb E\left[\sum_{i =1}^n \mathbf d_i^{(n)}\right]}{n(n - 1)} &= \frac{\mathbb E\left[n\mathbf d^{(n)}\right]}{n(n - 1)} = \frac{n\mathbb E\left[\mathbf d^{(n)}\right]}{n(n - 1)}. \end{align*}\]

By cancelling out the \(n\) in the numerator and the denominator on the right-hand side of the expression:

\[\begin{align*} \frac{\mathbb E\left[\sum_{i =1}^n \mathbf d_i^{(n)}\right]}{n(n - 1)} &= \frac{\mathbb E\left[\mathbf d^{(n)}\right]}{n - 1}. \end{align*}\]

For the network to be sparse, therefore we need that:

\[\begin{align*} \lim_{n \rightarrow \infty}\frac{\mathbb E\left[\mathbf d^{(n)}\right]}{n - 1} = 0. \end{align*}\]

Again, using L’Hopital’s rule and that \(\frac{d}{dn}(n - 1) = 1\), we see that:

\[\begin{align*} \lim_{n \rightarrow \infty}\frac{\mathbb E\left[\mathbf d^{(n)}\right]}{n - 1} &= \lim_{n \rightarrow \infty}\frac{d}{dn}\mathbb E\left[d^{(n)}\right] = 0. \end{align*}\]

What this shows us is that, as \(n\) is increasing, the expected average node degree of a sparse sequence of random networks, \(\mathbb E\left[\mathbf d^{(n)}\right]\), grows at a rate (with respect to \(n\)) which is decreasing (to \(0\)) as \(n\) grows.

9.3.2.1.3. What does this notion of sparsity have to do with practical applications?#

In the strictest sense, this notion of “sparsity” as, in effect, a property of the average expected node degree in a sequence of random networks seems a bit overly theoretical to be practical. In practice, you observe a network; you don’t obtain an infinite sequence of random networks \(\mathbf A^{(n)}\) with an arbitrarily large number of nodes \(n\). However, this concept has a number of theoretical applications, in particular since we can easily conceptualize real networks having attributes which are like sparse networks. Let’s see how.

Consider, for instance, a social network. Let’s imagine one that’s really, really big, like facebook for instance. When a new person joins facebook (\(n\) is growing), it person is probably not going to become friends with the entire network. Realistically, they will have a fixed subset of friends that they connect with, and people with similar interests that they connect with, and the average degree of the network is probably going to scale by a lot less than the number of people who were added to the network (\(1\) person). As more and more people are added to the network, this rate of change is probably going to converge to \(0\), like we saw above. In this sense, social networks can usually be reasoned to behave like sparse networks.

Likewise, let’s consider another practical application of sparsity. Many algorithms for ranking search pages (especially in the early days of the internet) conceptualized the internet as an interconnected network of web pages. Nodes were individual pages on the web, and an edge exists between page \(i\) and page \(j\) if page \(i\) has a hyperlink to page \(j\). Early search query engines, such as Google’s Page Rank [6], up-rank a web page \(i\) in search results based on the number of other web pages that link to page \(i\) (the logic being that better pages are linked to more often, and hence, are desirable to return to searchers). As more and more web pages are indexed by google, it is likely that other pages that are similar in content might end up cross-linking to a given web page, but it is unlikely that the increase in the average number of web page links will scale with the number of web pages.

As another application of sparse networks, let’s imagine that you are a planner for a highway system. The nodes of the network are cities that have a highway that passes through them, and the edges are whether a pair of cities are connected directly by a highway (that goes from one straight to the other). As nodes are added to the network (linked by new highways), you would probably just put a highway to the nearest city that is already in the highway system, rather than build direct highways to all of the other cities in the system. In this sense, highway systems can be conceptualized to be sparse.

For these and many other reasons, this theoretical notion of sparsity becomes important when determining the expected behavior for algorithms or analysis methods designed for networks which can be conceptualized as sparse.

9.3.2.2. Sparsity as a tool to use when it makes sense#

There are much more direct reasons that we might need to think about sparsity, too. When we have an adjacency matrix \(A\) which has \(n\) nodes (the network is realized, and not a conceptualized sequence of random networks, like above), all of the considerations that we gave above for matrices in Section 9.3.1 (storage, computation, and algorithmic) considerations come into play. One of the most popular statistics to use to determine sparsity in realized networks is the network density, but there are many others that have their own advantages [7], [8].

9.3.2.2.1. Storage considerations#

For instance, if a network doesn’t have all that many edges, we might be able to store it more efficiently as an edge list. An edge list basically “flattens” the adjacency matrix, and just stores a list of the node pairs \(i\) and \(j\) where \(a_{ij} \neq 0\). Let’s conceptualize a simple network \(A\) that has an adjacency matrix that looks like this:

\[\begin{align*} A &= \begin{bmatrix} 0 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{bmatrix}. \end{align*}\]

An edge list for this network would look like this (note that the below list encodes all of the information about the adjacency matrix, due to the fact that the network is simple, and hence the adjacency matrix is symmetric). The rows of the edgelist are just the node pairings for all of the edges in the network:

Node \(i\)	Node \(j\)
1	2
2	3

One caveat with edge lists is that they can, occassionally, be a little bit confusing to deal with. Particularly, if you don’t know ahead of time that the network is simple, you might not know that it is undirected. When you reconstruct the adjacency matrix \(A\) from the edge list, it might be ambiguous to you whether since \(a_{12} = 1\) you should also reflect that \(a_{21} = 1\) (should you induce symmetry?). Likewise, any nodes that are disconnected will not show up at all in the edge list, and it might not be obvious that the node is part of the network (just disconnected). For this reason, when you analyze network data that is stored as an edge list, it is imperative that you understand what properties you expect the network to have ahead of time. For instance, if you read a network using the networkx function read_edgelist() or read_weighted_edgelist() and then attempt to convert it to a numpy adjacency matrix using the to_numpy_array() or to_numpy_matrix() functions, that you take care to first convert the graph object returned by networkx’s read functions to a loopless, undirected, unweighted matrix (if applicable), and then convert the network to a numpy array/matrix with a pre-specified nodelist option. You can do these things with the to_undirected() utility.

9.3.2.2.2. Computational considerations#

The computational considerations about an adjacency matrix which is sparse are virtually identical to those we noted above. We might be able to make significant use of an adjacency matrix’s sparsity to radically alter the computational techniques that we choose to employ.

9.3.2.2.3. Algorithmic considerations#

For a singificant implication of sparsity, we’ll turn to a well-known problem in the sparse network literature: the “EigenSpokes” problem [9]. The basic idea is that, when dealing with networks that have a very low edge-density, a very large number of nodes, and with a high degree of community-specific connectivity (that is, the within-community connectivity greatly exceeds the between-community connectivity), individual nodes do not tend to be “split up” into nice little “blobs” when we perform an adjacency spectral embedding, like in Section 5.3.

In these situations, the latent positions tend to form a “spoke-like” pattern (called, “Eigen spokes”). The name “eigen” comes from a relationship between the singular vectors and the eigenvectors in the singular value decomposition when the matrix being decomposed is symmetric, such as a symmetric adjacency matrix for an undirected network. The latent positions will have a value of \(0\) in some latent dimensions, and then “project” along other latent dimensions (the “spokes”). All of the community structure is not found in the “blob-like” structure a particular node is in, but rather, which projections it has (or does not have) a value of \(0\) in. In this sense, it would be particularly unreasonable to use something like adjacency spectral embedding followed by community detection like we did in Section 6.1; rather, we might need to exploit other patterns to find community structure, like the authors did for several extremely large and extremely large networks in their paper.

9.3.3. References#

1: James H. Wilkinson. The Algebraic Eigenvalue Problem. Clarendon Press, Oxford, 1965.
2: G. K. Wallace. The JPEG still picture compression standard. IEEE Trans. Consum. Electron., 38(1):xviii–xxxiv, February 1992. doi:10.1109/30.125072.
3: Hui Zou, Trevor Hastie, and Robert Tibshirani. Sparse Principal Component Analysis. J. Comput. Graph. Stat., 15(2):265–286, June 2006. doi:10.1198/106186006X113430.
4: Maria Kateri. Contingency Table Analysis. Springer, New York, NY, USA, 2014. ISBN 978-0-8176-4811-4. URL: https://link.springer.com/book/10.1007/978-0-8176-4811-4?source=shoppingads&locale=en-us&srsltid=Ad5pg_FPOKpA_3SJgeDkORDZmHF1axCY_F167q-fkvzRXNCpHK6DFPSePT8.
5: George Casella and Roger L. Berger. Statistical Inference. Cengage Learning, Boston, MA, USA, June 2001. ISBN 978-0-53424312-8.
6: Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1):107–117, April 1998. doi:10.1016/S0169-7552(98)00110-X.
7: Niall Hurley and Scott Rickard. Comparing Measures of Sparsity. IEEE Trans. Inf. Theory, 55(10):4723–4741, September 2009. doi:10.1109/TIT.2009.2027527.
8: Swati Goswami, C. A. Murthy, and Asit K. Das. Sparsity Measure of a Network Graph: Gini Index. arXiv, December 2016. arXiv:1612.07074, doi:10.48550/arXiv.1612.07074.
9: B. Aditya Prakash, Ashwin Sridharan, Mukund Seshadri, Sridhar Machiraju, and Christos Faloutsos. EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs. In Advances in Knowledge Discovery and Data Mining, pages 435–448. Springer, Berlin, Germany, 2010. doi:10.1007/978-3-642-13672-6_42.

Hands-on Network Machine Learning with Scikit-Learn and Graspologic

Network Sparsity

Contents