11.2. Background#

To understand this section, there are some important background points you should be get some familiarity with. First, you should be readily familiar with all of the mathematical operations and concepts from probability and statistics which are explained directly in the Terminology. We will introduce some of these background points, and the specific notations we will use for this section here. It is often the case that when you read probability or statistics books, that different instructors or professors will use different notations that will differ. We find these differences to be cumbersome, so we’re going to hopefully outline most of the common ones that people have multiple notations for here. They are:

11.2.1. Notational Tendencies#

Extrapolatory ellipses: We might often summarize a sequence of natural numbers with ellipses; e.g., \(\{1, ..., n\}\). All the ellipses mean is to just continue indexing in the meantime, until you reach the last index in the sequence. For instance, in the above sequence, the ellipses stand for \(4\), \(5\), \(6\), … all the way up to \(n-1\). These ellipses have the same interpretation in either vectors or matrices; just continue the numbering pattern upwards basically.
Shorthand for sequences of natural numbers: We will denote a sequence of natural numbers that goes from \(1\) up to \(n\) using the notation \([n]\). Stated another way, \([n] = \{1, ..., n\}\).
Useful numerical spaces: We will often use a number of numerical spaces in this book. The common ones will tend to be accented with a double bold-faced print. They are the natural numbers (denoted \(\mathbb N\)), the integers (denoted \(\mathbb Z\)), the non-negative integers (denoted \(\mathbb Z_{\geq 0}\)), the positive integers (denoted \(\mathbb Z_+\)), and the real numbers (denoted \(\mathbb R\)).
Shorthand for objects which can be (arbitrary) values from a particular numerical space: it will often be the case that in describing a network model, our description applies regardless of what the value is for a particular number in that description. For this reason, we tend to use notation to denote the arbitrariness of this choice. We will use the notation \(x \in \mathcal S\) to denote that the value \(x\) (which could be a scalar, a vector, or a matrix) has values which can be described by the numerical space captured by \(\mathcal S\). For instance, \(x \in \mathbb R\) means that \(x\) is an arbitrary real number. \(\vec x \in \mathbb R^d\) means that \(\vec x\) is an arbitrary vector with \(d\) elements, where each element is an arbitrary real number; e.g., \(x_i \in \mathbb R\). Another common vector representation we will see is \(\vec x \in [K]^d\) or \(\vec x \in \{1,..., K\}^d\), which in both cases, means that each element \(x_i\) is an arbitrary natural number that is at most \(K\). \(X \in \mathbb R^{r \times c}\) means that \(X\) is an arbitrary vector with \(r\) rows and \(c\) columns, where each element is an arbitrary real number; e.g., \(x_{ij} \in \mathbb R\).
Vector in-line notation: with \(\vec x\) as a vector, we might sometimes resort to describing \(\vec x\) using an in-line notation which directly captures its dimensionality. We might say something like, \(\vec x = (x_i)_{i = 1}^d\), which just means that \(\vec x\) looks like this:

\[\begin{align*} \vec x = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_d \end{bmatrix} \end{align*}\]

Matrix in-line notation: with \(X\) as a matrix, we might similarly describe \(X\) using in-line notation which captures its number of rows and columns, with something like \(X = (x_{ij})_{i \in [r], j \in [c]}\) or \(\left((x_{ij})_{j = 1}^c\right)_{i = 1}^r\). What this means is that we first “unroll” \(X\) across the dimension being indexed by \(j\) from \(1\) to \(c\), and then down the dimension being indexed by \(i\) from \(1\) to \(r\). In this sense, \(X\) would be a matrix with \(r\) rows and \(c\) columns. It would look like this:

\[\begin{align*} X &= \begin{bmatrix} x_{11} & ... & x_{1c} \\ \vdots & \ddots & \vdots \\ x_{r1} & ... & x_{rc} \end{bmatrix} \end{align*}\]

11.2.2. Background in probability distributions and distribution functions#

Distribution functions for random variables: \(\mathbf x \sim F\), which basically can be read as, the random variable (which is denoted with a bold-faced \(\mathbf x\)) has a distribution which is delineated by the distribution function \(F\).” For instance, if \(\mathbf x \sim Bern(p)\), this means that the random variable \(\mathbf x\) has a distribution which can be described by a Bernoulli random variable with probability \(p\). Stated mathematically, \(Pr(\mathbf x = x) = \begin{cases}p & x = 1 \\1 - p & x = 0\end{cases}\). In this case, what this means is that realizations of \(\mathbf x\), denoted by \(x\) (no bold-face), are like a coin flip which lands on heads with probability \(p\) and tails with probability \(1 - p\), where “heads” is akin to \(x\) having a value of \(1\), and “tails” is akin to \(x\) having a value of \(0\).
- We believe that familiarity with the Bernoulli distribution with probability parameter \(p\) denoted \(Bern(p)\), the Categorical (Multinoulli) distribution with probability vector \(\vec p\), the Normal distribution \(\mathcal N(\mu, \sigma^2)\), and the Uniform distribution \(Unif(a, b)\) with a minimum at \(a\) and a maximum at \(b\) would be valuable to look at. If you are already familiar with these words but forget exactly what they mean, we will describe them in-line as necessary, as well.
Distribution functions for random vectors: \(\mathbf {\vec x} \sim F\), which can be read as, “the random vector \(\mathbf x\) has a distribution function which is delineated by the distribution function \(F\)”. In most cases, we will assume some level of statistical independence for random vectors, which basically means that instead of describing \(\mathbf{\vec x}\) itself directly as having a distribution, we can for our purposes describe each individual element of \(\mathbf {\vec x}\), written \(\mathbf x_i\), as having a distribution. This will reduce cumbersome descriptions of distribution functions for random vectors to simpler distributions functions for random variables.
- This brings us to an important aside. When we are working with vectors (either random or fixed), you might see us use the fancy word “dimensions” to describe individual elements of these vectors. The \(i^{th}\) dimension of a scalar vector \(\vec x\) is just the \(i^{th}\) element of that vector; e.g., \(x_i\), which is a scalar. The \(i^{th}\) dimension of a random vector \(\mathbf {\vec x}\) is similar, where \(\mathbf x_i\) is a random variable.
Distribution functions for random matrices: \(\mathbf X \sim F\), which can be read as, “the random matrix \(\mathbf x\) has a distribution function which is delineated by the distribution function \(F\)”. Here, we will also tend to assume statistical independence for random matrices, which again means that instead of describing \(\mathbf X\), we can just look at the individual elements of \(\mathbf X\), denoted \(\mathbf x_{ij}\), as having distributions.
- There is one important exception to this, which will arise for the a posteriori Random Dot Product Graph, which will use something called inner-product distributions. In this case, instead of describing \(\mathbf X\) itself, we will describe a family of distributions for random vectors, which comprise the rows of \(\mathbf X\). We will try our best to explain these in an intuitive way without going outside of the scope of a graduate understanding of statistics.
Parametrized functions: In statistics, there is a concept called a parameter. A parameter is a number, or a set of numbers, which uniquely defines the behavior of something. In our case, we will often be concerned with parametrized random variables, such as \(\mathbf x \sim Bern(p)\), which states that \(\mathbf x\) is a random variable which is described by the Bernoulli distribution with parameter \(p\). In this case, in a sense, the “\(p\)” is static, in that for our particular random variable \(\mathbf x\) that we are talking about, \(p\) itself isn’t going to change. However, when we talk about realizations of \(\mathbf x\) (which are \(0\)s and \(1\)s) these realizations can, and will, change. In this sense, when we take probability statements about \(\mathbf x\), such as \(Pr(\mathbf x = 1, p) = p\) or \(Pr(\mathbf x = 0, p) = 1 - p\), the probability is a function of both the parameter \(p\) and the value \(x\) which \(\mathbf x\) takes. Simultaneously, however, when we study \(\mathbf x\), the \(p\) isn’t going to change for a given \(\mathbf x\). For this reason, we explicitly delineate this difference by instead dropping the \(p\) down as a subscript; e.g., \(Pr(\mathbf x = x, p)\) will instead be denoted as \(Pr_p(\mathbf x = x)\). This makes explicit that \(p\) is a parameter of the distribution of \(\mathbf x\), and not something that is realized (or changing) for different realizations of \(\mathbf x\).
Arbitrary sets of parameters: When we describe random variables very generally, it is often the case that we want to be as unrestrictive as possible. For instance, if we are describing a generic random variable \(\mathbf x\) which could have a Bernoulli distribution with a parameter \(p\) or a Normal distribution with mean \(\mu\) and variance \(\sigma^2\), we have different sets of parameters depending on which distribution \(\mathbf x\) has. When a random variable could have different parameters, we will often describe the parameters using the notation \(\theta\), which is just an arbitrary parameter set. For instance, in the example we gave, \(\theta\) could be the set \(\{p\}\) or the set \(\{\mu, \sigma^2\}\), so we will just describe \(\mathbf x\) in terms of the generic parameter \(\theta\) instead of cumbersomely writing that the parameter set can be different every time. In this sense, \(\theta\) will denote an arbitrary set of parameters for a random variable \(\mathbf x\), a random vector \(\mathbf{\vec x}\), or a random matrix \(\mathbf X\).

11.2.3. Abuses of notation#

In statistical work, there is a common problem experienced called an “abuse” of notation. What this means is a particular notation choice which takes on multiple meanings. These can really complicate things for your understanding, if you see a notation defined and used as one particular thing, and then redefined and reused as another particular thing. We will use one particular abuse of notation fairly regularly. As it turns out, if we have a random variable, random vector, or random matrix, a unique distribution can be delineated entirely by its cumulative distribution. This is an important aside, since cumulative distribution functions aren’t distributions themselves, but they each equivalently delineate unique distributions. To make this description explicit, let’s say that \(\mathbf x \sim \mathcal N(\mu, \sigma^2)\), which means that \(\mathbf x\) has the Normal distribution with mean \(\mu\) and variance \(\sigma^2\). This statement is explicit, and will not stop being true. But, equivalently, we might use an abuse of notation, by defining the cumulative distribution function:

\[\begin{align*} F_{\mu, \sigma^2}(x) &= \int_{-\infty}^x f_{\mu, \sigma^2}(x) \text{d}x \end{align*}\]

where \(f_{\mu, \sigma^2}(x)\) is the probability density for the normal distribution at the value \(x\) with mean \(\mu\) and \(\sigma^2\). For a given choice of \(\mu\) and \(\sigma^2\), \(F_{\mu, \sigma^2}(x)\) has a unique value over the range of values which \(x\) could take. In this sense, if we were to say that \(\mathbf x \sim F_{\mu, \sigma^2}\), we have kind of “abused” the notation of the cumulative distribution function, in that the cumulative distribution function is not itself a description for a random variable. However, since \(F_{\mu,\sigma^2}\) is unique for a \(\mathcal N(\mu, \sigma^2)\) random variable, it “does the job” for us, and is “clear enough” for our purposes. The reason that we do this is, sometimes we might want to leave it totally generic as to the type of distribution that our random variable is described as. For instance, we could say \(\mathbf x \sim F\), which just means that \(\mathbf x\) is a random variable with an arbitrary cumulative distribution function \(F\). This leaves it generic to us the specifics of \(F\), including the parameter choices that could be chosen, or the behavior of the specific family of random variables that it would define. This will come up when we study inner product distributions below.

Hands-on Network Machine Learning with Scikit-Learn and Graspologic

Background

Contents

11.2. Background#

11.2.1. Notational Tendencies#

11.2.2. Background in probability distributions and distribution functions#

11.2.3. Abuses of notation#