treeple.experimental.mutual_info_ksg#
- treeple.experimental.mutual_info_ksg(X, Y, Z=None, k=0.2, metric='forest', algorithm='kd_tree', n_jobs=-1, transform='rank', random_seed=None)[source]#
Compute the generalized (conditional) mutual information KSG estimate.
- Parameters:
- XArrayLike of shape (n_samples, n_features_x)
The X covariate space.
- YArrayLike of shape (n_samples, n_features_y)
The Y covariate space.
- ZArrayLike of shape (n_samples, n_features_z), optional
The Z covariate space, by default None. If None, then the MI is computed. If Z is defined, then the CMI is computed.
- k
float
, optional The number of neighbors to use in defining the radius, by default 0.2.
- metric
str
Any distance metric accepted by
sklearn.neighbors.NearestNeighbors
. If ‘forest’ (default), then uses antreeple.UnsupervisedObliqueRandomForest
to compute geodesic distances.- algorithm
str
, optional Method to use, by default ‘knn’. Can be (‘ball_tree’, ‘kd_tree’, ‘brute’).
- n_jobs
int
, optional Number of parallel jobs, by default -1.
- transformone of {‘rank’, ‘standardize’, ‘uniform’}
Preprocessing, by default “rank”.
- random_seed
int
, optional Random seed, by default None.
- Returns:
- val
float
The estimated MI, or CMI value.
- val
Notes
Given a dataset with
n
samples, the KSG estimator proceeds by:For fixed k, get the distance to the kth nearest-nbr in XYZ subspace, call it ‘r’
Get the number of NN in XZ subspace within radius ‘r’
Get the number of NN in YZ subspace within radius ‘r’
Get the number of NN in Z subspace within radius ‘r’
Apply analytic solution for KSG estimate
For MI, the analytical solution is:
\[\psi(k) - E[(\psi(n_x) + \psi(n_y))] + \psi(n)\]For CMI, the analytical solution is:
\[\psi(k) - E[(\psi(n_{xz}) + \psi(n_{yz}) - \psi(n_{z}))]\]where \(\psi\) is the DiGamma function, and each expectation term is estimated by taking the sample average.
Note that the \(n_i\) terms denote the number of neighbors within radius ‘r’ in the subspace of ‘i’, where ‘i’ could be for example the X, Y, XZ, etc. subspaces. This term does not include the sample itself.