treeple.experimental.mutual_info_ksg#

treeple.experimental.mutual_info_ksg(X, Y, Z=None, k=0.2, metric='forest', algorithm='kd_tree', n_jobs=-1, transform='rank', random_seed=None)[source]#

Compute the generalized (conditional) mutual information KSG estimate.

Parameters:

XArrayLike of shape (n_samples, n_features_x): The X covariate space.
YArrayLike of shape (n_samples, n_features_y): The Y covariate space.
ZArrayLike of shape (n_samples, n_features_z), optional: The Z covariate space, by default None. If None, then the MI is computed. If Z is defined, then the CMI is computed.
kfloat, optional: The number of neighbors to use in defining the radius, by default 0.2.
metricstr: Any distance metric accepted by sklearn.neighbors.NearestNeighbors. If ‘forest’ (default), then uses an treeple.UnsupervisedObliqueRandomForest to compute geodesic distances.
algorithmstr, optional: Method to use, by default ‘knn’. Can be (‘ball_tree’, ‘kd_tree’, ‘brute’).
n_jobsint, optional: Number of parallel jobs, by default -1.
transformone of {‘rank’, ‘standardize’, ‘uniform’}: Preprocessing, by default “rank”.
random_seedint, optional: Random seed, by default None.

Returns:

valfloat: The estimated MI, or CMI value.

Notes

Given a dataset with n samples, the KSG estimator proceeds by:

For fixed k, get the distance to the kth nearest-nbr in XYZ subspace, call it ‘r’
Get the number of NN in XZ subspace within radius ‘r’
Get the number of NN in YZ subspace within radius ‘r’
Get the number of NN in Z subspace within radius ‘r’
Apply analytic solution for KSG estimate

For MI, the analytical solution is:

\[\psi(k) - E[(\psi(n_x) + \psi(n_y))] + \psi(n)\]

For CMI, the analytical solution is:

\[\psi(k) - E[(\psi(n_{xz}) + \psi(n_{yz}) - \psi(n_{z}))]\]

where \(\psi\) is the DiGamma function, and each expectation term is estimated by taking the sample average.

Note that the \(n_i\) terms denote the number of neighbors within radius ‘r’ in the subspace of ‘i’, where ‘i’ could be for example the X, Y, XZ, etc. subspaces. This term does not include the sample itself.