API Documentation#

sktree:

Scikit manifold oblique random forests.

Scikit-learn Tree Estimators#

We provide a drop-in replacement for the scikit-learn tree estimators with experimental features that we have developed. These estimators are still compatible with the scikit-learn API. These estimators all have the capability of binning features, which theoretically will improve runtime significantly for high-dimensional and high-sample size data.

Use at your own risk! We have not tested these estimators extensively, compared to the scikit-learn estimators.

RandomForestClassifier([n_estimators, ...])

A random forest classifier.

RandomForestRegressor([n_estimators, ...])

A random forest regressor.

ExtraTreesClassifier([n_estimators, ...])

An extra-trees classifier.

ExtraTreesRegressor([n_estimators, ...])

An extra-trees regressor.

DecisionTreeClassifier(*[, criterion, ...])

A decision tree classifier.

DecisionTreeRegressor(*[, criterion, ...])

A decision tree regressor.

ExtraTreeClassifier(*[, criterion, ...])

An extremely randomized tree classifier.

ExtraTreeRegressor(*[, criterion, splitter, ...])

An extremely randomized tree regressor.

Supervised#

Decision-tree models are traditionally implemented with axis-aligned splits and storing the mean outcome (i.e. label) vote in the leaf nodes. However, more exotic splits are possible, called “oblique” splits, which are some function of multiple feature columns to create a “new feature value” to split on.

This can take the form of a random (sparse) linear combination of feature columns, or even take advantage of the structure in the data (e.g. if it is an image) to sample feature indices in a manifold-aware fashion. This class of models generalizes the splitting function in the trees, while everything else is consistent with how scikit-learn builds trees.

ObliqueRandomForestClassifier([...])

An oblique random forest classifier.

ObliqueRandomForestRegressor([n_estimators, ...])

An oblique random forest regressor.

PatchObliqueRandomForestClassifier([...])

A patch-oblique random forest classifier.

PatchObliqueRandomForestRegressor([...])

A patch-oblique random forest regressor.

HonestForestClassifier([n_estimators, ...])

A forest classifier with honest leaf estimates.

MultiViewRandomForestClassifier([...])

A multi-view axis-aligned random forest classifier.

ObliqueDecisionTreeClassifier(*[, ...])

An oblique decision tree classifier.

ObliqueDecisionTreeRegressor(*[, criterion, ...])

An oblique decision tree Regressor.

PatchObliqueDecisionTreeClassifier(*[, ...])

A oblique decision tree classifier that operates over patches of data.

PatchObliqueDecisionTreeRegressor(*[, ...])

A oblique decision tree regressor that operates over patches of data.

HonestTreeClassifier([tree_estimator, ...])

A decision tree classifier with honest predictions.

MultiViewDecisionTreeClassifier(*[, ...])

A multi-view axis-aligned decision tree classifier.

Unsupervised#

Decision-tree models are traditionally used for classification and regression. However, they are also powerful non-parametric embedding and clustering models. The RandomTreesEmbedding is an example of unsupervised tree model. We implement other state-of-the-art models that explicitly split based on unsupervised criterion such as variance and BIC.

UnsupervisedRandomForest([n_estimators, ...])

Unsupervised random forest.

UnsupervisedObliqueRandomForest([...])

Unsupervised oblique random forest.

The trees that comprise those forests are also available as standalone classes.

tree.UnsupervisedDecisionTree(*[, ...])

Unsupervised decision tree.

tree.UnsupervisedObliqueDecisionTree(*[, ...])

Unsupervised oblique decision tree.

Outlier Detection#

Isolation forests are a model implemented in scikit-learn, which is an ensemble of extremely randomized axis-aligned decision tree models. Extended isolation forests replaces the base tree model with an oblique tree, which allows a more flexible model for detecting outliers.

ExtendedIsolationForest(*[, n_estimators, ...])

Extended Isolation Forest Algorithm.

Distance Metrics#

Trees inherently produce a “distance-like” metric. We provide an API for extracting pairwise distances from the trees that include a correction that turns the “tree-distance” into a proper distance metric.

compute_forest_similarity_matrix(forest, X)

Compute the similarity matrix of samples in X using a trained forest.

In addition to providing a distance metric based on leaves, tree-models provide a natural way to compute neighbors based on the splits. We provide an API for extracting the nearest neighbors from a tree-model. This provides an API-like interface similar to NearestNeighbors.

NearestNeighborsMetaEstimator(estimator[, ...])

Meta-estimator for nearest neighbors.

Statistical Hypothesis Testing#

We provide an API for performing statistical hypothesis testing using Decision tree models.

FeatureImportanceForestRegressor([...])

Forest hypothesis testing with continuous y variable.

FeatureImportanceForestClassifier([...])

Forest hypothesis testing with categorical y variable.

PermutationForestClassifier([estimator, ...])

Hypothesis testing of covariates with a permutation forest classifier.

PermutationForestRegressor([estimator, ...])

Hypothesis testing of covariates with a permutation forest regressor.

build_coleman_forest(est, perm_est, X, y[, ...])

Build a hypothesis testing forest using a two-forest approach.

build_permutation_forest(est, perm_est, X, y)

Build a hypothesis testing forest using a permutation-forest approach.

build_hyppo_oob_forest(est, X, y[, verbose])

Build a hypothesis testing forest using oob samples.

build_hyppo_cv_forest(est, X, y[, cv, ...])

Build a hypothesis testing forest using oob samples.

PermutationHonestForestClassifier([...])

A forest classifier with a permutation over the dataset.

Datasets#

We provide some convenience functions for simulating datasets beyond those offered in scikit-learn.

make_gaussian_mixture(centers, covariances)

Two-view Gaussian mixture model dataset generator.

make_joint_factor_model(n_views, n_features)

Joint factor model data generator.

make_quadratic_classification(n_samples, ...)

Simulate classification data from a quadratic model.

make_trunk_classification(n_samples[, ...])

Generate trunk binary classification dataset.

make_trunk_mixture_classification(n_samples)

Generate trunk mixture binary classification dataset.

make_marron_wand_classification(n_samples[, ...])

Generate Marron-Wand binary classification dataset.

approximate_clf_mutual_information(means, covs)

Approximate MI for multivariate Gaussian for a classification setting.

approximate_clf_mutual_information_with_monte_carlo(...)

Approximate MI for multivariate Gaussian for a classification setting.

Experimental Functionality#

We also include experimental functionality that is works in progress.

mutual_info_ksg(X, Y[, Z, k, metric, ...])

Compute the generalized (conditional) mutual information KSG estimate.

conditional_resample(conditional_array, *arrays)

Conditionally resample arrays or sparse matrices in a consistent way.

We also include functions that help simulate and evaluate mutual information (MI) and conditional mutual information (CMI) estimators. Specifically, functions that help simulate multivariate gaussian data and compute the analytical solutions for the entropy, MI and CMI of the Gaussian distributions.

simulate_multivariate_gaussian([mean, cov, ...])

Multivariate gaussian simulation for testing entropy and MI estimators.

simulate_helix([radius_a, radius_b, ...])

Simulate data from a helix.

simulate_sphere([radius, noise_func, alpha, ...])

Simulate samples generated on a sphere.

mi_gaussian(cov)

Compute mutual information of a multivariate Gaussian.

cmi_gaussian(cov, x_index, y_index, z_index)

Compute the analytical CMI for a multivariate Gaussian distribution.

entropy_gaussian(cov)

Compute entropy of a multivariate Gaussian.