sktree.stats.FeatureImportanceForestRegressor#

class sktree.stats.FeatureImportanceForestRegressor(estimator=None, random_state=None, verbose=0, test_size=0.2, sample_dataset_per_tree=False, conditional_perm=False, permute_forest_fraction=None, train_test_split=True)[source]#

Forest hypothesis testing with continuous y variable.

Implements the algorithm described in [1].

The dataset is split into a training and testing dataset initially. Then there are two forests that are trained: one on the original dataset, and one on the permuted dataset. The dataset is either permuted once, or independently for each tree in the permuted forest. The original test statistic is computed by comparing the metric on both forests (metric_forest - metric_perm_forest).

Then the output predictions are randomly sampled to recompute the test statistic n_repeats times. The p-value is computed as the proportion of times the null test statistic is greater than the original test statistic.

Parameters:
estimatorobject, default=None

Type of forest estimator to use. By default None, which defaults to sklearn.ensemble.RandomForestRegressor.

random_stateint, RandomState instance or None, default=None

Controls both the randomness of the bootstrapping of the samples used when building trees (if bootstrap=True) and the sampling of the features to consider when looking for the best split at each node (if max_features < n_features). See Glossary for details.

verboseint, default=0

Controls the verbosity when fitting and predicting.

test_sizefloat, default=0.2

Proportion of samples per tree to use for the test set.

sample_dataset_per_treebool, default=False

Whether to sample the dataset per tree or per forest.

conditional_permbool, default=False

Whether or not to conditionally permute the covariate index. If True, then the covariate index is permuted while preserving the joint with respect to the rest of the covariates.

permute_forest_fractionfloat, default=None

The fraction of trees to permute the covariate index for. If None, then just one permutation is performed. If sampling a permutation per tree is desirable, then the fraction should be set to 1. / n_estimators.

train_test_splitbool, default=True

Whether to split the dataset before passing to the forest.

Notes

This class trains two forests: one on the original dataset, and one on the permuted dataset. The forest from the original dataset is cached and re-used to compute the test-statistic each time the test() method is called. However, the forest from the permuted dataset is re-trained each time the test() is called if the covariate_index differs from the previous run.

To fully start from a new dataset, call the reset method, which will then re-train both forests upon calling the test() and statistic() methods.

References

Attributes:
estimator_BaseForest

The estimator used to compute the test statistic.

n_samples_test_int

The number of samples used in the final test set.

indices_train_ArrayLike of shape (n_samples_train,)

The indices of the samples used in the training set.

indices_test_ArrayLike of shape (n_samples_test,)

The indices of the samples used in the testing set.

samples_ArrayLike of shape (n_samples_final,)

The indices of the samples used in the final test set that would slice the original (X, y) input.

y_true_final_ArrayLike of shape (n_samples_final,)

The true labels of the samples used in the final test.

observe_posteriors_ArrayLike of shape (n_estimators, n_samples, n_outputs) or

(n_estimators, n_samples, n_classes) The predicted posterior probabilities of the samples used in the final test. For samples that are NaNs for all estimators, means the sample was not used in the test set at all across all trees.

null_dist_ArrayLike of shape (n_repeats,)

The null distribution of the test statistic.

Methods

statistic(X, y[, covariate_index, metric, ...])

Compute the test statistic.

test(X, y[, covariate_index, metric, ...])

Perform hypothesis test using Coleman method.

reset

statistic(X, y, covariate_index=None, metric='mse', return_posteriors=False, check_input=True, **metric_kwargs)[source]#

Compute the test statistic.

Parameters:
XArrayLike of shape (n_samples, n_features)

The data matrix.

yArrayLike of shape (n_samples, n_outputs)

The target matrix.

covariate_indexArrayLike, optional of shape (n_covariates,)

The index array of covariates to shuffle, by default None.

metricstr, optional

The metric to compute, by default “mse”.

return_posteriorsbool, optional

Whether or not to return the posteriors, by default False.

check_inputbool, optional

Whether or not to check the input, by default True.

**metric_kwargsdict, optional

Additional keyword arguments to pass to the metric function.

Returns:
statfloat

The test statistic.

posterior_finalArrayLike of shape (n_estimators, n_samples_final, n_outputs) or

(n_estimators, n_samples_final), optional If return_posteriors is True, then the posterior probabilities of the samples used in the final test. n_samples_final is equal to n_samples if all samples are encountered in the test set of at least one tree in the posterior computation.

samplesArrayLike of shape (n_samples_final,), optional

The indices of the samples used in the final test. n_samples_final is equal to n_samples if all samples are encountered in the test set of at least one tree in the posterior computation.

test(X, y, covariate_index=None, metric='mi', n_repeats=1000, return_posteriors=True, **metric_kwargs)#

Perform hypothesis test using Coleman method.

X is split into a training/testing split. Optionally, the covariate index columns are shuffled.

On the training dataset, two honest forests are trained and then the posterior is estimated on the testing dataset. One honest forest is trained on the permuted dataset and the other is trained on the original dataset.

Finally, resample the posteriors of the two forests to compute the null distribution of the statistics.

Parameters:
XArrayLike of shape (n_samples, n_features)

The data matrix.

yArrayLike of shape (n_samples, n_outputs)

The target matrix.

covariate_indexArrayLike, optional of shape (n_covariates,)

The index array of covariates to shuffle, will shuffle all columns by default (corresponding to None).

metricstr, optional

The metric to compute, by default “mse”.

n_repeatsint, optional

Number of times to sample the null distribution, by default 1000.

return_posteriorsbool, optional

Whether or not to return the posteriors, by default True.

**metric_kwargsdict, optional

Additional keyword arguments to pass to the metric function.

Returns:
statfloat

The test statistic. To compute the test statistic, take permute_stat_ and subtract observe_stat_.

pvalfloat

The p-value of the test statistic.

property train_test_samples_#

The subset of drawn samples for each base estimator.

Returns a dynamically generated list of indices identifying the samples used for fitting each member of the ensemble, i.e., the in-bag samples.

Note: the list is re-created at each call to the property in order to reduce the object memory footprint by not storing the sampling data. Thus fetching the property may be slower than expected.