sktree.stats
.FeatureImportanceForestRegressor#
- class sktree.stats.FeatureImportanceForestRegressor(estimator=None, random_state=None, verbose=0, test_size=0.2, sample_dataset_per_tree=False, conditional_perm=False, permute_forest_fraction=None, train_test_split=True)[source]#
Forest hypothesis testing with continuous
y
variable.Implements the algorithm described in [1].
The dataset is split into a training and testing dataset initially. Then there are two forests that are trained: one on the original dataset, and one on the permuted dataset. The dataset is either permuted once, or independently for each tree in the permuted forest. The original test statistic is computed by comparing the metric on both forests
(metric_forest - metric_perm_forest)
.Then the output predictions are randomly sampled to recompute the test statistic
n_repeats
times. The p-value is computed as the proportion of times the null test statistic is greater than the original test statistic.- Parameters:
- estimator
object
, default=None Type of forest estimator to use. By default
None
, which defaults tosklearn.ensemble.RandomForestRegressor
.- random_state
int
,RandomState
instance orNone
, default=None Controls both the randomness of the bootstrapping of the samples used when building trees (if
bootstrap=True
) and the sampling of the features to consider when looking for the best split at each node (ifmax_features < n_features
). See Glossary for details.- verbose
int
, default=0 Controls the verbosity when fitting and predicting.
- test_size
float
, default=0.2 Proportion of samples per tree to use for the test set.
- sample_dataset_per_tree
bool
, default=False Whether to sample the dataset per tree or per forest.
- conditional_perm
bool
, default=False Whether or not to conditionally permute the covariate index. If True, then the covariate index is permuted while preserving the joint with respect to the rest of the covariates.
- permute_forest_fraction
float
, default=None The fraction of trees to permute the covariate index for. If None, then just one permutation is performed. If sampling a permutation per tree is desirable, then the fraction should be set to
1. / n_estimators
.- train_test_split
bool
, default=True Whether to split the dataset before passing to the forest.
- estimator
Notes
This class trains two forests: one on the original dataset, and one on the permuted dataset. The forest from the original dataset is cached and re-used to compute the test-statistic each time the
test()
method is called. However, the forest from the permuted dataset is re-trained each time thetest()
is called if thecovariate_index
differs from the previous run.To fully start from a new dataset, call the
reset
method, which will then re-train both forests upon calling thetest()
andstatistic()
methods.References
- Attributes:
- estimator_BaseForest
The estimator used to compute the test statistic.
- n_samples_test_
int
The number of samples used in the final test set.
- indices_train_ArrayLike of shape (n_samples_train,)
The indices of the samples used in the training set.
- indices_test_ArrayLike of shape (n_samples_test,)
The indices of the samples used in the testing set.
- samples_ArrayLike of shape (n_samples_final,)
The indices of the samples used in the final test set that would slice the original
(X, y)
input.- y_true_final_ArrayLike of shape (n_samples_final,)
The true labels of the samples used in the final test.
- observe_posteriors_ArrayLike of shape (n_estimators, n_samples, n_outputs) or
(n_estimators, n_samples, n_classes) The predicted posterior probabilities of the samples used in the final test. For samples that are NaNs for all estimators, means the sample was not used in the test set at all across all trees.
- null_dist_ArrayLike of shape (n_repeats,)
The null distribution of the test statistic.
Methods
statistic
(X, y[, covariate_index, metric, ...])Compute the test statistic.
test
(X, y[, covariate_index, metric, ...])Perform hypothesis test using Coleman method.
reset
- statistic(X, y, covariate_index=None, metric='mse', return_posteriors=False, check_input=True, **metric_kwargs)[source]#
Compute the test statistic.
- Parameters:
- XArrayLike of shape (n_samples, n_features)
The data matrix.
- yArrayLike of shape (n_samples, n_outputs)
The target matrix.
- covariate_indexArrayLike, optional of shape (n_covariates,)
The index array of covariates to shuffle, by default None.
- metric
str
, optional The metric to compute, by default “mse”.
- return_posteriors
bool
, optional Whether or not to return the posteriors, by default False.
- check_input
bool
, optional Whether or not to check the input, by default True.
- **metric_kwargs
dict
, optional Additional keyword arguments to pass to the metric function.
- Returns:
- stat
float
The test statistic.
- posterior_finalArrayLike of shape (n_estimators, n_samples_final, n_outputs) or
(n_estimators, n_samples_final), optional If
return_posteriors
is True, then the posterior probabilities of the samples used in the final test.n_samples_final
is equal ton_samples
if all samples are encountered in the test set of at least one tree in the posterior computation.- samplesArrayLike of shape (n_samples_final,), optional
The indices of the samples used in the final test.
n_samples_final
is equal ton_samples
if all samples are encountered in the test set of at least one tree in the posterior computation.
- stat
- test(X, y, covariate_index=None, metric='mi', n_repeats=1000, return_posteriors=True, **metric_kwargs)#
Perform hypothesis test using Coleman method.
X is split into a training/testing split. Optionally, the covariate index columns are shuffled.
On the training dataset, two honest forests are trained and then the posterior is estimated on the testing dataset. One honest forest is trained on the permuted dataset and the other is trained on the original dataset.
Finally, resample the posteriors of the two forests to compute the null distribution of the statistics.
- Parameters:
- XArrayLike of shape (n_samples, n_features)
The data matrix.
- yArrayLike of shape (n_samples, n_outputs)
The target matrix.
- covariate_indexArrayLike, optional of shape (n_covariates,)
The index array of covariates to shuffle, will shuffle all columns by default (corresponding to None).
- metric
str
, optional The metric to compute, by default “mse”.
- n_repeats
int
, optional Number of times to sample the null distribution, by default 1000.
- return_posteriors
bool
, optional Whether or not to return the posteriors, by default True.
- **metric_kwargs
dict
, optional Additional keyword arguments to pass to the metric function.
- Returns:
- property train_test_samples_#
The subset of drawn samples for each base estimator.
Returns a dynamically generated list of indices identifying the samples used for fitting each member of the ensemble, i.e., the in-bag samples.
Note: the list is re-created at each call to the property in order to reduce the object memory footprint by not storing the sampling data. Thus fetching the property may be slower than expected.