sktree.datasets.make_trunk_classification#

sktree.datasets.make_trunk_classification(n_samples, n_dim=10, n_informative=10, m_factor=-1, rho=0, band_type='ma', return_params=False, mix=0, seed=None)[source]#

Generate trunk dataset.

For each dimension in the first distribution, there is a mean of \(1 / d\), where d is the dimensionality. The covariance is the identity matrix. The second distribution has a mean vector that is the negative of the first. As d increases, the two distributions become closer and closer.

See full details in [1].

Instead of the identity covariance matrix, one can implement a banded covariance matrix that follows [2].

Parameters:
n_samplesint

Number of sample to generate.

n_dimint, optional

The dimensionality of the dataset and the number of unique labels, by default 10.

n_informativeint, optional

The informative dimensions. All others for n_dim - n_informative are uniform noise.

m_factorint, optional

The multiplicative factor to apply to the mean-vector of the first distribution to obtain the mean-vector of the second distribution. By default -1.

rhofloat, optional

The covariance value of the bands. By default 0 indicating, an identity matrix is used.

band_typestr

The band type to use. For details, see Example 1 and 2 in [2]. Either ‘ma’, or ‘ar’.

return_paramsbool, optional

Whether or not to return the distribution parameters of the classes normal distributions.

mixfloat, optional

Whether or not to mix the Gaussians. Should be a value between 0 and 1.

seedint, optional

Random seed, by default None.

Returns:
Xnp.ndarray of shape (n_samples, n_dim), dtype=np.float64

Trunk dataset as a dense array.

ynp.ndarray of shape (n_samples,), dtype=np.intp

Labels of the dataset.

meanslist of ArrayLike of shape (n_dim,), dtype=np.float64

The mean vector for each class starting with class 0. Returned if return_params is True.

covslist of ArrayLike of shape (n_dim, n_dim), dtype=np.float64

The covariance for each class. Returned if return_params is True.

References