treeple.datasets.make_trunk_classification#

treeple.datasets.make_trunk_classification(n_samples, n_dim=4096, n_informative=256, mu_0=0, mu_1=1, rho=0, band_type='ma', return_params=False, scaling_factor=1.0, seed=None)[source]#

Generate trunk binary classification dataset.

For each dimension in the first distribution, there is a mean of \(1 / d\), where d is the dimensionality. The covariance is the identity matrix. The second distribution has a mean vector that is the negative of the first. As d increases, the two distributions become closer and closer.

Full details for the trunk simulation can be found in [1].

Instead of the identity covariance matrix, one can implement a banded covariance matrix that follows [2].

Parameters:
n_samplesint

Number of sample to generate. Must be an even number, else the total number of samples generated will be n_samples - 1.

n_dimint, optional

The dimensionality of the dataset and the number of unique labels, by default 4096.

n_informativeint, optional

The informative dimensions. All others for n_dim - n_informative are Gaussian noise. Default is 256.

mu_0int, optional

The mean of the first distribution. By default -1. The mean of the distribution will decrease by a factor of sqrt(i) for each dimension i.

mu_1int, optional

The mean of the second distribution. By default 1. The mean of the distribution will decrease by a factor of sqrt(i) for each dimension i.

rhofloat, optional

The covariance value of the bands. By default 0 indicating, an identity matrix is used.

band_typestr

The band type to use. For details, see Example 1 and 2 in [2]. Either ‘ma’, or ‘ar’.

return_paramsbool, optional

Whether or not to return the distribution parameters of the classes normal distributions. Default false.

scaling_factorfloat, optional

The scaling factor for the covariance matrix. By default 1.

seedint, optional

Random seed, by default None.

Returns:
Xnp.ndarray of shape (n_samples, n_dim), dtype=np.float64

Trunk dataset as a dense array.

ynp.ndarray of shape (n_samples,), dtype=np.intp

Labels of the dataset.

meanslist of ArrayLike of shape (n_dim,), dtype=np.float64

The mean vector for each class starting with class 0. Returned if return_params is True.

covslist of ArrayLike of shape (n_dim, n_dim), dtype=np.float64

The covariance for each class. Returned if return_params is True.

Notes

Trunk: The trunk simulation decreases the signal-to-noise ratio as the dimensionality increases. This is implemented by decreasing the mean of the distribution by a factor of sqrt(i) for each dimension i. Thus for instance if the means of distribution one and two are 1 and -1 respectively, the means for the first dimension will be 1 and -1, for the second dimension will be 1/sqrt(2) and -1/sqrt(2), and so on.

Trunk Overlap: The trunk overlap simulation generates two classes of data with the same covariance matrix and mean vector of zeros.

Covariance: The covariance matrix among different dimensions is controlled by the rho parameter and the band_type parameter. The band_type parameter controls the type of band to use, while the rho parameter controls the specific scaling factor for the covariance matrix while going from one dimension to the next.

References

Examples using treeple.datasets.make_trunk_classification#

Calculating S@98

Calculating S@98

Calculating MI

Calculating MI

Calculating pAUC

Calculating pAUC

Calculating Hellinger Distance

Calculating Hellinger Distance

Calculating p-value (MIGHT)

Calculating p-value (MIGHT)

Calculating S@98 with multiview data

Calculating S@98 with multiview data

Calculating CMI

Calculating CMI

Calculating p-value with multiview data (CoMIGHT)

Calculating p-value with multiview data (CoMIGHT)