treeple.datasets.make_marron_wand_classification#

treeple.datasets.make_marron_wand_classification(n_samples, n_dim=4096, n_informative=256, simulation='gaussian', rho=0, band_type='ma', return_params=False, scaling_factor=1.0, seed=None)[source]#

Generate Marron-Wand binary classification dataset.

The simulation is similar to that of treeple.datasets.make_trunk_classification() where the first class is generated from a multivariate-Gaussians with mean vector of 0’s. The second class is generated from a mixture of Gaussians with mean vectors specified by the Marron-Wand simulations, but as the dimensionality increases, the second class distribution approaches the first class distribution by a factor of \(1 / sqrt(d)\).

Full details for the Marron-Wand simulations can be found in [1].

Instead of the identity covariance matrix, one can implement a banded covariance matrix that follows [2].

Parameters:
n_samplesint

Number of sample to generate. Must be an even number, else the total number of samples generated will be n_samples - 1.

n_dimint, optional

The dimensionality of the dataset and the number of unique labels, by default 4096.

n_informativeint, optional

The informative dimensions. All others for n_dim - n_informative are Gaussian noise. Default is 256.

simulationstr, optional

Which simulation to run. Must be one of the following Marron-Wand simulations: ‘gaussian’, ‘skewed_unimodal’, ‘strongly_skewed’, ‘kurtotic_unimodal’, ‘outlier’, ‘bimodal’, ‘separated_bimodal’, ‘skewed_bimodal’, ‘trimodal’, ‘claw’, ‘double_claw’, ‘asymmetric_claw’, ‘asymmetric_double_claw’, ‘smooth_comb’, ‘discrete_comb’. When calling the Marron-Wand simulations, only the covariance parameters are considered (rho and band_type). Means are taken from [1]. By default ‘gaussian’.

rhofloat, optional

The covariance value of the bands. By default 0 indicating, an identity matrix is used.

band_typestr

The band type to use. For details, see Example 1 and 2 in [2]. Either ‘ma’, or ‘ar’.

return_paramsbool, optional

Whether or not to return the distribution parameters of the classes normal distributions.

scaling_factorfloat, optional

The scaling factor for the covariance matrix. By default 1.

seedint, optional

Random seed, by default None.

Returns:
Xnp.ndarray of shape (n_samples, n_dim), dtype=np.float64

Trunk dataset as a dense array.

ynp.ndarray of shape (n_samples,), dtype=np.intp

Labels of the dataset.

Gnp.ndarray of shape (n_samples, n_dim), dtype=np.float64

The mixture of Gaussians for the Marron-Wand simulations. Returned if return_params is True.

wnp.ndarray of shape (n_dim,), dtype=np.float64

The weight vector for the Marron-Wand simulations. Returned if return_params is True.

Notes

Marron-Wand Simulations: The Marron-Wand simulations generate two classes of data with the setup specified in the paper.

Covariance: The covariance matrix among different dimensions is controlled by the rho parameter and the band_type parameter. The band_type parameter controls the type of band to use, while the rho parameter controls the specific scaling factor for the covariance matrix while going from one dimension to the next.

For each dimension in the first distribution, there is a mean of \(1 / d\), where d is the dimensionality. The covariance is the identity matrix.

The second distribution has a mean vector that is the negative of the first. As d increases, the two distributions become closer and closer. Full details for the trunk simulation can be found in [3].

References