treeple.datasets.make_marron_wand_classification#
- treeple.datasets.make_marron_wand_classification(n_samples, n_dim=4096, n_informative=256, simulation='gaussian', rho=0, band_type='ma', return_params=False, scaling_factor=1.0, seed=None)[source]#
Generate Marron-Wand binary classification dataset.
The simulation is similar to that of
treeple.datasets.make_trunk_classification()
where the first class is generated from a multivariate-Gaussians with mean vector of 0’s. The second class is generated from a mixture of Gaussians with mean vectors specified by the Marron-Wand simulations, but as the dimensionality increases, the second class distribution approaches the first class distribution by a factor of \(1 / sqrt(d)\).Full details for the Marron-Wand simulations can be found in [1].
Instead of the identity covariance matrix, one can implement a banded covariance matrix that follows [2].
- Parameters:
- n_samples
int
Number of sample to generate. Must be an even number, else the total number of samples generated will be
n_samples - 1
.- n_dim
int
, optional The dimensionality of the dataset and the number of unique labels, by default 4096.
- n_informative
int
, optional The informative dimensions. All others for
n_dim - n_informative
are Gaussian noise. Default is 256.- simulation
str
, optional Which simulation to run. Must be one of the following Marron-Wand simulations: ‘gaussian’, ‘skewed_unimodal’, ‘strongly_skewed’, ‘kurtotic_unimodal’, ‘outlier’, ‘bimodal’, ‘separated_bimodal’, ‘skewed_bimodal’, ‘trimodal’, ‘claw’, ‘double_claw’, ‘asymmetric_claw’, ‘asymmetric_double_claw’, ‘smooth_comb’, ‘discrete_comb’. When calling the Marron-Wand simulations, only the covariance parameters are considered (
rho
andband_type
). Means are taken from [1]. By default ‘gaussian’.- rho
float
, optional The covariance value of the bands. By default 0 indicating, an identity matrix is used.
- band_type
str
The band type to use. For details, see Example 1 and 2 in [2]. Either ‘ma’, or ‘ar’.
- return_params
bool
, optional Whether or not to return the distribution parameters of the classes normal distributions.
- scaling_factor
float
, optional The scaling factor for the covariance matrix. By default 1.
- seed
int
, optional Random seed, by default None.
- n_samples
- Returns:
- X
np.ndarray
of shape (n_samples, n_dim), dtype=np.float64 Trunk dataset as a dense array.
- y
np.ndarray
of shape (n_samples,), dtype=np.intp Labels of the dataset.
- G
np.ndarray
of shape (n_samples, n_dim), dtype=np.float64 The mixture of Gaussians for the Marron-Wand simulations. Returned if
return_params
is True.- w
np.ndarray
of shape (n_dim,), dtype=np.float64 The weight vector for the Marron-Wand simulations. Returned if
return_params
is True.
- X
Notes
Marron-Wand Simulations: The Marron-Wand simulations generate two classes of data with the setup specified in the paper.
Covariance: The covariance matrix among different dimensions is controlled by the
rho
parameter and theband_type
parameter. Theband_type
parameter controls the type of band to use, while therho
parameter controls the specific scaling factor for the covariance matrix while going from one dimension to the next.For each dimension in the first distribution, there is a mean of \(1 / d\), where
d
is the dimensionality. The covariance is the identity matrix.The second distribution has a mean vector that is the negative of the first. As
d
increases, the two distributions become closer and closer. Full details for the trunk simulation can be found in [3].References