Compare extra oblique forest and oblique random forest predictions on cc18 datasets#

A performance comparison between extra oblique forest and standard oblique random forest using four datasets from OpenML benchmarking suites.

Extra oblique forest uses extra oblique trees as base model which differ from classic decision trees in the way they are built. When looking for the best split to separate the samples of a node into two groups, random splits are drawn for each of the max_features randomly selected features and the best split among those is chosen. This is in contrast with the greedy approach, which evaluates the best possible threshold for each chosen split. For details of the original extra-tree, see [1].

The datasets used in this example are from the OpenML benchmarking suite are:

Large datasets are subsampled due to computational constraints for running this example. Note that cnae-9 is an high dimensional dataset with very sparse 856 features, mostly consisting of zeros.

Dataset

# Samples

# Features

Datatype

Phishing Website

2000

30

nominal

WDBC

455

30

numeric

Lsvt

100

310

numeric

har

2000

561

numeric

cnae-9

864

856

numeric

Note

In the following example, the parameters max_depth and ‘max_features` are set deliberately low in order to allow the example to run in our CI test suite. For normal usage, these parameters should be set to appropriate values depending on the dataset.

Discussion#

Extra Oblique Tree demonstrates performance similar to that of regular Oblique Tree on average with some increase in variance. See [1] for a detailed discussion on the bias-variance tradeoff of extra-trees vs normal trees.

However, Extra Oblique Tree runs substantially faster than Oblique Tree on some datasets due to the random split process which omits the computationally expensive search for the best split. The main source of increase in speed stems from the omission of sorting samples during the splitting of a node. In the standard trees, samples are sorted in ascending order to determine the best split hence the complexity is O(nlog(n)). In Extra trees, samples are not sorted and the split is determined by randomly drawing a threshold from the feature’s range, hence the complexity is O(n). This makes the algorithm more suitable for large datasets.

To see how sample-sizes affect the performance of Extra Oblique Trees vs regular Oblique Trees, see Speed of Extra Oblique Random Forest vs Oblique Random Forest on different dataset sizes

References#

PhishingWebsites, wdbc, lsvt, har, cnae-9
             value
max_features  sqrt
n_estimators    50
max_depth        3
random_state   123
n_cv            10
n_repeats        1
n_jobs          -1
                                    time_taken
dataset          dimension   model
PhishingWebsites (2000, 30)  EORF     1.372836
                             ORF      1.369735
cnae-9           (864, 856)  EORF     1.289198
                             ORF      1.296513
har              (2000, 561) EORF     1.452092
                             ORF      2.133561
lsvt             (100, 310)  EORF     1.191948
                             ORF      1.319485
wdbc             (455, 30)   EORF     1.224105
                             ORF      1.230045

from datetime import datetime

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.datasets import fetch_openml
from sklearn.model_selection import RepeatedKFold, cross_validate

from treeple import ExtraObliqueRandomForestClassifier, ObliqueRandomForestClassifier

# Model parameters
max_depth = 3
max_features = "sqrt"
max_sample_size = 2000
random_state = 123
n_estimators = 50

# Datasets
phishing_website = 4534
wdbc = 1510
lsvt = 1484
har = 1478
cnae_9 = 1468

data_ids = [phishing_website, wdbc, lsvt, har, cnae_9]
df = pd.DataFrame()


def load_cc18(data_id):
    df = fetch_openml(data_id=data_id, as_frame=True, parser="pandas")

    # extract the dataset name
    d_name = df.details["name"]

    # Subsampling large datasets
    n = int(df.frame.shape[0] * 0.8)

    if n > max_sample_size:
        n = max_sample_size

    df = df.frame.sample(n, random_state=random_state)
    X, y = df.iloc[:, :-1], df.iloc[:, -1]

    return X, y, d_name


def get_scores(X, y, d_name, n_cv=5, n_repeats=1, **kwargs):
    clfs = [ExtraObliqueRandomForestClassifier(**kwargs), ObliqueRandomForestClassifier(**kwargs)]
    dim = X.shape
    tmp = []

    for i, clf in enumerate(clfs):
        t0 = datetime.now()
        cv = RepeatedKFold(n_splits=n_cv, n_repeats=n_repeats, random_state=kwargs["random_state"])
        test_score = cross_validate(estimator=clf, X=X, y=y, cv=cv, scoring="accuracy")
        time_taken = datetime.now() - t0
        # convert the time taken to seconds
        time_taken = time_taken.total_seconds()

        tmp.append(
            [
                d_name,
                dim,
                ["EORF", "ORF"][i],
                test_score["test_score"],
                test_score["test_score"].mean(),
                time_taken,
            ]
        )

    df = pd.DataFrame(tmp, columns=["dataset", "dimension", "model", "score", "mean", "time_taken"])
    df = df.explode("score")
    df["score"] = df["score"].astype(float)
    df.reset_index(inplace=True, drop=True)

    return df


params = {
    "max_features": max_features,
    "n_estimators": n_estimators,
    "max_depth": max_depth,
    "random_state": random_state,
    "n_cv": 10,
    "n_repeats": 1,
    "n_jobs": -1,
}

for data_id in data_ids:
    X, y, d_name = load_cc18(data_id=data_id)
    tmp = get_scores(X=X, y=y, d_name=d_name, **params)
    df = pd.concat([df, tmp])

# Show the time taken to train each model
print(pd.DataFrame.from_dict(params, orient="index", columns=["value"]))
print(df.groupby(["dataset", "dimension", "model"])[["time_taken"]].mean())

# Draw a comparison plot
d_names = df.dataset.unique()
N = d_names.shape[0]

fig, ax = plt.subplots(1, N)
fig.set_size_inches(6 * N, 6)

for i, name in enumerate(d_names):
    sns.stripplot(
        data=df.query(f'dataset == "{name}"'),
        x="model",
        y="score",
        ax=ax[i],
        dodge=True,
    )
    sns.boxplot(
        data=df.query(f'dataset == "{name}"'),
        x="model",
        y="score",
        ax=ax[i],
        color="white",
    )
    ax[i].set_title(name)
    if i != 0:
        ax[i].set_ylabel("")
    ax[i].set_xlabel("")
# show the figure
plt.show()

Total running time of the script: (0 minutes 32.932 seconds)

Estimated memory usage: 370 MB

Gallery generated by Sphinx-Gallery