Ensemble Learning with Annealing (QBoost)

Ensemble Learning with Annealing (QBoost)#

QBoost is one of the ensemble learning methods using quantum annealing. Ensemble learning is a method that prepares a large number of weak predictors and combines the predictions of each predictor to obtain a final accurate prediction result.

Overview of QBoost#

QBoost is an algorithm whose goal is to accurately identify the nature of the input signal \(x\). Let us consider the problem of which of the two values \(\pm 1\) should be assigned to the input signal. As an example, imagine a task where \(x\) represents image data and we want to identify whether what we see in the image is a dog or a cat. In ensemble learning, the goal is to achieve better prediction accuracy (boosting) by using multiple predictors. Here, we have many predictors that do not perform well (weak predictors). Note that by poorly performing, it means that they often do not produce correct outputs for their inputs. Let the output of these predictors be \(c_i (x) \in \{ -1, 1\} \ (i=0, 1, \dots, N-1)\). The basic idea is that by summing the outputs of several weak predictors, we can make better predictions. This can be expressed as:

\[ C(x) = \mathrm{sgn} \left( \sum_{i=0}^{N-1} w_i c_i (x) \right) \tag{1} \]

\(w_i \in \{0, 1\}\) indicates whether the \(i\)th predictor is used or not. Let us identify which predictor gives better performance with as few weak predictors as possible. We will use supervised learning to find the optimal \(\{w_i\}\) pairs. We have a large number of \((x^{(d)}, y^{(d)}) \ (d= 0, 1, \dots, D-1)\) of supervised data (\(D \gg 1\)), and we adjust \(\{w_i\}\) to reproduce them as closely as possible. In other words, we aim to minimize the following Hamiltonian for \(\{w_i\}\):

\[ H(w) = \sum_{d=0}^{D-1} \left( \frac{1}{N} \sum_{i=0}^{N-1} w_i c_i (x^{(d)}) - y^{(d)}\right)^2 + \lambda \sum_{i=0}^{N-1} w_i \tag{2} \]

Through this minimization of the Hamiltonian, the difference from the supervised data \(y^{(d)}\) is made as small as possible. If we use the right-hand side of equation (1) as it is, it cannot come down to the Ising model because it is not a quadratic form of \(w_i\) due to the sign function. Therefore, the problem is made to minimize the square of the difference between the \(1/N\) times the sign function argument \(\sum_i w_i c_i\) and the supervised data \(y^{(d)}\). The coefficient of \(1/N\) is to adjust so that the difference from \(y^{(d)}= \pm 1\) is not too large, since the maximum value of \(\sum_i w_i c_i(x)\) is \(N\). The term with \(\lambda (>0)\) applied represents the regularization term to efficiently construct a relatively small number of weak predictors without setting too many \(w_i\) to 1.

Formulation with JijModeling#

Let us formulate the Hamiltonian described above using JijModeling. First, we define the necessary variables.

import jijmodeling as jm

problem = jm.Problem("QBoost")

c = problem.Integer("c", ndim=2)
y = problem.Integer("y", ndim=1)
N = problem.DependentVar("N", c.len_at(0))
D = problem.DependentVar("D", c.len_at(1))
w = problem.BinaryVar("w", shape=(N, ))
lamb = problem.Float("lamb", latex="\lambda")

Here, c is defined as the output of weak predictors (a 2D integer array), y as the training labels (a 1D integer array), N as the number of predictors (automatically determined from the first dimension of c), D as the number of training samples (automatically determined from the second dimension of c), w as a binary variable indicating whether each predictor is used, and lamb as the regularization parameter \(\lambda\).

Hamiltonian#

Let us formulate the Hamiltonian above.

# set objective function 1: minimize the sum of differences
obj1 = jm.sum(D, lambda d: (jm.sum(N, lambda i: w[i]*c[i, d])/N - y[d])**2)
# set objective function 2: minimize regularization term
obj2 = lamb * w[:].sum()

problem += obj1 + obj2

w[:].sum() allows us to implement \(\sum_i w_i\) concisely.

Let us display the mathematical model in the Jupyter Notebook.

problem

\[\begin{split}\begin{array}{rl} \text{Problem}\colon &\text{QBoost}\\\displaystyle \min &\displaystyle \sum _{d=0}^{D-1}{{\left(\left(\sum _{i=0}^{N-1}{{w}_{i}\cdot {c}_{i,d}}\right)\div N-{y}_{d}\right)}^{2}}+\lambda\cdot \left(\sum _{\vec{\imath }}{{{\left({w}_{\left(\colon \right)}\right)}}_{\vec{\imath }}}\right)\\&\\\text{where}&\\&\text{Decision Variables:}\\&\qquad \begin{alignedat}{2}w&\in \mathop{\mathrm{Array}}\left[N;\left\{0, 1\right\}\right]&\quad &1\text{-dim binary variable}\\\end{alignedat}\\&\\&\text{Placeholders:}\\&\qquad \begin{alignedat}{2}c&\in \mathop{\mathrm{Array}}\left[(-)\times (-);\mathbb{Z}\right]&\quad &2\text{-dimensional array of placeholders with elements in }\mathbb{Z}\\\lambda&\in \mathbb{R}&\quad &\text{A scalar placeholder in }\mathbb{R}\\y&\in \mathop{\mathrm{Array}}\left[(-);\mathbb{Z}\right]&\quad &1\text{-dimensional array of placeholders with elements in }\mathbb{Z}\\\end{alignedat}\\&\\&\text{Dependent Variables:}\\&\qquad \begin{alignedat}{2}D&=\mathop{\mathtt{len\_{}at}}\left(c,1\right)&\quad &\in \mathbb{N}\\N&=\mathop{\mathtt{len\_{}at}}\left(c,0\right)&\quad &\in \mathbb{N}\\\end{alignedat}\end{array} \end{split}\]

Creating an Instance#

Let us set up the task to be executed. We use the decision stump (one-layer decision tree) from scikit-learn as the weak predictor. We also use the scikit-learn cancer identification dataset.

import numpy as np

from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier as DTC


def prediction_from_train(N, X_train, y_train, X_test):
    # set the number of ensembles to be taken out for one sample in bootstrap sampling
    sample_train = 40
    # set model
    models = [DTC(splitter="random", max_depth=1) for i in range(N)]
    for model in models:
        # extract randomly
        train_idx = np.random.choice(np.arange(X_train.shape[0]), sample_train)
        # make decision tree with variables
        model.fit(X=X_train[train_idx], y=y_train[train_idx])
    y_pred_list_train = []
    for model in models:
        # execute prediction with model
        y_pred_list_train.append(model.predict(X_train))
    y_pred_list_train = np.asanyarray(y_pred_list_train)
    y_pred_list_test = []
    for model in models:
        # execute with test data
        y_pred_list_test.append(model.predict(X_test))
    y_pred_list_test = np.array(y_pred_list_test)
    return y_pred_list_train, y_pred_list_test

# load data
cancer_data = datasets.load_breast_cancer()
# set the number of train data
num_train = 200
# add noise to feature
noisy_data = np.concatenate((cancer_data.data, np.random.rand(cancer_data.data.shape[0], 30)), axis=1)
# convert (0, 1) label to (-1, 1) and cast to integer
labels = ((cancer_data.target-0.5) * 2).astype(int)
# divide dataset to train and test
X_train = noisy_data[:num_train, :]
X_test = noisy_data[num_train:, :]
y_train = labels[:num_train]
y_test = labels[num_train:]
# set the number of classifer
inst_N = 20
# predict from train data using dicision tree classifier
y_pred_list_train, y_pred_list_test = prediction_from_train(inst_N, X_train, y_train, X_test)
# set lambda (coefficient of 2nd term)
inst_lamb = 10.0
# instance_data contains only the variables defined in the model
instance_data = {'y': y_train.astype(int), 'c': y_pred_list_train.astype(int), 'lamb': inst_lamb}

For demonstration purposes, we use data with the addition of noisy features. The prediction_from_train function is used to create weak predictors and the output \(c_i (x^{(d)})\) from those predictors. Here the number of weak predictors is 20 and the number of supervised data is 200. The value of \(\lambda\) in equation (2) is set to 10.0.

Running Optimization with OpenJij#

Let us solve the optimization problem using OpenJij’s simulated annealing.

from ommx_openjij_adapter import OMMXOpenJijSAAdapter

instance = problem.eval(instance_data)

adapter = OMMXOpenJijSAAdapter(instance)
best_sample = adapter.sample(
    instance, num_reads=100
).best_feasible_unrelaxed

Visualizing the Solution#

Finally, let us visualize the obtained solution.

from sklearn.metrics import accuracy_score

# Get selected classifier indices from the best solution
df = best_sample.decision_variables_df
w_indices = [row["subscripts"][0] for _, row in df.iterrows() if row["name"] == "w" and row["value"] > 0.5]

# Predict using selected classifiers
y_pred_train = np.sign(np.sum(y_pred_list_train[w_indices, :], axis=0))
y_pred_test = np.sign(np.sum(y_pred_list_test[w_indices, :], axis=0))

acc_train = accuracy_score(y_true=y_train, y_pred=y_pred_train)
acc_test = accuracy_score(y_true=y_test, y_pred=y_pred_test)
print(f"Train Accuracy of QBoost is {acc_train}")
print(f"Test Accuracy of QBoost is {acc_test}")

Train Accuracy of QBoost is 0.935
Test Accuracy of QBoost is 0.9051490514905149

In this example, relatively high accuracy was obtained for both the training and test data. At least from these results, we can confirm that QBoost is able to select weak predictors while achieving a certain level of classification performance.