Combining multiple learners

Tue 04 May 2021

In this project I will explore and compare bagging, boosting, which are both ensamble models often with tree models as their base model. To show my understanding of these different algorithms I will implement them from the ground up. I use a dataset which is comprised of emails in which some of them are categorized as spam and some are not. The task is then to predict wether a new unknown email data object is again spam or not.

Imports and loading of dataset

In [4]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import tree
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = [12, 8]
sns.set()
sns.set_palette("Set3")
sns.set_style('ticks')
np.set_printoptions(suppress=True,linewidth=np.inf)
In [5]:
# load dataset
data = np.loadtxt("data/spambase.data", delimiter=",")

X = data[:,:-1]
y = data[:,-1]

# set up cross validation
X_train_val, X_test, y_train_val, y_test = train_test_split(X, y, test_size=0.20)
X_train, X_val, y_train, y_val = train_test_split(X_train_val, y_train_val, test_size=0.25)

Basic tree model

In [6]:
# create basic tree model
tree_model = tree.DecisionTreeClassifier()
tree_model = tree_model.fit(X_train, y_train)
basic_y_preds = tree_model.predict(X_val)

basic_accuracy = np.sum(basic_y_preds == y_val)/y_val.shape[0]

print(f'Basic model accuracy: {round(basic_accuracy, 3)}')
Basic model accuracy: 0.902

Bagging

Bagging uses boostraping to create a number of synthesized versions of our dataset. We then fit a tree model to all of these different versions of the dataset and in the end use a comitee to decide what the prediction outcome will be.

In [7]:
def bootstrapping(X, y, n_samples):
    rng = np.random.default_rng(0)
    N = X.shape[0]
    y = y.reshape(-1,1)
    
    Z = []
    for _ in range(n_samples):
        X_y_concat = np.hstack([X, y]) # concatenate y values to the boostrapping
        Z.append(rng.choice(X_y_concat, N))
    return Z

def bagging(X_train, y_train, Z, X_val, y_val, model):
    B = len(Z)
    N_val = X_val.shape[0]

    y_bag = np.zeros((N_val,1))
    for Z_b in Z:
        tree_model = model.fit(Z_b[:,:-1], Z_b[:,-1]) 
        y_bag += tree_model.predict(X_val).reshape(-1,1)

    k_vector = y_bag/B
    g_bag = np.where(k_vector > 0.5, 1, 0)
    return g_bag.flatten()
In [8]:
# Bagging model accuracy using 200 bootstrap samples
n_samples = 200 
Z = bootstrapping(X_train, y_train, n_samples)

y_preds = bagging(X_train, y_train, Z, X_val, y_val, tree_model)

accuracy = np.sum(y_preds == y_val)/y_val.shape[0]
print(f'Bagging model accuracy: {round(accuracy,3)}')
Bagging model accuracy: 0.938
In [9]:
accuracy_list = [] # values to use in plot
for n_samples in range(1, 201):
    Z = bootstrapping(X_train, y_train, n_samples)

    y_preds = bagging(X_train, y_train, Z, X_val, y_val, tree_model)

    accuracy = np.sum(y_preds == y_val)/y_val.shape[0]
    accuracy_list.append(round(accuracy,3))
In [10]:
x = np.arange(1,201)
plt.title('Validation accuracy of the basic tree model vs the bagged trees using bootstrap samples')
plt.scatter(x=x, y=accuracy_list, facecolors='none', edgecolor=sns.color_palette('Set3')[3])
plt.axhline(y=basic_accuracy, linestyle=':', linewidth=2.5)
plt.ylabel('Accuracy')
plt.xlabel('Number of boostrap samples')
plt.legend(['Basic tree model', 'Bagged trees'])
plt.show()
2021-06-02T18:53:08.223414 image/svg+xml Matplotlib v3.3.3, https://matplotlib.org/

Boosting

The boosting algorithm combines many so called "weak learners" and like bagging then avarages the outcome (in classification by using a comitee). The idea is to train several weak models on our dataset, in the case of a tree model this will be a tree with only two leaves (also called a stump) and let each new iteration learn from the former. There are different versions of the boosting algorithm which have different pros and cons. I have chosen to implement the AdaBoost M1 algortihm.

In [12]:
class AdaBoost():
    def __init__(self):
        self.stumps = None
        self.stumps_weights = None
        self.sample_weights = None
        self.errors = None
    
    def _check_X_y(self, X, y):
        """ Validate assumptions about format of input data"""
        assert set(y.flatten()) == {-1, 1}, 'Response variable must be ±1'
        return X, y

    def initialize(self, X, y, iters):
        N = X.shape[0]

        self.sample_weights = np.zeros(shape=(iters,N))
        self.stumps = np.zeros(shape=iters, dtype=object)
        self.stumps_weights = np.zeros(shape=iters)
        self.errors = np.zeros(shape=iters)

        # initialize sample weights as 1/N
        self.sample_weights[0] = np.full(shape=N, fill_value=1/N)
        

    def fit(self, X, y, iters):
        X, y = self._check_X_y(X, y)
        y = y.flatten()
        N = X.shape[0]

        self.initialize(X, y, iters)

        for m in range(iters):
            # fit weak leaner
            curr_sample_weights = self.sample_weights[m]
            stump = tree.DecisionTreeClassifier(max_depth=1, max_leaf_nodes=2)
            stump.fit(X,y, sample_weight=curr_sample_weights)
            self.stumps[m] = stump

            # compute errors
            stump_preds = stump.predict(X)
            indicator = (y != stump_preds).astype(int)
            
            err = np.sum(curr_sample_weights * indicator) / np.sum(curr_sample_weights)
            self.errors[m] = err

            # compute stump weights
            stump_weight = np.log((1-err)/err)
            self.stumps_weights[m] = stump_weight

            # update sample weigh
            new_sample_weights = curr_sample_weights * np.exp(stump_weight * indicator)
            if m+1 < iters:
                self.sample_weights[m + 1] = new_sample_weights
            
            
        return self
        
    # compute output
    def predict(self, X):
        stumps_preds = np.array([stump.predict(X) for stump in self.stumps])
        output = np.sign(self.stumps_weights @ stumps_preds)
        return output
In [13]:
# turn target values from [0,1] binary values to [-1,1] and change back the y dimensions
y_train[y_train == 0] = -1
y_val[y_val == 0] = -1 

ada = AdaBoost()
ada.fit(X_train, y_train, 200)
y_preds = ada.predict(X_val)
accuracy = np.sum(y_preds == y_val)/y_val.shape[0]
print(f'Boosting model accuracy: {round(accuracy,3)}')
Boosting model accuracy: 0.941
In [14]:
# collect boosting accuracy for different number of iterations
boosting_accuracy_list = []
for iterations in range(1, 201):
    ada = AdaBoost()
    ada.fit(X_train, y_train, iterations)
    boosting_y_preds = ada.predict(X_val)
    boosting_accuracy = np.sum(boosting_y_preds == y_val)/y_val.shape[0]
    boosting_accuracy_list.append(boosting_accuracy)
In [17]:
# stump predictions
#stump = tree.DecisionTreeClassifier(max_depth=1, max_leaf_nodes=2)
#stump.fit(X,y)
#stump_preds = stump.predict(X_val)
#stump_accuracy = np.sum(stump_preds == y_val)/y_val.shape[0]
#stump_accuracy

x = np.arange(1,201)
plt.title('Validation accuracy of the weak learner stump and the basic tree model vs the AdaboostM1 model')
plt.scatter(x=x, y=boosting_accuracy_list, facecolors='none', edgecolor=sns.color_palette('Set3')[3])
#plt.axhline(y=stump_accuracy, linestyle=':', c=sns.color_palette('Set3')[5], linewidth=2.5 )
plt.axhline(y=basic_accuracy, linestyle=':', linewidth=2.5)
plt.ylabel('Accuracy')
plt.xlabel('Number of iterations')
plt.legend(['Basic tree model', 'AdaboostM1 model'])
plt.show()
2021-06-02T19:03:46.409994 image/svg+xml Matplotlib v3.3.3, https://matplotlib.org/

As can be seen from the plot the boosting algorithm outperforms the bagging model when setting the parameter iterations to 200. It actually starts out with a worse accuracy than the basic tree model but starts to perform better at around 20 iterations. In very small increments the boosting model seems to keep performing better with every iteration we add. There could therefore be potential for investigating using even more iterations, but at this point there begins to be a trade off in performance as the model already takes quite som computational power to run.