Table of contents
- Introduction
- What is a Decision Tree?
Introduction
Welcome back, everyone! Apologies for the delay in continuing our ML algorithms series that started on January 1, 2024. But no worries, we are back on track and ready to speed things up! From now on, I’ll be uploading one machine learning algorithm-based blog every week.
So lets get started, In this blog post, we will build a Decision Tree model from scratch, explaining each and every step and later testing the model on Breast Cancer dataset. By the end, you’ll have a solid understanding of Decision Trees and how to implement them in code.
What is a Decision Tree?
A Decision Tree is a type of supervised learning algorithm used for both classification and regression tasks. It works by splitting the data into subsets based on the value of input features, making decisions at each node until reaching a final prediction at the leaf nodes. Lets understand this with the help of a hypothetical scenario.
The following diagram illustrates the flow of decision tree for decision making with labels.
Rain = Yes
No Rain = No
How Splitting Happens
The way a Decision Tree decides how to split the data involves different techniques:
Gini Impurity (for Classification):
- Think of Gini Impurity as a measure of how mixed up the labels are in a group. If you randomly pick an item from a group, Gini Impurity tells you the chance of it being mislabeled. Lower Gini Impurity means the group is more pure, with mostly the same labels.
Information Gain (for Classification):
- Information Gain is like tidying up messy information. It uses entropy, which is a measure of chaos or randomness. By splitting the data based on a feature, we aim to make the subsets more organized and less random. Higher Information Gain means the data becomes more ordered after the split. We'll use this technique for our implementation.
Mean Squared Error (for Regression):
- Imagine you're trying to predict someone's weight. Mean Squared Error measures how far off your predictions are from the actual weights, squared (to make all differences positive). Lower MSE means your predictions are closer to the truth, minimizing the overall error.
Mean Absolute Error (for Regression):
- Mean Absolute Error is similar to MSE, but instead of squaring the differences, we just take their absolute values. This gives us a measure of how much, on average, our predictions differ from the actual values. Lower MAE means our predictions are more accurate, with smaller errors on average.
These techniques help the Decision Tree decide the best way to split the data at each step, ensuring that the final tree is as accurate and efficient as possible.
Key Concepts
Nodes and Leaves: Each decision point in the tree is called a node, and the final output points are called leaves.
Splitting: Dividing a node into two or more sub-nodes based on certain criteria.
Entropy and Information Gain: Metrics used to decide the best split. Entropy measures the randomness / impurity in a dataset, and information gain calculates the reduction in entropy after the dataset is split on an attribute, so more the information gain the better our split is.
Stopping Criteria:
Maximum no of layer of nodes our decision tree can grow.
Minimum no of samples a node must have if samples are less then don’t split the node.
Minimum entropy change for a split to take place.
Step-by-Step Implementation
Step 1: Import Necessary Libraries
We’ll use NumPy for numerical operations, Pandas for data manipulation, and some utilities from Scikit-learn for loading datasets and splitting data.
import numpy as np
from collections import Counter
from sklearn import datasets
from sklearn.model_selection import train_test_split
Step 2: Define the Node Class
The Node
class represents each node in the tree.
class Node:
def __init__(self,feature=None,threshold=None,left=None,right=None,*,value=None):
self.feature = feature
self.threshold = threshold
self.left = left
self.right = right
self.value = value # value of node is Yes/No only incase of Leaf, else None
def is_leaf(self):
return self.value is not None # Returns true if the node is leaf else false
Each node stores information about the feature and threshold for splitting, pointers to left and right child nodes, and the value if it is a leaf node.
Step 3: Define the Decision Tree Class
class Decision_Tree:
def __init__(self, max_depth=10, min_sample_split=2, criteria='entropy', n_features=None):
self.max_depth = max_depth # Stopping Criteria
self.min_sample_split = min_sample_split # Stopping Criteria
self.criteria = criteria # Criteria type, entropy in our case.
self.n_features = n_features # No of features we'll be using for constructing the tree.
self.root = None
Training the Tree
The fit
method trains the tree on the provided dataset.
def fit(self, X_train, y_train):
self.n_features = X_train.shape[1] if not self.n_features else min(X_train.shape[1], self.n_features)
# So that the no of features in the tree do not exceed the actual no of features we have in data.
n_samples = X_train.shape[0] # No of samples.
self.root = self.construct_tree(X_train, y_train, self.n_features, n_samples)
Here
X_train.shape[0]
gives us no of samples/rows.X_train.shape[1]
gives us no of features/columns.
Constructing the Tree
The construct_tree
method recursively builds the tree by splitting the data.
def construct_tree(self, X, y, n_features, n_samples, depth=0):
# No of labels there are in a specific feature,
# if 1 then no need to split.
# For eg. in case of Wind, we can go to 2 labels (strong and weak) so we split.
labels = len(np.unique(y))
# Check the stopping criteria
# if met create a leaf node, based on label with max frequence
if depth >= self.max_depth or labels == 1 or n_samples < self.min_sample_split:
leaf_value = self.most_common_label(y)
return Node(value=leaf_value)
# Find the best split
feat_indexs = np.random.choice(n_features, self.n_features, replace=False)
best_threshold, best_feature = self.best_split(X, y, feat_indexs)
# Create child nodes (Recursively Create Tree)
left_indxs, right_indxs = self.split(X[:, best_feature], best_threshold)
left = self.construct_tree(X[left_indxs, :], y[left_indxs], n_samples, depth + 1)
right = self.construct_tree(X[right_indxs, :], y[right_indxs], n_samples, depth + 1)
return Node(best_feature, best_threshold, left, right)
def most_common_label(self,y):
c = Counter(y)
return c.most_common(1)[0][0]
Here we first check whether the stopping criteria is met.
If the stopping criteria is met, we'll create a leaf node.
If the stopping criteria is not met, we'll find the best split.
Create left and right child based on the best split.
Splitting the Data
The split
method divides the data based on the threshold.
def split(self,X_col,threshold):
# Left Split, val <= threshold
left_idxs = np.argwhere(X_col<=threshold).flatten()
# Right Split, val> threshold
right_idxs = np.argwhere(X_col>threshold).flatten()
# np.argwhere returns the indices in a list of lists, so we flatten the result.
return left_idxs, right_idxs
left_idxs
are the indices of X_col having val less than or equal to threshold.right_idxs
are the indices of X_col having val more than threshold.
Finding the Best Split
The best_split
method iterates over all features and thresholds to find the best split based on information gain.
def best_split(self,X,y,feat_indexs):
best_gain = -1
split_index = None
split_threshold = None
for feat_index in feat_indexs:
X_col = X[:, feat_index] # Values of feature X_col.
thresholds = np.unique(X_col) # Selecting all the unique values of X_col as thresholds.
for thr in thresholds:
# For each threshold in thresholds.
# Find the threshold with maximum Information Gain
gain = self.calculate_gain(X_col,y,thr)
if gain > best_gain:
best_gain = gain
split_index = feat_index
split_threshold = thr
return split_threshold, split_index
Initially our
best_gain
,split_index
,split_threshold
are set to -1, None, None.For a
X_col
we get all its unique vals in a list, and find which amongst them will be the best threshold for the split.
Calculating Entropy and Information Gain
In this blog we use Information gain for splitting the nodes in decision tree. Information Gain is calculated using the formula:
Information Gain = Entropy(Parent) - Weighted Average * Entropy(Children)
Where:
Entropy(Parent)
is the entropy of the parent node.Weighted Average
is the average of entropy of the children, weighted by their number of instances.Entropy(Children)
is the entropy of each child node.
Entropy is calculated using the formula:
Entropy = - ∑ [ p(x) * log(p(x)) ]
Where:
p(x)
is the probability of occurrence of an event, i.e. number of times classx
has occurred divided by total no of samples.p(x) = count(x) / n
def calculate_entropy(self, y):
hist = np.bincount(y) # returns a frequency list of elements, from 0 to max(y).
ps = hist / len(y) # [p(x1), p(x2), p(x3),..., p(xN)]
return -np.sum([p * np.log(p) for p in ps if p > 0]) # Only consider non-zero probabilities
def calculate_gain(self, X_col, y, threshold):
entropy_parent = self.calculate_entropy(y)
# Create children
left_idxs, right_idxs = self.split(X_col, threshold)
if len(left_idxs) == 0 or len(right_idxs) == 0:
return 0
# Calculate the weighted average entropy of the children.
n = len(y)
# No of samples in left and right.
n_l, n_r = len(left_idxs), len(right_idxs)
# Left entropy and right entropy.
e_l, e_r = self.calculate_entropy(y[left_idxs]), self.calculate_entropy(y[right_idxs])
# No of samples in left/total samples times left entropy + No of samples in right/total samples times right entropy.
weighted_avg_entropy_children = (n_l / n) * e_l + (n_r / n) * e_r
# Calculate Information Gain
info_gain = entropy_parent - weighted_avg_entropy_children
return info_gain
Making Predictions
The predict
method traverses the tree to make predictions on new data.
def predict(self, X):
return np.array([self.traverse_tree(x, self.root) for x in X])
def traverse_tree(self, X, node):
if node.is_leaf():
return node.value
if X[node.feature] <= node.threshold: # Recursively travel
return self.traverse_tree(X, node.left) # val <= threshold, Goto Left child.
return self.traverse_tree(X, node.right) # val > threshold, Goto Right child.
We send a data
X
, to predict function.The data travels the tree until it reaches the predicted leaf.
Step 4: Testing the Model
Finally, we test our Decision Tree model on the breast cancer dataset.
def mse(y1, y2):
return np.mean((y2 - y1)**2)
def accuracy(y1,y2):
return np.sum(y1==y2)/len(y1)
if __name__ == "__main__":
model = Decision_Tree()
data = datasets.load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model.fit(X_train, y_train)
preds = model.predict(X_test)
error = mse(y_test, preds)
print('Error:', error)
Performance Evaluation on Breast Cancer Dataset
Testing Accuracy - 0.88 or 88%
Mean Squared Error - 0.114
Points to Note: Common Misconceptions
Misconception 1: Decision Trees Are Always Prone to Overfitting
While Decision Trees can be prone to overfitting, especially with deep trees that capture noise in the training data, this is not always the case. Proper pruning techniques and parameter tuning can significantly mitigate overfitting.
Explanation: The misconception arises because Decision Trees are flexible models that can adapt closely to the training data. However, by setting constraints like maximum depth and minimum samples per leaf, or by using ensemble methods like Random Forests, we can control overfitting effectively.
Misconception 2: Decision Trees Are Always Better with More Features
Adding more features to a Decision Tree does not always improve its performance. Irrelevant or redundant features can confuse the model and lead to poorer performance.
Explanation: Including too many features, especially those that do not contribute meaningful information, can lead to a more complex tree with unnecessary splits. Feature selection techniques or regularization methods can help identify the most relevant features for building a robust model.
When to Apply Decision Trees: Key Points to Consider
Non-linearity: Decision Trees do not require the relationship between input and output variables to be linear. They can very well handle non-linear relationships and interactions between features.
Handling of Mixed Data Types: Decision Trees can handle both numerical and categorical data, making them versatile for different types of datasets.
Handling Missing Values: Decision Trees can manage missing values in the data. During the splitting process, they can work without requiring imputation.
Robustness to Outliers: Decision Trees are relatively robust to outliers, as splits are based on feature thresholds that can separate outliers from the majority of the data.
Small to Medium-Sized Data: Decision Trees work well with small to medium-sized datasets. However, for very large datasets, ensemble methods like Random Forests or Gradient Boosting Trees might be more efficient, which we'll discuss soon.