Tree Based Machine Learning Algorithms


Tree-based algorithms are well-thought-out to be the best and widely used supervised learning methods. They allow predictive models with high accurateness, strength, and ease of interpretation. They map non-linear relationships to a certain extent well. They are flexible at solving any kind of problem at hand related to classification or regression.

This article purposes to make a distinction between tree-based Machine Learning algorithms giving to the difficulty. Tree-based is an important family of supervised Machine Learning. It does classification and regression tasks by constructing a tree-like structure. That is being performed for determining the target variable class or value rendering to the features.


Decision Tree algorithms are mentioned as CARTs. That stands for Classification and Regression Trees. This term was suggested by Breiman et al. (1984). Classification tree study is engaged when the predicted result is the discrete class to which the data model fits. As an alternative, Regression tree analysis is used when the predicted result is a nonstop variable. For example the price of a house, or a patient’s span of stay in a hospital.

There are many resemblances in both regression and classification trees. However, they similarly have several differences. For example, the procedure used to control where to divide the tree. Classification trees divide the data centered on the concept of clarity of the node. Particularly we target to make the most of the reduction of impurity for each split. While for regression tasks, we normally fit the model to the set of descriptive variables. Giving less care to those structures where their presence rises the prediction error for those nodes.

Pros and Cons of CARTs

  • They are informal to know, and their productivity is easy to interpret.
  • CARTs are simple to use and their flexibility provides them a capacity to define non-linear needs between features and labels.
  • CARTs do not essential a lot of feature pre-processing of the structures before serving them into the model as the training set.
  • However, CARTs have also various limitations. For example, a classification tree is only capable to yield orthogonal decision boundaries.
  • CARTs are likewise very sensitive to small variations in the training set.
  • They also feel pain from tall change when they are trained deprived of limits.

Ensemble Methods

Ensemble methods build more than one decision tree. The below are some ensemble methods.

Boosted trees

They incrementally structure an ensemble by training every new example to highlight the training illustrations formerly is-modeled. A classic model is AdaBoost. These may be used for regression and classification-type difficulties.

Bootstrap aggregated decision trees

These are also known as bagged decision trees. An initial ensemble method shapes many decision trees. That builds by frequently resampling training data with standby, and voting the trees for an accord prediction. A random forest classifier is a particular type of bootstrap aggregating.

For example, we need to train 5 Decision Tree models. Every Decision Tree picks up from sub-samples of the training dataset. Clarifications in the training dataset may be learned by some models. This would outcome in 5 alike Decision Tree models.

Random Forest

Random Forest Machine Learning generates a number of models at the same time distinctly. Each of the models is free of the other. We may still develop our model correctness using Boosting. In rotation forest, every decision tree is trained by first relating principal component analysis (PCA) on a random subset of the input feature.

Random Forest is a protean machine literacy system able of acting together retrogression and bracket tasks. It likewise assumes dimensional reduction styles, treats missing values, outlier values, and another vital way of data tract, and fixes a fairly good job. It’s a type of ensemble literacy system, where a group of weak models combines to form an important model.

Most normally used algorithms in Decision Tree


Gini states, if we choose two items from a population at random then they must be of a similar class and the probability for this is 1 if the population is unadulterated. It works with definite target variables as Success or Failure. It makes only Binary splits. Greater the value of Gini greater the homogeneity. CART uses the Gini technique to make binary splits.

Calculation of Gini for a split

By using the formula sum of the square of probability for success and failure (p^2+q^2). Calculate Gini for split using the weighted Gini score of each node of that split.


This is an algorithm to realize the statistical impact on the changes between sub-nodes and parent nodes. We measure it by the sum of squares of standardized changes between practical and probable frequencies of the target variable. It works with definite target variables as Success or Failure. It may achieve two or more splits.

The greater the value of Chi-Square is greater the statistical importance of changes between sub-node and Parent node.

Calculation of Chi-Square of each node

By using formula, Chi-square = ((Actual – Expected) ^2 / Expected) ^1/2

It makes a tree called CHAID (Chi-square Automatic Interaction Detector)

Calculate Chi-square for separate nodes by computing the abnormality for Success and Failure together. Calculated Chi-square of Split using Sum of all Chi-square of success and Failure of each node of the split.

Reduction in Variance

Reduction invariance is an algorithm used for constant target variables as in regression problems. This algorithm uses the normal formula of variance to select the best split. The split with minor variance is selected as the criteria to split the population:

Decision Tree, Reduction in Variance

Above X-bar is the mean of the values, X is actual and n is a number of values.

Calculation of Variance:

Calculate variance for each node. Determine change for every split as the weighted average of each node variance.