You have 3 free guides left 😟
Unlock your guides
You have 3 free guides left 😟
Unlock your guides

8.1 Decision Trees: Construction and Pruning

3 min readaugust 7, 2024

Decision trees are powerful tools for classification and regression tasks. They use a hierarchical structure to make predictions, starting from a root and splitting data based on features. Understanding their components and traversal is key to grasping their functionality.

Splitting criteria like and entropy help choose the best features for node splits. Pruning techniques, such as , address by simplifying trees. The algorithm combines these concepts to build efficient, interpretable decision trees.

Decision Tree Structure

Components of a Decision Tree

Top images from around the web for Components of a Decision Tree
Top images from around the web for Components of a Decision Tree
  • Decision trees consist of a hierarchical structure used for classification or regression tasks
  • Begin with a root node representing the entire dataset or population
  • Recursively split the data at each internal node based on a selected feature and threshold
  • nodes represent the final decision or prediction for a given instance after traversing the tree

Traversing a Decision Tree

  • Start at the root node and evaluate the corresponding feature for a given instance
  • Follow the appropriate based on the feature value until reaching a leaf node
  • Leaf nodes contain the predicted class label (classification) or value (regression) for the instance
  • Path from the root to a leaf represents a series of decisions or rules leading to the final prediction

Splitting Criteria

Measures of Impurity

  • Splitting criteria determine the best feature and threshold to split a node
  • Aim to maximize the homogeneity or purity of the resulting subsets after splitting
  • Common measures of impurity include Gini impurity and entropy
  • Gini impurity measures the probability of misclassification if a random instance is labeled based on the distribution of classes in the subset
  • Entropy quantifies the amount of uncertainty or randomness in the class distribution of a subset

Information Gain

  • measures the reduction in impurity achieved by splitting a node based on a specific feature
  • Calculated as the difference between the impurity of the parent node and the weighted average impurity of the child nodes
  • Higher information gain indicates a more informative feature for splitting
  • Goal is to select the feature and threshold that maximize information gain at each node

Pruning Techniques

Addressing Overfitting

  • Decision trees are prone to overfitting, especially when grown to full depth
  • Overfitting occurs when the tree becomes too complex and starts to memorize noise or outliers in the training data
  • Pruning techniques are employed to simplify the tree and improve generalization performance
  • Pruning involves removing or collapsing nodes that do not significantly contribute to the overall

Cost Complexity Pruning

  • Cost complexity pruning, also known as weakest link pruning, is a commonly used pruning technique
  • Introduces a complexity parameter (alpha) that balances the trade-off between tree size and accuracy
  • Pruning process starts from the bottom of the tree and recursively evaluates the impact of removing each node
  • Nodes are pruned if the increase in misclassification cost is less than the decrease in complexity (determined by alpha)
  • Higher values of alpha result in more aggressive pruning and smaller trees

CART Algorithm

  • CART (Classification and Regression Trees) is a popular algorithm for building decision trees
  • Builds the tree by recursively selecting the best feature and threshold to split nodes based on impurity measures
  • Grows the tree to full depth and then applies cost complexity pruning to obtain the optimal subtree
  • Handles both categorical and numerical features and supports classification and regression tasks
  • Provides a framework for building interpretable and efficient decision trees
© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.
Glossary
Glossary