🧠Machine Learning Engineering Unit 5 – Model Selection and Evaluation

Model selection and evaluation are crucial steps in machine learning. They involve choosing the best model from candidates and assessing performance on unseen data. Techniques like cross-validation, hyperparameter tuning, and various evaluation metrics help ensure models generalize well. Understanding the bias-variance tradeoff is key to balancing model complexity. Overfitting and underfitting are common pitfalls that can be addressed through regularization, early stopping, and proper data handling. Practical tips like starting simple and using pipelines enhance the model development process.

Key Concepts and Terminology

  • Model selection involves choosing the best model from a set of candidate models based on their performance on unseen data
  • Evaluation metrics quantify the performance of a model on a specific task (accuracy, precision, recall, F1-score, ROC AUC)
  • Cross-validation is a technique used to assess the generalization performance of a model by partitioning the data into subsets for training and testing
    • K-fold cross-validation splits the data into K equally sized folds, trains on K-1 folds, and tests on the remaining fold, repeating the process K times
  • Hyperparameters are settings of a model that are not learned from data but set before training (learning rate, regularization strength, number of hidden layers)
  • Overfitting occurs when a model learns the noise in the training data, leading to poor generalization on unseen data
  • Underfitting happens when a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and test data
  • Bias refers to the error introduced by approximating a real-world problem with a simplified model
  • Variance measures how much the model's predictions vary for different training sets

Model Selection Techniques

  • Hold-out validation splits the data into training, validation, and test sets, using the validation set to select the best model and the test set for final evaluation
  • K-fold cross-validation provides a more robust estimate of model performance by averaging results across multiple splits of the data
  • Stratified K-fold cross-validation ensures that each fold has a representative distribution of the target variable, especially useful for imbalanced datasets
  • Leave-one-out cross-validation (LOOCV) is a special case of K-fold cross-validation where K equals the number of samples, providing an unbiased estimate of model performance but can be computationally expensive
  • Repeated K-fold cross-validation performs K-fold cross-validation multiple times with different random splits to obtain a more stable performance estimate
  • Nested cross-validation is used to tune hyperparameters and evaluate model performance simultaneously, with an outer loop for model evaluation and an inner loop for hyperparameter tuning
  • Time series cross-validation accounts for the temporal structure of the data by using past data for training and future data for testing, ensuring that the model does not learn from future information

Cross-Validation Strategies

  • Train-test split is the simplest form of cross-validation, dividing the data into a training set for model fitting and a test set for performance evaluation
  • Stratified train-test split maintains the same proportion of target variable classes in both the training and test sets
  • K-fold cross-validation provides a more reliable estimate of model performance by averaging results across multiple splits of the data
    • Reduces the variance of the performance estimate compared to a single train-test split
  • Stratified K-fold cross-validation is preferred for classification tasks with imbalanced classes, ensuring each fold has a representative class distribution
  • Repeated K-fold cross-validation helps to further reduce the variance of the performance estimate by repeating the K-fold process multiple times with different random splits
  • Leave-one-out cross-validation is computationally expensive but provides an unbiased estimate of model performance, suitable for small datasets
  • Group K-fold cross-validation is used when data points are grouped (patients, users) and the model should not learn from future data points within the same group

Evaluation Metrics

  • Accuracy measures the proportion of correct predictions out of all predictions, suitable for balanced datasets
  • Precision quantifies the proportion of true positive predictions among all positive predictions, focusing on the model's ability to avoid false positives
  • Recall (sensitivity) measures the proportion of true positive predictions among all actual positive instances, emphasizing the model's ability to identify positive cases
  • F1-score is the harmonic mean of precision and recall, providing a balanced measure of the model's performance
  • Specificity measures the proportion of true negative predictions among all actual negative instances
  • ROC AUC (Area Under the Receiver Operating Characteristic Curve) evaluates the model's ability to discriminate between classes across various threshold settings
  • Log loss (cross-entropy loss) quantifies the dissimilarity between predicted probabilities and true labels, commonly used as a training objective for classification tasks
  • Mean squared error (MSE) measures the average squared difference between predicted and actual values, suitable for regression tasks

Bias-Variance Tradeoff

  • Bias refers to the error introduced by approximating a real-world problem with a simplified model
    • High bias models (linear regression) make strong assumptions about the data, leading to underfitting
  • Variance measures how much the model's predictions vary for different training sets
    • High variance models (complex neural networks) are sensitive to noise in the training data, leading to overfitting
  • The goal of model selection is to find the right balance between bias and variance to achieve good generalization performance
  • Increasing model complexity typically reduces bias but increases variance, while decreasing complexity has the opposite effect
  • Regularization techniques (L1/L2 regularization, dropout) can help control the bias-variance tradeoff by constraining the model's complexity
  • Ensemble methods (bagging, boosting) can reduce variance by combining predictions from multiple models trained on different subsets of the data

Overfitting and Underfitting

  • Overfitting occurs when a model learns the noise in the training data, leading to poor generalization on unseen data
    • Characterized by high performance on the training set but low performance on the test set
  • Underfitting happens when a model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and test data
  • Regularization techniques help prevent overfitting by adding a penalty term to the loss function, discouraging the model from learning overly complex patterns
  • Early stopping is another approach to mitigate overfitting, where training is stopped when performance on a validation set starts to degrade
  • Increasing the size and diversity of the training data can help reduce overfitting by exposing the model to a wider range of examples
  • Simplifying the model architecture (reducing layers, neurons) can help address overfitting by limiting the model's capacity to memorize noise
  • Adding more features or increasing model complexity can help alleviate underfitting by enabling the model to capture more complex patterns in the data

Hyperparameter Tuning

  • Hyperparameters are settings of a model that are not learned from data but set before training (learning rate, regularization strength, number of hidden layers)
  • Hyperparameter tuning aims to find the optimal combination of hyperparameters that maximizes the model's performance on unseen data
  • Grid search exhaustively evaluates all possible combinations of hyperparameters from a predefined set, which can be computationally expensive
  • Random search samples hyperparameter combinations randomly, often more efficient than grid search when the search space is large
  • Bayesian optimization uses a probabilistic model to guide the search for optimal hyperparameters, balancing exploration and exploitation
  • Gradient-based optimization methods (learning rate schedules) adapt hyperparameters during training based on the model's performance
  • Evolutionary algorithms (genetic algorithms) can be used to optimize hyperparameters by iteratively evolving a population of candidate solutions
  • Hyperparameter importance can be assessed using techniques like permutation importance or ablation studies to identify the most influential hyperparameters

Practical Implementation Tips

  • Start with a simple model and gradually increase complexity to establish a performance baseline and avoid overfitting
  • Use stratified sampling when splitting data to ensure representative class distribution in each subset
  • Scale and normalize features to improve convergence and model performance, especially for gradient-based optimization algorithms
  • Handle missing data by either removing samples with missing values, imputing missing values, or using models that can handle missingness directly (tree-based models)
  • Address class imbalance through resampling techniques (oversampling minority class, undersampling majority class) or by using class weights during training
  • Perform feature selection to identify the most informative features and reduce model complexity, using techniques like correlation analysis, mutual information, or L1 regularization
  • Monitor training progress using learning curves to detect overfitting or underfitting early and adjust the model accordingly
  • Use pipelines to encapsulate data preprocessing, feature engineering, and model training steps for easier experimentation and deployment
  • Document and version control your experiments to keep track of different model configurations, hyperparameters, and results


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.