You have 3 free guides left 😟
Unlock your guides
You have 3 free guides left 😟
Unlock your guides

1.4 Machine Learning Workflow and Data Preprocessing

4 min readaugust 7, 2024

Machine learning workflows involve crucial steps from data prep to . Data preprocessing, including cleaning and , sets the foundation for accurate models. Understanding these steps is key to grasping the machine learning process.

Model development and evaluation are critical for creating effective predictive systems. Selecting the right algorithm, tuning hyperparameters, and rigorously evaluating performance help ensure models generalize well to new data. These skills are essential for applying machine learning in practice.

Data Preprocessing

Data Collection and Cleaning

Top images from around the web for Data Collection and Cleaning
Top images from around the web for Data Collection and Cleaning
  • involves gathering raw data from various sources (databases, APIs, web scraping) for use in the machine learning workflow
  • is the process of identifying and correcting or removing inaccurate, incomplete, or irrelevant data points from the dataset
    • Includes handling missing values by either removing the corresponding instances or imputing the missing values using techniques like mean, median, or mode imputation
    • Involves identifying and removing outliers, which are data points that significantly deviate from the majority of the data distribution, using statistical methods (Z-score, IQR) or domain knowledge
  • combines data from multiple sources into a unified view, resolving inconsistencies and redundancies to ensure data integrity and consistency

Feature Engineering and Normalization

  • Feature engineering is the process of creating new features or transforming existing features to improve the performance of machine learning models
    • Includes , which involves deriving new features from existing ones (extracting day, month, and year from a date feature)
    • Encompasses , which is the process of identifying and selecting the most relevant features for the model, reducing dimensionality and improving computational efficiency (using techniques like correlation analysis, mutual information, or regularization)
  • Data is the process of scaling the features to a consistent range (usually between 0 and 1) to prevent features with larger magnitudes from dominating the learning process
    • Common normalization techniques include , which scales the features to a specific range, and , which scales the features to have zero mean and unit variance
  • is a technique used to convert categorical variables into a binary vector representation, enabling machine learning models to process categorical data effectively

Data Splitting

  • involves dividing the preprocessed dataset into separate subsets for training, validation, and testing purposes
    • is used to train the machine learning model, allowing it to learn patterns and relationships in the data (typically 60-80% of the data)
    • is used to tune the model's hyperparameters and assess its performance during the development phase, helping to prevent overfitting (typically 10-20% of the data)
    • is used to evaluate the final model's performance on unseen data, providing an unbiased estimate of its generalization ability (typically 10-20% of the data)
  • ensures that the class distribution in each subset is representative of the original dataset, which is particularly important for imbalanced datasets

Model Development

Model Selection and Hyperparameter Tuning

  • involves choosing an appropriate machine learning algorithm based on the problem type (classification, regression, clustering), data characteristics, and performance requirements
    • Considerations include model interpretability, computational complexity, and the ability to handle specific data types (numerical, categorical, text)
    • Popular algorithms include , , , , (SVM), and
  • is the process of finding the optimal set of hyperparameters for a selected model to maximize its performance
    • Hyperparameters are settings that control the model's learning process and architecture (learning rate, regularization strength, number of hidden layers in neural networks)
    • Techniques for hyperparameter tuning include , which exhaustively searches through a predefined set of hyperparameter combinations, and , which samples hyperparameter values from a specified distribution

Model Evaluation

  • assesses the performance of a trained model using appropriate evaluation metrics based on the problem type
    • For classification tasks, common metrics include , , , , and (AUC-ROC)
    • For regression tasks, common metrics include (MSE), (MAE), and (R2R^2)
  • is a technique used to assess the model's performance and its ability to generalize to unseen data by partitioning the data into multiple subsets and iteratively training and evaluating the model on different combinations of these subsets ()
  • is the balance between a model's ability to fit the training data (bias) and its ability to generalize to new, unseen data (variance)
    • High bias models (underfitting) are too simplistic and fail to capture the underlying patterns in the data, while high variance models (overfitting) are too complex and memorize noise in the training data

Model Deployment

Deployment Considerations

  • Model deployment is the process of integrating a trained machine learning model into a production environment to make predictions on new, unseen data
  • Deployment considerations include choosing an appropriate deployment platform (cloud, on-premises), ensuring the model's compatibility with the production environment, and establishing a pipeline for data preprocessing and post-processing
  • is crucial to ensure the deployed model's performance remains stable over time and to detect , which occurs when the statistical properties of the target variable change over time
  • involves periodically retraining the model with new data to adapt to changes in the underlying data distribution and to incorporate user feedback
  • Scalability and efficiency are important factors in model deployment, as the model should be able to handle large volumes of data and make predictions in real-time with minimal latency
© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.
Glossary
Glossary