⛱️Cognitive Computing in Business Unit 3 – Machine Learning Essentials

Machine learning is a subset of AI that enables systems to learn and improve from experience without explicit programming. It uses algorithms to build models based on training data, making predictions or decisions in various applications like email filtering and computer vision. Key concepts include datasets, features, labels, and models. The machine learning process involves defining problems, preparing data, choosing models, training, evaluating, and deploying. Popular algorithms range from linear regression to neural networks, with tools like Python and TensorFlow supporting implementation.

What's Machine Learning Anyway?

  • Machine learning (ML) is a subset of artificial intelligence that focuses on building systems that can learn and improve from experience without being explicitly programmed
  • ML algorithms build mathematical models based on sample data, known as training data, to make predictions or decisions without being explicitly programmed to do so
  • As the models are exposed to new data, they adapt and learn from previous computations to produce reliable, repeatable decisions and results
  • ML is closely related to computational statistics, which focuses on making predictions using computers, but not all ML is statistical learning
  • The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning
  • ML algorithms are used in a wide variety of applications (email filtering, computer vision, recommendation engines) where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks
    • Spam filtering is a common example, where ML algorithms learn to flag spam based on words or phrases found in the email text
  • ML enables analysis of massive quantities of data, delivering faster, more accurate results to identify profitable opportunities or dangerous risks

Key Machine Learning Concepts

  • Dataset is a collection of data used to train and test ML models, typically split into training, validation, and test sets
  • Features are the input variables used in making predictions, represented as columns in tabular datasets or certain abstract qualities (texture, color, etc.) in unstructured data like images
  • Labels are the output variable or target that the model is trying to predict, based on the features
  • Model is the mathematical representation of a real-world process, trained on historical data to make predictions on new data by capturing the relationship between features and label
  • Training is the process of feeding data to the model so it can learn the relationships between features and label
    • The goal is to minimize the difference between the model's predictions and the actual labels, known as the loss or cost function
  • Inference is when the trained model is used to make predictions on new, unseen data
  • Hyperparameters are the settings used to control the model training process (learning rate, number of hidden layers in a neural network, etc.)
    • These are set before training and are not learned from data like model parameters

Types of Machine Learning

  • Supervised learning uses labeled datasets to train algorithms that to classify data or predict outcomes accurately
    • Input data is called training data and has a known label or result (historical stock prices, images of dogs labeled "dog", etc.)
    • Model is trained until it can detect the underlying patterns and relationships, enabling it to yield accurate labeling results when presented with never-before-seen data
  • Unsupervised learning is used on data with no historical labels, allowing the algorithm to act on that data without guidance
    • Unsupervised learning can discover hidden patterns or data groupings (customer segments) without the need for human intervention
    • Commonly used for transactional data (identifying segments of customers with similar attributes who can then be treated similarly in marketing campaigns)
  • Semi-supervised learning uses a mix of labeled and unlabeled data, usually a small amount of labeled data with a large amount of unlabeled data
    • Can address real-world problems where labeled data is scarce or expensive, but unlabeled data is abundant
  • Reinforcement learning trains models to make a sequence of decisions by exposing the model to the environment where it trains itself based on feedback
    • Learns by trial and error to capture the best possible knowledge to make accurate decisions
    • Often used in gaming, robotics, and navigation

The Machine Learning Process

  • Define problem and gather data from various sources (databases, sensors, APIs) and different formats (tables, images, text)
  • Prepare data by cleaning (handling missing values, removing duplicates), transforming (scaling, encoding categorical variables), and splitting into train, validation, and test sets
    • Exploratory data analysis examines data to understand its main characteristics (mean, standard deviation, correlation, etc.)
    • Feature engineering creates new input features from existing ones to improve model performance
  • Choose a model based on the problem type (classification, regression, clustering), data size and type, and resource constraints
  • Train the model on the training data, tuning hyperparameters to improve performance
    • Validation data provides an unbiased evaluation of model performance while tuning hyperparameters
  • Evaluate model performance on the test set using appropriate metrics (accuracy, precision, recall, F1-score)
  • Deploy model into production environment to start generating predictions on real-world data
  • Monitor and maintain model performance over time, retraining or updating as needed based on new data or changing environment
  • Linear Regression predicts continuous values (house prices) by fitting a linear equation to observed data
  • Logistic Regression predicts binary outcomes (spam or not spam) by fitting a logistic function to observed data
  • Decision Trees predict outcomes by learning simple decision rules inferred from the data features
    • Random Forest is an ensemble of decision trees, making predictions by aggregating the predictions of multiple trees
  • Support Vector Machines find a hyperplane in N-dimensional space (N = number of features) that distinctly classifies the data points
  • Naive Bayes classifiers are a family of probabilistic algorithms based on applying Bayes' theorem with strong independence assumptions between the features
  • K-Means Clustering groups unlabeled data into K clusters based on feature similarity
  • Neural Networks are inspired by biological neural networks, consisting of input, hidden, and output layers of interconnected nodes
    • Deep Learning uses neural networks with many hidden layers to learn hierarchical representations of data

Tools and Frameworks

  • Python is the most popular programming language for ML due to its extensive libraries (NumPy for numerical computing, Pandas for data manipulation, Matplotlib for visualization)
  • Scikit-learn is a Python library that provides a wide range of supervised and unsupervised learning algorithms
  • TensorFlow is an open-source library for dataflow and differentiable programming, used for ML applications such as neural networks
  • Keras is a high-level neural networks API written in Python, capable of running on top of TensorFlow, Microsoft Cognitive Toolkit, R, Theano, or PlaidML
  • PyTorch is an open source ML library based on Torch, used for applications such as computer vision and natural language processing, primarily developed by Facebook's AI Research lab
  • Apache Spark is a fast and general-purpose cluster computing system that provides APIs to work with large datasets, including Spark MLlib for ML
  • Cloud platforms like Amazon Web Services, Google Cloud, and Microsoft Azure offer managed ML services for building, training, and deploying models at scale

Real-World Business Applications

  • Fraud Detection uses ML to identify suspicious patterns and prevent fraudulent transactions in industries like banking and insurance
    • JPMorgan Chase uses ML to detect fraud and money laundering, saving $150 million annually
  • Recommendation Systems suggest relevant products or content to users based on past behavior and similar users' preferences
    • Netflix uses ML to personalize movie and TV recommendations, saving $1 billion annually in customer retention
  • Customer Churn Prediction helps businesses identify customers at high risk of leaving, allowing proactive retention efforts
    • Verizon uses ML to predict customer churn and improve retention by 1-5%
  • Dynamic Pricing optimizes product prices in real-time based on demand, competitor prices, and other market factors
    • Uber uses ML for surge pricing to balance supply and demand by raising prices when demand outstrips supply
  • Predictive Maintenance anticipates equipment failures to allow maintenance to be scheduled before the failure occurs, preventing unexpected downtime
    • General Electric uses ML to predict jet engine failures, reducing flight delays and cancellations
  • Image and Video Analysis extracts insights from visual data for applications like medical diagnosis, defect detection in manufacturing, and autonomous vehicles
    • IBM Watson uses ML to assist doctors in diagnosing diseases like cancer from medical images
  • Natural Language Processing enables computers to understand, interpret, and generate human language for applications like sentiment analysis, chatbots, and machine translation
    • Google Translate uses ML to provide real-time language translation for over 100 languages

Challenges and Limitations

  • Data quality issues like missing values, outliers, and noise can significantly impact model performance
    • Garbage in, garbage out - ML models are only as good as the data they are trained on
  • Model interpretability is a challenge, particularly for complex models like deep neural networks
    • Black box models make it difficult to understand how the model arrives at its predictions, which can be problematic in regulated industries
  • Bias in training data can lead to biased predictions, perpetuating or even amplifying societal biases
    • Amazon had to scrap an ML recruiting tool that showed bias against women due to historical hiring data
  • Overfitting occurs when a model learns the noise in the training data to the extent that it negatively impacts the performance on new data
    • Regularization techniques (L1/L2 regularization, dropout) can help mitigate overfitting
  • Deployment and scaling ML models in production environments can be challenging due to computational requirements and the need to retrain models as new data becomes available
    • Techniques like model compression and quantization can help reduce model size and inference latency
  • Adversarial attacks involve malicious actors manipulating input data to fool ML models and cause misclassifications
    • Adversarial training incorporates adversarial examples in the training data to improve model robustness
  • Concept drift occurs when the statistical properties of the target variable change over time, leading to a degradation in model performance
    • Continuous monitoring and retraining of models is necessary to adapt to changing environments


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.