7.3 Unsupervised learning: clustering and dimensionality reduction
5 min read•august 14, 2024
Unsupervised learning uncovers hidden patterns in without predefined outputs. It's useful when labeled data is scarce or expensive. Common tasks include , which groups similar data points, and , which simplifies complex datasets.
This section covers two clustering methods: and . It also explores (PCA) for dimensionality reduction. These techniques help reveal data structures, visualize , and preprocess information for further analysis.
Unsupervised Learning Goals and Applications
Discovering Hidden Patterns and Structures
Top images from around the web for Discovering Hidden Patterns and Structures
Cluster exploration with R and plotly - rinzewind.org/blog-en View original
Is this image relevant?
Clustering, reducción dimensional y visualización View original
Is this image relevant?
Cluster exploration with R and plotly - rinzewind.org/blog-en View original
Is this image relevant?
Clustering, reducción dimensional y visualización View original
Is this image relevant?
1 of 2
Top images from around the web for Discovering Hidden Patterns and Structures
Cluster exploration with R and plotly - rinzewind.org/blog-en View original
Is this image relevant?
Clustering, reducción dimensional y visualización View original
Is this image relevant?
Cluster exploration with R and plotly - rinzewind.org/blog-en View original
Is this image relevant?
Clustering, reducción dimensional y visualización View original
Is this image relevant?
1 of 2
Unsupervised learning aims to discover hidden patterns or structures in unlabeled data without prior knowledge of the desired output
Algorithms do not require labeled data, making them suitable for scenarios where obtaining labeled data is expensive, time-consuming, or infeasible (text data, image collections)
Can serve as a preprocessing step for supervised learning tasks by providing useful insights into the data or reducing the dimensionality of the input space
Helps identify intrinsic properties, similarities, or relationships within the data that may not be apparent or easily observable
Common Unsupervised Learning Tasks
Clustering is a common unsupervised learning task that groups similar data points together based on their intrinsic properties or similarities
Applications of clustering include customer segmentation (market research), anomaly detection (fraud detection), image segmentation (object recognition), and document clustering (topic modeling)
Dimensionality reduction is another unsupervised learning task that aims to reduce the number of features in high-dimensional data while preserving the most important information
Applications of dimensionality reduction include (scatter plots), noise reduction (signal processing), (face recognition), and data compression ()
K-means Clustering for Data Grouping
K-means Algorithm Overview
K-means clustering is a partitional clustering algorithm that aims to partition n data points into k clusters, where each data point belongs to the cluster with the nearest mean (centroid)
The algorithm iteratively assigns data points to the nearest centroid and updates the centroids based on the mean of the assigned data points until convergence
Initialization: Randomly select k data points as initial centroids
Assignment: Assign each data point to the nearest centroid based on a distance metric (Euclidean distance)
Update: Recalculate the centroids by taking the mean of the data points assigned to each cluster
Repeat the assignment and update steps until the centroids no longer change significantly or a maximum number of iterations is reached
Considerations and Limitations
The choice of the number of clusters (k) is a hyperparameter that needs to be determined beforehand, often based on domain knowledge or using techniques like the elbow method or silhouette analysis
K-means clustering is sensitive to the initial placement of centroids and may converge to different local optima depending on the initialization. Multiple runs with different initializations can help find a better solution
The algorithm assumes that clusters are spherical, equally sized, and have similar densities, which may not always hold true in real-world datasets (elongated clusters, varying cluster sizes)
K-means works well for datasets with well-separated, compact clusters but may struggle with overlapping or non-convex clusters
Hierarchical Clustering vs K-means
Hierarchical Clustering Overview
Hierarchical clustering is a clustering approach that builds a hierarchy of clusters by either merging smaller clusters into larger ones (agglomerative) or dividing larger clusters into smaller ones (divisive)
Agglomerative clustering starts with each data point as a separate cluster and iteratively merges the closest clusters until a single cluster is formed
Divisive clustering starts with all data points in a single cluster and recursively splits the clusters into smaller ones until each data point forms its own cluster
The hierarchical structure of the clusters can be visualized using a dendrogram, which shows the merging or splitting of clusters at different levels of similarity
Advantages of Hierarchical Clustering
Hierarchical clustering does not require specifying the number of clusters in advance, allowing for more flexibility in exploring different levels of granularity in the data
Can handle non-spherical and non-convex clusters, as it does not make assumptions about the shape or size of the clusters
The algorithm can use various linkage criteria to determine the similarity between clusters, such as single linkage (nearest neighbor), complete linkage (farthest neighbor), or average linkage (average distance between all pairs of points)
Provides a more informative and interpretable representation of the clustering structure compared to flat clustering methods like k-means
Limitations of Hierarchical Clustering
Hierarchical clustering has a higher computational complexity compared to k-means, making it less suitable for large datasets (time complexity of O(n^3) for agglomerative clustering)
Does not provide a direct way to assign new data points to existing clusters, requiring the re-computation of the entire hierarchy
The choice of linkage criteria can have a significant impact on the resulting clustering structure and may require domain knowledge or experimentation to determine the most appropriate one
Dimensionality Reduction with PCA
Principal Component Analysis (PCA) Overview
Principal Component Analysis (PCA) is a linear dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space while preserving the most important information
PCA identifies the principal components, which are orthogonal directions in the feature space that capture the maximum variance in the data
The first principal component captures the direction of maximum variance, the second principal component captures the direction of maximum variance orthogonal to the first, and so on
The steps to perform PCA include:
Standardize the data by subtracting the mean and dividing by the standard deviation of each feature
Compute the covariance matrix of the standardized data
Perform eigendecomposition on the covariance matrix to obtain the eigenvectors (principal components) and eigenvalues
Sort the eigenvectors in descending order based on their corresponding eigenvalues
Select the top k eigenvectors to form the projection matrix
Transform the original data into the lower-dimensional space by multiplying it with the projection matrix
Considerations and Limitations
The number of principal components to retain can be determined based on the cumulative explained variance ratio, which measures the proportion of total variance captured by the selected components
PCA assumes that the data is linearly separable and that the principal components are orthogonal to each other. It may not capture non-linear relationships or preserve the original data structure (manifold structure)
PCA is sensitive to the scale of the features, so it is important to standardize the data before applying PCA to ensure equal contribution of each feature to the variance
Interpreting the principal components can be challenging, as they are linear combinations of the original features and may not have a clear physical or semantic meaning