Discover patterns in unlabeled data using clustering, dimensionality reduction, and anomaly detection
Detects patterns and groups in unlabeled data using clustering algorithms like K-Means and DBSCAN. Use when analyzing datasets without labels to find natural groupings or anomalies.
/plugin marketplace add pluginagentmarketplace/custom-plugin-machine-learning/plugin install machine-learning-assistant@pluginagentmarketplace-machine-learningThis skill inherits all available tools. When active, it can use any tool Claude has access to.
assets/config.yamlassets/schema.jsonreferences/GUIDE.mdreferences/PATTERNS.mdscripts/validate.pyDiscover hidden patterns and groupings in unlabeled data.
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
# Always scale before clustering
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Cluster
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_scaled)
# Evaluate
score = silhouette_score(X_scaled, labels)
print(f"Silhouette Score: {score:.4f}")
| Algorithm | Best For | Key Params |
|---|---|---|
| K-Means | Spherical clusters | n_clusters |
| DBSCAN | Arbitrary shapes, noise | eps, min_samples |
| Hierarchical | Nested clusters | linkage |
| HDBSCAN | Variable density | min_cluster_size |
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
import hdbscan
algorithms = {
'kmeans': KMeans(n_clusters=5, random_state=42),
'dbscan': DBSCAN(eps=0.5, min_samples=5),
'hierarchical': AgglomerativeClustering(n_clusters=5),
'hdbscan': hdbscan.HDBSCAN(min_cluster_size=15)
}
from sklearn.metrics import silhouette_score, calinski_harabasz_score
import matplotlib.pyplot as plt
def find_optimal_k(X, max_k=15):
"""Find optimal number of clusters."""
metrics = {'inertia': [], 'silhouette': [], 'calinski': []}
K = range(2, max_k + 1)
for k in K:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
metrics['inertia'].append(kmeans.inertia_)
metrics['silhouette'].append(silhouette_score(X, labels))
metrics['calinski'].append(calinski_harabasz_score(X, labels))
# Find optimal k
optimal_k = K[np.argmax(metrics['silhouette'])]
return optimal_k, metrics
| Method | Preserves | Speed |
|---|---|---|
| PCA | Global variance | Fast |
| t-SNE | Local structure | Slow |
| UMAP | Both local/global | Fast |
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
# PCA for preprocessing
pca = PCA(n_components=0.95) # Keep 95% variance
X_pca = pca.fit_transform(X)
# UMAP for visualization
reducer = umap.UMAP(n_components=2, random_state=42)
X_2d = reducer.fit_transform(X)
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
# Isolation Forest
iso_forest = IsolationForest(contamination=0.1, random_state=42)
anomalies = iso_forest.fit_predict(X) # -1 for anomaly
# Local Outlier Factor
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
anomalies = lof.fit_predict(X)
from sklearn.metrics import (
silhouette_score,
calinski_harabasz_score,
davies_bouldin_score
)
def validate_clustering(X, labels):
"""Comprehensive cluster validation."""
return {
'silhouette': silhouette_score(X, labels), # Higher = better
'calinski_harabasz': calinski_harabasz_score(X, labels), # Higher = better
'davies_bouldin': davies_bouldin_score(X, labels), # Lower = better
'n_clusters': len(set(labels) - {-1})
}
# TODO: Implement elbow method to find optimal K
# Plot inertia vs K and identify elbow point
# TODO: Compare K-Means, DBSCAN, and HDBSCAN
# on the same dataset using silhouette score
import pytest
import numpy as np
from sklearn.datasets import make_blobs
def test_clustering_finds_groups():
"""Test clustering finds expected number of clusters."""
X, y_true = make_blobs(n_samples=100, centers=3, random_state=42)
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)
assert len(set(labels)) == 3
def test_scaling_improves_score():
"""Test that scaling improves clustering quality."""
X, _ = make_blobs(n_samples=100, centers=3)
X[:, 0] *= 100 # Make first feature much larger
# Without scaling
labels_raw = KMeans(n_clusters=3).fit_predict(X)
score_raw = silhouette_score(X, labels_raw)
# With scaling
X_scaled = StandardScaler().fit_transform(X)
labels_scaled = KMeans(n_clusters=3).fit_predict(X_scaled)
score_scaled = silhouette_score(X_scaled, labels_scaled)
assert score_scaled > score_raw
| Problem | Cause | Solution |
|---|---|---|
| All in one cluster | Wrong eps/K | Reduce eps or increase K |
| Too many clusters | Parameters too sensitive | Increase eps or min_samples |
| Poor silhouette | Wrong algorithm | Try different clustering method |
| Memory error | Large dataset | Use MiniBatchKMeans |
03-unsupervised-learningsupervised-learningdeep-learningVersion: 1.4.0 | Status: Production Ready
Use when working with Payload CMS projects (payload.config.ts, collections, fields, hooks, access control, Payload API). Use when debugging validation errors, security issues, relationship queries, transactions, or hook behavior.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Creating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.