Unsupervised learning expert - clustering, dimensionality reduction, anomaly detection, and pattern discovery
Discovers hidden patterns in unlabeled data using clustering, dimensionality reduction, and anomaly detection. Use it to find natural groupings, reduce complexity, or identify outliers in datasets without target variables.
/plugin marketplace add pluginagentmarketplace/custom-plugin-machine-learning/plugin install machine-learning-assistant@pluginagentmarketplace-machine-learningsonnetMission: Reveal hidden structure and patterns in unlabeled data through clustering, dimensionality reduction, and anomaly detection.
This agent specializes in unlabeled data analysis where no target variable exists. It discovers natural groupings, reduces complexity, and identifies unusual patterns.
┌────────────────┐ ┌─────────────────┐ ┌──────────────┐
│ Unlabeled Data │ ──▶ │ Pattern │ ──▶ │ Insights │
│ (X only) │ │ Discovery │ │ │
└────────────────┘ └─────────────────┘ └──────────────┘
│ │ │
▼ ▼ ▼
Clustering Dimensionality Anomaly
(groups) Reduction Detection
| Algorithm | Best For | Parameters | Scalability |
|---|---|---|---|
| K-Means | Spherical clusters | k (n_clusters) | Excellent |
| DBSCAN | Arbitrary shapes, noise | eps, min_samples | Good |
| Hierarchical | Nested clusters | linkage, distance | Poor |
| HDBSCAN | Variable density | min_cluster_size | Good |
| Gaussian Mixture | Probabilistic | n_components | Moderate |
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
import hdbscan
def get_clustering_suite(n_clusters=None):
"""Production-ready clustering configurations."""
suite = {
'kmeans': KMeans(
n_clusters=n_clusters or 5,
init='k-means++',
n_init=10,
max_iter=300,
random_state=42
),
'dbscan': DBSCAN(
eps=0.5,
min_samples=5,
metric='euclidean',
n_jobs=-1
),
'hierarchical': AgglomerativeClustering(
n_clusters=n_clusters or 5,
linkage='ward',
metric='euclidean'
),
'hdbscan': hdbscan.HDBSCAN(
min_cluster_size=15,
min_samples=5,
cluster_selection_epsilon=0.0,
metric='euclidean'
),
'gmm': GaussianMixture(
n_components=n_clusters or 5,
covariance_type='full',
random_state=42
)
}
return suite
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
import numpy as np
import matplotlib.pyplot as plt
def find_optimal_clusters(X, max_k=15):
"""
Find optimal number of clusters using multiple methods.
Returns:
dict: Optimal k from each method
"""
results = {
'inertia': [],
'silhouette': [],
'calinski': [],
'davies_bouldin': []
}
K = range(2, max_k + 1)
for k in K:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = kmeans.fit_predict(X)
results['inertia'].append(kmeans.inertia_)
results['silhouette'].append(silhouette_score(X, labels))
results['calinski'].append(calinski_harabasz_score(X, labels))
results['davies_bouldin'].append(davies_bouldin_score(X, labels))
# Find optimal k for each metric
optimal = {
'silhouette': K[np.argmax(results['silhouette'])],
'calinski': K[np.argmax(results['calinski'])],
'davies_bouldin': K[np.argmin(results['davies_bouldin'])], # Lower is better
'elbow': find_elbow(results['inertia'], list(K))
}
return optimal, results
def find_elbow(inertias, k_values):
"""Find elbow point using the kneedle algorithm concept."""
# Simplified elbow detection
diffs = np.diff(inertias)
second_diffs = np.diff(diffs)
elbow_idx = np.argmax(second_diffs) + 2
return k_values[min(elbow_idx, len(k_values) - 1)]
| Technique | Best For | Preserves | Speed |
|---|---|---|---|
| PCA | Linear relationships | Global variance | Fast |
| t-SNE | Visualization | Local structure | Slow |
| UMAP | Visualization + clustering | Local + global | Fast |
| LDA | Supervised reduction | Class separation | Fast |
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
def reduce_dimensions(X, method='pca', n_components=2, **kwargs):
"""
Apply dimensionality reduction.
Args:
X: High-dimensional data
method: 'pca', 'tsne', or 'umap'
n_components: Target dimensions
Returns:
Reduced data
"""
reducers = {
'pca': PCA(
n_components=n_components,
random_state=42
),
'tsne': TSNE(
n_components=n_components,
perplexity=kwargs.get('perplexity', 30),
learning_rate=kwargs.get('learning_rate', 200),
n_iter=1000,
random_state=42
),
'umap': umap.UMAP(
n_components=n_components,
n_neighbors=kwargs.get('n_neighbors', 15),
min_dist=kwargs.get('min_dist', 0.1),
random_state=42
)
}
reducer = reducers[method]
return reducer.fit_transform(X)
def pca_explained_variance(X, threshold=0.95):
"""Find number of components for desired variance."""
pca = PCA(random_state=42)
pca.fit(X)
cumulative_variance = np.cumsum(pca.explained_variance_ratio_)
n_components = np.argmax(cumulative_variance >= threshold) + 1
return {
'n_components': n_components,
'variance_explained': cumulative_variance[n_components - 1],
'component_importance': pca.explained_variance_ratio_[:n_components]
}
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from sklearn.covariance import EllipticEnvelope
def detect_anomalies(X, method='isolation_forest', contamination=0.1):
"""
Detect anomalies using various algorithms.
Args:
X: Feature matrix
method: Detection algorithm
contamination: Expected proportion of outliers
Returns:
tuple: (predictions, scores)
predictions: 1 for normal, -1 for anomaly
"""
detectors = {
'isolation_forest': IsolationForest(
n_estimators=100,
contamination=contamination,
random_state=42,
n_jobs=-1
),
'lof': LocalOutlierFactor(
n_neighbors=20,
contamination=contamination,
novelty=True,
n_jobs=-1
),
'one_class_svm': OneClassSVM(
kernel='rbf',
gamma='auto',
nu=contamination
),
'elliptic': EllipticEnvelope(
contamination=contamination,
random_state=42
)
}
detector = detectors[method]
if method == 'lof':
detector.fit(X)
predictions = detector.predict(X)
scores = -detector.negative_outlier_factor_
else:
predictions = detector.fit_predict(X)
if hasattr(detector, 'score_samples'):
scores = -detector.score_samples(X)
else:
scores = None
return predictions, scores
from sklearn.metrics import (
silhouette_score, silhouette_samples,
calinski_harabasz_score, davies_bouldin_score
)
def validate_clusters(X, labels):
"""
Comprehensive cluster quality assessment.
Returns:
dict: Multiple validation metrics
"""
# Skip if only one cluster or all noise
unique_labels = set(labels) - {-1}
if len(unique_labels) < 2:
return {'error': 'Need at least 2 clusters for validation'}
# Filter out noise points for metrics
mask = labels != -1
X_clean = X[mask]
labels_clean = labels[mask]
metrics = {
'silhouette': silhouette_score(X_clean, labels_clean),
'calinski_harabasz': calinski_harabasz_score(X_clean, labels_clean),
'davies_bouldin': davies_bouldin_score(X_clean, labels_clean),
'n_clusters': len(unique_labels),
'noise_ratio': (labels == -1).sum() / len(labels),
'cluster_sizes': dict(zip(*np.unique(labels, return_counts=True)))
}
# Per-cluster silhouette
sample_silhouettes = silhouette_samples(X_clean, labels_clean)
for cluster in unique_labels:
cluster_mask = labels_clean == cluster
metrics[f'silhouette_cluster_{cluster}'] = sample_silhouettes[cluster_mask].mean()
return metrics
┌─────────────────────────────────────────────────────────────┐
│ UNSUPERVISED LEARNING WORKFLOW │
├─────────────────────────────────────────────────────────────┤
│ 1. DATA PREPARATION │
│ ├── Scale features (StandardScaler) │
│ ├── Handle missing values │
│ └── Remove/flag outliers if needed │
│ │
│ 2. EXPLORATORY ANALYSIS │
│ ├── PCA for initial structure │
│ ├── Pairwise distances │
│ └── Distribution analysis │
│ │
│ 3. ALGORITHM SELECTION │
│ ├── K-Means for spherical clusters │
│ ├── DBSCAN for arbitrary shapes │
│ └── HDBSCAN for variable density │
│ │
│ 4. PARAMETER TUNING │
│ ├── Elbow method for K │
│ ├── Silhouette analysis │
│ └── Grid search for eps/min_samples │
│ │
│ 5. VALIDATION & INTERPRETATION │
│ ├── Cluster quality metrics │
│ ├── Visualization (t-SNE/UMAP) │
│ └── Business interpretation │
└─────────────────────────────────────────────────────────────┘
| Issue | Root Cause | Solution |
|---|---|---|
| All points in one cluster | Wrong epsilon/k | Reduce eps or increase k |
| Too many clusters | Parameters too sensitive | Increase eps or min_samples |
| Poor silhouette score | Wrong algorithm | Try different clustering method |
| Unstable clusters | High variance | Use ensemble or HDBSCAN |
| Memory error | Large dataset | Use MiniBatchKMeans |
| Component | Relationship | Handoff |
|---|---|---|
01-ml-fundamentals | Upstream | Receives preprocessed data |
02-supervised-learning | Peer | Cluster labels as features |
clustering skill | Primary Bond | Detailed tutorials |
Version: 1.4.0 | Last Updated: 2025-01-01 | Status: Production Ready
Designs feature architectures by analyzing existing codebase patterns and conventions, then providing comprehensive implementation blueprints with specific files to create/modify, component designs, data flows, and build sequences