From sciagent-skills
Decomposes multi-omics datasets (scRNA-seq, ATAC-seq, proteomics, methylation) into latent factors using MOFA+ and mofapy2. Captures variation, supports multi-group designs, correlates with metadata, visualizes and enriches factors.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
MOFA+ (Multi-Omics Factor Analysis v2) is an unsupervised statistical framework that jointly decomposes multiple omics datasets into a small set of latent factors. Each factor captures an independent source of variation (e.g., cell cycle, a disease phenotype, a technical batch) and is associated with feature weights (loadings) that reveal which genes, peaks, or proteins drive it. The Python pac...
Analyzes multi-modal single-cell data with muon/MuData: joint RNA+ATAC (10x Multiome), CITE-seq (RNA+protein). WNN embeddings, per-modality preprocessing, MOFA factors.
Integrates multi-omics datasets (transcriptomics, proteomics, epigenomics, genomics, metabolomics) for systems biology and precision medicine via cross-omics correlation, MOFA+/NMF clustering, pathway integration, and sample matching.
Guides scvi-tools for single-cell omics analysis including scRNA-seq, scATAC-seq, CITE-seq, spatial transcriptomics: probabilistic modeling, batch correction, integration, DE, annotation.
Share bugs, ideas, or general feedback.
MOFA+ (Multi-Omics Factor Analysis v2) is an unsupervised statistical framework that jointly decomposes multiple omics datasets into a small set of latent factors. Each factor captures an independent source of variation (e.g., cell cycle, a disease phenotype, a technical batch) and is associated with feature weights (loadings) that reveal which genes, peaks, or proteins drive it. The Python package mofapy2 produces an HDF5 model file compatible with downstream analysis in both Python and R. MOFA+ extends the original MOFA to support multi-group settings where samples belong to distinct cohorts or conditions.
mofapy2>=0.7, anndata>=0.10, numpy>=1.24, pandas>=2.0, scipy>=1.11, matplotlib>=3.7, seaborn>=0.13, muon (optional, for MuData integration)MOFA2 Bioconductor packagepip install mofapy2 anndata muon matplotlib seaborn
import numpy as np
import pandas as pd
import anndata as ad
from mofapy2.run.entry_point import entry_point
# Simulate two omics views, 200 cells, 500 RNA genes, 300 ATAC peaks
np.random.seed(42)
n_cells, n_rna, n_atac = 200, 500, 300
adata_rna = ad.AnnData(np.abs(np.random.randn(n_cells, n_rna)),
obs=pd.DataFrame(index=[f"cell_{i}" for i in range(n_cells)]))
adata_atac = ad.AnnData(np.abs(np.random.randn(n_cells, n_atac)),
obs=adata_rna.obs.copy())
ent = entry_point()
ent.set_data_options(scale_groups=False, scale_views=True)
ent.set_data_matrix([[adata_rna.X, adata_atac.X]],
likelihoods=["gaussian", "gaussian"],
views_names=["RNA", "ATAC"],
groups_names=["all_cells"],
samples_names=[list(adata_rna.obs_names)])
ent.set_model_options(factors=10)
ent.set_train_options(iter=500, convergence_mode="fast", seed=42)
ent.build()
ent.run()
ent.save("mofa_model.hdf5")
print("Model saved to mofa_model.hdf5")
Each omics layer is represented as an AnnData object. Align cell indices across modalities, log-normalize RNA counts, and binarize or normalize ATAC/methylation data as appropriate.
import numpy as np
import pandas as pd
import anndata as ad
import scipy.sparse as sp
# --- RNA-seq: 200 cells x 2000 highly variable genes ---
np.random.seed(42)
n_cells = 200
cell_ids = [f"cell_{i:03d}" for i in range(n_cells)]
# Simulate log-normalized counts (in practice: load from h5ad after Scanpy preprocessing)
rna_counts = np.abs(np.random.randn(n_cells, 2000) * 2)
adata_rna = ad.AnnData(
X=rna_counts,
obs=pd.DataFrame(
{"condition": ["A"] * 100 + ["B"] * 100,
"patient": [f"P{i % 10}" for i in range(n_cells)]},
index=cell_ids
),
var=pd.DataFrame(index=[f"Gene_{i}" for i in range(2000)])
)
# --- ATAC-seq: same cells x 1000 peaks ---
atac_matrix = (np.random.rand(n_cells, 1000) > 0.8).astype(float)
adata_atac = ad.AnnData(
X=atac_matrix,
obs=adata_rna.obs.copy(),
var=pd.DataFrame(index=[f"Peak_{i}" for i in range(1000)])
)
# Confirm alignment
assert list(adata_rna.obs_names) == list(adata_atac.obs_names), "Cell indices must match"
print(f"RNA: {adata_rna.shape}, ATAC: {adata_atac.shape}")
print(f"Conditions: {adata_rna.obs['condition'].value_counts().to_dict()}")
Instantiate the entry_point and register all data views. Views are provided as a list-of-lists: data[groups][views]. Assign meaningful view and group names for interpretability.
from mofapy2.run.entry_point import entry_point
ent = entry_point()
# Configure data options before setting data
ent.set_data_options(
scale_groups=False, # Do not rescale variance between groups
scale_views=True, # Rescale each view to unit variance (recommended when views differ in scale)
)
# Provide data as list-of-lists: [groups][views]
# Single group → wrap each view in a list
ent.set_data_matrix(
data=[[adata_rna.X, adata_atac.X]], # outer list = groups, inner = views
likelihoods=["gaussian", "bernoulli"], # gaussian for continuous, bernoulli for binary ATAC
views_names=["RNA", "ATAC"],
groups_names=["all_cells"],
samples_names=[list(adata_rna.obs_names)] # one list per group
)
print("Data registered. Views: RNA, ATAC | Groups: all_cells")
Configure the number of factors and training hyperparameters. More factors capture finer variation but increase computation and risk overfitting; convergence_mode="medium" balances speed and accuracy.
# Model options
ent.set_model_options(
factors=15, # Number of latent factors (start with 15; prune inactive ones automatically)
spikeslab_weights=True, # Sparse weight prior (ARD+spike-slab); recommended for feature selection
ard_factors=True, # Automatic relevance determination per factor per view
ard_weights=True, # ARD per feature weight; enables pruning of irrelevant features
)
# Training options
ent.set_train_options(
iter=1000, # Maximum EM iterations
convergence_mode="medium", # "fast" (<1000 iter), "medium" (default), "slow" (>5000 iter)
startELBO=1, # Start computing ELBO from iteration 1
freqELBO=5, # Compute ELBO every 5 iterations
dropR2=0.01, # Drop factors explaining < 1% variance (set to None to disable)
seed=42,
verbose=False
)
print("Model and training options set")
Build the internal data structures, then run variational inference. Training produces a fitted model where each factor's weights and scores are optimized to maximize the evidence lower bound (ELBO).
# Build internal model structure
ent.build()
# Run training (EM algorithm with variational Bayes updates)
ent.run()
# Save trained model to HDF5 — required for downstream analysis
output_path = "mofa_model.hdf5"
ent.save(output_path, overwrite=True)
print(f"Model trained and saved to {output_path}")
Load the HDF5 model and inspect how much variance each factor explains per view. Factors explaining less than ~1-2% total variance are typically noise.
import h5py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
def load_mofa_r2(model_path):
"""Extract variance explained (R2) per factor per view from MOFA+ HDF5."""
with h5py.File(model_path, "r") as f:
r2 = f["variance_explained"]["r2_per_factor"]
views = [v.decode() for v in f["views"]["views"][:]]
groups = [g.decode() for g in f["groups"]["groups"][:]]
# r2_per_factor: shape (n_groups, n_views, n_factors)
r2_array = np.stack([r2[g][:] for g in groups], axis=0) # (groups, views, factors)
# Average across groups; shape → (n_views, n_factors)
r2_mean = r2_array.mean(axis=0)
n_factors = r2_mean.shape[1]
df = pd.DataFrame(r2_mean * 100,
index=views,
columns=[f"Factor{i+1}" for i in range(n_factors)])
return df
r2_df = load_mofa_r2("mofa_model.hdf5")
print("Variance explained (%) per factor per view:")
print(r2_df.round(2))
# Heatmap of variance explained
fig, ax = plt.subplots(figsize=(max(8, r2_df.shape[1] * 0.6), 3))
sns.heatmap(r2_df, annot=True, fmt=".1f", cmap="YlOrRd",
linewidths=0.5, ax=ax, vmin=0)
ax.set_title("MOFA+ Variance Explained (%) per Factor per View")
ax.set_xlabel("Factor")
ax.set_ylabel("Omics View")
plt.tight_layout()
plt.savefig("mofa_variance_explained.png", dpi=200)
print("Saved mofa_variance_explained.png")
Factor scores (Z matrix) are the per-sample coordinates in factor space. Correlate scores with continuous or categorical metadata to biologically annotate each factor.
import scipy.stats as stats
def load_mofa_factors(model_path):
"""Load factor scores (Z) from MOFA+ HDF5. Returns DataFrame (samples x factors)."""
with h5py.File(model_path, "r") as f:
groups = [g.decode() for g in f["groups"]["groups"][:]]
factors_list = []
for g in groups:
z = f["expectations"]["Z"][g][:] # shape: (n_factors, n_samples)
samples = [s.decode() for s in f["samples"][g][:]]
n_factors = z.shape[0]
df = pd.DataFrame(z.T, index=samples,
columns=[f"Factor{i+1}" for i in range(n_factors)])
factors_list.append(df)
return pd.concat(factors_list, axis=0)
factors_df = load_mofa_factors("mofa_model.hdf5")
print(f"Factor scores: {factors_df.shape} (cells x factors)")
# Merge with metadata
meta = adata_rna.obs[["condition", "patient"]].copy()
factors_meta = factors_df.join(meta)
# Point-biserial correlation: factor score vs binary metadata
factor_cols = [c for c in factors_meta.columns if c.startswith("Factor")]
print("\nFactor–Condition correlation (eta-squared approximation):")
for fc in factor_cols[:5]:
groups_vals = [factors_meta.loc[factors_meta["condition"] == g, fc].values
for g in factors_meta["condition"].unique()]
stat, pval = stats.f_oneway(*groups_vals)
print(f" {fc}: F={stat:.2f}, p={pval:.3f}")
Plot factor score scatter plots colored by metadata, and heatmaps of the top-weighted features (loadings) per factor to understand what each factor captures.
def load_mofa_weights(model_path, view_name):
"""Load feature weights (W) for a specific view. Returns DataFrame (features x factors)."""
with h5py.File(model_path, "r") as f:
views = [v.decode() for v in f["views"]["views"][:]]
view_idx = views.index(view_name)
# Weights stored per view as (n_factors, n_features)
w = f["expectations"]["W"][view_name][:]
features = [ft.decode() for ft in f["features"][view_name][:]]
n_factors = w.shape[0]
df = pd.DataFrame(w.T, index=features,
columns=[f"Factor{i+1}" for i in range(n_factors)])
return df
# --- Factor scatter plot ---
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
for ax, (fx, fy) in zip(axes, [("Factor1", "Factor2"), ("Factor1", "Factor3")]):
for cond, grp in factors_meta.groupby("condition"):
ax.scatter(grp[fx], grp[fy], label=cond, alpha=0.6, s=20)
ax.set_xlabel(fx)
ax.set_ylabel(fy)
ax.legend(title="Condition")
ax.set_title(f"{fx} vs {fy}")
plt.tight_layout()
plt.savefig("mofa_factor_scatter.png", dpi=200)
print("Saved mofa_factor_scatter.png")
# --- Top-loading heatmap for RNA view ---
weights_rna = load_mofa_weights("mofa_model.hdf5", "RNA")
top_n = 20
top_features = []
for fc in [f"Factor{i+1}" for i in range(min(5, weights_rna.shape[1]))]:
top_pos = weights_rna[fc].nlargest(top_n // 2).index.tolist()
top_neg = weights_rna[fc].nsmallest(top_n // 2).index.tolist()
top_features.extend(top_pos + top_neg)
top_features = list(dict.fromkeys(top_features)) # unique, preserve order
fig, ax = plt.subplots(figsize=(8, max(6, len(top_features) * 0.3)))
sns.heatmap(weights_rna.loc[top_features, [f"Factor{i+1}" for i in range(min(5, weights_rna.shape[1]))]],
cmap="RdBu_r", center=0, ax=ax, yticklabels=True)
ax.set_title("Top RNA Feature Weights per Factor")
plt.tight_layout()
plt.savefig("mofa_rna_weights_heatmap.png", dpi=200)
print("Saved mofa_rna_weights_heatmap.png")
Use factor scores as a low-dimensional embedding for clustering, and extract top-weighted genes per factor for pathway enrichment.
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import warnings
# --- Cluster cells using factor scores ---
factor_cols = [c for c in factors_df.columns if c.startswith("Factor")]
X_factors = StandardScaler().fit_transform(factors_df[factor_cols].values)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
with warnings.catch_warnings():
warnings.simplefilter("ignore")
cluster_labels = kmeans.fit_predict(X_factors)
factors_df["mofa_cluster"] = cluster_labels.astype(str)
print(f"K-means clusters (k=4):\n{pd.Series(cluster_labels).value_counts().sort_index()}")
# --- Extract top-weighted genes per factor for enrichment input ---
weights_rna = load_mofa_weights("mofa_model.hdf5", "RNA")
enrichment_inputs = {}
for fc in [f"Factor{i+1}" for i in range(min(5, weights_rna.shape[1]))]:
# Positive weights: top activating genes; negative: top repressing genes
top_pos = weights_rna[fc].nlargest(100).index.tolist()
top_neg = weights_rna[fc].nsmallest(100).index.tolist()
enrichment_inputs[f"{fc}_positive"] = top_pos
enrichment_inputs[f"{fc}_negative"] = top_neg
# Save gene lists for external enrichment (e.g., gseapy Enrichr)
for name, genes in list(enrichment_inputs.items())[:2]:
print(f"\n{name} — top 10: {genes[:10]}")
# Example: run ORA with gseapy (install separately)
# import gseapy
# enr = gseapy.enrichr(gene_list=enrichment_inputs["Factor1_positive"],
# gene_sets="GO_Biological_Process_2023",
# organism="human", outdir=None)
# print(enr.results.head(5)[["Term", "Adjusted P-value", "Genes"]])
factors_df.to_csv("mofa_factor_scores.csv")
print("\nFactor scores with cluster labels saved to mofa_factor_scores.csv")
| Parameter | Default | Range / Options | Effect |
|---|---|---|---|
factors | 15 | 5–50 | Number of latent factors to infer; inactive ones pruned by ARD |
likelihoods | (required) | "gaussian", "bernoulli", "poisson" | Per-view likelihood; gaussian for normalized continuous, bernoulli for binary ATAC, poisson for raw counts |
scale_views | True | True/False | Rescale each view to unit variance; recommended when views differ in scale or unit |
scale_groups | False | True/False | Rescale variance across groups; set True if groups have very different total variances |
spikeslab_weights | True | True/False | Spike-and-slab sparsity prior on weights; enables feature selection via near-zero weights |
ard_factors | True | True/False | Automatic relevance determination per factor per view; prunes factors not used in a view |
iter | 1000 | 200–5000 | Maximum EM iterations; convergence usually reached in 200–800 |
convergence_mode | "medium" | "fast", "medium", "slow" | ELBO convergence tolerance: fast = 1e-4, medium = 1e-6, slow = 1e-8 |
dropR2 | 0.01 | None, 0.001–0.05 | Drop factors explaining less than this fraction of variance; None keeps all |
startELBO | 1 | 1–100 | Iteration to start ELBO monitoring; set higher to skip initial instability |
Each latent factor Z_k (a vector of length n_samples) represents a source of variation. The corresponding weight matrix W_k (a vector of length n_features per view) contains the loading of each feature on that factor. Factors with spike-and-slab priors produce sparse loadings: most weights are shrunk to near zero, leaving a small set of features that meaningfully drive the factor. A positive weight means higher factor score correlates with higher feature expression.
A "group" in MOFA+ is a set of samples that share the same factor weight matrices W but have independent factor score distributions Z. Use multi-group analysis when:
Single-group analysis (all samples in one group) is appropriate when samples are from a single experiment with no major batch structure.
The R2 heatmap (views x factors) is the primary diagnostic output. Factors should show:
Use when comparing two or more patient groups or experimental conditions, where you want to identify condition-specific vs shared factors.
from mofapy2.run.entry_point import entry_point
import numpy as np
import pandas as pd
import anndata as ad
# Simulate two groups: condition A (100 cells) and condition B (100 cells)
np.random.seed(0)
n_per_group, n_genes, n_peaks = 100, 1000, 500
groups = {"condA": {}, "condB": {}}
for g in groups:
groups[g]["rna"] = np.abs(np.random.randn(n_per_group, n_genes))
groups[g]["atac"] = (np.random.rand(n_per_group, n_peaks) > 0.75).astype(float)
sample_ids_A = [f"A_cell_{i}" for i in range(n_per_group)]
sample_ids_B = [f"B_cell_{i}" for i in range(n_per_group)]
ent = entry_point()
ent.set_data_options(scale_groups=False, scale_views=True)
# Multi-group: data[groups][views] — two groups, each with RNA and ATAC
ent.set_data_matrix(
data=[[groups["condA"]["rna"], groups["condA"]["atac"]],
[groups["condB"]["rna"], groups["condB"]["atac"]]],
likelihoods=["gaussian", "bernoulli"],
views_names=["RNA", "ATAC"],
groups_names=["condA", "condB"],
samples_names=[sample_ids_A, sample_ids_B]
)
ent.set_model_options(factors=10, spikeslab_weights=True, ard_factors=True)
ent.set_train_options(iter=500, convergence_mode="fast", seed=0, verbose=False)
ent.build()
ent.run()
ent.save("mofa_multigroup.hdf5", overwrite=True)
# Compare factor scores between groups
factors_all = load_mofa_factors("mofa_multigroup.hdf5")
factors_all["group"] = ["condA"] * n_per_group + ["condB"] * n_per_group
print(f"Multi-group model trained. Factor scores: {factors_all.shape}")
print(factors_all.groupby("group")[["Factor1", "Factor2"]].mean().round(3))
Retrieve the top positive and negative loading genes per factor and print a summary table for biological annotation.
import pandas as pd
def annotate_factors(model_path, view_name="RNA", top_n=20, n_factors=5):
"""
Summarize top-loading features per factor to assist biological annotation.
Returns a DataFrame with factor names, top positive genes, top negative genes,
and the absolute weight range (an activity proxy).
"""
weights = load_mofa_weights(model_path, view_name)
factor_cols = [f"Factor{i+1}" for i in range(min(n_factors, weights.shape[1]))]
rows = []
for fc in factor_cols:
w = weights[fc]
top_pos = w.nlargest(top_n).index.tolist()
top_neg = w.nsmallest(top_n).index.tolist()
rows.append({
"Factor": fc,
"Max_weight": round(w.max(), 4),
"Min_weight": round(w.min(), 4),
"Top_positive": ", ".join(top_pos[:5]),
"Top_negative": ", ".join(top_neg[:5]),
})
summary = pd.DataFrame(rows)
return summary
summary_df = annotate_factors("mofa_model.hdf5", view_name="RNA", top_n=20, n_factors=5)
print("\nFactor annotation summary (RNA view):")
print(summary_df.to_string(index=False))
summary_df.to_csv("mofa_factor_annotation.csv", index=False)
print("\nSaved mofa_factor_annotation.csv")
| Output | Description |
|---|---|
mofa_model.hdf5 | Trained MOFA+ model — factor scores, weights, ELBO trace, variance explained |
mofa_variance_explained.png | Heatmap of R2 (%) per factor per view; primary diagnostic for factor selection |
mofa_factor_scatter.png | Scatter plots of Factor1 vs Factor2/3 colored by metadata (condition, patient) |
mofa_rna_weights_heatmap.png | Heatmap of top RNA feature weights across the first 5 factors |
mofa_factor_scores.csv | Table of per-cell factor scores (cells x factors) with cluster labels |
mofa_factor_annotation.csv | Factor annotation table: top positive/negative genes per factor |
| Per-factor gene lists | Input for gseapy Enrichr or GSEA to identify enriched pathways per factor |
| Problem | Cause | Solution |
|---|---|---|
KeyError in set_data_matrix | Mismatched number of groups, views, or sample list dimensions | Ensure data, likelihoods, views_names, groups_names, samples_names all have consistent lengths |
| All factors show near-zero variance explained | Data not preprocessed or scale mismatch across views | Normalize each view before input; set scale_views=True; verify non-zero variance in input matrices |
| Model trains but factor scores are NaN | Convergence failure due to extreme values or near-singular data | Check for Inf/NaN in input matrices; reduce iter; try convergence_mode="fast" first |
| Too many factors pruned (only 1-2 remain) | dropR2 threshold too aggressive or insufficient variation in data | Set dropR2=0.001 or dropR2=None; increase data diversity or reduce noise |
HDF5 file cannot be read | File truncated due to crash during training | Re-run training; check disk space; use overwrite=True in ent.save() |
| Factor scores identical across all samples | Single-sample group or zero-variance input view | Confirm at least 2 distinct samples per group; check input matrix is not all zeros |
| Very slow training (>1 hr) | Large feature space (>10k features per view) or many factors | Pre-filter to top HVGs (2000-5000) per view; reduce factors to 10-15; enable verbose=False |
| ELBO not converging (oscillates) | Learning rate instability or poorly scaled data | Increase startELBO; standardize each view independently; use convergence_mode="slow" |
| Weights all near zero for one view | Bernoulli likelihood on continuous data or vice versa | Verify likelihoods list matches view data types; use "gaussian" for normalized RNA |
ModuleNotFoundError: mofapy2 | Package not installed | pip install mofapy2 |