From sciagent-skills
Infers gene regulatory networks from bulk or single-cell RNA-seq data using GRNBoost2 (gradient boosting) or GENIE3 (Random Forest). Dask-parallelized; outputs TF-target links for pySCENIC.
npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skillsThis skill uses the workspace's default tool permissions.
Arboreto infers gene regulatory networks (GRNs) from gene expression data using parallelized tree-based regression. For each target gene, it trains a regression model with all other genes (or a specified TF list) as features and emits TF-target-importance triplets. It provides two interchangeable algorithms -- GRNBoost2 (gradient boosting, fast) and GENIE3 (Random Forest, classic) -- sharing id...
Infers gene regulatory networks from transcriptomics data using GRNBoost2 and GENIE3 algorithms. Scales via distributed computing for large RNA-seq datasets.
Infers gene regulatory networks from gene expression data using GRNBoost2 and GENIE3 algorithms. For transcriptomics analysis (bulk/single-cell RNA-seq) with distributed scaling.
Analyzes single-cell RNA-seq data with Scanpy: QC, normalization, HVG selection, PCA/UMAP/t-SNE, Leiden clustering, markers, annotation, trajectory inference. For count matrices in h5ad/10X.
Share bugs, ideas, or general feedback.
Arboreto infers gene regulatory networks (GRNs) from gene expression data using parallelized tree-based regression. For each target gene, it trains a regression model with all other genes (or a specified TF list) as features and emits TF-target-importance triplets. It provides two interchangeable algorithms -- GRNBoost2 (gradient boosting, fast) and GENIE3 (Random Forest, classic) -- sharing identical input/output formats. Computation is Dask-parallelized, scaling from laptop cores to HPC clusters.
arboreto, pandas, numpy, dask, distributed, scikit-learn, scipynetworkx, matplotlib for visualizationpip install arboreto distributed networkx matplotlib
Complete GRN inference in a single block. The if __name__ == '__main__': guard is required because Dask spawns worker processes via multiprocessing.
import pandas as pd
from arboreto.algo import grnboost2
from arboreto.utils import load_tf_names
if __name__ == '__main__':
# Load expression matrix (observations x genes)
expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
tf_names = load_tf_names('tf_list.txt') # optional TF filter
# Infer GRN (uses all local cores by default)
network = grnboost2(expression_data=expression_matrix,
tf_names=tf_names, seed=777)
# Filter top links and save
top_network = network[network['importance'] > 1.0]
top_network.to_csv('grn_output.tsv', sep='\t', index=False, header=False)
print(f"Inferred {len(network)} links, kept {len(top_network)} above threshold")
# Example: Inferred 185432 links, kept 12876 above threshold
Arboreto accepts a pandas DataFrame (recommended) or NumPy array. Rows are observations (cells or samples), columns are genes. Gene names must be column headers.
import pandas as pd
# From TSV (genes as columns, observations as rows)
expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
print(f"Shape: {expression_matrix.shape} "
f"({expression_matrix.shape[0]} observations x {expression_matrix.shape[1]} genes)")
# Example: Shape: (5000, 18654) (5000 observations x 18654 genes)
# From AnnData (e.g., after scanpy preprocessing)
import anndata as ad
adata = ad.read_h5ad('preprocessed.h5ad')
expression_matrix = pd.DataFrame(
adata.X.toarray() if hasattr(adata.X, 'toarray') else adata.X,
columns=adata.var_names.tolist()
)
print(f"Converted AnnData: {expression_matrix.shape}")
# Example: Converted AnnData: (5000, 18654)
Providing a TF list restricts regulators to known transcription factors, reducing computation time and improving biological relevance. If omitted, all genes are treated as potential regulators.
from arboreto.utils import load_tf_names
# From file (one TF name per line)
tf_names = load_tf_names('human_tfs.txt')
print(f"Loaded {len(tf_names)} transcription factors")
# Example: Loaded 1639 transcription factors
# Or define directly
tf_names = ['MYC', 'TP53', 'SOX2', 'NANOG', 'POU5F1']
# Verify TFs exist in expression matrix columns
tf_in_data = [tf for tf in tf_names if tf in expression_matrix.columns]
print(f"TFs found in expression data: {len(tf_in_data)}/{len(tf_names)}")
# Example: TFs found in expression data: 1583/1639
By default arboreto creates an internal Dask client using all local cores. Create an explicit client for resource control, monitoring, or cluster deployment.
from distributed import LocalCluster, Client
# Custom local client with resource limits
local_cluster = LocalCluster(
n_workers=8,
threads_per_worker=1, # avoid GIL contention in scikit-learn
memory_limit='4GB' # per worker
)
client = Client(local_cluster)
print(f"Dashboard: {client.dashboard_link}")
# Example: Dashboard: http://127.0.0.1:8787/status
Call grnboost2() (recommended) or genie3(). Both share the same signature and output format.
from arboreto.algo import grnboost2
if __name__ == '__main__':
network = grnboost2(
expression_data=expression_matrix,
tf_names=tf_names, # 'all' if no TF list
client_or_address=client, # omit to use default local scheduler
seed=777, # for reproducibility
verbose=True # print progress
)
print(f"Inferred {len(network)} regulatory links")
print(network.head())
# Example:
# TF target importance
# 0 MYC CDK4 3.214
# 1 MYC CCND1 2.871
# 2 TP53 CDKN1A 2.654
Raw output contains links for every TF-target pair with non-zero importance. Filter to retain high-confidence regulatory interactions.
# Strategy 1: Importance threshold
threshold = 1.0
filtered = network[network['importance'] > threshold]
print(f"Threshold {threshold}: {len(filtered)} links "
f"({len(filtered)/len(network)*100:.1f}% retained)")
# Example: Threshold 1.0: 12876 links (6.9% retained)
# Strategy 2: Top N links per target gene
top_n = 10
top_per_target = (network.groupby('target')
.apply(lambda g: g.nlargest(top_n, 'importance'))
.reset_index(drop=True))
print(f"Top {top_n} per target: {len(top_per_target)} links")
# Example: Top 10 per target: 186540 links
Save as a TSV file with three columns: TF, target, importance.
# Save full network
network.to_csv('full_network.tsv', sep='\t', index=False)
print(f"Saved full network: {len(network)} links")
# Save filtered network (without header for pySCENIC compatibility)
filtered.to_csv('filtered_network.tsv', sep='\t', index=False, header=False)
print(f"Saved filtered network: {len(filtered)} links")
# Clean up Dask client if explicitly created
client.close()
local_cluster.close()
Plot the top regulatory interactions as a directed network graph.
import networkx as nx
import matplotlib.pyplot as plt
# Build directed graph from top links
top_links = network.nlargest(50, 'importance')
G = nx.from_pandas_edgelist(
top_links, source='TF', target='target',
edge_attr='importance', create_using=nx.DiGraph()
)
# Draw network
fig, ax = plt.subplots(figsize=(12, 10))
pos = nx.spring_layout(G, k=2, seed=42)
nx.draw_networkx_nodes(G, pos, node_size=300, node_color='lightblue', ax=ax)
nx.draw_networkx_labels(G, pos, font_size=7, ax=ax)
edges = nx.draw_networkx_edges(
G, pos, edge_color=[G[u][v]['importance'] for u, v in G.edges()],
edge_cmap=plt.cm.Reds, width=1.5, arrows=True, ax=ax
)
plt.colorbar(edges, ax=ax, label='Importance')
ax.set_title('Top 50 Regulatory Interactions')
plt.tight_layout()
plt.savefig('grn_network.png', dpi=300, bbox_inches='tight')
print("Saved grn_network.png")
| Parameter | Default | Range / Options | Effect |
|---|---|---|---|
expression_data | (required) | DataFrame or ndarray | Expression matrix, observations x genes |
tf_names | 'all' | list of str or 'all' | Restrict regulators to known TFs; 'all' uses every gene |
gene_names | None | list of str | Required when expression_data is a NumPy array |
client_or_address | 'local' | Client, str, or 'local' | Dask client instance or scheduler address |
seed | None | int | Random seed for reproducibility; always set for replicable results |
verbose | False | bool | Print progress messages during inference |
Both algorithms follow the same multiple-regression strategy (train one model per target gene, extract feature importances) but differ in the underlying regressor.
| Feature | GRNBoost2 | GENIE3 |
|---|---|---|
| Method | Stochastic gradient boosting with early stopping | Random Forest (or ExtraTrees) |
| Speed | Fast -- optimized for large datasets | Slower -- higher per-model cost |
| Memory | Lower (early stopping limits tree depth) | Higher (full forests per target) |
| Best for | Default choice; 10k+ observations, single-cell scale | Comparison with published GENIE3 results; validation |
| Import | from arboreto.algo import grnboost2 | from arboreto.algo import genie3 |
| Output format | TF, target, importance (identical) | TF, target, importance (identical) |
Decision rule: Start with GRNBoost2. Use GENIE3 only when reproducing published GENIE3 analyses or as an independent validation of GRNBoost2 results.
The network DataFrame has three columns, sorted by descending importance:
| Column | Type | Description |
|---|---|---|
TF | str | Transcription factor (regulator) gene name |
target | str | Target gene name |
importance | float | Regulatory importance score (higher = stronger predicted regulation) |
Importance scores are not p-values. They represent feature importance from the tree-based model. Filter by absolute threshold or top-N per target for downstream analysis.
if __name__ == '__main__': RequirementDask uses Python's multiprocessing module to spawn worker processes. Without the main guard, each spawned process re-executes the script, leading to infinite process spawning. Always wrap arboreto calls in if __name__ == '__main__': when running as a script. This is not needed inside Jupyter notebooks.
When to use: datasets too large for a single machine (e.g., 50k+ cells, 20k+ genes). Requires a Dask distributed scheduler running on the cluster.
# On head node: start scheduler
dask-scheduler
# Output: Scheduler at tcp://10.118.224.134:8786
# On each compute node: start workers
dask-worker tcp://10.118.224.134:8786 --nprocs 4 --nthreads 1 --memory-limit 16GB
from distributed import Client
from arboreto.algo import grnboost2
import pandas as pd
if __name__ == '__main__':
client = Client('tcp://10.118.224.134:8786')
print(f"Connected to cluster: {client.scheduler_info()['workers'].__len__()} workers")
# Example: Connected to cluster: 16 workers
expression_data = pd.read_csv('large_scrna.tsv', sep='\t')
tf_names = pd.read_csv('tf_list.txt', header=None)[0].tolist()
network = grnboost2(
expression_data=expression_data,
tf_names=tf_names,
client_or_address=client,
seed=42, verbose=True
)
network.to_csv('cluster_grn.tsv', sep='\t', index=False)
print(f"Cluster inference complete: {len(network)} links")
client.close()
When to use: improve robustness by aggregating networks from multiple random seeds. Links appearing consistently across runs are more reliable.
import pandas as pd
from distributed import LocalCluster, Client
from arboreto.algo import grnboost2
if __name__ == '__main__':
client = Client(LocalCluster(n_workers=8, threads_per_worker=1))
expression_data = pd.read_csv('expression_data.tsv', sep='\t')
tf_names = pd.read_csv('tf_list.txt', header=None)[0].tolist()
seeds = [42, 123, 456, 789, 1001]
all_networks = []
for seed in seeds:
net = grnboost2(expression_data=expression_data,
tf_names=tf_names,
client_or_address=client, seed=seed)
net['seed'] = seed
all_networks.append(net)
print(f"Seed {seed}: {len(net)} links")
combined = pd.concat(all_networks, ignore_index=True)
# Consensus: mean importance across seeds, keep links found in 3+ runs
consensus = (combined.groupby(['TF', 'target'])
.agg(mean_importance=('importance', 'mean'),
n_seeds=('seed', 'nunique'))
.reset_index())
consensus = consensus[consensus['n_seeds'] >= 3]
consensus = consensus.sort_values('mean_importance', ascending=False)
consensus.to_csv('consensus_network.tsv', sep='\t', index=False)
print(f"Consensus network: {len(consensus)} links (found in 3+ of {len(seeds)} runs)")
# Example: Consensus network: 28453 links (found in 3+ of 5 runs)
client.close()
When to use: downstream regulatory analysis after arboreto GRN inference. Arboreto produces the adjacency matrix (Step 1 of SCENIC); pySCENIC performs regulon identification (Step 2) and activity scoring (Step 3).
from arboreto.algo import grnboost2
from arboreto.utils import load_tf_names
import pandas as pd
if __name__ == '__main__':
# Step 1 (arboreto): GRN inference
expression_matrix = pd.read_csv('scrna_expression.tsv', sep='\t')
tf_names = load_tf_names('allTFs_hg38.txt')
adjacencies = grnboost2(
expression_data=expression_matrix,
tf_names=tf_names, seed=777
)
adjacencies.to_csv('adjacencies.tsv', sep='\t', index=False, header=False)
print(f"SCENIC Step 1 complete: {len(adjacencies)} adjacencies")
# Example: SCENIC Step 1 complete: 234567 adjacencies
# Steps 2-3 use pySCENIC CLI (requires pyscenic installation):
# pyscenic ctx adjacencies.tsv hg38_ranking.feather \
# --annotations_fname motifs-v10nr.hgnc-m0.001-o0.0.tbl \
# --output regulons.csv
#
# pyscenic aucell scrna_expression.tsv regulons.csv \
# --output auc_matrix.csv
print("Next: run pyscenic ctx (Step 2) and pyscenic aucell (Step 3)")
full_network.tsv -- complete TF-target-importance table; columns: TF, target, importancefiltered_network.tsv -- high-confidence subset after threshold or top-N filteringconsensus_network.tsv -- aggregated network from multi-seed runs; columns: TF, target, mean_importance, n_seedsadjacencies.tsv -- arboreto output formatted for pySCENIC input (no header)grn_network.png -- directed graph visualization of top regulatory interactions| Problem | Cause | Solution |
|---|---|---|
RuntimeError: freeze_support() or infinite process spawning | Missing if __name__ == '__main__': guard | Wrap all arboreto calls in main guard; not needed in Jupyter |
MemoryError during inference | Expression matrix too large for available RAM | Filter low-variance genes first, reduce to top 5-10k genes, or use distributed cluster |
| Empty or near-empty network | TF names do not match expression matrix column names | Verify TF names overlap with expression_matrix.columns; check case sensitivity |
| Very slow inference | Using GENIE3 on large dataset, or too few Dask workers | Switch to GRNBoost2; create explicit Dask client with more workers |
ModuleNotFoundError: No module named 'arboreto' | Package not installed | pip install arboreto |
| All importance scores are 0 or identical | Expression matrix has no variance (e.g., all zeros after filtering) | Check data quality; ensure matrix contains non-zero expression values with variance |
TypeError with NumPy array input | Missing gene_names parameter | Provide gene_names= list matching the column count of the array |
This entry is fully self-contained with no references/ directory.
Original reference file dispositions:
regressor_type, regressor_kwargs) omitted: advanced tuning rarely needed; users can pass scikit-learn kwargs directly per the arboreto API docs.Narrative use-case dispositions (from original Common Use Cases):