Skill

arboreto-grn-inference

Infers gene regulatory networks from bulk or single-cell RNA-seq data using GRNBoost2 (gradient boosting) or GENIE3 (Random Forest). Dask-parallelized; outputs TF-target links for pySCENIC.

Python

ai-ml

data-engineering

npx claudepluginhub jaechang-hits/sciagent-skills --plugin sciagent-skills

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Arboreto infers gene regulatory networks (GRNs) from gene expression data using parallelized tree-based regression. For each target gene, it trains a regression model with all other genes (or a specified TF list) as features and emits TF-target-importance triplets. It provides two interchangeable algorithms -- GRNBoost2 (gradient boosting, fast) and GENIE3 (Random Forest, classic) -- sharing id...

SKILL.md

Similar Skills

arboreto

Infers gene regulatory networks from transcriptomics data using GRNBoost2 and GENIE3 algorithms. Scales via distributed computing for large RNA-seq datasets.

4 files

scientific-skills

arboreto

Infers gene regulatory networks from gene expression data using GRNBoost2 and GENIE3 algorithms. For transcriptomics analysis (bulk/single-cell RNA-seq) with distributed scaling.

4 files

superpowers

scanpy-scrna-seq

135

Analyzes single-cell RNA-seq data with Scanpy: QC, normalization, HVG selection, PCA/UMAP/t-SNE, Leiden clustering, markers, annotation, trajectory inference. For count matrices in h5ad/10X.

3 files

sciagent-skills

Stats

Stars135

Forks16

Last CommitApr 28, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

Arboreto GRN Inference

Overview

When to Use

Inferring transcription factor-to-target gene regulatory relationships from bulk RNA-seq expression data
Building gene regulatory networks from single-cell RNA-seq count matrices (cells as rows, genes as columns)
Generating the adjacency matrix (Step 1) of the pySCENIC regulatory analysis pipeline
Comparing regulatory network structure across experimental conditions (e.g., control vs treatment)
Producing consensus regulatory networks by running inference across multiple random seeds
Validating GRN results by comparing GRNBoost2 and GENIE3 outputs on the same dataset
For downstream regulon identification and activity scoring, use arboreto output with pySCENIC
For single-cell preprocessing (QC, normalization, clustering) before GRN inference, use scanpy-scrna-seq

Prerequisites

Python packages: arboreto, pandas, numpy, dask, distributed, scikit-learn, scipy
Data requirements: Gene expression matrix (genes as columns, observations as rows) in TSV/CSV; optionally a TF list file (one gene name per line)
Environment: Python 3.8+; optional networkx, matplotlib for visualization

pip install arboreto distributed networkx matplotlib

Quick Start

Complete GRN inference in a single block. The if __name__ == '__main__': guard is required because Dask spawns worker processes via multiprocessing.

import pandas as pd
from arboreto.algo import grnboost2
from arboreto.utils import load_tf_names

if __name__ == '__main__':
    # Load expression matrix (observations x genes)
    expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
    tf_names = load_tf_names('tf_list.txt')  # optional TF filter

    # Infer GRN (uses all local cores by default)
    network = grnboost2(expression_data=expression_matrix,
                        tf_names=tf_names, seed=777)

    # Filter top links and save
    top_network = network[network['importance'] > 1.0]
    top_network.to_csv('grn_output.tsv', sep='\t', index=False, header=False)
    print(f"Inferred {len(network)} links, kept {len(top_network)} above threshold")
    # Example: Inferred 185432 links, kept 12876 above threshold

Workflow

Step 1: Load Expression Data

Arboreto accepts a pandas DataFrame (recommended) or NumPy array. Rows are observations (cells or samples), columns are genes. Gene names must be column headers.

import pandas as pd

# From TSV (genes as columns, observations as rows)
expression_matrix = pd.read_csv('expression_data.tsv', sep='\t')
print(f"Shape: {expression_matrix.shape}  "
      f"({expression_matrix.shape[0]} observations x {expression_matrix.shape[1]} genes)")
# Example: Shape: (5000, 18654)  (5000 observations x 18654 genes)

# From AnnData (e.g., after scanpy preprocessing)
import anndata as ad
adata = ad.read_h5ad('preprocessed.h5ad')
expression_matrix = pd.DataFrame(
    adata.X.toarray() if hasattr(adata.X, 'toarray') else adata.X,
    columns=adata.var_names.tolist()
)
print(f"Converted AnnData: {expression_matrix.shape}")
# Example: Converted AnnData: (5000, 18654)

Step 2: Load Transcription Factor List

Providing a TF list restricts regulators to known transcription factors, reducing computation time and improving biological relevance. If omitted, all genes are treated as potential regulators.

from arboreto.utils import load_tf_names

# From file (one TF name per line)
tf_names = load_tf_names('human_tfs.txt')
print(f"Loaded {len(tf_names)} transcription factors")
# Example: Loaded 1639 transcription factors

# Or define directly
tf_names = ['MYC', 'TP53', 'SOX2', 'NANOG', 'POU5F1']

# Verify TFs exist in expression matrix columns
tf_in_data = [tf for tf in tf_names if tf in expression_matrix.columns]
print(f"TFs found in expression data: {len(tf_in_data)}/{len(tf_names)}")
# Example: TFs found in expression data: 1583/1639

Step 3: Configure Dask Client (Optional)

By default arboreto creates an internal Dask client using all local cores. Create an explicit client for resource control, monitoring, or cluster deployment.

from distributed import LocalCluster, Client

# Custom local client with resource limits
local_cluster = LocalCluster(
    n_workers=8,
    threads_per_worker=1,   # avoid GIL contention in scikit-learn
    memory_limit='4GB'       # per worker
)
client = Client(local_cluster)
print(f"Dashboard: {client.dashboard_link}")
# Example: Dashboard: http://127.0.0.1:8787/status

Step 4: Run GRN Inference

Call grnboost2() (recommended) or genie3(). Both share the same signature and output format.

from arboreto.algo import grnboost2

if __name__ == '__main__':
    network = grnboost2(
        expression_data=expression_matrix,
        tf_names=tf_names,         # 'all' if no TF list
        client_or_address=client,  # omit to use default local scheduler
        seed=777,                  # for reproducibility
        verbose=True               # print progress
    )
    print(f"Inferred {len(network)} regulatory links")
    print(network.head())
    # Example:
    #       TF  target  importance
    # 0   MYC   CDK4       3.214
    # 1   MYC   CCND1      2.871
    # 2  TP53  CDKN1A      2.654

Step 5: Filter Results

Raw output contains links for every TF-target pair with non-zero importance. Filter to retain high-confidence regulatory interactions.

# Strategy 1: Importance threshold
threshold = 1.0
filtered = network[network['importance'] > threshold]
print(f"Threshold {threshold}: {len(filtered)} links "
      f"({len(filtered)/len(network)*100:.1f}% retained)")
# Example: Threshold 1.0: 12876 links (6.9% retained)

# Strategy 2: Top N links per target gene
top_n = 10
top_per_target = (network.groupby('target')
                  .apply(lambda g: g.nlargest(top_n, 'importance'))
                  .reset_index(drop=True))
print(f"Top {top_n} per target: {len(top_per_target)} links")
# Example: Top 10 per target: 186540 links

Step 6: Save Network

Save as a TSV file with three columns: TF, target, importance.

# Save full network
network.to_csv('full_network.tsv', sep='\t', index=False)
print(f"Saved full network: {len(network)} links")

# Save filtered network (without header for pySCENIC compatibility)
filtered.to_csv('filtered_network.tsv', sep='\t', index=False, header=False)
print(f"Saved filtered network: {len(filtered)} links")

# Clean up Dask client if explicitly created
client.close()
local_cluster.close()

Step 7: Visualize Results (Optional)

Plot the top regulatory interactions as a directed network graph.

import networkx as nx
import matplotlib.pyplot as plt

# Build directed graph from top links
top_links = network.nlargest(50, 'importance')
G = nx.from_pandas_edgelist(
    top_links, source='TF', target='target',
    edge_attr='importance', create_using=nx.DiGraph()
)

# Draw network
fig, ax = plt.subplots(figsize=(12, 10))
pos = nx.spring_layout(G, k=2, seed=42)
nx.draw_networkx_nodes(G, pos, node_size=300, node_color='lightblue', ax=ax)
nx.draw_networkx_labels(G, pos, font_size=7, ax=ax)
edges = nx.draw_networkx_edges(
    G, pos, edge_color=[G[u][v]['importance'] for u, v in G.edges()],
    edge_cmap=plt.cm.Reds, width=1.5, arrows=True, ax=ax
)
plt.colorbar(edges, ax=ax, label='Importance')
ax.set_title('Top 50 Regulatory Interactions')
plt.tight_layout()
plt.savefig('grn_network.png', dpi=300, bbox_inches='tight')
print("Saved grn_network.png")

Key Parameters

Parameter	Default	Range / Options	Effect
`expression_data`	(required)	DataFrame or ndarray	Expression matrix, observations x genes
`tf_names`	`'all'`	list of str or `'all'`	Restrict regulators to known TFs; `'all'` uses every gene
`gene_names`	`None`	list of str	Required when `expression_data` is a NumPy array
`client_or_address`	`'local'`	Client, str, or `'local'`	Dask client instance or scheduler address
`seed`	`None`	int	Random seed for reproducibility; always set for replicable results
`verbose`	`False`	bool	Print progress messages during inference

Key Concepts

Algorithm Selection Guide

Both algorithms follow the same multiple-regression strategy (train one model per target gene, extract feature importances) but differ in the underlying regressor.

Feature	GRNBoost2	GENIE3
Method	Stochastic gradient boosting with early stopping	Random Forest (or ExtraTrees)
Speed	Fast -- optimized for large datasets	Slower -- higher per-model cost
Memory	Lower (early stopping limits tree depth)	Higher (full forests per target)
Best for	Default choice; 10k+ observations, single-cell scale	Comparison with published GENIE3 results; validation
Import	`from arboreto.algo import grnboost2`	`from arboreto.algo import genie3`
Output format	TF, target, importance (identical)	TF, target, importance (identical)

Decision rule: Start with GRNBoost2. Use GENIE3 only when reproducing published GENIE3 analyses or as an independent validation of GRNBoost2 results.

Output Format

The network DataFrame has three columns, sorted by descending importance:

Column	Type	Description
`TF`	str	Transcription factor (regulator) gene name
`target`	str	Target gene name
`importance`	float	Regulatory importance score (higher = stronger predicted regulation)

Importance scores are not p-values. They represent feature importance from the tree-based model. Filter by absolute threshold or top-N per target for downstream analysis.

The `if name == 'main':` Requirement

Dask uses Python's multiprocessing module to spawn worker processes. Without the main guard, each spawned process re-executes the script, leading to infinite process spawning. Always wrap arboreto calls in if __name__ == '__main__': when running as a script. This is not needed inside Jupyter notebooks.

Common Recipes

Recipe: Distributed Computing on a Cluster

When to use: datasets too large for a single machine (e.g., 50k+ cells, 20k+ genes). Requires a Dask distributed scheduler running on the cluster.

# On head node: start scheduler
dask-scheduler
# Output: Scheduler at tcp://10.118.224.134:8786

# On each compute node: start workers
dask-worker tcp://10.118.224.134:8786 --nprocs 4 --nthreads 1 --memory-limit 16GB

from distributed import Client
from arboreto.algo import grnboost2
import pandas as pd

if __name__ == '__main__':
    client = Client('tcp://10.118.224.134:8786')
    print(f"Connected to cluster: {client.scheduler_info()['workers'].__len__()} workers")
    # Example: Connected to cluster: 16 workers

    expression_data = pd.read_csv('large_scrna.tsv', sep='\t')
    tf_names = pd.read_csv('tf_list.txt', header=None)[0].tolist()

    network = grnboost2(
        expression_data=expression_data,
        tf_names=tf_names,
        client_or_address=client,
        seed=42, verbose=True
    )
    network.to_csv('cluster_grn.tsv', sep='\t', index=False)
    print(f"Cluster inference complete: {len(network)} links")
    client.close()

Recipe: Consensus Network from Multiple Seeds

When to use: improve robustness by aggregating networks from multiple random seeds. Links appearing consistently across runs are more reliable.

import pandas as pd
from distributed import LocalCluster, Client
from arboreto.algo import grnboost2

if __name__ == '__main__':
    client = Client(LocalCluster(n_workers=8, threads_per_worker=1))
    expression_data = pd.read_csv('expression_data.tsv', sep='\t')
    tf_names = pd.read_csv('tf_list.txt', header=None)[0].tolist()

    seeds = [42, 123, 456, 789, 1001]
    all_networks = []

    for seed in seeds:
        net = grnboost2(expression_data=expression_data,
                        tf_names=tf_names,
                        client_or_address=client, seed=seed)
        net['seed'] = seed
        all_networks.append(net)
        print(f"Seed {seed}: {len(net)} links")

    combined = pd.concat(all_networks, ignore_index=True)

    # Consensus: mean importance across seeds, keep links found in 3+ runs
    consensus = (combined.groupby(['TF', 'target'])
                 .agg(mean_importance=('importance', 'mean'),
                      n_seeds=('seed', 'nunique'))
                 .reset_index())
    consensus = consensus[consensus['n_seeds'] >= 3]
    consensus = consensus.sort_values('mean_importance', ascending=False)
    consensus.to_csv('consensus_network.tsv', sep='\t', index=False)
    print(f"Consensus network: {len(consensus)} links (found in 3+ of {len(seeds)} runs)")
    # Example: Consensus network: 28453 links (found in 3+ of 5 runs)

    client.close()

Recipe: pySCENIC Integration

When to use: downstream regulatory analysis after arboreto GRN inference. Arboreto produces the adjacency matrix (Step 1 of SCENIC); pySCENIC performs regulon identification (Step 2) and activity scoring (Step 3).

from arboreto.algo import grnboost2
from arboreto.utils import load_tf_names
import pandas as pd

if __name__ == '__main__':
    # Step 1 (arboreto): GRN inference
    expression_matrix = pd.read_csv('scrna_expression.tsv', sep='\t')
    tf_names = load_tf_names('allTFs_hg38.txt')

    adjacencies = grnboost2(
        expression_data=expression_matrix,
        tf_names=tf_names, seed=777
    )
    adjacencies.to_csv('adjacencies.tsv', sep='\t', index=False, header=False)
    print(f"SCENIC Step 1 complete: {len(adjacencies)} adjacencies")
    # Example: SCENIC Step 1 complete: 234567 adjacencies

    # Steps 2-3 use pySCENIC CLI (requires pyscenic installation):
    # pyscenic ctx adjacencies.tsv hg38_ranking.feather \
    #     --annotations_fname motifs-v10nr.hgnc-m0.001-o0.0.tbl \
    #     --output regulons.csv
    #
    # pyscenic aucell scrna_expression.tsv regulons.csv \
    #     --output auc_matrix.csv
    print("Next: run pyscenic ctx (Step 2) and pyscenic aucell (Step 3)")

Expected Outputs

full_network.tsv -- complete TF-target-importance table; columns: TF, target, importance
filtered_network.tsv -- high-confidence subset after threshold or top-N filtering
consensus_network.tsv -- aggregated network from multi-seed runs; columns: TF, target, mean_importance, n_seeds
adjacencies.tsv -- arboreto output formatted for pySCENIC input (no header)
grn_network.png -- directed graph visualization of top regulatory interactions

Troubleshooting

Problem	Cause	Solution
`RuntimeError: freeze_support()` or infinite process spawning	Missing `if __name__ == '__main__':` guard	Wrap all arboreto calls in main guard; not needed in Jupyter
`MemoryError` during inference	Expression matrix too large for available RAM	Filter low-variance genes first, reduce to top 5-10k genes, or use distributed cluster
Empty or near-empty network	TF names do not match expression matrix column names	Verify TF names overlap with `expression_matrix.columns`; check case sensitivity
Very slow inference	Using GENIE3 on large dataset, or too few Dask workers	Switch to GRNBoost2; create explicit Dask client with more workers
`ModuleNotFoundError: No module named 'arboreto'`	Package not installed	`pip install arboreto`
All importance scores are 0 or identical	Expression matrix has no variance (e.g., all zeros after filtering)	Check data quality; ensure matrix contains non-zero expression values with variance
`TypeError` with NumPy array input	Missing `gene_names` parameter	Provide `gene_names=` list matching the column count of the array

Bundled Resources

This entry is fully self-contained with no references/ directory.

Original reference file dispositions:

references/algorithms.md (139 lines) -- Algorithm comparison, parameter details, custom regressor kwargs. Consolidated into Key Concepts (Algorithm Selection Guide table) and Key Parameters. Custom regressor kwargs (regressor_type, regressor_kwargs) omitted: advanced tuning rarely needed; users can pass scikit-learn kwargs directly per the arboreto API docs.
references/basic_inference.md (152 lines) -- Input formats, TF loading, basic workflow, output format. Fully consolidated into Workflow Steps 1-2, Key Concepts (Output Format table), and Quick Start.
references/distributed_computing.md (242 lines) -- Local client, cluster setup, monitoring, performance tips. Consolidated into Workflow Step 3 and Recipe: Distributed Computing. Omitted: Dask dashboard panel descriptions (monitoring UI detail beyond scope), detailed network/storage tuning (cluster-admin concern).
scripts/basic_grn_inference.py (98 lines) -- CLI wrapper with argparse. Core load-infer-save logic absorbed into Workflow Steps 1-6 and Quick Start. Argparse/CLI boilerplate stripped.

Narrative use-case dispositions (from original Common Use Cases):

Single-cell RNA-seq analysis -- absorbed into When to Use bullets and Workflow Steps 1/4
Bulk RNA-seq with TF filtering -- absorbed into Workflow Step 2 and Quick Start
Comparative analysis (multiple conditions) -- omitted as standalone recipe: trivial loop over conditions, pattern shown in Consensus Network recipe

Related Skills

scanpy-scrna-seq -- upstream single-cell preprocessing (QC, normalization, HVG selection) before GRN inference
scvi-tools-single-cell -- deep generative models for batch correction and imputation prior to network inference
pyscenic (planned) -- downstream regulon identification and cellular activity scoring using arboreto adjacencies
networkx-graph-analysis -- graph-theoretic analysis of inferred regulatory networks (centrality, communities)

References

Arboreto GitHub repository -- source code, README, examples
Arboreto PyPI -- installation and version history
Moerman et al. (2019) "GRNBoost2 and Arboreto: efficient and scalable inference of gene regulatory networks" Bioinformatics 35(12):2159-2161 -- algorithm paper
Aibar et al. (2017) "SCENIC: single-cell regulatory network inference and clustering" Nature Methods 14:1083-1086 -- SCENIC pipeline paper
Dask distributed documentation -- Dask client, cluster setup, dashboard

arboreto-grn-inference

Tool Access

Preview

SKILL.md

Similar Skills

Help us improve

Help us improve

arboreto-grn-inference

Tool Access

Preview

SKILL.md

Arboreto GRN Inference

Overview

When to Use

Prerequisites

Quick Start

Workflow

Step 1: Load Expression Data

Step 2: Load Transcription Factor List

Step 3: Configure Dask Client (Optional)

Step 4: Run GRN Inference

Step 5: Filter Results

Step 6: Save Network

Step 7: Visualize Results (Optional)

Key Parameters

Key Concepts

Algorithm Selection Guide

Output Format

The if __name__ == '__main__': Requirement

Common Recipes

Recipe: Distributed Computing on a Cluster

Recipe: Consensus Network from Multiple Seeds

Recipe: pySCENIC Integration

Expected Outputs

Troubleshooting

Bundled Resources

Related Skills

References

Similar Skills

Help us improve

Arboreto GRN Inference

Overview

When to Use

Prerequisites

Quick Start

Workflow

Step 1: Load Expression Data

Step 2: Load Transcription Factor List

Step 3: Configure Dask Client (Optional)

Step 4: Run GRN Inference

Step 5: Filter Results

Step 6: Save Network

Step 7: Visualize Results (Optional)

Key Parameters

Key Concepts

Algorithm Selection Guide

Output Format

The if __name__ == '__main__': Requirement

Common Recipes

Recipe: Distributed Computing on a Cluster

Recipe: Consensus Network from Multiple Seeds

Recipe: pySCENIC Integration

Expected Outputs

Troubleshooting

Bundled Resources

Related Skills

References

The `if name == 'main':` Requirement

The `if name == 'main':` Requirement