From anysite-skills
Operates anysite CLI for web data extraction from LinkedIn/Instagram/Twitter, batch API processing, dataset pipelines with scheduling/transforms/exports, SQL queries, PostgreSQL/SQLite loading, and LLM data analysis (summarize/classify/enrich).
npx claudepluginhub anysiteio/agent-skills --plugin anysite-cliThis skill uses the workspace's default tool permissions.
Command-line tool for web data extraction, dataset pipelines, and database operations.
Builds data pipelines with ETL processes, ingestion from sources, batch/real-time processing, quality validation, schema design, warehousing, caching, replication, and analytics using TypeScript and Vercel Cron.
Create and manage Hugging Face Hub datasets: initialize repos, configure prompts/metadata, stream row updates, and query/transform data with DuckDB SQL.
Builds automated Python agents to scrape public websites/APIs on GitHub Actions schedule, enrich data with free Gemini Flash LLM, and store to Notion/Sheets/Supabase.
Share bugs, ideas, or general feedback.
Command-line tool for web data extraction, dataset pipelines, and database operations.
BEFORE planning any data collection task, follow this sequence:
Discover available endpoints
anysite describe --search "<keyword>" # Search by domain (linkedin, company, user, etc.)
Select endpoints needed for the task — identify which endpoints will provide the required data
Inspect each selected endpoint
anysite describe /api/linkedin/company # View input params and output fields
Only then plan — now you know the exact parameters, field names, and data structure to build your config or API calls
This prevents errors from wrong endpoint paths, missing required parameters, or incorrect field names in dependencies.
Use dataset pipelines for multi-step tasks
dataset.yaml config instead of running multiple ad-hoc commandsSave data in Parquet format by default — unless user requests another format or CSV/JSON fits better
Prefer datasets over ad-hoc scripts — one dataset.yaml replaces dozens of shell commands
Before any data collection task:
# 1. Check CLI is available and see latest changes
anysite --version
anysite changelog --last 1 --json # Check what's new in this version
# If not found: source .venv/bin/activate or pip install anysite-cli
# 2. Update schema cache (required for endpoint discovery)
anysite schema update
# 3. Verify API key
anysite config get api_key
# If not set: anysite config set api_key sk-xxxxx
After upgrading, run anysite changelog --since <old_version> --json to discover new features.
ALWAYS discover endpoints before writing API calls or dataset configs:
anysite describe # List all endpoints
anysite describe --search "company" # Search with dependency context
anysite describe /api/linkedin/company # Full details: params, output, connections
Search returns matched endpoints plus upstream providers (who can supply input IDs) and downstream consumers (who can use output IDs). Use this to plan endpoint chains for dataset pipelines.
Input params show type, description, examples, and defaults. Array params show item structure:
Input parameters:
* urn string User URN, only fsd_profile urn type is allowed
example: "urn:li:fsd_profile:ACoAABXy1234"
count integer Number of posts to return
default: 20
companies array[object{type,value}] Company URNs
example: [{"type": "company", "value": "14064608"}]
pip install "anysite-cli[data]" # DuckDB + PyArrow for dataset commands
pip install "anysite-cli[llm]" # LLM analysis (openai/anthropic)
pip install "anysite-cli[postgres]" # PostgreSQL adapter
pip install "anysite-cli[clickhouse]" # ClickHouse adapter
anysite config set api_key sk-xxxxx # Configure API key
anysite schema update # Update schema cache
anysite llm setup # Interactive setup (human)
anysite llm setup --provider openai --api-key sk-xxx --no-test # Non-interactive (agent)
anysite llm setup --provider anthropic --api-key-env ANTHROPIC_API_KEY --no-test
anysite db add pg --type postgres --host localhost --database mydb --user app --password secret
# Or via env var: anysite db add pg ... --password-env PGPASS
anysite db add ch --type clickhouse --host ch.example.com --port 8443 --database analytics --user app --password secret --ssl
anysite auth login # Interactive browser-based OAuth2 (human)
anysite auth login --force --no-browser # Re-authenticate without confirmation (agent)
anysite auth status # Check current auth status
anysite auth status --json # Machine-readable auth status
anysite auth logout # Interactive logout (human)
anysite auth logout --force # Logout without confirmation (agent)
anysite api /api/linkedin/user user=satyanadella
anysite api /api/linkedin/company company=anthropic --format table
anysite api /api/linkedin/search/users keywords="CTO" count=50 --format csv --output ctos.csv
anysite api /api/linkedin/user user=satyanadella --fields "name,headline,urn.value" -q | jq
Parameters like location, current_companies, industry accept two formats:
# Single name (text search) — resolves to URNs automatically
location="London"
current_companies="Microsoft"
# Multiple URNs (direct) — use JSON array in single quotes
'location=["urn:li:geo:101165590", "urn:li:geo:101282230"]'
'current_companies=["urn:li:company:1035", "urn:li:company:1441"]'
Note: List of names ["Microsoft", "Google"] is NOT supported — use either one name OR multiple URNs.
anysite api /api/linkedin/user --from-file users.txt --input-key user \
--parallel 5 --rate-limit "10/s" --on-error skip --progress --stats
For complex data collection with dependencies, LLM enrichment, scheduling — use dataset pipelines.
anysite dataset init my-dataset
# Creates my-dataset/dataset.yaml with template config
inputname: my-dataset # Dataset name (required)
description: Optional description # Human-readable description
sources:
# === TYPE 1: Independent source (single API call) ===
- id: search_results # Unique identifier (required)
endpoint: /api/linkedin/search/users # API endpoint (required for type: api)
input: # Static API parameters
keywords: "software engineer"
count: 50
parallel: 1 # Concurrent requests: 1-10 (default: 1)
rate_limit: "10/s" # Rate limit: "N/s", "N/m", "N/h"
on_error: stop # Error handling: stop | skip (default: stop)
- id: search_extra # Another search (can be combined with union)
endpoint: /api/linkedin/search/users
input: { keywords: "data engineer", count: 50 }
# === TYPE 2: from_file source (batch from file) ===
- id: companies
endpoint: /api/linkedin/company
from_file: companies.txt # Input file: .txt (line per value), .csv, .jsonl
file_field: company_slug # CSV column name (for CSV files only)
input_key: company # API parameter to fill with each value
parallel: 3
# === TYPE 3: Dependent source (values from parent) ===
- id: employees
endpoint: /api/linkedin/company/employees
dependency:
from_source: companies # Parent source ID (required)
field: urn.value # Dot-notation path to extract from parent records
match_by: name # Alternative: fuzzy match instead of exact field
dedupe: true # Remove duplicate values (default: false)
input_key: companies # API parameter for extracted values
input_template: # Transform values before API call
companies:
- type: company
value: "{value}" # {value} = extracted value placeholder
count: 5
refresh: auto # Incremental behavior: auto (default) | always | never
# never = skip if data exists
# === Shorthand for dependent sources ===
# ${source.field} auto-expands to dependency + input_key:
- id: profiles
endpoint: /api/linkedin/user
input:
user: ${companies.urn.value} # Equivalent to dependency + input_key above
# === TYPE 4: Union source (combine multiple sources) ===
- id: all_search_results
type: union # Source type: api (default) | union | llm | sql
sources: [search_results, search_extra] # Parent source IDs to combine (required)
dedupe_by: urn.value # Optional: field path for deduplication (dot-notation)
# NOTE: type: union cannot have endpoint, dependency, from_file, input_key, input
# NOTE: all sources in the list must have the same endpoint (same data structure)
# Records are annotated with _union_source = parent source ID
# === TYPE 5: LLM source (process parent data without API) ===
- id: employees_analyzed
type: llm # Source type: api (default) | union | llm | sql
dependency:
from_source: employees
field: name # Required by schema (not used for LLM sources)
llm: # LLM enrichment steps (required for type: llm)
- type: classify # Step types: classify | enrich | summarize | generate
categories: "developer,recruiter,executive" # Comma-separated (omit for auto-detect)
output_column: role_type # Output column name (default: category)
fields: [headline] # Record fields to include in LLM prompt
- type: enrich
add: # Field specs (required for enrich)
- "sentiment:positive/negative/neutral" # Enum: value1/value2/value3
- "language:string" # Types: string | number | integer | boolean
- "quality_score:1-10" # Range: min-max
fields: [headline, summary]
temperature: 0.0 # LLM temperature: 0.0-1.0 (default: 0.0)
provider: openai # Provider override: openai | anthropic
model: gpt-4o-mini # Model override
- type: summarize
max_length: 50 # Max words (default: 100)
output_column: bio
- type: generate
prompt: "Write pitch for {name}" # Template with {field} placeholders (required)
output_column: pitch
temperature: 0.7 # Higher for creative text
export: # Export destinations (runs after Parquet write)
- type: file
path: ./output/{{source}}-{{date}}.csv # Templates: {{date}}, {{source}}, {{dataset}}
format: csv # Format: json | jsonl | csv
# NOTE: type: llm cannot have endpoint, from_file, input_key, input
# === TYPE 6: SQL source (query a database) ===
# Mode A: Named connection (query external database)
- id: billing_users
type: sql
connection: billing # Named connection from 'anysite db add'
query: "SELECT name, email FROM subscriptions WHERE status = 'inactive'"
# query_file: queries/inactive.sql # Alternative: read SQL from file
filter: ".email != null" # All base fields work: filter, llm, transform, export, db_load
# Mode B: Dataset views (no connection — query Parquet via DuckDB)
# Each collected source becomes a view by its id (hyphens → underscores)
- id: enriched_profiles
type: sql
query: |
SELECT u.*, c.description as company_desc
FROM user_profiles u
LEFT JOIN company_profiles c ON u._input_value = c._input_value
# Use for cross-source JOINs within the pipeline — no external DB needed
# NOTE: type: sql cannot have endpoint, from_file, input_key, input, parallel, rate_limit, on_error
# === OPTIONAL BLOCKS (any source type) ===
# THREE-LEVEL FILTERING:
# Level 1: source.filter — before LLM + Parquet (saves tokens, drops records entirely)
# Level 2: transform.filter — before exports only (Parquet keeps all records)
# Level 3: db_load.filter — before DB loading (Parquet keeps all records)
# All use same syntax: .field op value, booleans (true/false), and/or
- id: profiles
endpoint: /api/linkedin/user
dependency: { from_source: employees, field: internal_id.value }
input_key: user
filter: '.follower_count > 100' # Level 1: early filter (before LLM + Parquet)
transform: # Level 2: export filter (Parquet keeps all)
filter: '.location != ""' # Safe expression
fields: # Field selection with dot-notation aliases
- name
- urn.value AS urn_id
- headline
add_columns: # Static columns to inject
batch: "q1-2026"
export:
- type: file
path: ./output/profiles.csv
format: csv
- type: webhook
url: https://example.com/hook
headers: { X-Token: abc }
db_load: # Database loading config
filter: '.active == true' # Level 3: DB filter (only active to DB)
table: people # Custom table name (default: source ID)
key: urn.value # Unique key for diff-based incremental sync
sync: full # Sync mode: full (default) | append (no DELETE)
fields: # Fields to load (default: all except _input_value)
- name
- urn.value AS urn_id
- headline
exclude: [_input_value, raw_html] # Fields to skip
storage:
format: parquet # Storage format (only parquet supported)
path: ./data/ # Storage directory (relative to dataset.yaml)
schedule:
cron: "0 9 * * *" # Cron expression for scheduling
notifications:
on_complete:
- url: "https://hooks.slack.com/xxx"
headers: { Authorization: "Bearer token" }
on_failure:
- url: "https://alerts.example.com/fail"
The type: union source combines records from multiple parent sources:
sources list (parent source IDs)dedupe_by field path for removing duplicates (supports dot-notation)endpoint, dependency, from_file, input_key, inputsources:
- id: search_cto
endpoint: /api/linkedin/search/users
input: { keywords: "CTO fintech", count: 50 }
- id: search_vp
endpoint: /api/linkedin/search/users
input: { keywords: "VP Engineering", count: 50 }
# Union combines all search results
- id: all_candidates
type: union
sources: [search_cto, search_vp]
dedupe_by: urn.value # Remove duplicates by URN
# Single dependent source processes all candidates
- id: profiles
endpoint: /api/linkedin/user
dependency:
from_source: all_candidates
field: urn.value
input_key: user
Records from union sources are annotated with _union_source (the parent source ID they came from).
The type: llm source processes existing parent data without making API calls:
dependency (parent source) and non-empty llm listendpoint, from_file, input_key, input# Collect only the LLM source (reads parent Parquet, applies LLM steps)
anysite dataset collect dataset.yaml --source employees_analyzed
anysite dataset validate dataset.yaml # Validate config (catches errors early)
anysite dataset collect dataset.yaml --dry-run # Preview plan
anysite dataset collect dataset.yaml # Run collection
anysite dataset collect dataset.yaml --load-db pg # Collect + auto-load to DB
anysite dataset collect dataset.yaml --incremental # Skip already-collected inputs
anysite dataset collect dataset.yaml --source employees # Single source + dependencies
anysite dataset collect dataset.yaml --no-llm # Skip LLM enrichment steps
anysite dataset collect dataset.yaml --limit 100 # Pilot run: max 100 inputs per source
For from_file and dependency sources, anysite tracks collected input values in metadata.json. This enables resuming collection without re-fetching existing data.
Workflow:
# First run — collects all inputs
anysite dataset collect dataset.yaml
# Later: add new items to input file, run with --incremental
anysite dataset collect dataset.yaml --incremental
# → Only new items are collected, existing ones skipped
# Force full re-collection
anysite dataset reset-cursor dataset.yaml
anysite dataset collect dataset.yaml
Per-source refresh option:
sources:
- id: profiles
refresh: auto # (default) respects --incremental
- id: posts
refresh: always # always re-collects, ignores --incremental
# use for time-sensitive data (feeds, activity)
- id: companies
refresh: never # skip if data exists (any snapshot)
Reset cursor:
anysite dataset reset-cursor dataset.yaml # all sources
anysite dataset reset-cursor dataset.yaml --source posts # specific source
anysite dataset query dataset.yaml --sql "SELECT * FROM companies LIMIT 10"
anysite dataset query dataset.yaml --source profiles --fields "name, urn.value AS id"
anysite dataset query dataset.yaml --source profiles --exclude "_input_value,_parent_source"
anysite dataset query dataset.yaml --interactive # SQL shell
anysite dataset stats dataset.yaml --source companies
anysite dataset profile dataset.yaml
anysite dataset load-db dataset.yaml -c pg --drop-existing # Full load
anysite dataset load-db dataset.yaml -c pg # Incremental sync (when db_load.key set)
anysite dataset load-db dataset.yaml -c pg --snapshot 2026-01-15
anysite dataset diff dataset.yaml --source profiles --key urn.value
anysite dataset diff dataset.yaml --source profiles --key urn.value --from 2026-01-30 --to 2026-02-01
anysite dataset schedule dataset.yaml --incremental --load-db pg # Generate cron entry
anysite dataset history my-dataset
anysite dataset logs my-dataset --run 42
anysite dataset reset-cursor dataset.yaml # Clear incremental state
# Connection management
anysite db add pg --type postgres --host localhost --database mydb --user app --password secret
anysite db add pg --type postgres --host localhost --database mydb --user app --password-env PGPASS
anysite db add ch --type clickhouse --host ch.example.com --port 8443 --database analytics --user app --password secret --ssl
anysite db add local --type sqlite --path ./data.db
anysite db add replica --type postgres --host replica.example.com --read-only
anysite db list
anysite db test pg
# Data operations
cat data.jsonl | anysite db insert pg --table users --stdin --auto-create
anysite db query pg --sql "SELECT * FROM users" --format table
anysite db query pg --sql "SELECT * FROM users" --format parquet --output users.parquet
anysite db query pg --sql "SELECT * FROM users" --format csv --output "reports/{{date}}/users.csv"
# Pipe API output to database
anysite api /api/linkedin/user user=satyanadella -q --format jsonl \
| anysite db insert pg --table profiles --stdin --auto-create
# Database discovery (schema introspection, sample data, LLM descriptions)
anysite db discover mydb # Discover schema
anysite db discover mydb --with-llm # Add LLM table/column descriptions
anysite db discover mydb --tables users,posts # Filter tables
anysite db discover mydb --exclude-tables _migrations
# View saved catalogs
anysite db catalog # List all catalogs
anysite db catalog mydb # Show full catalog
anysite db catalog mydb --table users # Show specific table
anysite db catalog mydb --json # JSON output for agents
Credentials: --password saves directly in ~/.anysite/connections.yaml, --password-env references an env var. Direct value takes priority. LLM API keys follow the same pattern via anysite llm setup.
Supported databases: SQLite, PostgreSQL, ClickHouse. ClickHouse uses clickhouse-connect driver (HTTP protocol, port 8123 default, 8443 for HTTPS/SSL).
Analyze collected dataset records using LLM. Requires anysite llm setup first.
Non-interactive setup (for agents):
anysite llm setup --provider openai --api-key <key> --no-test
anysite llm setup --provider anthropic --api-key-env ANTHROPIC_API_KEY --model claude-sonnet-4-5-20250514 --no-test
anysite llm summarize dataset.yaml --source profiles --fields "name,headline" --format table
anysite llm classify dataset.yaml --source posts --categories "positive,negative,neutral"
anysite llm enrich dataset.yaml --source profiles \
--add "seniority:junior/mid/senior" --add "is_technical:boolean"
anysite llm generate dataset.yaml --source profiles \
--prompt "Write intro for {name} who works as {headline}" --temperature 0.7
anysite llm match dataset.yaml --source-a profiles --source-b companies --top-k 3
anysite llm deduplicate dataset.yaml --source profiles --key name --threshold 0.8
anysite llm cache-stats
anysite llm cache-clear
Use anysite describe --search <keyword> for more endpoints.
--format json|jsonl|csv|table|parquet (parquet requires --output)keywords=a,b,c auto-wraps as ["a","b","c"]--fields "name,headline,urn.value" (dot-notation for nested)--on-error stop|skip|retry~/.anysite/config.yaml > defaultsFor detailed option tables and advanced configuration: