Skill

opendata-api

Queries OpenData API datasets via REST for row fetching, filtering, sorting, aggregation, column inspection, and metadata retrieval. Use for data research, analysis, and pipelines on Parquet files.

REST API

data-engineering

api-development

npx claudepluginhub tryopendata/skills --plugin opendata

Tool Access

This skill uses the workspace's default tool permissions.

Preview

Query datasets stored as Parquet files through a REST API backed by DuckDB. The API returns JSON by default, with support for CSV, TSV, and XLSX exports.

Supporting Assets

references/aggregation.mdreferences/column-introspection.mdreferences/common-patterns.mdreferences/discover.mdreferences/filtering.mdreferences/output-formats.mdreferences/pagination-and-sort.mdreferences/sql-query.md

SKILL.md

Similar Skills

tooluniverse-dataset-discovery

1.3k

Finds and evaluates research datasets for scientific questions. Reasons about data needs like study design and variables, searches repositories, assesses fitness, and identifies access requirements.

mims-harvard-tooluniverse

futuresearch-python

Dispatches AI researchers to classify, rank, score, deduplicate, merge, forecast, and enrich Python dataframes at scale.

futuresearch

datahub-search

Searches DataHub catalog to discover entities, find datasets by platform/domain, and answer ad-hoc metadata questions like ownership, PII columns, or table schemas.

3 files1 tool

datahub-skills

Stats

Parent Repo Stars40

Parent Repo Forks1

Last CommitMar 29, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

OpenData Query API

Query datasets stored as Parquet files through a REST API backed by DuckDB. The API returns JSON by default, with support for CSV, TSV, and XLSX exports.

Base URL: https://api.tryopendata.ai (production) or http://localhost:8000 (local dev). Default to use production

Authentication

All endpoints require authentication in production. Pass an API key via Authorization: Bearer header:

curl -H "Authorization: Bearer od_live_..." \
  "https://api.tryopendata.ai/v1/datasets/fred/cpi?limit=5"

Local dev (localhost:8000) does not require auth when running the standalone opendata server (make quickstart). The backend server (make dev-all) requires auth for write endpoints but allows unauthenticated reads.

Quick Start

The dataset path IS the query endpoint. Just GET the dataset path with query params:

# Get the 5 most recent CPI values
curl -H "Authorization: Bearer od_live_..." \
  "https://api.tryopendata.ai/v1/datasets/fred/cpi?limit=5&sort=-date"
#                                           ^^^^^^^^^^^^^^^^
#                              This path returns data directly.

# Filter, sort, aggregate - all via query params on the same path
curl -H "Authorization: Bearer od_live_..." \
  'https://api.tryopendata.ai/v1/datasets/owid/gdp?filter[year][gte]=2020&sort=-gdp_per_capita&limit=10'

Do NOT append /query to GET requests. GET /v1/datasets/fred/cpi/query will fail with a SUBDATASET_NOT_FOUND error because the API interprets query as a subdataset name. The POST /query endpoint is a separate SQL interface (see below).

All data endpoints live under /v1/datasets/.

Endpoints

Method	Path	Description
GET	`/v1/datasets/{provider}/{dataset}`	Query dataset rows (flat) or list subdatasets (hierarchical)
GET	`/v1/datasets/{provider}/{dataset}/{subdataset}`	Query subdataset rows
GET	`/v1/datasets/{provider}/{dataset}/columns`	Column metadata and statistics
GET	`/v1/datasets/{provider}/{dataset}/columns/{name}`	Single column detail with full value list
GET	`/v1/datasets/{provider}/{dataset}/meta`	Dataset metadata (schema, views, capabilities)
GET	`/v1/datasets/{provider}/{dataset}/views`	List available views
POST	`/v1/datasets/{provider}/{dataset}/query`	Execute SQL query (authenticated)
GET	`/v1/discover`	Search datasets with enriched metadata for LLM agents
POST	`/v1/discover/batch`	Batch discover across multiple queries with deduplication
POST	`/v1/query`	Cross-dataset SQL query (join multiple datasets)

Subdatasets

Some datasets contain multiple tables (e.g., multi-sheet Excel workbooks, BLS series groups). For these:

GET /v1/datasets/{provider}/{dataset} returns data for the default subdataset, or lists available subdatasets
GET /v1/datasets/{provider}/{dataset}/{subdataset} queries a specific subdataset

If you get a SUBDATASET_NOT_FOUND error, the dataset likely has subdatasets. Check the error response's suggestions field - it includes a link to browse available subdatasets. Any unrecognized path segment after the dataset slug is interpreted as a subdataset name, which is why paths like /query or /search appended to a dataset path produce this error.

Query Parameters

Parameter	Example	Description	Reference
`filter[col]`	`filter[year]=2024`	Filter rows by column value	filtering.md
`filter[col][op]`	`filter[year][gte]=2020`	Filter with operator	filtering.md
`sort`	`sort=-year`	Sort by column (prefix `-` for desc)	pagination-and-sort.md
`limit`	`limit=50`	Max rows to return (1-1000, default 100)	pagination-and-sort.md
`offset`	`offset=100`	Skip N rows	pagination-and-sort.md
`cursor`	`cursor=...`	Keyset pagination token	pagination-and-sort.md
`fields`	`fields=year,score`	Column projection	output-formats.md
`format`	`format=csv`	Output format (json, csv, tsv, xlsx)	output-formats.md
`aggregate`	`aggregate=avg(score)`	Aggregate functions	aggregation.md
`group_by`	`group_by=year`	Group rows by column	aggregation.md
`view`	`view=timeseries`	Apply a named view
`expand`	`expand=area`	Expand joined dimensions inline
`include_sources`	`include_sources=true`	Show `_source_url`, `_source_page` columns
`response_format`	`response_format=columnar`	Response shape: `objects` (default) or `columnar` (compact)	output-formats.md
`debug`	`debug=true`	Include generated SQL and query echo

Common Pitfalls

Use filter[col]=val, not ?col=val. Bare column names as query params are silently ignored. The API returns a structured warning, but you still get unfiltered data back.

# Wrong - returns ALL rows, with a warning
curl '.../nces/naep?year=2024'

# Right
curl '.../nces/naep?filter[year]=2024'

URL-encode brackets in curl. Some shells interpret [ and ]. Use %5B / %5D or quote the URL.

curl 'https://api.tryopendata.ai/v1/datasets/nces/naep?filter%5Byear%5D=2024'

Check warnings in the response. Unknown parameters produce structured QueryWarning objects with code, message, and param. The X-OpenData-Warnings HTTP header also carries these for piped workflows.

Use ?debug=true to see generated SQL. Returns a debug object with debug.query (echo of your parameters) and debug.sql (the DuckDB SQL that ran). Useful for verifying filters and sorts are applied correctly.

aggregate and nest_fields are mutually exclusive. You get a 400 error if you combine them. Aggregation produces flat summary rows; nesting produces grouped hierarchical data.

SQL endpoint may return 5xx errors. The POST /query endpoint can fail with server-side errors. When this happens, fall back to REST aggregation (aggregate + group_by params) for the same analysis. The REST endpoint is more reliable for straightforward aggregations.

Sorting on computed aggregation columns works. When using aggregate + group_by, you can sort on the computed column names (e.g., sort=-count_event_id for aggregate=count(event_id)). Invalid sort fields return a 400 with valid_values showing available options.

SQL Query

The POST /v1/datasets/{provider}/{dataset}/query endpoint accepts raw SQL and executes it against the dataset. Requires authentication (API key or session). The dataset table is available as data or "provider/dataset". SQL is validated against an allowlist (SELECT only, no DDL/DML/IO) and runs with resource limits (5s timeout, 10k rows, 512MB memory).

Parameterized queries: Use ? placeholders with a params array to avoid string quoting issues:

{
  "sql": "SELECT * FROM data WHERE country IN (?, ?) AND year >= ?",
  "params": ["United States", "Japan", 2020]
}

This eliminates the triple-nested escaping problem (SQL quotes inside JSON inside shell). See sql-query.md for details.

Discovery

The GET /v1/discover endpoint returns datasets matching a natural language query, enriched with metadata tailored for LLM agents and programmatic integrations. Results include column schemas (with units, value ranges, display names), available views, canonical questions, methodology summaries, sample rows, and relevance scores. Unlike /v1/search, discover is authenticated and optimized for machine consumption rather than human browsing.

Batch discover: POST /v1/discover/batch accepts multiple queries in one call, deduplicates results, and returns per-query dataset references alongside the full metadata. See discover.md for details.

Reference Files

File	When to load
references/filtering.md	Writing filter expressions, checking operator syntax
references/aggregation.md	Using group_by, aggregate functions, summary queries
references/pagination-and-sort.md	Paginating large results, sorting, cursor-based pagination
references/column-introspection.md	Discovering schema, column types, value distributions
references/output-formats.md	Exporting CSV/TSV/XLSX, field projection, system columns
references/common-patterns.md	Recipes for exploratory analysis and data research
references/sql-query.md	Raw SQL query endpoint, allowed functions, security model
references/discover.md	Using the discover endpoint, LLM agent integration, dataset discovery