From trousse
BigQuery data analysis — BEFORE writing any BQ query, load this for methodology and dialect reference. Provides 5-stage workflow (discover → understand → analyse → validate → present) mapped to MCP tools (execute_sql, forecast, analyze_contribution, catalog search). Triggers on 'analyse this data', 'explore the dataset', 'what tables do we have', 'build a dashboard', 'query BigQuery', 'why did this metric change'. (user)
npx claudepluginhub spm1001/batterie-de-savoir --plugin trousseThis skill is limited to using the following tools:
<!-- Sources: Google BQ Data Analytics extension (Apache-2.0), Anthropic knowledge-work-plugins (Apache-2.0) -->
Answers data questions via SQL on connected warehouses: quick metrics, trend investigations, segment comparisons, formal reports. Uses manual input if no connection.
Analyzes BigQuery data for contributions to metric changes, generates insights from queries, and performs time-series forecasting. Useful for explaining data shifts or future projections.
Design and spec an analytical dashboard — define the question each chart answers, write the SQL queries, spec the layout and refresh cadence. Produces a complete dashboard spec ready to implement. Use when asked to "build a dashboard", "analytics dashboard", "BI dashboard", "weekly product health", or "visualize this data".
Share bugs, ideas, or general feedback.
Systematic methodology for exploring, querying, validating, and visualizing BigQuery data using MCP tools.
For Google Sheets under ~5K rows, use /consomme-sheets <url> to analyse directly in-context — no BigQuery access needed. The same profiling methodology applies (shape detection, quality assessment, distributions) but execution happens over the CSV data rather than via SQL. Requires the mise MCP server for Sheet fetching.
Every analysis follows five stages. Use the MCP tools mapped to each stage:
DISCOVER → UNDERSTAND → ANALYZE → VALIDATE → PRESENT
| Stage | Purpose | MCP Tools |
|---|---|---|
| Discover | Find relevant tables and datasets | search_catalog, list_dataset_ids, list_table_ids |
| Understand | Learn schema, shape, and quality | get_dataset_info, get_table_info, execute_sql |
| Analyze | Query, aggregate, model, test significance | execute_sql, forecast, analyze_contribution |
| Validate | Check results before sharing | execute_sql (cross-checks, spot-checks) |
| Present | Visualize and communicate | Results from above stages → charts, dashboards, tables |
Start at Discover unless the user names specific tables. Never skip Understand — always profile before analyzing.
When comparing groups or segments in the Analyze stage, check whether observed differences are statistically meaningful — see references/statistical-analysis.md for confidence intervals, significance tests, and minimum sample sizes. This is especially important for survey data where sample sizes per segment may be small.
search_catalog — Semantic search across tables, views, models, routines, and connections. The prompt parameter is a natural language description, not just keywords — "tables about advertiser campaign performance" works better than "advertiser campaign". Use first when the user describes data conceptually rather than naming specific tables.
list_dataset_ids — List all datasets in the project. Use to orient when entering an unfamiliar project.
list_table_ids — List all tables in a specific dataset. Use after identifying a relevant dataset to see what's available.
get_dataset_info — Get dataset-level metadata (description, labels, location, default expiration). Use to understand dataset purpose and organization.
get_table_info — Get table metadata including schema (column names and types), partitioning configuration, clustering fields, and row count. Use before writing any query against a table.
execute_sql — The WorkhorseExecute SQL statements against BigQuery. Used for profiling, analysis, and validation.
Cost awareness: BigQuery bills per byte scanned. Before running expensive queries on large tables:
dry_run: true to preview query cost without executingSELECT * — select only the columns you needAPPROX_COUNT_DISTINCT() instead of COUNT(DISTINCT) on large tablesResult set management: Results come back as one JSON object per row. To avoid overwhelming the context:
LIMIT for exploratory queries (LIMIT 20–50)GROUP BY to reduce rows, not LIMIT on raw dataLIMIT 10 explicitlyforecast — Time Series ForecastingForecast future values based on historical patterns.
Required parameters:
history_data — a SQL query (not just a table name) returning the time seriestimestamp_col — name of the date/timestamp column in the query resultsdata_col — name of the numeric metric column to forecasthorizon — number of future periods to predict (default: 10)Optional: id_cols — array of column names for parallel series (e.g., forecast per region)
Prerequisites — check before using:
Example: forecast daily revenue for the next 30 days
history_data: "SELECT event_date, SUM(revenue) AS daily_revenue
FROM sales WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 365 DAY)
AND event_date < CURRENT_DATE()
GROUP BY event_date ORDER BY event_date"
timestamp_col: "event_date"
data_col: "daily_revenue"
horizon: 30
See references/profiling-timeseries.md for preparing data for forecasting.
analyze_contribution — What's Driving the Change?Analyze which dimensions contribute most to a change in a key metric between two groups.
When to use: User asks "why did X change?", "what's driving the increase?", or "how do these groups differ?"
Required parameters:
input_data — a SQL query returning both populations, with columns for dimensions, the metric, and a boolean test/control columncontribution_metric — an expression in one of these forms:
SUM(metric_column) for summable metricsSUM(numerator)/SUM(denominator) for ratio metricsSUM(metric)/COUNT(DISTINCT category) for per-category metricsis_test_col — name of the boolean column that splits test vs controlOptional: dimension_id_cols — array of dimension columns to analyze as drivers
Example: why is satisfaction lower in Region A vs Region B?
input_data: "SELECT region, age_group, product_line, satisfaction_score,
region = 'A' AS is_test
FROM survey_results
WHERE region IN ('A', 'B')"
contribution_metric: "SUM(satisfaction_score)"
is_test_col: "is_test"
dimension_id_cols: ["age_group", "product_line"]
The tool returns ranked contributors — dimensions (or combinations) that explain the difference, with magnitude and support metrics.
Survey data note: For binary 0/1 response columns (e.g., brand awareness), use SUM(column) as the metric — this gives the count of positive responses, which the tool can decompose by dimension. Cast dimension columns to STRING for clean output.
gemini auth login. Set BIGQUERY_PROJECT environment variable to your GCP project ID.bq-toolbox) configured with Application Default Credentials or service account credentials. The project is set in the MCP server configuration.roles/bigquery.user for read access. Additionally roles/bigquery.dataEditor for creating tables/views.Before profiling, identify the data shape to pick the right methodology. Look at table count, column count, and naming patterns — then confirm with the user.
| Signal | Likely Shape | Profiling Reference |
|---|---|---|
| Multiple tables, ID columns, FK relationships | Warehouse — normalised star/snowflake schema | references/profiling-warehouse.md |
| Single wide table, many columns, column names are questions or codes | Survey — questionnaire responses from Sheets | references/profiling-survey.md |
| Date/timestamp column drives the analysis, questions about trends or forecasts | Time series — events or periodic metrics | references/profiling-timeseries.md |
Ask the user: "This looks like [survey/warehouse/time series] data — is that right?" Don't assume.
Mixed shapes are common. A warehouse might have a survey responses fact table, or time series data in a star schema. Use the reference that matches the analytical question, not just the table structure.
BigQuery-specific functions (date/time, string, arrays/structs), performance tips, and common analytical patterns (window functions, CTEs, cohort retention, funnels, deduplication) are in references/sql-reference.md.
Read that reference when writing or reviewing any SQL query. Key points to always remember:
ILIKE in BigQuery — use LOWER(col) LIKE '%pattern%'DATE_TRUNC(col, MONTH) not DATE_TRUNC('month', col) — the period is an identifier, not a stringAPPROX_COUNT_DISTINCT() for large-scale cardinality — much cheaper than COUNT(DISTINCT)SELECT * — select only the columns you needRun through these checks before sharing any analysis.
Data quality:
Calculation checks:
Reasonableness:
Presentation:
Join explosion — A many-to-many join silently multiplies rows, inflating counts and sums. Always check row counts after joins. Use COUNT(DISTINCT a.id) instead of COUNT(*) when counting entities through joins.
Survivorship bias — Analyzing only entities that exist today ignores those that were deleted, churned, or failed. Always ask: "who is NOT in this dataset?"
Incomplete period comparison — Comparing a partial month to a full month. Always filter to complete periods, or compare same-number-of-days.
Denominator shifting — The denominator changes between periods (e.g., "eligible" users redefined), making rates incomparable. Use consistent definitions across all compared periods.
Average of averages — Averaging pre-computed averages gives wrong results when group sizes differ. Always aggregate from raw data.
Timezone mismatches — Different sources use different timezones, causing misalignment. Standardize to a single timezone (UTC recommended) before analysis.
Selection bias in segmentation — Defining segments by the outcome being measured creates circular logic ("power users generate more revenue" — they became power users BY generating revenue). Define segments based on pre-treatment characteristics, not outcomes.
Magnitude checks:
| Metric Type | Sanity Check |
|---|---|
| User counts | Match known MAU/DAU figures? |
| Revenue | Right order of magnitude vs. known ARR? |
| Conversion rates | Between 0% and 100%? Match dashboard figures? |
| Growth rates | Is 50%+ MoM realistic, or a data issue? |
| Averages | Reasonable given the distribution? |
| Percentages | Segment percentages sum to ~100%? |
Cross-validation techniques:
Statistical validity:
references/statistical-analysis.md for methodsRed flags that warrant investigation:
Every non-trivial analysis should record:
## Analysis: [Title]
Question: [What's being answered]
Sources: [Tables used, as-of dates]
Definitions: [How key metrics are calculated]
Methodology: [Steps taken]
Assumptions: [What's assumed and why]
Limitations: [Known gaps and their impact]
Key Findings: [Results with supporting evidence]
| What You're Showing | Best Chart | Alternatives |
|---|---|---|
| Trend over time | Line chart | Area chart (cumulative/composition) |
| Comparison across categories | Vertical bar | Horizontal bar (many categories) |
| Ranking | Horizontal bar | Dot plot, slope chart (two periods) |
| Part-to-whole composition | Stacked bar | Treemap (hierarchical) |
| Composition over time | Stacked area | 100% stacked bar (proportion focus) |
| Distribution | Histogram | Box plot (comparing groups) |
| Correlation (2 variables) | Scatter plot | Bubble chart (add 3rd variable as size) |
| Correlation (many variables) | Heatmap | Pair plot |
| Flow / process | Sankey diagram | Funnel chart (sequential stages) |
| Performance vs. target | Bullet chart | Gauge (single KPI only) |
| Multiple KPIs at once | Small multiples | Dashboard with separate charts |
When NOT to use:
| Target | Best For | Agent Can Build? |
|---|---|---|
| Chart.js HTML | Quick analysis, one-off exploration, email attachment | ✅ Yes — fully autonomous |
| Google Sheets + BQ connector | Non-technical users, ad-hoc filtering/pivoting, familiar interface | ⚡ Partially — agent creates the BQ view, user connects it in Sheets |
| Looker Studio | Production dashboards, auto-refreshing, team-wide sharing | ❌ No API — agent prepares the data layer, user builds in GUI |
Start with Chart.js HTML for exploratory analysis and quick answers. Graduate to Sheets or Looker Studio when the analysis needs to be refreshed, shared with a team, or maintained over time.
For self-contained interactive dashboards the agent builds directly, see references/dashboard-patterns.md. Covers Chart.js CDN setup, KPI cards, line/bar/doughnut charts, filters, sortable tables, CSS design system, and performance guidelines by data size.
The most accessible output for non-technical team members. Two patterns:
Connected Sheets (recommended): Sheets connects directly to BQ — data stays in BigQuery, Sheets is just the interface. Users can pivot, chart, and filter BQ data in a familiar environment without writing SQL.
Agent's role — prepare the data layer:
project.dataset.view_name"Scheduled query → table → Sheets extract: For data that needs to be refreshed on a schedule. Agent creates:
For production dashboards that auto-refresh and are shared across a team. The agent cannot build Looker Studio dashboards (no creation API), but can prepare everything needed:
| Anti-Pattern | Problem | Fix |
|---|---|---|
| Skip Understand stage | Query wrong table, wrong grain, wrong types | Always get_table_info + profile before analysing |
SELECT * on large tables | Expensive scan, context overflow | Select only needed columns, aggregate first |
No LIMIT on exploratory queries | Hundreds of JSON rows flood context | Always LIMIT 20-50 for exploration |
| Eyeball significance | "Region A is lower" without testing | Use confidence intervals or z-test from references/statistical-analysis.md |
| Average of averages | Wrong result when group sizes differ | Aggregate from raw data, not pre-computed averages |
| Assume survey codes | Label S3=7 as "Scotland" without checking | Ask for the datamap/codebook first — see references/profiling-survey.md |
| Ignore straight-liners | Biased Likert scores | Detect and quantify before reporting — >20% is a red flag |
| Compare incomplete periods | Current month vs last full month | Exclude current incomplete period |
| Pie charts for many categories | Unreadable, angles hard to compare | Use horizontal bar charts instead |
| Build Looker Studio directly | No creation API exists | Prepare BQ views + dashboard spec for user to build |