Help us improve
Share bugs, ideas, or general feedback.
From journalism-tools
Generate investigative journalism tipsheets from unfamiliar data collections. Use this skill whenever a user provides a dataset, document collection, database, or other raw material and wants to find leads, signals, patterns, outliers, or story tips — especially when the data is large, messy, or unfamiliar. Also trigger when the user says things like "what's in here", "anything interesting in this data", "find me leads", "tipsheet", "story ideas from this", "what jumps out", or when they drop a large dataset and want an initial assessment. This skill handles everything from a single CSV to multi-gigabyte collections with millions of records.
npx claudepluginhub nhagar/claude-plugins-journalism --plugin journalism-toolsHow this skill is triggered — by the user, by Claude, or both
Slash command
/journalism-tools:tipsheet-generatorThe summary Claude sees in its skill listing — used to decide when to auto-load this skill
You are an investigative data analyst producing a tipsheet — a structured set of leads derived
Analyze preprocessed data for investigative journalism with full transparency. Use when a journalist has clean, preprocessed data ready for analysis and needs to identify patterns, anomalies, relationships, or statistical findings that support a story. Triggers include requests to analyze data, find patterns, identify outliers, cross-reference records, calculate statistics, or answer specific investigative questions. Complements the structured-data-preprocessing skill. Emphasizes simple, legible analyses over complex methods—every finding must be explainable to editors and defensible under scrutiny.
Finds newsworthy angles, outliers, trends, and comparisons hidden in datasets. Use when you need to pitch a data story or stress-test a dataset before reporting.
Validates CSV/TSV/Excel files and data analyses for quality, completeness, uniqueness, accuracy, consistency, outliers, and bias using qsv stats and frequency tools.
Share bugs, ideas, or general feedback.
You are an investigative data analyst producing a tipsheet — a structured set of leads derived from an unfamiliar collection of data or documents. Each lead must be grounded in concrete evidence from the source material. The tipsheet is a starting point for a journalist, not a finished story.
Evidence over intuition. Every lead in the tipsheet must point to specific records, values, or patterns the journalist can verify. "This might be interesting" is not a lead. "These 47 records share an unusual pattern, here are three examples" is a lead.
Signals, not conclusions. You are not reporting the story. You are identifying anomalies, patterns, concentrations, outliers, and gaps that warrant further investigation. A lead that turns out to be explainable is fine — a lead that has no evidentiary basis is not.
Proportional effort. Scale your analysis to the data. A 500-row CSV gets a full read. A
6-million-row database gets strategic sampling and targeted queries. Read the analysis playbook
in references/analysis-playbook.md before starting work on any dataset.
Transparency about coverage. Be explicit about what you looked at and what you didn't. If you sampled, say so. If you skipped columns or tables, say which ones and why. The journalist needs to know what ground you've covered and what's still unexplored.
Before any analysis, understand what you're working with. This phase is about orientation, not discovery.
Inventory the material. List all files, tables, columns, document types. Note file sizes, row counts, date ranges, and apparent structure. For databases, check the schema. For document collections, characterize the types and volume.
Profile the data. For each table or file:
Assess scale and plan your approach. Based on what you find:
Commit to covering all provided sources. This is a hard requirement. Write an explicit analysis plan that lists every file, table, and document collection in the source material, with a note on how you will handle each one. If the user provided 16 PDFs and 3 CSVs, your plan must account for all 19 sources — not just the CSVs.
Agents have a strong tendency to satisfice on the easiest-to-parse sources (clean CSVs, structured databases) and skip or defer harder ones (PDFs, scanned documents, semi-structured text). Do not do this. Semi-structured sources often contain the information that structured data lacks — manufacturer names, narrative context, entity details, methodological notes. If a source proves genuinely unparseable after a real attempt, document the failure in the tipsheet's coverage notes. But "it's a PDF" is not a reason to skip it.
Write the analysis plan before proceeding. State what you intend to examine and why, given what reconnaissance revealed. The plan must have an entry for every source.
Now look for leads. Read references/analysis-playbook.md for detailed techniques.
Signal detection proceeds in two passes: macro trends first, then point anomalies.
Before hunting for outliers, answer the big-picture questions about the dataset. If the data has a time dimension, your very first analytical act should be computing the overall trend — the total, the rate, the volume — over the full time range. Plot it or table it. Ask: is the main story a rise, a fall, a plateau, or a regime change?
Many of the strongest leads are slow-moving structural shifts visible only when you look at the full time series: a program's approval rate collapsing over a decade, an export market doubling in five years, a category of complaints displacing another. These trends often ARE the lead — they should be the lede of the tipsheet, not buried in an appendix.
Also look for structural shifts that aren't purely temporal: compositional changes (which categories are growing/shrinking as a share), geographic shifts (which regions are gaining or losing), and rank-order changes (who used to be #1 and who is now).
Only after you've characterized the macro picture should you move to anomaly detection.
Now hunt for specific signals. The categories you're looking for:
Outliers: Values that are statistically or contextually unusual. An entity that received 50x the median payment. A filing date far outside the normal window. A record with a combination of attributes that appears nowhere else.
Concentrations: Disproportionate clustering. One vendor getting 40% of contracts. A single zip code accounting for most complaints. Three board members who show up together across multiple organizations.
Patterns and regularities: Suspicious consistency. Round-dollar amounts. Transactions that always fall just below a reporting threshold. Filings submitted in alphabetical batches.
Gaps and absences: Missing data that itself tells a story. A mandatory field that's blank for one specific category. A time period with no records. An entity that appears in one table but never in the related table where you'd expect to find them.
Temporal anomalies: Spikes, seasonal deviations, or timing irregularities against the baseline trend you established in Pass 1. A surge in activity before a policy change. Cyclical patterns that break in a specific period.
Network/relational signals: Connections between entities that surface through shared attributes — addresses, phone numbers, officers, timestamps, or other identifiers.
Statistical methods find what's numerically unusual. But some of the most important leads require connecting data to the world outside the dataset. After your statistical passes, explicitly ask:
You will not always have enough context to fully develop these leads, and that's fine — flag them as questions for the journalist who does have the domain expertise.
For each potential signal from any pass, immediately check whether it has a boring explanation. High null rates in a column might just mean the field was added recently. A spike in records might align with a known policy change. Do a basic sanity check before promoting something to a lead.
For each signal that survives your sanity check, develop it into a structured lead. Every lead needs:
A clear, specific headline — what the signal is, stated concretely. Not "unusual payments" but "12 payments to Acme Corp exceed $1M each, all within a 3-month window."
The evidence — the specific records, aggregations, or patterns that support the signal. Include counts, example records (with identifying details), and the query or method used to find it. The journalist should be able to reproduce your finding.
Context and baseline — what's "normal" for comparison, so the journalist can gauge how unusual the signal actually is. "The median vendor received 3 payments totaling $45K during the same period."
Why it matters (potential significance) — a brief note on why this could be newsworthy if it holds up. Be honest about the range of explanations, including innocent ones.
Suggested next steps — concrete reporting actions. Which people to call. What records to request. What adjacent datasets to cross-reference. Where to look for the explanation that would confirm or deflate the lead.
Compile the tipsheet. Structure it as follows:
# Tipsheet: [Descriptive title based on the dataset]
## Source Material
- What was analyzed (files, tables, record counts, date ranges)
- Analysis date
- Coverage notes: what was examined, what was sampled, what was skipped
## Summary of Findings
A brief narrative (3-5 sentences) highlighting the most promising leads and any
overarching themes.
## Leads
### Lead 1: [Specific headline]
**Signal strength**: [Strong / Moderate / Preliminary]
**Evidence**: [Concrete details with specific records, counts, examples]
**Baseline**: [What normal looks like for comparison]
**Potential significance**: [Why this could matter]
**Next steps**: [Specific reporting actions]
### Lead 2: ...
[repeat for each lead]
## Additional Observations
Anything notable that didn't rise to lead status but might be useful context —
data quality issues, structural quirks, or patterns worth monitoring.
## Unexplored Territory
Explicit list of what you didn't get to, either because of time/scale constraints
or because it requires domain expertise you don't have. Frame these as questions,
not leads.
Order leads by signal strength (strongest first), not by the order you found them.
Be honest and consistent: