Extracts and normalizes data from CSV, JSON, text files, Notion databases, and Google Drive (via gws CLI) into a unified schema. Activates when the user wants to load data, parse files, import data sources, or prepare raw data for report generation — handles delimiter detection, encoding, type inference, and cross-source merging.
From founder-osnpx claudepluginhub thecloudtips/founder-os --plugin founder-osThis skill uses the workspace's default tool permissions.
Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.
Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.
Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.
Extract, parse, and normalize data from heterogeneous sources into a single unified structure that the Analysis Agent can consume without further transformation. Support five source types: CSV files, JSON files, plain text files, Notion databases, and Google Drive documents (via the gws CLI). Treat each source independently -- detect its format, apply format-specific extraction rules, normalize the result into the common output schema, and attach source metadata for traceability. Process multiple sources per report request by extracting from each in sequence and merging all results into a single dataset collection before handing off. Never modify original source files. All extraction is read-only.
Detect the delimiter by sampling the first 20 lines. Check for a .tsv extension first (assume tab). Otherwise count occurrences of comma, tab, semicolon, and pipe across sampled lines. Select the delimiter producing the most consistent column count (>=90% of lines yield the same count). Break ties with this preference order: comma > tab > semicolon > pipe. If no delimiter produces consistent columns, default to comma and flag the file as potentially malformed.
Default to UTF-8. On decode failure: attempt UTF-8 with BOM detection, fall back to Latin-1, then Windows-1252 if Latin-1 produces garbled output (3+ consecutive non-ASCII characters outside known language ranges). Record the detected encoding in source metadata and flag non-UTF-8 as a warning.
Determine whether the first row is a header by checking if it contains exclusively non-numeric strings while subsequent rows contain mixed types. Look for header patterns: underscores, camelCase, or title-cased words. When the first row matches the data pattern of other rows, treat the file as headerless and generate synthetic names (column_1, column_2, ..., column_N). Record header status in source metadata.
Infer each column's type by sampling all values:
Assign the dominant type when >=80% of non-empty values match. Otherwise assign text.
Follow RFC 4180: fields containing delimiters, newlines, or double quotes must be quoted. Doubled double quotes represent a literal quote. Preserve newlines within quoted fields. Strip whitespace from unquoted fields only.
Start at the root. If the root is an array, use it as the data collection. If the root is an object, search for the first key whose value is an array of objects (common keys: data, results, records, items, rows). Accept a user-specified dot-notation or bracket-notation path for deeper nesting. If no data array is found, flatten the entire object into a single-record dataset.
Flatten nested objects using dot-notation keys (address.city). Limit depth to 5 levels; serialize deeper structures as JSON strings. Join primitive arrays into comma-separated strings. Expand object arrays into indexed columns (contacts[0].name), capping at 20 elements with a truncation note.
Infer schema from the first 50 records. Collect all unique keys, classify value types using CSV type-inference rules, and mark keys present in >=95% of records as required. Record the schema in source metadata.
On syntax errors: report the error location, attempt common auto-repairs (trailing comma removal, single-to-double quote conversion, unquoted key quoting). If repair succeeds, proceed and flag in metadata. If repair fails, abort this source with status "extraction_failed" and continue other sources.
Classify the file structure: Tabular (fixed-width alignment consistent across >=80% of lines), Delimited (consistent delimiter patterns -- apply CSV rules), or Freeform (unstructured text).
Scan for patterns: colon-separated (Key: Value), equals-separated (Key = Value), tab-separated, or arrow-separated (Key -> Value). When >=60% of non-empty lines match a consistent pattern, extract as a key-value dataset. Handle multi-line values via continuation-line detection (indented lines after a key-value pair).
Identify embedded tables by horizontal separators (---, ===, |---|), consistent column alignment across 3+ lines, or Markdown table syntax. Extract each table as a separate dataset named by preceding heading text or position.
Detect timestamped logs (extract timestamp, level, message), key-value logs (parse key=value pairs), and Apache/Nginx access logs (extract IP, timestamp, method, path, status, size, referrer, user agent). Apply the matched parser to produce structured records.
Detect sections by Markdown headings, underlined headings, numbered sections, or all-caps header lines. Extract each section as a record: section_title, section_level, section_content, section_order.
Never hardcode database IDs. Search by title via the Notion MCP search endpoint. Prefer exact matches over partial matches. Try variations (with/without prefixes, with/without emoji). When multiple exact matches exist, select the most recently edited. Record discovered database ID and title in source metadata.
| Notion Type | Extraction Rule |
|---|---|
| title | Plain text string |
| rich_text | Plain text, strip formatting, preserve \n |
| number | Numeric value, null for empty |
| select | Option name as string |
| multi_select | Comma-separated option names |
| date | ISO start date; include end date if present |
| checkbox | Boolean |
| url, email, phone_number | String |
| relation | Page IDs as array; optionally resolve to titles |
| formula, rollup | Computed value by result type |
| created_time, last_edited_time | ISO datetime string |
| created_by, last_edited_by | User name |
| files | File URLs as array |
| status | Status name as string |
Fetch 100 records per request. Follow next_cursor while has_more is true. Stop at 10,000 records (safety limit), flag truncation in metadata, and track total API calls.
Translate human-readable conditions from the report spec into Notion filter objects. Support compound AND/OR filters (default AND). Support sort_by and sort_direction. Retrieve all records when no filters are specified.
Check for gws CLI availability before attempting Drive extraction:
which gws || echo "gws CLI not available"
If the gws CLI is unavailable or authentication is not configured, skip Drive extraction entirely and mark the source as unavailable in metadata.
gws drive files export --params '{"fileId":"FILE_ID","mimeType":"text/csv"}' --output /tmp/sheet.csv, then apply CSV rules. Extract each sheet tab as a separate dataset.gws drive files export --params '{"fileId":"FILE_ID","mimeType":"text/plain"}' --output /tmp/doc.txt, then apply text extraction rules. Preserve heading hierarchy.gws drive files export. One record per slide: slide_number, title, body_text.Search by file name from the report spec using the gws CLI:
gws drive files list --params '{"q":"name contains '\''SEARCH_TERM'\''","pageSize":20,"fields":"files(id,name,mimeType,modifiedTime,webViewLink)"}' --format json
Support wildcard patterns. Filter by MIME type or folder path when specified. Extract from all matching files, naming each dataset by file name.
Use gws drive files export for Google-native formats, direct download for uploaded files. Skip files exceeding 50 MB and record the skip. Track download status per file.
Convert to snake_case: lowercase, replace spaces with underscores, remove special characters. Collapse consecutive underscores. Prefix numeric names with col_. Resolve collisions with numeric suffixes (revenue, revenue_2).
true/false.null.Match records across sources using a join key when specified in the report spec. Without a join key, keep datasets separate. When merging, prefix column names with source name to avoid collisions. Record merge statistics in metadata.
Produce a JSON object conforming to this structure for handoff to the Analysis Agent:
{
"extraction_id": "uuid-v4",
"generated_at": "ISO-8601 datetime",
"report_spec": {
"title": "Report title from user request",
"sources_requested": ["source references from the report spec"]
},
"datasets": [
{
"dataset_id": "uuid-v4",
"source": {
"type": "csv | json | text | notion | gdrive",
"name": "Original file name or database title",
"path": "File path, database ID, or Drive file ID",
"encoding": "Detected encoding",
"format_details": {
"delimiter": ",",
"header_detected": true,
"record_count": 150,
"column_count": 12
}
},
"schema": {
"columns": [
{
"name": "snake_case_name",
"original_name": "Original Name",
"type": "numeric | date | boolean | text",
"nullable": true,
"sample_values": ["first 3 non-null values"]
}
]
},
"records": [
{"column_name": "value"}
],
"warnings": ["Issues encountered during extraction"],
"status": "success | partial | failed"
}
],
"metadata": {
"total_datasets": 3,
"total_records": 450,
"extraction_duration_ms": 2340,
"sources_successful": 3,
"sources_failed": 0,
"warnings": ["Pipeline-level warnings"]
}
}
Every field is required. Use null for unavailable scalars and [] for unavailable lists. Never omit a key.
Set dataset status to "partial". Produce an empty records array. Add warning: "Source file is empty -- no data extracted." Continue to the next source.
Attempt the primary detected encoding first. Replace problematic bytes with the Unicode replacement character (U+FFFD) on per-field failures and add a per-record warning. Record encoding issues in source metadata.
Extract the first 10,000 rows when a source exceeds 10,000 rows or 50 MB. Record total row count in metadata. Add truncation warning and suggest applying filters or splitting the source.
Set dataset status to "failed" for binary files, images, video, or proprietary formats. Record the MIME type or extension. Add warning: "Unsupported file format: [format]. Skipping extraction." Move to the next source.
Retry once after a 5-second delay. On second failure, set dataset status to "failed". Record error type, HTTP status code, and error message. Continue processing remaining sources -- never halt the entire extraction for one API failure.
When the same source is referenced twice, extract once and reuse the dataset entry. Add metadata note: "Duplicate source reference detected -- extracted once."
Extract Notion self-referencing relations as page IDs only. Do not recursively resolve. Limit relation resolution to one level of depth. Add a metadata note if circular references are detected.