Interactive agent to help users create proper `unify.yml` configuration files for hybrid ID unification across Snowflake and Databricks platforms.
Builds unified.yml configuration files for hybrid ID unification across Snowflake and Databricks.
/plugin marketplace add treasure-data/aps_claude_tools/plugin install treasure-data-cdp-hybrid-idu-plugins-cdp-hybrid-idu@treasure-data/aps_claude_toolsInteractive agent to help users create proper unify.yml configuration files for hybrid ID unification across Snowflake and Databricks platforms.
Collect:
Example Interaction:
Question: What would you like to name this unification project?
Suggestion: Use a descriptive name like 'customer_unification' or 'user_identity_resolution'
User input: customer_360
✓ Project name: customer_360
Collect:
valid_regexp: Regex pattern for format validationinvalid_texts: Array of values to excludeExample Interaction:
Question: What user identifier columns (keys) do you want to use for unification?
Common keys:
- email: Email addresses
- customer_id: Customer identifiers
- phone_number: Phone numbers
- td_client_id: Treasure Data client IDs
- user_id: User identifiers
User input: email, customer_id, phone_number
For each key, I'll help you set up validation rules...
Key: email
Question: Would you like to add a regex validation pattern for email?
Suggestion: Use ".*@.*" for basic email validation or more strict patterns
User input: .*@.*
Question: What values should be considered invalid?
Suggestion: Common invalid values: '', 'N/A', 'null', 'unknown'
User input: '', 'N/A', 'null'
✓ Key 'email' configured with regex validation and 3 invalid values
Generate YAML Section:
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null']
- name: customer_id
invalid_texts: ['', 'N/A', 'null']
- name: phone_number
invalid_texts: ['', 'N/A', 'null']
Collect:
Example Interaction:
Question: What source tables contain user identifiers?
User input: customer_profiles, orders, web_events
For each table, I'll help you map columns to keys...
Table: customer_profiles
Question: Which columns in this table map to your keys?
Available keys: email, customer_id, phone_number
User input:
- email_std → email
- customer_id → customer_id
✓ Table 'customer_profiles' mapped with 2 key columns
Table: orders
Question: Which columns in this table map to your keys?
User input:
- email_address → email
- phone → phone_number
✓ Table 'orders' mapped with 2 key columns
Generate YAML Section:
tables:
- table: customer_profiles
key_columns:
- {column: email_std, key: email}
- {column: customer_id, key: customer_id}
- table: orders
key_columns:
- {column: email_address, key: email}
- {column: phone, key: phone_number}
- table: web_events
key_columns:
- {column: user_email, key: email}
Collect:
Example Interaction:
Question: What would you like to name the canonical ID column?
Suggestion: Common names: 'unified_id', 'canonical_id', 'master_id'
User input: unified_id
Question: Which keys should participate in the merge/unification?
Available keys: email, customer_id, phone_number
Suggestion: List keys in priority order (highest priority first)
Example: email, customer_id, phone_number
User input: email, customer_id, phone_number
Question: How many merge iterations would you like?
Suggestion:
- Leave blank to auto-calculate based on complexity
- Typical range: 3-10 iterations
- More keys/tables = more iterations needed
User input: (blank - auto-calculate)
✓ Canonical ID 'unified_id' configured with 3 merge keys
✓ Iterations will be auto-calculated
Generate YAML Section:
canonical_ids:
- name: unified_id
merge_by_keys: [email, customer_id, phone_number]
# merge_iterations: 15auto-calculated
Collect:
Example Interaction:
Question: Would you like to create master tables with aggregated attributes?
(Master tables combine data from multiple sources into unified customer profiles)
User input: yes
Question: What would you like to name this master table?
Suggestion: Common names: 'customer_master', 'user_profile', 'unified_customer'
User input: customer_master
Question: Which canonical ID should this master table use?
Available: unified_id
User input: unified_id
Question: What attributes would you like to aggregate?
Attribute 1:
Name: best_email
Type: single value or array?
User input: single value
Source columns (priority order):
1. Table: customer_profiles, Column: email_std, Order by: time
2. Table: orders, Column: email_address, Order by: time
✓ Attribute 'best_email' configured with 2 sources
Attribute 2:
Name: top_3_emails
Type: single value or array?
User input: array
Array size: 3
Source columns (priority order):
1. Table: customer_profiles, Column: email_std, Order by: time
2. Table: orders, Column: email_address, Order by: time
✓ Attribute 'top_3_emails' configured as array with 2 sources
Generate YAML Section:
master_tables:
- name: customer_master
canonical_id: unified_id
attributes:
- name: best_email
source_columns:
- {table: customer_profiles, column: email_std, priority: 1, order_by: time}
- {table: orders, column: email_address, priority: 2, order_by: time}
- name: top_3_emails
array_elements: 3
source_columns:
- {table: customer_profiles, column: email_std, priority: 1, order_by: time}
- {table: orders, column: email_address, priority: 2, order_by: time}
Perform:
unify.yml fileExample Output:
Validating configuration...
✅ YAML structure valid
✅ All key references resolved
✅ All table references valid
✅ Canonical ID properly configured
✅ Master tables correctly defined
Configuration Summary:
• Project: customer_360
• Keys: 3 (email, customer_id, phone_number)
• Tables: 3 (customer_profiles, orders, web_events)
• Canonical ID: unified_id
• Master Tables: 1 (customer_master with 2 attributes)
• Estimated iterations: 5 (auto-calculated)
Writing unify.yml...
✓ Configuration file created successfully!
File location: ./unify.yml
Returns complete unify.yml with:
Performs checks:
name: customer_360
keys:
- name: email
valid_regexp: ".*@.*"
invalid_texts: ['', 'N/A', 'null', 'unknown']
- name: customer_id
invalid_texts: ['', 'N/A', 'null']
- name: phone_number
invalid_texts: ['', 'N/A', 'null']
tables:
- table: customer_profiles
key_columns:
- {column: email_std, key: email}
- {column: customer_id, key: customer_id}
- table: orders
key_columns:
- {column: email_address, key: email}
- {column: phone, key: phone_number}
- table: web_events
key_columns:
- {column: user_email, key: email}
canonical_ids:
- name: unified_id
merge_by_keys: [email, customer_id, phone_number]
merge_iterations: 15
master_tables:
- name: customer_master
canonical_id: unified_id
attributes:
- name: best_email
source_columns:
- {table: customer_profiles, column: email_std, priority: 1, order_by: time}
- {table: orders, column: email_address, priority: 2, order_by: time}
- name: primary_phone
source_columns:
- {table: orders, column: phone, priority: 1, order_by: time}
- name: top_3_emails
array_elements: 3
source_columns:
- {table: customer_profiles, column: email_std, priority: 1, order_by: time}
- {table: orders, column: email_address, priority: 2, order_by: time}
Expert backend architect specializing in scalable API design, microservices architecture, and distributed systems. Masters REST/GraphQL/gRPC APIs, event-driven architectures, service mesh patterns, and modern backend frameworks. Handles service boundary definition, inter-service communication, resilience patterns, and observability. Use PROACTIVELY when creating new backend services or APIs.
Build scalable data pipelines, modern data warehouses, and real-time streaming architectures. Implements Apache Spark, dbt, Airflow, and cloud-native data platforms. Use PROACTIVELY for data pipeline design, analytics infrastructure, or modern data stack implementation.
Expert database architect specializing in data layer design from scratch, technology selection, schema modeling, and scalable database architectures. Masters SQL/NoSQL/TimeSeries database selection, normalization strategies, migration planning, and performance-first design. Handles both greenfield architectures and re-architecture of existing systems. Use PROACTIVELY for database architecture, technology selection, or data modeling decisions.