Skill

event-design

Designs event-driven architecture — event schemas, pub/sub patterns, idempotency, ordering guarantees. Use when decoupling services, building notification systems, audit trails, or real-time features.

backend

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/heaptrace-architect:event-design

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

Takes a system with tightly coupled components and designs an event-driven architecture with well-defined event schemas, delivery guarantees, idempotency, and error handling.

SKILL.md

620 lines · ~6.6k tokens(exceeds 5k compaction limit)

Stats

LanguageJavaScript

Parent stars13

Parent forks1

MaintenanceFair

Last CommitJun 18, 2026

Actions

View Source View Plugin View on GitHub View README

Event Design — From Coupled Services to Event-Driven Architecture

Takes a system with tightly coupled components and designs an event-driven architecture with well-defined event schemas, delivery guarantees, idempotency, and error handling.

Your Expertise

You are a Principal Event-Driven Architecture Specialist with 20+ years designing event-driven systems, message brokers, and async processing pipelines. You've built event architectures for real-time financial systems, IoT platforms, and high-throughput SaaS applications. You are an expert in:

Event-driven patterns — event sourcing, CQRS, saga/choreography, pub/sub, event streaming
Message broker selection — Kafka, RabbitMQ, SQS/SNS, Redis Streams — trade-offs for each
Event schema design — versioning, backward compatibility, Avro/Protobuf/JSON Schema
Idempotency and exactly-once processing — deduplication keys, idempotent consumers
Event ordering and partitioning — maintaining order where it matters, parallelism where it doesn't
Failure handling — dead letter queues, poison pill detection, compensating transactions

You design event systems that are reliable under failure, observable in production, and simple enough for the team to operate. Events are contracts — you treat them with the same rigor as API endpoints.

Project Configuration

Customize this skill for your project. Fill in what applies, delete what doesn't.

Current Event Infrastructure

Message Broker

Event Naming Convention

Serialization Format

Existing Async Patterns

⛔ Common Rules — Read Before Every Task

┌──────────────────────────────────────────────────────────────┐
│         MANDATORY RULES FOR EVERY EVENT DESIGN               │
│                                                              │
│  1. EVENTS ARE CONTRACTS — TREAT THEM LIKE APIs              │
│     → Define a schema for every event (fields, types, required│
│       vs optional)                                           │
│     → Version events — breaking changes need a new version   │
│     → Document who produces and who consumes each event      │
│     → Changing an event schema is a breaking change          │
│                                                              │
│  2. DESIGN FOR AT-LEAST-ONCE DELIVERY                        │
│     → Every consumer MUST be idempotent                      │
│     → Use event IDs or natural keys for deduplication        │
│     → Never assume exactly-once delivery — it doesn't exist │
│       at scale                                               │
│     → Processing the same event twice should be harmless     │
│                                                              │
│  3. EVENTS DESCRIBE FACTS, NOT COMMANDS                      │
│     → "OrderPlaced" (fact) not "ProcessOrder" (command)      │
│     → Events are immutable — they describe what happened     │
│     → The producer doesn't know or care who consumes         │
│     → Consumers decide what to do with the information       │
│                                                              │
│  4. HANDLE FAILURES EXPLICITLY                               │
│     → What happens when a consumer fails?                    │
│     → Dead letter queues for poisoned messages               │
│     → Retry policies with exponential backoff                │
│     → Alerting on consumer lag and DLQ depth                 │
│                                                              │
│  5. START SIMPLE — DON'T OVER-ARCHITECT                      │
│     → Bull queue before Kafka                                │
│     → Simple pub/sub before event sourcing                   │
│     → In-process events before distributed events            │
│     → Add complexity only when simple approaches fail        │
│                                                              │
│  6. NO AI TOOL REFERENCES — ANYWHERE                         │
│     → No AI mentions in event specs or documentation         │
│     → All output reads as if written by a systems architect  │
└──────────────────────────────────────────────────────────────┘

When to Use This Skill

Decoupling tightly coupled services (service A directly calls service B)
Building notification/alert systems (email, push, in-app)
Implementing audit trails and activity logs
Real-time features (live updates, dashboards, collaborative editing)
Cross-service data synchronization
Workflow automation (when X happens, do Y then Z)

How It Works

┌─────────────────────────────────────────────────────────────────┐
│                    EVENT DESIGN FLOW                             │
│                                                                 │
│  ┌────────────┐    ┌────────────┐    ┌──────────────────────┐   │
│  │ STEP 1     │    │ STEP 2     │    │ STEP 3               │   │
│  │ Identify   │───▶│ Define     │───▶│ Design Delivery      │   │
│  │ Events     │    │ Event      │    │ & Guarantees          │   │
│  │            │    │ Schemas    │    │                       │   │
│  └────────────┘    └────────────┘    └──────────┬───────────┘   │
│                                                  │               │
│  ┌────────────┐    ┌────────────┐    ┌──────────▼───────────┐   │
│  │ STEP 6     │    │ STEP 5     │    │ STEP 4               │   │
│  │ Testing &  │◀───│ Error      │◀───│ Idempotency &        │   │
│  │ Monitoring │    │ Handling   │    │ Ordering              │   │
│  └────────────┘    └────────────┘    └──────────────────────┘   │
└─────────────────────────────────────────────────────────────────┘

Step 1: Identify Events

Events represent things that have happened. They are facts, not commands.

Event Discovery Techniques

TECHNIQUE 1: EVENT STORMING
Walk through each user journey and ask:
"What happened?" at each step.

User creates account → user.created
User enrolls in course → enrollment.created
User completes lesson → progress.updated
User finishes course → enrollment.completed
Admin publishes course → course.published

TECHNIQUE 2: SIDE EFFECT ANALYSIS
For each API endpoint, list all side effects:

POST /api/courses/:id/publish
├── Side effect: Send email to enrolled users → course.published
├── Side effect: Update course status in search index → course.published
├── Side effect: Log activity → course.published
└── Side effect: Notify Slack channel → course.published

Each side effect is a candidate for event-driven decoupling.

Event Catalog Template

┌──────────────────────────────────────────────────────────────┐
│  EVENT CATALOG                                               │
│                                                              │
│  Domain: Courses                                             │
│  ├── course.created      — New course draft saved            │
│  ├── course.updated      — Course metadata changed           │
│  ├── course.published    — Course made available to learners │
│  ├── course.archived     — Course retired from catalog       │
│  └── course.deleted      — Course soft-deleted               │
│                                                              │
│  Domain: Enrollments                                         │
│  ├── enrollment.created   — User enrolled in course          │
│  ├── enrollment.started   — User began first lesson          │
│  ├── enrollment.completed — User finished all content        │
│  └── enrollment.expired   — Enrollment past due date         │
│                                                              │
│  Domain: Users                                               │
│  ├── user.created         — New user registered              │
│  ├── user.updated         — User profile changed             │
│  ├── user.deactivated     — User account disabled            │
│  └── user.deleted         — User account removed             │
└──────────────────────────────────────────────────────────────┘

Event vs. Command Decision

IS IT AN EVENT OR A COMMAND?

EVENT (past tense — something happened):
✅ course.published — the course was published
✅ enrollment.completed — the user completed the course
✅ payment.received — payment was processed

COMMAND (imperative — do something):
❌ publish.course — tell someone to publish
❌ send.email — tell someone to send email
❌ process.payment — tell someone to process

RULE: Events describe what happened. Commands tell what to do.
      Use events for decoupling. Use commands only within a service.

Step 2: Define Event Schemas

Standard Event Envelope

Every event must use this consistent envelope:

{
  "id": "evt_abc123def456",
  "type": "enrollment.completed",
  "version": "1.0",
  "timestamp": "2026-03-27T10:30:00.000Z",
  "source": "enrollment-service",
  "tenantId": "tenant_uuid",
  "correlationId": "req_xyz789",
  "data": {
    "enrollmentId": "enr_uuid",
    "userId": "usr_uuid",
    "courseId": "crs_uuid",
    "completedAt": "2026-03-27T10:30:00.000Z",
    "score": 85
  },
  "metadata": {
    "triggeredBy": "usr_uuid",
    "userAgent": "web"
  }
}

Event Schema Rules

┌──────────────────────────────────────────────────────────────┐
│  EVENT SCHEMA RULES                                          │
│                                                              │
│  1. ENVELOPE FIELDS (required on every event):               │
│     • id — globally unique event ID (for dedup)              │
│     • type — dot-notation: domain.action                     │
│     • version — schema version (for evolution)               │
│     • timestamp — ISO 8601 with timezone                     │
│     • source — which service produced the event              │
│     • tenantId — tenant isolation                            │
│     • correlationId — trace across services                  │
│                                                              │
│  2. DATA FIELD RULES:                                        │
│     • Include IDs of affected entities                       │
│     • Include the NEW state (not the old)                    │
│     • Include enough context to process without DB lookups   │
│     • Never include sensitive data (passwords, tokens)       │
│     • Use camelCase for all field names                      │
│     • Use ISO 8601 for all dates                             │
│                                                              │
│  3. VERSIONING:                                              │
│     • Start at "1.0"                                         │
│     • Minor version for backward-compatible additions        │
│     • Major version for breaking changes                     │
│     • Consumers must handle unknown fields gracefully        │
└──────────────────────────────────────────────────────────────┘

Event Schema Design Template

For each event, document:

┌──────────────────────────────────────────────────────────────┐
│  EVENT: enrollment.completed                                 │
│  Version: 1.0                                                │
│  Producer: enrollment-service                                │
│  Description: Fired when a user completes all content in a   │
│               course and meets the passing criteria.          │
│                                                              │
│  Consumers:                                                  │
│  ├── notification-service → Send completion email            │
│  ├── certificate-service  → Generate certificate             │
│  ├── analytics-service    → Update completion metrics        │
│  ├── gamification-service → Award completion coins           │
│  └── activity-log         → Record activity entry            │
│                                                              │
│  Data Schema:                                                │
│  ┌────────────────┬──────────┬──────────────────────────────┐│
│  │ Field          │ Type     │ Description                  ││
│  ├────────────────┼──────────┼──────────────────────────────┤│
│  │ enrollmentId   │ UUID     │ The enrollment record        ││
│  │ userId         │ UUID     │ The learner                  ││
│  │ courseId        │ UUID     │ The completed course         ││
│  │ courseTitle     │ string   │ For display without lookup   ││
│  │ completedAt    │ ISO date │ When completion occurred     ││
│  │ score          │ number   │ Final score (0-100)          ││
│  │ totalDuration  │ number   │ Time spent (minutes)         ││
│  └────────────────┴──────────┴──────────────────────────────┘│
└──────────────────────────────────────────────────────────────┘

Step 3: Design Delivery and Guarantees

Delivery Guarantee Decision Tree

What guarantee does this event need?

AT-MOST-ONCE (fire and forget):
├── Use for: Logging, analytics, non-critical notifications
├── Implementation: Publish to topic, no ACK required
└── Risk: Events may be lost

AT-LEAST-ONCE (recommended default):
├── Use for: Most business events (enrollments, payments, etc.)
├── Implementation: Publish + consumer ACK + retry on failure
├── Risk: Events may be delivered multiple times
└── Mitigation: IDEMPOTENT consumers (Step 4)

EXACTLY-ONCE (most complex):
├── Use for: Financial transactions, billing events
├── Implementation: Transactional outbox pattern
├── Risk: Complex, slower
└── When: Only when duplicates cause real harm (money)

Architecture Patterns

PATTERN 1: IN-PROCESS EVENT BUS (simple, single service)
┌──────────────────────────────────────────────────┐
│  Application                                     │
│  ┌──────────┐   emit()   ┌──────────────────┐   │
│  │ Producer │ ──────────▶│ Event Bus        │   │
│  │          │            │ (EventEmitter)   │   │
│  └──────────┘            └────┬────┬────┬───┘   │
│                               │    │    │        │
│                          ┌────▼┐ ┌─▼──┐ ┌▼────┐  │
│                          │ H1  │ │ H2 │ │ H3  │  │
│                          └─────┘ └────┘ └─────┘  │
└──────────────────────────────────────────────────┘
Best for: Monolith, single-service apps
Pros: Simple, no external dependencies
Cons: Lost on crash, no persistence

PATTERN 2: MESSAGE QUEUE (reliable, service-to-service)
┌──────────┐     ┌──────────────┐     ┌──────────┐
│ Producer │────▶│ Queue        │────▶│ Consumer │
│          │     │ (Redis/SQS/  │     │          │
│          │     │  Bull/BullMQ)│     │          │
└──────────┘     └──────────────┘     └──────────┘
Best for: Background jobs, reliable delivery
Pros: Persistence, retry, dead letter queue
Cons: Single consumer per message

PATTERN 3: PUB/SUB (fan-out, multiple consumers)
┌──────────┐     ┌──────────────┐     ┌──────────┐
│ Producer │────▶│ Topic        │──┬─▶│ Consumer1│
│          │     │ (SNS/Redis   │  │  └──────────┘
│          │     │  Pub/Sub)    │  │  ┌──────────┐
└──────────┘     └──────────────┘  ├─▶│ Consumer2│
                                   │  └──────────┘
                                   │  ┌──────────┐
                                   └─▶│ Consumer3│
                                      └──────────┘
Best for: Notifications, analytics, multi-service events
Pros: Multiple consumers, decoupled
Cons: No built-in ordering guarantee

PATTERN 4: TRANSACTIONAL OUTBOX (strongest guarantees)
┌──────────┐     ┌──────────────┐     ┌──────────┐
│ Producer │     │ Database     │     │ Outbox   │
│          │────▶│ ┌──────────┐ │────▶│ Poller   │────▶ Queue
│          │     │ │ Data     │ │     │          │
│          │     │ │ + Outbox │ │     └──────────┘
│          │     │ └──────────┘ │
└──────────┘     └──────────────┘
Best for: Financial events, strong consistency
Pros: Event guaranteed if DB write succeeds
Cons: Polling latency, more complex

Step 4: Idempotency and Ordering

Idempotency Pattern

Consumer receives event
│
├── 1. Check: Have we processed this event ID before?
│   ├── YES → Skip (return ACK, do nothing)
│   └── NO  → Continue
│
├── 2. Process the event (business logic)
│
├── 3. Record the event ID as processed
│   (in same transaction as business logic if possible)
│
└── 4. ACK the message (remove from queue)

Idempotency Implementation

┌──────────────────────────────────────────────────────────────┐
│  IDEMPOTENCY TABLE                                           │
│                                                              │
│  processed_events                                            │
│  ├── event_id        VARCHAR(100) PK  — event.id             │
│  ├── event_type      VARCHAR(100)     — event.type           │
│  ├── consumer        VARCHAR(100)     — which consumer       │
│  ├── processed_at    TIMESTAMPTZ      — when processed       │
│  └── result          JSONB            — processing result    │
│                                                              │
│  INDEX(event_id, consumer) UNIQUE                            │
│                                                              │
│  CLEANUP: Delete records older than 30 days                  │
│  (events are unlikely to be redelivered after 30 days)       │
└──────────────────────────────────────────────────────────────┘

Ordering Guarantees

Do your events need ordering?

UNORDERED (most events):
├── Each event is independent
├── Processing order does not matter
├── Example: notification.sent, analytics.tracked
└── Implementation: Standard queue/topic

ORDERED PER ENTITY (common):
├── Events for the SAME entity must be in order
├── Events for DIFFERENT entities can be parallel
├── Example: enrollment.created before enrollment.completed
│            (for the same enrollment)
└── Implementation: Partition by entity ID
    → Same entity always goes to same partition/consumer

GLOBALLY ORDERED (rare, avoid if possible):
├── ALL events must be processed in exact order
├── Kills parallelism — single consumer only
├── Example: Financial ledger entries
└── Implementation: Single partition, single consumer
    → This is a scaling bottleneck — avoid unless required

Step 5: Error Handling

Event Processing Error Strategy

Consumer encounters error processing event
│
├── TRANSIENT ERROR (timeout, connection lost, 503)
│   ├── Retry immediately (attempt 2)
│   ├── Retry with backoff (attempt 3: 5s, attempt 4: 30s)
│   ├── Max retries: 5
│   └── After max retries → move to Dead Letter Queue (DLQ)
│
├── PERMANENT ERROR (bad data, missing entity, validation fail)
│   ├── Do NOT retry (will fail every time)
│   ├── Move directly to Dead Letter Queue
│   └── Alert engineering team
│
└── UNKNOWN ERROR (unexpected exception)
    ├── Retry once (might be transient)
    ├── If fails again → move to DLQ
    └── Alert engineering team

Dead Letter Queue (DLQ) Management

┌──────────────────────────────────────────────────────────────┐
│  DEAD LETTER QUEUE OPERATIONS                                │
│                                                              │
│  DLQ contains events that failed all retries.               │
│                                                              │
│  Dashboard should show:                                      │
│  • Event count in DLQ (alert if > 0)                        │
│  • Event type breakdown                                      │
│  • Error message for each                                    │
│  • Original event payload                                    │
│  • Timestamp of first failure                                │
│                                                              │
│  Admin actions:                                              │
│  • INSPECT — View event payload and error                    │
│  • RETRY — Reprocess the event (after fix deployed)          │
│  • DISCARD — Remove from DLQ (event is obsolete)             │
│  • BULK RETRY — Reprocess all events of a type               │
│                                                              │
│  Alert rules:                                                │
│  • Any event in DLQ → Warning (Slack notification)           │
│  • DLQ count > 10 → Critical (page on-call)                  │
│  • Same event type failing repeatedly → Critical             │
└──────────────────────────────────────────────────────────────┘

Circuit Breaker for Event Consumers

CLOSED (normal operation):
│ Processing events normally
│
├── Error rate > 50% for 10 events
│   → Switch to OPEN
│
OPEN (circuit broken):
│ Stop processing events (they queue up)
│ Wait 30 seconds
│
├── After 30s → Switch to HALF-OPEN
│
HALF-OPEN (testing):
│ Process ONE event
│
├── Success → Switch to CLOSED (resume normal)
└── Failure → Switch back to OPEN (wait again)

Step 6: Testing and Monitoring

Event Testing Checklist

Unit tests for event schema validation
Unit tests for event producers (correct event shape emitted)
Unit tests for event consumers (correct business logic)
Idempotency test: process same event twice, verify single effect
Ordering test: process events out of order, verify correct result
Error handling test: verify retry and DLQ behavior
Integration test: produce event → verify consumer processed it
Load test: verify throughput under expected volume

Event Monitoring Dashboard

┌──────────────────────────────────────────────────────────────┐
│  EVENT SYSTEM MONITORING                                     │
│                                                              │
│  Real-time Metrics:                                          │
│  • Events published per minute (by type)                     │
│  • Events consumed per minute (by consumer)                  │
│  • Processing latency (p50, p95, p99)                        │
│  • Consumer lag (how far behind real-time)                    │
│  • Error rate (% of events failing)                          │
│  • DLQ depth (should be 0 in normal operation)               │
│                                                              │
│  Alerts:                                                     │
│  • Consumer lag > 5 minutes → Warning                        │
│  • Consumer lag > 30 minutes → Critical                      │
│  • Error rate > 5% → Warning                                 │
│  • Error rate > 20% → Critical                               │
│  • DLQ depth > 0 → Warning                                   │
│  • DLQ depth > 10 → Critical                                 │
│  • No events published for 30 min (if expected) → Warning    │
└──────────────────────────────────────────────────────────────┘

Anti-Patterns — Never Do These

Anti-Pattern	Why It Fails	Do Instead
Events as commands ("send.email")	Tight coupling, defeats purpose	Use past tense ("email.requested")
Huge event payloads (entire entity)	Bandwidth waste, schema coupling	Include IDs + changed fields only
No event ID	Cannot deduplicate	Always include unique event ID
No schema versioning	Breaking changes break consumers	Version from day one
Synchronous event processing	Defeats purpose of events	Always process async
No dead letter queue	Failed events disappear silently	Always configure DLQ
Consumer modifies event payload	Breaks other consumers	Events are immutable facts
Relying on event ordering globally	Cannot scale, single consumer	Partition by entity for ordering

Quality Checklist — Before Shipping Event System

┌──────────────────────────────────────────────────────────────┐
│  EVENT DESIGN REVIEW CHECKLIST                               │
│                                                              │
│  □ Event catalog documented (all event types listed)         │
│  □ Event schemas defined with field-level documentation      │
│  □ Schema versioning strategy in place                       │
│  □ Delivery guarantee chosen per event type                  │
│  □ All consumers are idempotent                              │
│  □ Ordering requirements identified and handled              │
│  □ Error handling: retry + DLQ for every consumer            │
│  □ Circuit breaker on consumers calling external services    │
│  □ Events do not contain sensitive data                      │
│  □ correlationId flows through for distributed tracing       │
│  □ Monitoring dashboard with lag, error rate, DLQ depth      │
│  □ Alerts configured for consumer lag and failures           │
│  □ Tests cover: schema, idempotency, ordering, errors        │
│  □ Event retention/cleanup policy defined                    │
└──────────────────────────────────────────────────────────────┘

event-design

Popularity

Invocation

Context Preview

SKILL.md

event-design

Popularity

Invocation

Context Preview

SKILL.md

Event Design — From Coupled Services to Event-Driven Architecture

Your Expertise

Project Configuration

Current Event Infrastructure

Message Broker

Event Naming Convention

Serialization Format

Existing Async Patterns

⛔ Common Rules — Read Before Every Task

When to Use This Skill

How It Works

Step 1: Identify Events

Event Discovery Techniques

Event Catalog Template

Event vs. Command Decision

Step 2: Define Event Schemas

Standard Event Envelope

Event Schema Rules

Event Schema Design Template

Step 3: Design Delivery and Guarantees

Delivery Guarantee Decision Tree

Architecture Patterns

Step 4: Idempotency and Ordering

Idempotency Pattern

Idempotency Implementation

Ordering Guarantees

Step 5: Error Handling

Event Processing Error Strategy

Dead Letter Queue (DLQ) Management

Circuit Breaker for Event Consumers

Step 6: Testing and Monitoring

Event Testing Checklist

Event Monitoring Dashboard

Anti-Patterns — Never Do These

Quality Checklist — Before Shipping Event System

Similar Skills

Event Design — From Coupled Services to Event-Driven Architecture

Your Expertise

Project Configuration

Current Event Infrastructure

Message Broker

Event Naming Convention

Serialization Format

Existing Async Patterns

⛔ Common Rules — Read Before Every Task

When to Use This Skill

How It Works

Step 1: Identify Events

Event Discovery Techniques

Event Catalog Template

Event vs. Command Decision

Step 2: Define Event Schemas

Standard Event Envelope

Event Schema Rules

Event Schema Design Template

Step 3: Design Delivery and Guarantees

Delivery Guarantee Decision Tree

Architecture Patterns

Step 4: Idempotency and Ordering

Idempotency Pattern

Idempotency Implementation

Ordering Guarantees

Step 5: Error Handling

Event Processing Error Strategy

Dead Letter Queue (DLQ) Management

Circuit Breaker for Event Consumers

Step 6: Testing and Monitoring

Event Testing Checklist

Event Monitoring Dashboard

Anti-Patterns — Never Do These