Use when designing real-time data processing systems, choosing stream processing frameworks, or implementing event-driven architectures. Covers Kafka, Flink, and streaming patterns.
Design real-time data pipelines and choose appropriate streaming frameworks like Kafka, Flink, or Spark. Use when building event-driven systems that require millisecond latency, handling event time vs processing time, windowing, state management, and exactly-once delivery guarantees.
/plugin marketplace add melodic-software/claude-code-plugins/plugin install systems-design@melodic-softwareThis skill is limited to using the following tools:
Patterns and technologies for real-time data processing, event streaming, and stream analytics.
| Aspect | Batch | Streaming |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Data | Bounded (finite) | Unbounded (infinite) |
| Processing | Process all at once | Process as it arrives |
| State | Recompute each run | Maintain continuously |
| Complexity | Lower | Higher |
| Cost | Often lower | Often higher |
Use streaming when:
- Real-time responses required (<1 minute)
- Events need immediate action (fraud, alerts)
- Data arrives continuously
- Users expect live updates
- Time-sensitive business decisions
Use batch when:
- Daily/hourly reports sufficient
- Complex transformations needed
- Cost optimization priority
- Historical analysis
- One-time processing
Event Time: When event actually occurred
Processing Time: When event is processed
Example:
┌─────────────────────────────────────────────────────────┐
│ Event: Purchase at 10:00:00 (event time) │
│ Network delay: 5 seconds │
│ Processing: 10:00:05 (processing time) │
└─────────────────────────────────────────────────────────┘
Why it matters:
- Late events need handling
- Ordering not guaranteed
- Watermarks track progress
Watermark = "All events before this time have arrived"
Event stream:
──[10:01]──[10:02]──[10:00]──[10:03]──[Watermark: 10:00]──
Allows system to:
- Know when window is complete
- Handle late events
- Balance latency vs completeness
Tumbling Window (fixed, non-overlapping):
|─────|─────|─────|
0 5 10 15 (seconds)
Sliding Window (fixed, overlapping):
|─────|
|─────|
|─────|
Size: 5s, Slide: 2s
Session Window (activity-based):
|──────| |───────────| |───|
User activity with gaps defines windows
Count Window:
Process every N events
Stateful operations require maintained state:
- Aggregations (sum, count, avg)
- Joins between streams
- Pattern detection
- Deduplication
State backends:
- In-memory (fast, limited)
- RocksDB (larger, persistent)
- External (Redis, database)
Characteristics:
- Library (not a cluster)
- Exactly-once semantics
- Kafka-native
- Java/Scala
Best for:
- Kafka-centric architectures
- Simpler transformations
- Microservices
Example topology:
source → filter → map → aggregate → sink
Characteristics:
- Distributed cluster
- True streaming (not micro-batch)
- Advanced state management
- SQL support
Best for:
- Complex event processing
- Large-scale streaming
- Low-latency requirements
Example:
DataStream<Event> events = env.addSource(kafkaSource);
events
.keyBy(e -> e.getUserId())
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.aggregate(new CountAggregator())
.addSink(sink);
Characteristics:
- Micro-batch processing
- Unified batch + streaming API
- Wide ecosystem
- Python, Scala, Java, R
Best for:
- Teams with Spark experience
- Batch + streaming unified
- Machine learning integration
Latency: Seconds (micro-batch)
| Factor | Kafka Streams | Flink | Spark Streaming |
|---|---|---|---|
| Deployment | Library | Cluster | Cluster |
| Latency | Low | Lowest | Medium |
| State | Good | Excellent | Good |
| Exactly-once | Yes | Yes | Yes |
| Complexity | Low | High | Medium |
| Scaling | With Kafka | Independent | Independent |
| SQL | Limited | Yes | Yes |
| ML integration | Limited | Limited | Excellent |
Input: All events
Output: Events matching criteria
Example: Only process events where amount > 1000
Input: Event type A
Output: Event type B
Example: Enrich order events with customer data
Input: Multiple events
Output: Single aggregated result
Examples:
- Count events per window
- Sum amounts per user
- Average latency per endpoint
Stream-Stream Join:
┌─────────────┐ ┌─────────────┐
│ Orders │ ──► │ Join │
└─────────────┘ │ (by order_id│
┌─────────────┐ │ in window) │
│ Shipments │ ──► │ │
└─────────────┘ └─────────────┘
Stream-Table Join (Enrichment):
┌─────────────┐ ┌─────────────┐
│ Events │ ──► │ Join │
└─────────────┘ │ (lookup by │
┌─────────────┐ │ customer) │
│ Customer │ ──► │ │
│ Table │ └─────────────┘
└─────────────┘
Problem: Duplicate events from at-least-once delivery
Solution:
1. Track seen IDs in state (with TTL)
2. If seen, drop
3. If new, process and store ID
State: {event_id: timestamp}
TTL: Based on expected duplicate window
May lose events, never duplicates
Process → Commit → (if fail, event lost)
Use when: Loss acceptable, simplicity preferred
Never loses, may have duplicates
Commit → Process → (if fail, reprocess)
Use when: No loss acceptable, handle duplicates downstream
Never loses, never duplicates
Requires:
- Idempotent operations, OR
- Transactional processing
How it works:
1. Read from source transactionally
2. Process and update state
3. Write output and commit together
Flink: Checkpointing + two-phase commit
Kafka Streams: Transactional producer + EOS
1. Drop late events
Simple, may lose data
2. Allow late events (allowed lateness)
Process if within lateness threshold
3. Side output late events
Main stream processes on-time
Side stream handles late separately
4. Reprocess historical
Batch job fixes late data impact
Bounded Out-of-Orderness:
watermark = max_event_time - max_lateness
Example:
max_event_time = 10:00:00
max_lateness = 5 seconds
watermark = 09:59:55
Events before 09:59:55 considered complete
Partition by key for parallel processing:
┌─────────────────────────────────────────────────────┐
│ Kafka Topic (3 partitions) │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐│
│ │ Partition 0 │ │ Partition 1 │ │ Partition 2 ││
│ │ user_a, b │ │ user_c, d │ │ user_e, f ││
│ └─────────────┘ └─────────────┘ └─────────────────┘│
└─────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐
│Worker 0 │ │Worker 1 │ │Worker 2 │
└─────────┘ └─────────┘ └─────────┘
When downstream can't keep up:
1. Buffer (risk: OOM)
2. Drop (risk: data loss)
3. Backpressure (slow down source)
Flink: Backpressure propagates automatically
Kafka: Consumer lag indicates backpressure
Throughput:
- Events per second
- Bytes per second
Latency:
- Processing latency
- End-to-end latency
Health:
- Consumer lag
- Checkpoint duration
- Backpressure rate
- Error rate
Lag = Latest offset - Consumer offset
High lag indicates:
- Processing too slow
- Need more parallelism
- Downstream bottleneck
Monitor: Set alerting thresholds
1. Design for exactly-once when needed
2. Handle late events explicitly
3. Use event time, not processing time
4. Monitor consumer lag closely
5. Plan for state recovery
6. Test with realistic data volumes
7. Implement backpressure handling
8. Keep processing idempotent when possible
message-queues - Messaging patternsdata-architecture - Data platform designetl-elt-patterns - Data pipeline patternsCreating algorithmic art using p5.js with seeded randomness and interactive parameter exploration. Use this when users request creating art using code, generative art, algorithmic art, flow fields, or particle systems. Create original algorithmic art rather than copying existing artists' work to avoid copyright violations.
Applies Anthropic's official brand colors and typography to any sort of artifact that may benefit from having Anthropic's look-and-feel. Use it when brand colors or style guidelines, visual formatting, or company design standards apply.
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.