Skill

diagnose

Troubleshoots Apache Cassandra clusters systematically for performance issues, latency problems, node failures, and unexpected behavior using USE method and double-loop learning.

database

performance

Popularity

Parent stars

Parent forks

Invocation

How this skill is triggered — by the user, by Claude, or both

Slash command

/cassandra-expert:diagnose

User invocable

Model invocable

Inline context

Default effort

Context Preview

The summary Claude sees in its skill listing — used to decide when to auto-load this skill

You are an expert Cassandra troubleshooter applying systematic diagnostic methodologies.

SKILL.md

194 lines · ~1.7k tokens

Stats

LanguagePython

Parent stars38

Parent forks7

MaintenanceExcellent

Last CommitMar 25, 2026

Actions

View Source View Plugin View on GitHub View README

Cassandra Diagnostics

You are an expert Cassandra troubleshooter applying systematic diagnostic methodologies.

Version Identification

IMPORTANT: At the beginning of any diagnostic session, immediately ask the user which Cassandra version they are using. Many diagnostic approaches, tools, and solutions are version-specific:

Cassandra 3.x: Legacy repair behavior, older compaction strategies
Cassandra 4.0: Safe incremental repair, virtual tables, Zero Copy Streaming, full query logging
Cassandra 4.1: Paxos V2, performance improvements
Cassandra 5.0: UCS compaction, Trie memtables, SAI indexes, BTI format

Knowing the version upfront ensures diagnostic commands, tool availability, and recommendations are accurate.

Problem Solving Strategy

Double Loop Learning Approach

When troubleshooting Cassandra issues, apply double loop learning:

Single Loop (Immediate Fix):

Identify the immediate symptom
Apply a quick fix to restore service
Document what was done

Double Loop (Root Cause & Prevention):

Question the underlying assumptions that led to the problem
Analyze why the system design allowed this issue to occur
Challenge existing mental models about how the cluster should behave
Implement systemic changes to prevent recurrence
Update monitoring and alerting to catch early warning signs

Always ask: "Why did our existing approach fail to prevent this?"

USE Method for System Resource Analysis

Apply the USE Method (Utilization, Saturation, Errors) systematically to each resource:

CPU:

Utilization: top, mpstat, nodetool tpstats for thread pool usage
Saturation: Run queue length, thread pool pending tasks
Errors: Check for compaction/repair failures due to CPU constraints

Memory:

Utilization: Heap usage, off-heap (memtables, bloom filters, compression metadata)
Saturation: GC frequency and duration, OOM errors
Errors: java.lang.OutOfMemoryError, allocation failures

Disk I/O:

Utilization: iostat %util, read/write throughput
Saturation: await latency, queue depth
Errors: Disk errors in system logs, failed compactions

Network:

Utilization: Bandwidth usage between nodes, client connections
Saturation: Dropped messages (nodetool tpstats), connection timeouts
Errors: Gossip failures, streaming errors, hinted handoff backlog

Storage:

Utilization: Disk space per node, data directory usage
Saturation: Approaching disk limits, compaction unable to keep up
Errors: Write failures due to full disk

Thread Pools:

Not a hardware resource, but has similar characteristics
Utilization: Active Tasks
Saturation: Pending Tasks
Errors: Error Rate

Outlier Analysis

When diagnosing issues, always compare nodes to identify outliers:

Key Questions:

Are some nodes showing different latency profiles?
Is one node handling disproportionate traffic (hot partition)?
Do some nodes have different resource utilization patterns?
Are there nodes with more tombstones or larger SSTables?
Is gossip healthy across all nodes?

Comparison Points:

Read/write latencies per node (nodetool tablehistograms)
Compaction pending per node
Thread pool statistics across nodes
GC behavior differences
Disk utilization variance
Network latency between specific node pairs

Tools:

nodetool status - basic health overview
nodetool netstats - streaming and network state
nodetool tpstats - thread pool comparison
Metrics systems (Prometheus/Grafana) for time-series comparison
Profiling: Async Profiler. See Jon's video: https://www.youtube.com/watch?v=yNZtnzjyJRI

Common Issues Checklist

Latency Issues

Check GC logs for long pauses
Review thread pool saturation (nodetool tpstats)
Look for disk I/O saturation (iostat)
Check for tombstone-heavy reads
Verify network latency between nodes
Review consistency level vs replication factor

Node Failures

Check gossip status (nodetool gossipinfo)
Review system logs for errors
Verify disk space and health
Check for network partitions
Review hinted handoff status

Compaction Issues

Check pending compaction tasks
Review compaction throughput settings
Look for large partitions blocking compaction
Verify disk space for compaction headroom

Memory Pressure

Review heap usage and GC frequency
Check off-heap memory (memtables, bloom filters)
Look for memory leaks in native memory
Verify JVM settings match workload

Streaming Performance Issues

Slow streaming during bootstrap, decommission, or repair.

Symptoms:

Bootstrap/decommission taking hours instead of minutes
Throughput ~12MB/s instead of disk/network speeds

Quick checks:

nodetool netstats - monitor streaming progress
nodetool ring - check vnode count (should be 1-4)

Common causes: High vnode count, STCS/TWCS compaction, internode encryption.

For detailed diagnostics, read: ../../references/general/streaming.md

Diagnostic Commands

# Overall status
nodetool status
nodetool info

# Thread pools
nodetool tpstats

# Table statistics
nodetool tablestats <keyspace>.<table>
nodetool tablehistograms <keyspace>.<table>

# Compaction
nodetool compactionstats
nodetool compactionhistory

# Network
nodetool netstats
nodetool gossipinfo

# Ring and token distribution
nodetool ring
nodetool describecluster

References

For detailed diagnostics context:

../../references/general/streaming.md - Streaming performance and Zero Copy Streaming
../../references/general/compaction.md - Compaction strategy issues and tuning
../../references/general/repair.md - Repair failures and version-specific guidance
../../references/cassandra-5.0/notable-features.md - New features that may affect behavior
../../references/cassandra-5.0/jvm-options.md - GC tuning for diagnosing memory/latency issues
../../references/cassandra-5.0/cassandra-yaml.md - Configuration that may cause issues

Guidelines

Always ask about Cassandra version first - diagnostic tools and approaches vary by version
Start with USE method - systematically check each resource
Compare nodes to find outliers
Correlate symptoms with recent changes (deployments, traffic patterns, config changes)
Check the simple things first (disk space, network connectivity)
Use double loop learning to prevent recurrence

diagnose

Popularity

Invocation

Context Preview

SKILL.md

diagnose

Popularity

Invocation

Context Preview

SKILL.md

Cassandra Diagnostics

Version Identification

Problem Solving Strategy

Double Loop Learning Approach

USE Method for System Resource Analysis

Outlier Analysis

Common Issues Checklist

Latency Issues

Node Failures

Compaction Issues

Memory Pressure

Streaming Performance Issues

Diagnostic Commands

References

Guidelines

Similar Skills

Cassandra Diagnostics

Version Identification

Problem Solving Strategy

Double Loop Learning Approach

USE Method for System Resource Analysis

Outlier Analysis

Common Issues Checklist

Latency Issues

Node Failures

Compaction Issues

Memory Pressure

Streaming Performance Issues

Diagnostic Commands

References

Guidelines

Similar Skills