Skill

clickhouse-prod-checklist

Provides ClickHouse production checklist: schema design, server tuning, S3 backups, monitoring, deployment verification. For go-live reviews and audits.

AWS

Kubernetes

Bash

database

devops

npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin clickhouse-pack

Tool Access

This skill is limited to using the following tools:

ReadBash(kubectl:*)Bash(curl:*)Grep

Preview

Comprehensive go-live checklist for ClickHouse covering server tuning, schema design,

SKILL.md

Similar Skills

clickhouse-observability

1.9k

Sets up ClickHouse monitoring with system table queries for query performance/merge health/resource usage, Prometheus metrics, Grafana dashboards, and alerting.

3 tools

clickhouse-pack

clickhouse-best-practices

404

Reviews ClickHouse schemas, queries, configurations against 31 rules for schema design, query optimization, data ingestion, and agent connectivity. Cite rules in recommendations.

20 files

clickhouse-best-practices

clickhouse-architect

Designs and optimizes ClickHouse schemas: ORDER BY keys, compression codecs, MergeTree partitioning, projections, indexes for v24.4+ Cloud and self-hosted.

9 files4 tools

quality-tools

Stats

Parent Repo Stars1854

Parent Repo Forks248

Last CommitMar 22, 2026

Actions

View Source View Plugin View on GitHub View README

Help us improve

Share bugs, ideas, or general feedback.

ClickHouse Production Checklist

Overview

Comprehensive go-live checklist for ClickHouse covering server tuning, schema design, backup configuration, monitoring, and operational readiness.

Prerequisites

ClickHouse instance provisioned (Cloud or self-hosted)
Application integration code tested in staging

Checklist

1. Schema & Engine Design

Tables use MergeTree family engines (not Memory, Log, or TinyLog)
ORDER BY columns match primary filter/group patterns
PARTITION BY is coarse (monthly or weekly, never by ID)
TTL configured for data retention policy
LowCardinality(String) used for low-cardinality columns
CODEC(ZSTD) applied to large String/JSON columns
ReplacingMergeTree used with FINAL or dedup logic if upserts needed

2. Server Configuration (Self-Hosted)

<!-- Key production settings in config.xml / users.xml -->

<!-- Memory: set to ~80% of available RAM -->
<max_server_memory_usage_to_ram_ratio>0.8</max_server_memory_usage_to_ram_ratio>

<!-- Query limits -->
<max_concurrent_queries>150</max_concurrent_queries>
<max_memory_usage>10000000000</max_memory_usage>  <!-- 10GB per query -->
<max_execution_time>300</max_execution_time>       <!-- 5 min timeout -->

<!-- Merge settings -->
<background_pool_size>16</background_pool_size>
<background_schedule_pool_size>16</background_schedule_pool_size>

<!-- Logging -->
<query_log>
    <database>system</database>
    <table>query_log</table>
    <flush_interval_milliseconds>7500</flush_interval_milliseconds>
</query_log>

3. Backup Configuration

-- ClickHouse native BACKUP to S3
BACKUP TABLE analytics.events
    TO S3(
        'https://my-bucket.s3.us-east-1.amazonaws.com/backups/events',
        'ACCESS_KEY',
        'SECRET_KEY'
    )
    SETTINGS compression_method = 'zstd';

-- Incremental backup (base + delta)
BACKUP TABLE analytics.events
    TO S3('s3://my-bucket/backups/events-incremental')
    SETTINGS base_backup = S3('s3://my-bucket/backups/events-base');

ClickHouse Cloud: Backups are automatic. Configure retention and frequency in the Cloud console under Service Settings.

Backup schedule configured (daily minimum)
Backup restore tested and documented
Point-in-time recovery possible (incremental backups)
Backup stored in different region/account from primary

4. Monitoring & Alerting

-- Key metrics to monitor
-- 1. Query latency (p50/p95/p99)
-- 2. Active parts per table (alert > 300)
-- 3. Merge queue depth
-- 4. Memory usage
-- 5. Disk space free
-- 6. Replication lag (if clustered)

-- Health check query (use for load balancer probes)
SELECT 1;

-- More thorough health check
SELECT
    (SELECT count() FROM system.processes) AS running_queries,
    (SELECT sum(value) FROM system.metrics WHERE metric = 'MemoryTracking') AS memory_bytes,
    (SELECT count() FROM system.merges) AS active_merges;

Prometheus endpoint configured (/metrics via ClickHouse Exporter or Cloud endpoint)
Grafana dashboard with key panels (QPS, latency, memory, parts, merges)
Alerts on: error rate > 5%, p95 latency > 5s, parts > 300, disk < 20%
Query log monitoring for slow/failed queries

5. Security

Default user password changed or user disabled
Application user has minimal privileges (see clickhouse-security-basics)
TLS enabled (HTTPS on port 8443)
IP allowlist configured
Secrets in environment variables or secret manager (not code)

6. Application Integration

Connection pooling configured (max_open_connections)
Graceful shutdown calls client.close()
Insert batching in place (10K+ rows per INSERT)
Retry logic for transient errors (see clickhouse-rate-limits)
Health check endpoint includes ClickHouse ping

// Health check endpoint
app.get('/health', async (req, res) => {
  try {
    const { success } = await client.ping();
    res.json({ status: success ? 'healthy' : 'degraded', clickhouse: success });
  } catch {
    res.status(503).json({ status: 'unhealthy', clickhouse: false });
  }
});

7. Operational Readiness

Incident runbook documented (see clickhouse-incident-runbook)
On-call escalation path defined
Key rotation procedure documented
Schema migration process in place (see clickhouse-migration-deep-dive)
Load testing completed at expected peak traffic

8. Verification Queries

-- Verify table sizes and part health
SELECT database, table, count() AS parts, sum(rows) AS rows,
       formatReadableSize(sum(bytes_on_disk)) AS size
FROM system.parts WHERE active
GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC;

-- Verify no pending mutations
SELECT * FROM system.mutations WHERE NOT is_done;

-- Verify no replication lag
SELECT database, table, queue_size, inserts_in_queue
FROM system.replicas WHERE queue_size > 0;

Error Handling

Issue	Detection	Action
Parts > 300	Monitoring alert	Review insert patterns, wait for merges
Disk > 80%	Disk alert	Add storage, drop old partitions
Query p95 > 5s	Latency alert	Check `system.query_log` for slow queries
Replication lag	Replica check	Investigate network, merge backlog

Resources

Next Steps

For SDK version upgrades, see clickhouse-upgrade-migration.