From harness-claude
Explains PostgreSQL query planner's use of pg_stats, histograms, and most-common-values to estimate selectivity and choose execution plans. Use for diagnosing row misestimations and suboptimal plans after data changes.
npx claudepluginhub intense-visions/harness-engineering --plugin harness-claudeThis skill uses the workspace's default tool permissions.
> How the planner uses table statistics (pg_stats, histograms, most-common-values) to estimate row counts and choose execution plans.
Interprets PostgreSQL EXPLAIN and EXPLAIN ANALYZE outputs to diagnose slow queries, row count misestimations, missing indexes, and I/O bottlenecks. Useful for production debugging and schema validation.
Analyzes slow queries in PostgreSQL, MySQL, MongoDB using EXPLAIN plans, buffer stats, index usage to detect sequential scans, missing indexes, and recommend optimizations.
Optimizes PostgreSQL and MySQL queries using indexing, EXPLAIN ANALYZE, pg_stat_statements, and best practices. Use for slow queries, N+1 problems, missing indexes, sequential scans, pagination, and JOIN issues.
Share bugs, ideas, or general feedback.
How the planner uses table statistics (pg_stats, histograms, most-common-values) to estimate row counts and choose execution plans.
The planner does not look at actual table data at query time. Instead, it uses pre-computed statistics stored in pg_statistic (accessible via the pg_stats view). These statistics are sampled approximations, not exact counts.
Key statistics columns in pg_stats:
n_distinct -- estimated number of distinct values. Positive values are absolute counts; negative values are fractions of total rows (e.g., -1.0 means every row is unique)most_common_vals -- the N most frequent values in the columnmost_common_freqs -- the frequency of each most-common value (fractions of total rows)histogram_bounds -- equal-frequency histogram bucket boundaries for values not in the MCV listnull_frac -- fraction of rows that are NULLcorrelation -- how well the physical row order matches the logical (sorted) order. Values near 1.0 or -1.0 mean high correlationThe ANALYZE command samples the table and updates these statistics:
ANALYZE orders; -- analyze one table
ANALYZE orders (status); -- analyze one column
ANALYZE; -- analyze all tables in the database
Autovacuum runs ANALYZE automatically, but it may lag behind large data changes.
Examining statistics for the status column on an orders table:
SELECT
attname,
n_distinct,
most_common_vals,
most_common_freqs,
null_frac,
correlation
FROM pg_stats
WHERE tablename = 'orders' AND attname = 'status';
attname | n_distinct | most_common_vals | most_common_freqs | null_frac | correlation
---------+------------+-----------------------------+--------------------------+-----------+------------
status | 3 | {completed,active,cancelled} | {0.85,0.12,0.03} | 0 | 0.15
How the planner uses this: For WHERE status = 'active', the estimated rows = total_rows * 0.12. On a 10M-row table, the estimate is 1.2M rows. For WHERE status = 'cancelled', the estimate is 300K rows.
Misestimation scenario: After a data migration that marked 90% of orders as cancelled (previously 3%), the statistics are stale:
EXPLAIN ANALYZE
SELECT * FROM orders WHERE status = 'cancelled';
Index Scan using idx_orders_status on orders
(cost=0.43..12345.67 rows=300000 width=52)
(actual time=0.031..8923.450 rows=9000000 loops=1)
Estimated 300K rows, actual 9M -- a 30x misestimation. The planner chose Index Scan (good for 300K) instead of Seq Scan (better for 9M). Fix:
ANALYZE orders;
After ANALYZE, the planner sees the updated distribution and switches to Seq Scan:
Seq Scan on orders (cost=0.00..223456.00 rows=9010000 width=52)
(actual time=0.012..1234.567 rows=9000000 loops=1)
Filter: (status = 'cancelled')
Execution Time: 1345.678 ms
Increasing statistics granularity for skewed distributions:
ALTER TABLE orders ALTER COLUMN status SET STATISTICS 1000;
ANALYZE orders;
The default default_statistics_target is 100 (100 histogram buckets and 100 MCV entries). For highly skewed columns, increasing to 500-1000 gives the planner a more accurate picture.
Never running ANALYZE after bulk operations. After a large INSERT, DELETE, or UPDATE that changes data distribution, statistics are stale. The planner uses the old distribution, producing bad plans. Always ANALYZE tablename; after bulk changes.
Setting default_statistics_target too low. The default of 100 works for uniform distributions but fails for skewed data. Columns with thousands of distinct values at varying frequencies need higher targets.
Ignoring n_distinct misestimates. For high-cardinality columns (e.g., user_id with 50M distinct values), the sampled n_distinct can be significantly off. Override with: ALTER TABLE orders ALTER COLUMN user_id SET (n_distinct = -1); (tells the planner every value is unique).
Disabling autovacuum. Autovacuum also runs auto-ANALYZE. Disabling it means statistics are never refreshed automatically, guaranteeing plan degradation over time.
default_statistics_target controls the number of histogram buckets and MCV entries. Default: 100. Maximum: 10000. Higher values increase ANALYZE time but improve estimation accuracy:
-- Global setting:
SET default_statistics_target = 200;
-- Per-column override:
ALTER TABLE events ALTER COLUMN event_type SET STATISTICS 500;
Extended statistics for correlated columns (PostgreSQL 10+):
CREATE STATISTICS stat_orders_status_region (dependencies)
ON status, region FROM orders;
ANALYZE orders;
Without extended statistics, the planner assumes columns are independent. If status = 'active' AND region = 'us-east' are correlated (e.g., 80% of us-east orders are active), the default estimation multiplies their individual frequencies, producing a significant underestimate.
Monitoring ANALYZE freshness:
SELECT relname, last_analyze, last_autoanalyze, n_live_tup, n_dead_tup
FROM pg_stat_user_tables
WHERE relname = 'orders';
If last_autoanalyze is days old and n_dead_tup is high, autovacuum is falling behind.
Multivariate statistics evolution:
dependencies)mcv)expressions)CREATE STATISTICS stat_orders_multi (dependencies, mcv)
ON status, region, tenant_id FROM orders;
Selectivity estimation for complex predicates. AND clauses multiply individual selectivities (assuming independence). OR clauses use inclusion-exclusion. NOT inverts selectivity. These assumptions break down for correlated columns -- extended statistics fix this.
The 1/n_distinct fallback. For values not in the MCV list and not in the histogram range, PostgreSQL estimates selectivity as 1/n_distinct. This is a rough guess and can be significantly wrong for new or rare values.
pg_statistic_ext stores extended statistics data. Query it to verify that your extended statistics are being computed:
SELECT stxname, stxkeys, stxkind
FROM pg_statistic_ext
WHERE stxrelid = 'orders'::regclass;
MySQL uses ANALYZE TABLE (not just ANALYZE) to update statistics:
ANALYZE TABLE orders;
MySQL stores statistics in mysql.innodb_index_stats and mysql.innodb_table_stats. Key differences:
innodb_stats_persistent = ON by default since 5.6): statistics survive restartsinnodb_stats_persistent_sample_pages, default 20): controls sample size for ANALYZE. Much smaller than PostgreSQL's default of 30,000 * default_statistics_target rowsANALYZE TABLE t UPDATE HISTOGRAM ON col WITH 100 BUCKETS;MySQL's optimizer uses a simpler statistics model overall. For complex queries with correlated columns, PostgreSQL's extended statistics provide significantly better estimates.
Reporting system with daily batch inserts of 5M rows. After each nightly batch, morning report queries degraded from 2 seconds to 30+ seconds. Investigation showed that autovacuum's ANALYZE had not run since the batch completed -- the statistics still reflected the previous day's distribution. Adding ANALYZE reporting_events; to the batch job's post-load step ensured fresh statistics. Query performance remained consistent at 2 seconds. Selectivity estimation error dropped from 100x (stale stats) to under 2x (fresh stats).
pg_stats for your key columns and comparing EXPLAIN estimated rows to actual rows.ANALYZE runs in post-load steps).