Search everything...

Skill

UX Metrics & Measurement

Comprehensive UX measurement frameworks covering HEART, SUS, UEQ, task-based metrics, behavioral analytics, A/B testing, AI-specific metrics, design system metrics, benchmarking, and ROI calculation for UX work.

From ux-ui-mastery

Install

Run in your terminal

npx claudepluginhub phazurlabs/ux-ui-mastery --plugin ux-ui-mastery

Tool Access

This skill uses the workspace's default tool permissions.

Supporting Assets

View in Repository

references/ai-ux-metrics-experimentation.md

references/metrics-frameworks-compendium.md

Skill Content

Similar Skills

agent-harness-construction

Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.

ecc

140.3k

agent-payment-x402

Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.

ecc

140.3k

agent-eval

Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.

ecc

140.3k

Stats

Stars1

Forks0

Last CommitFeb 12, 2026

Actions

View Source View Plugin View on GitHub View README

UX Metrics & Measurement | ux-ui-mastery | ClaudePluginHub

Skill

UX Metrics & Measurement

From ux-ui-mastery

Install

Run in your terminal

npx claudepluginhub phazurlabs/ux-ui-mastery --plugin ux-ui-mastery

Tool Access

This skill uses the workspace's default tool permissions.

Supporting Assets

View in Repository

references/ai-ux-metrics-experimentation.md

references/metrics-frameworks-compendium.md

Skill Content

UX Metrics & Measurement — Quantifying User Experience

Measurement Philosophy

Measure what matters, not what is easy. Vanity metrics (page views, downloads, time on site) tell you what happened but not whether the experience was good. Meaningful UX metrics connect user behavior to user satisfaction to business outcomes. Every metric in the system should answer one of three questions: Is it usable? Is it useful? Is it desirable?

The Measurement Hierarchy

Behavioral metrics (what users do) are more reliable than attitudinal metrics (what users say).
Task-level metrics (can users accomplish goals?) matter more than session-level metrics (how long did users stay?).
Leading indicators (early signals of success/failure) are more actionable than lagging indicators (outcomes measured after the fact).
Benchmarked metrics (compared to industry or historical baseline) are more meaningful than absolute metrics (numbers without context).

HEART Framework (Google)

The HEART framework provides a structured approach to selecting user-centered metrics across five dimensions. Developed at Google, it maps product goals to measurable signals to concrete metrics.

Five Dimensions

Happiness — Subjective user satisfaction and sentiment

Signals: survey responses, ratings, sentiment in feedback
Example metrics: satisfaction score (CSAT), NPS, SUS score, ease-of-use rating
Measurement: post-task surveys, in-app ratings, periodic satisfaction surveys

Engagement — Depth and frequency of user interaction

Signals: session frequency, feature usage, content consumption
Example metrics: DAU/MAU ratio, actions per session, feature adoption rate, session depth
Measurement: product analytics, event tracking

Adoption — New users successfully onboarding

Signals: signups, onboarding completion, first key action
Example metrics: signup-to-activation rate, onboarding completion rate, time to first value
Measurement: funnel analytics, cohort analysis

Retention — Users continuing to derive value over time

Signals: return visits, subscription renewals, continued usage
Example metrics: D1/D7/D30 retention, churn rate, renewal rate, feature retention
Measurement: cohort retention curves, survival analysis

Task Success — Ability to complete intended goals effectively and efficiently

Signals: task completion, errors, time spent, help requests
Example metrics: task success rate, time on task, error rate, lostness score
Measurement: usability testing, analytics funnels, error logging

Goals-Signals-Metrics Process

For each product area, work through three columns:

Goal: What is the desired user outcome? ("Users can find relevant products quickly")
Signal: What user behavior indicates progress toward the goal? ("Users complete searches and click results")
Metric: How do you quantify the signal? ("Search-to-click rate, average time from search to product page view")

This process prevents metric selection from being arbitrary. Every metric traces back to a user goal.

System Usability Scale (SUS)

The most widely used standardized usability questionnaire. Ten questions on a five-point Likert scale, producing a score from 0-100.

Scoring and Interpretation

Raw score calculation: alternate positive/negative items, subtract/add from scale anchors, multiply by 2.5.
Average SUS score across studies: 68 (this is the 50th percentile, not a passing grade).
Adjective scale mapping: <50 = Awful, 50-67 = Poor to OK, 68 = Passable, 68-80 = Good, 80-90 = Excellent, 90+ = Best Imaginable.
Grade scale: <60 = F, 60-69 = D, 70-79 = C, 80-89 = B, 90+ = A.
Percentile benchmarks: 68 = 50th, 74 = 64th, 80 = 85th, 86 = 96th.

When to Use SUS

After usability testing sessions (post-study).
Baseline measurement before redesign.
Comparative evaluation between design alternatives.
Longitudinal tracking across releases.
Industry benchmarking.

Limitations

SUS measures perceived usability, not actual usability. A user may rate a system highly despite making errors.
The 10-item format is not ideal for quick pulse checks — use UMUX-Lite (2 items) for lightweight measurement.
SUS does not diagnose specific issues — it is a thermometer, not a diagnostic tool.

UEQ (User Experience Questionnaire)

The UEQ measures six dimensions of user experience: Attractiveness, Perspicuity (clarity), Efficiency, Dependability, Stimulation, and Novelty. It uses 26 semantic differential items (word pairs).

When to Use UEQ Over SUS

When you need dimensional insight beyond a single usability score.
When hedonic qualities (stimulation, novelty) matter alongside pragmatic qualities (perspicuity, efficiency, dependability).
When comparing products across multiple experience dimensions.
UEQ+ (modular version) allows selecting only relevant scales.

UMUX-Lite

A two-item questionnaire that correlates strongly with SUS (r = 0.83). Useful for: in-app pulse surveys, post-task quick checks, high-frequency measurement where survey fatigue is a concern.

Items: (1) "[System] capabilities meet my requirements." (2) "[System] is easy to use." Both rated on a seven-point Likert scale.

SUPR-Q (Standardized User Experience Percentile Rank Questionnaire)

Web-specific measurement covering four factors: Usability, Trust/Credibility, Appearance, and Loyalty. Produces a percentile rank against a database of 200+ websites. Best for benchmarking web experiences against industry competitors.

Task-Based Metrics

Task Success Rate

The most fundamental UX metric. Percentage of users who successfully complete a defined task.

Binary success: Did the user complete the task? Yes/No.
Partial success: Weighted scoring for partially completed tasks (e.g., 0 = fail, 0.5 = completed with significant difficulty, 1 = success).
Benchmark: Industry average for web tasks is approximately 78%. Below 70% indicates serious usability problems.

Time on Task

How long users take to complete a task. Lower is generally better, but context matters — rushed completion may indicate skipped steps.

Measure from task start to task completion.
Use geometric mean (not arithmetic mean) because time data is typically right-skewed.
Compare against baseline or competitor benchmarks.
Distinguish between first-time and repeat users.

Error Rate

Frequency and severity of errors during task completion.

Error frequency: Number of errors per task attempt.
Error taxonomy: Classify errors as slips (wrong action, right intention), mistakes (wrong intention), or system errors (not user-caused).
Recovery rate: Percentage of errors the user successfully recovers from.
Error impact: Which errors cause task abandonment versus temporary friction.

Lostness Score

Measures navigation efficiency. Calculated as: Lostness = sqrt((N/S - 1)^2 + (R/N - 1)^2) where N = number of pages visited, S = minimum pages needed, R = unique pages visited.

Score of 0 = optimal path. Score above 0.4 = user is seriously lost.
Useful for evaluating information architecture and navigation design.

Behavioral Analytics

Funnel Analysis

Track user progression through multi-step flows. Identify where users drop off and why.

Define funnels for every critical flow: onboarding, checkout, feature activation, upgrade.
Measure step-to-step conversion rate, overall completion rate, and time per step.
Segment funnels by user cohort, acquisition channel, device, and user experience level.
Investigate drop-off points with session recordings and qualitative research.

Cohort Analysis

Group users by shared characteristic (signup date, acquisition channel, plan tier) and track behavior over time.

Retention cohorts: What percentage of users who signed up in Week 1 are still active in Week 4, 8, 12?
Feature cohorts: How does behavior differ between users who adopted Feature X versus those who did not?
Identify the "aha moment" — the action most correlated with long-term retention.

Retention Curves

Plot the percentage of active users over time since signup.

Flattening curve: Healthy product — a stable base of users finds lasting value.
Declining curve: Problem — users are not finding enough value to stay.
Smiling curve: Recovery — initial drop-off followed by increasing engagement (often from re-engagement campaigns).

Experimentation

A/B Testing

Compare two design variants with random user assignment to determine which performs better on a defined metric.

Hypothesis template: "If we [change], then [metric] will [improve/decrease] by [amount] because [reasoning]."
Sample size: Calculate before launching using a power calculator. Specify minimum detectable effect, significance level (typically 0.05), and power (typically 0.80).
Duration: Run for at least one full business cycle (minimum 1-2 weeks). Do not peek at results and stop early.
One variable: Test one change at a time for clear causation. Multivariate testing requires exponentially larger samples.
Guard rails: Monitor guardrail metrics (crash rate, load time, error rate) to ensure the variant does not cause harm even if the primary metric improves.

Sequential Testing

Alternative to fixed-horizon A/B testing. Allows checking results continuously without inflating false positive rate. Useful when you need faster decisions or have variable traffic.

Benchmarking

Industry Benchmarks

SUS: Average across studies is 68. B2B software averages 65-72. Consumer apps average 70-78.
Task success rate: Web average 78%. E-commerce checkout 65-70%. Mobile forms 60-75%.
NPS: Average varies by industry. SaaS averages 30-40. B2C averages 40-60.
Time on task: Highly variable. Compare against your own historical baseline rather than cross-industry.

Historical Tracking

Track key metrics across every release. Plot trendlines.
Set regression thresholds: alert when a metric drops more than 5% from baseline.
Celebrate improvements with the team — measurement should drive positive reinforcement.

AI-Specific UX Metrics

Trust calibration score: How closely does user confidence in AI output match actual AI accuracy?
Appropriate reliance rate: Percentage of times users correctly accept accurate AI outputs and correctly reject inaccurate ones.
AI feature adoption: Opt-in rate, prompt frequency, feature retention over time.
Correction frequency: How often users edit or override AI outputs.
Time saved: Measured difference in task time with AI versus without.

Design System Metrics

Adoption rate: Percentage of UI built with design system components versus custom implementations.
Component reuse: Average instances per component across products.
Consistency score: Visual regression pass rate across products.
Contribution rate: External PRs per month to the design system.
Developer satisfaction: Quarterly NPS survey of design system consumers.
Design system ROI: [(Time efficiency gains + Quality gains + Scale gains + Consistency gains) - (Initial cost + Maintenance cost)] / (Initial + Maintenance) x 100%.
Sparkbox research data: 38% design efficiency, 31% dev efficiency, 228% higher ROI with mature DesignOps.

Cross-Referencing

For heuristic evaluation methodology, reference nng-ux-heuristics.
For research methodology, reference ux-research-methods.
For AI-specific metrics detail, reference ux-metrics-measurement/references/ai-ux-metrics-experimentation.
For design system metrics detail, reference design-systems-architecture/references/governance-scaling.
For usability testing protocols that produce metrics, reference ux-research-methods/references/research-protocols.

v3.0 Cross-References

The v3.0 upgrade adds reference materials that extend measurement into design system maturity, AI-specific quality metrics, and notification effectiveness.

Design System Maturity Metrics See design-systems-architecture/references/maturity-model-multi-brand.md for the 5-level design system maturity assessment rubric with quantifiable metrics at each level — including adoption rate thresholds (Level 3 requires 60%+ component coverage), contribution velocity benchmarks, token coverage ratios, cross-platform parity scores, and governance process maturity indicators. This reference provides the structured assessment framework that complements the Design System Metrics section above, enabling teams to measure where they stand and define concrete advancement targets.

AI-Specific Quality and Trust Metrics See agentic-ai-generative-ux/references/llm-hallucination-design-guardrails.md for expanded AI-specific metrics beyond the AI-Specific UX Metrics section above — including hallucination rate measurement methodology (per-claim factual accuracy scoring), confidence calibration metrics (Expected Calibration Error measuring alignment between stated confidence and actual correctness), trust calibration accuracy (user trust vs. system reliability correlation), verification engagement rate (how often users check AI citations), and AI quality gate pass/fail rates for production deployment monitoring. These metrics are essential for any team shipping LLM-powered features.

Notification Effectiveness Metrics See performance-states-patterns/references/notification-system-design.md for metrics specific to notification system evaluation — including notification-to-action rate (percentage of notifications that lead to meaningful user action), dismissal rate by notification type, opt-out rate trending, notification fatigue indicators (declining engagement over time), time-to-action measurement, and notification channel effectiveness comparison (push vs. in-app vs. email). These metrics connect directly to the HEART framework's Engagement dimension for notification-driven features.

Key Sources

Google HEART Framework (Rodden, Hutchinson, Fu)
Sauro, J. & Lewis, J.: "Quantifying the User Experience"
MeasuringU: 48 UX Metrics (2025)
Brooke, J.: SUS — A Quick and Dirty Usability Scale
Schrepp, M.: User Experience Questionnaire (UEQ)
Frontiers in Computer Science: UX Measurement Frameworks
Sparkbox: Design System ROI Research
NNG Group: UX Maturity Model, DesignOps 101

Similar Skills

agent-harness-construction

Designs and optimizes AI agent action spaces, tool definitions, observation formats, error recovery, and context for higher task completion rates.

ecc

140.3k

agent-payment-x402

Enables AI agents to execute x402 payments with per-task budgets, spending controls, and non-custodial wallets via MCP tools. Use when agents pay for APIs, services, or other agents.

ecc

140.3k

agent-eval

Compares coding agents like Claude Code and Aider on custom YAML-defined codebase tasks using git worktrees, measuring pass rate, cost, time, and consistency.

ecc

140.3k

Stats

Stars1

Forks0

Last CommitFeb 12, 2026

Actions

View Source View Plugin View on GitHub View README

UX Metrics & Measurement — Quantifying User Experience

Measurement Philosophy

The Measurement Hierarchy

Behavioral metrics (what users do) are more reliable than attitudinal metrics (what users say).
Task-level metrics (can users accomplish goals?) matter more than session-level metrics (how long did users stay?).
Leading indicators (early signals of success/failure) are more actionable than lagging indicators (outcomes measured after the fact).
Benchmarked metrics (compared to industry or historical baseline) are more meaningful than absolute metrics (numbers without context).

HEART Framework (Google)

The HEART framework provides a structured approach to selecting user-centered metrics across five dimensions. Developed at Google, it maps product goals to measurable signals to concrete metrics.

Five Dimensions

Happiness — Subjective user satisfaction and sentiment

Signals: survey responses, ratings, sentiment in feedback
Example metrics: satisfaction score (CSAT), NPS, SUS score, ease-of-use rating
Measurement: post-task surveys, in-app ratings, periodic satisfaction surveys

Engagement — Depth and frequency of user interaction

Signals: session frequency, feature usage, content consumption
Example metrics: DAU/MAU ratio, actions per session, feature adoption rate, session depth
Measurement: product analytics, event tracking

Adoption — New users successfully onboarding

Signals: signups, onboarding completion, first key action
Example metrics: signup-to-activation rate, onboarding completion rate, time to first value
Measurement: funnel analytics, cohort analysis

Retention — Users continuing to derive value over time

Signals: return visits, subscription renewals, continued usage
Example metrics: D1/D7/D30 retention, churn rate, renewal rate, feature retention
Measurement: cohort retention curves, survival analysis

Task Success — Ability to complete intended goals effectively and efficiently

Signals: task completion, errors, time spent, help requests
Example metrics: task success rate, time on task, error rate, lostness score
Measurement: usability testing, analytics funnels, error logging

Goals-Signals-Metrics Process

For each product area, work through three columns:

Goal: What is the desired user outcome? ("Users can find relevant products quickly")
Signal: What user behavior indicates progress toward the goal? ("Users complete searches and click results")
Metric: How do you quantify the signal? ("Search-to-click rate, average time from search to product page view")

This process prevents metric selection from being arbitrary. Every metric traces back to a user goal.

System Usability Scale (SUS)

The most widely used standardized usability questionnaire. Ten questions on a five-point Likert scale, producing a score from 0-100.

Scoring and Interpretation

Raw score calculation: alternate positive/negative items, subtract/add from scale anchors, multiply by 2.5.
Average SUS score across studies: 68 (this is the 50th percentile, not a passing grade).
Adjective scale mapping: <50 = Awful, 50-67 = Poor to OK, 68 = Passable, 68-80 = Good, 80-90 = Excellent, 90+ = Best Imaginable.
Grade scale: <60 = F, 60-69 = D, 70-79 = C, 80-89 = B, 90+ = A.
Percentile benchmarks: 68 = 50th, 74 = 64th, 80 = 85th, 86 = 96th.

When to Use SUS

After usability testing sessions (post-study).
Baseline measurement before redesign.
Comparative evaluation between design alternatives.
Longitudinal tracking across releases.
Industry benchmarking.

Limitations

SUS measures perceived usability, not actual usability. A user may rate a system highly despite making errors.
The 10-item format is not ideal for quick pulse checks — use UMUX-Lite (2 items) for lightweight measurement.
SUS does not diagnose specific issues — it is a thermometer, not a diagnostic tool.

UEQ (User Experience Questionnaire)

The UEQ measures six dimensions of user experience: Attractiveness, Perspicuity (clarity), Efficiency, Dependability, Stimulation, and Novelty. It uses 26 semantic differential items (word pairs).

When to Use UEQ Over SUS

When you need dimensional insight beyond a single usability score.
When hedonic qualities (stimulation, novelty) matter alongside pragmatic qualities (perspicuity, efficiency, dependability).
When comparing products across multiple experience dimensions.
UEQ+ (modular version) allows selecting only relevant scales.

UMUX-Lite

A two-item questionnaire that correlates strongly with SUS (r = 0.83). Useful for: in-app pulse surveys, post-task quick checks, high-frequency measurement where survey fatigue is a concern.

Items: (1) "[System] capabilities meet my requirements." (2) "[System] is easy to use." Both rated on a seven-point Likert scale.

SUPR-Q (Standardized User Experience Percentile Rank Questionnaire)

Task-Based Metrics

Task Success Rate

The most fundamental UX metric. Percentage of users who successfully complete a defined task.

Binary success: Did the user complete the task? Yes/No.
Partial success: Weighted scoring for partially completed tasks (e.g., 0 = fail, 0.5 = completed with significant difficulty, 1 = success).
Benchmark: Industry average for web tasks is approximately 78%. Below 70% indicates serious usability problems.

Time on Task

How long users take to complete a task. Lower is generally better, but context matters — rushed completion may indicate skipped steps.

Measure from task start to task completion.
Use geometric mean (not arithmetic mean) because time data is typically right-skewed.
Compare against baseline or competitor benchmarks.
Distinguish between first-time and repeat users.

Error Rate

Frequency and severity of errors during task completion.

Error frequency: Number of errors per task attempt.
Error taxonomy: Classify errors as slips (wrong action, right intention), mistakes (wrong intention), or system errors (not user-caused).
Recovery rate: Percentage of errors the user successfully recovers from.
Error impact: Which errors cause task abandonment versus temporary friction.

Lostness Score

Measures navigation efficiency. Calculated as: Lostness = sqrt((N/S - 1)^2 + (R/N - 1)^2) where N = number of pages visited, S = minimum pages needed, R = unique pages visited.

Score of 0 = optimal path. Score above 0.4 = user is seriously lost.
Useful for evaluating information architecture and navigation design.

Behavioral Analytics

Funnel Analysis

Track user progression through multi-step flows. Identify where users drop off and why.

Define funnels for every critical flow: onboarding, checkout, feature activation, upgrade.
Measure step-to-step conversion rate, overall completion rate, and time per step.
Segment funnels by user cohort, acquisition channel, device, and user experience level.
Investigate drop-off points with session recordings and qualitative research.

Cohort Analysis

Group users by shared characteristic (signup date, acquisition channel, plan tier) and track behavior over time.

Retention cohorts: What percentage of users who signed up in Week 1 are still active in Week 4, 8, 12?
Feature cohorts: How does behavior differ between users who adopted Feature X versus those who did not?
Identify the "aha moment" — the action most correlated with long-term retention.

Retention Curves

Plot the percentage of active users over time since signup.

Flattening curve: Healthy product — a stable base of users finds lasting value.
Declining curve: Problem — users are not finding enough value to stay.
Smiling curve: Recovery — initial drop-off followed by increasing engagement (often from re-engagement campaigns).

Experimentation

A/B Testing

Compare two design variants with random user assignment to determine which performs better on a defined metric.

Hypothesis template: "If we [change], then [metric] will [improve/decrease] by [amount] because [reasoning]."
Sample size: Calculate before launching using a power calculator. Specify minimum detectable effect, significance level (typically 0.05), and power (typically 0.80).
Duration: Run for at least one full business cycle (minimum 1-2 weeks). Do not peek at results and stop early.
One variable: Test one change at a time for clear causation. Multivariate testing requires exponentially larger samples.
Guard rails: Monitor guardrail metrics (crash rate, load time, error rate) to ensure the variant does not cause harm even if the primary metric improves.

Sequential Testing

Alternative to fixed-horizon A/B testing. Allows checking results continuously without inflating false positive rate. Useful when you need faster decisions or have variable traffic.

Benchmarking

Industry Benchmarks

SUS: Average across studies is 68. B2B software averages 65-72. Consumer apps average 70-78.
Task success rate: Web average 78%. E-commerce checkout 65-70%. Mobile forms 60-75%.
NPS: Average varies by industry. SaaS averages 30-40. B2C averages 40-60.
Time on task: Highly variable. Compare against your own historical baseline rather than cross-industry.

Historical Tracking

Track key metrics across every release. Plot trendlines.
Set regression thresholds: alert when a metric drops more than 5% from baseline.
Celebrate improvements with the team — measurement should drive positive reinforcement.

AI-Specific UX Metrics

Trust calibration score: How closely does user confidence in AI output match actual AI accuracy?
Appropriate reliance rate: Percentage of times users correctly accept accurate AI outputs and correctly reject inaccurate ones.
AI feature adoption: Opt-in rate, prompt frequency, feature retention over time.
Correction frequency: How often users edit or override AI outputs.
Time saved: Measured difference in task time with AI versus without.

Design System Metrics

Adoption rate: Percentage of UI built with design system components versus custom implementations.
Component reuse: Average instances per component across products.
Consistency score: Visual regression pass rate across products.
Contribution rate: External PRs per month to the design system.
Developer satisfaction: Quarterly NPS survey of design system consumers.
Design system ROI: [(Time efficiency gains + Quality gains + Scale gains + Consistency gains) - (Initial cost + Maintenance cost)] / (Initial + Maintenance) x 100%.
Sparkbox research data: 38% design efficiency, 31% dev efficiency, 228% higher ROI with mature DesignOps.

Cross-Referencing

For heuristic evaluation methodology, reference nng-ux-heuristics.
For research methodology, reference ux-research-methods.
For AI-specific metrics detail, reference ux-metrics-measurement/references/ai-ux-metrics-experimentation.
For design system metrics detail, reference design-systems-architecture/references/governance-scaling.
For usability testing protocols that produce metrics, reference ux-research-methods/references/research-protocols.

v3.0 Cross-References

The v3.0 upgrade adds reference materials that extend measurement into design system maturity, AI-specific quality metrics, and notification effectiveness.

Key Sources

Google HEART Framework (Rodden, Hutchinson, Fu)
Sauro, J. & Lewis, J.: "Quantifying the User Experience"
MeasuringU: 48 UX Metrics (2025)
Brooke, J.: SUS — A Quick and Dirty Usability Scale
Schrepp, M.: User Experience Questionnaire (UEQ)
Frontiers in Computer Science: UX Measurement Frameworks
Sparkbox: Design System ROI Research
NNG Group: UX Maturity Model, DesignOps 101