Generates a structured VUI spec with interaction flows, confirmation patterns, error recovery, hybrid screen layouts, audio feedback, and accessibility guidelines for a given product.
From nakshanpx claudepluginhub adityaraj0421/naksha-studio --plugin naksha[product type] [platform: alexa|google|custom] [screen: voice-only|hybrid]/design-voice-uiGenerates a structured VUI spec with interaction flows, confirmation patterns, error recovery, hybrid screen layouts, audio feedback, and accessibility guidelines for a given product.
You are designing a complete voice user interface (VUI). Your output is a single structured spec covering interaction flows, confirmation patterns, screen companion layout (if hybrid), audio feedback, and accessibility.
Arguments: $ARGUMENTS
Parse the following from $ARGUMENTS:
Read ${CLAUDE_PLUGIN_ROOT}/skills/design/references/conversational-designer.md to apply the full Conversational Designer VUI knowledge base to every section below. Focus especially on: VUI Principles (wake word design, confirmation strategies, barge-in, no-input handling, SSML), Multi-Modal Design (when to show vs. speak, progressive disclosure, sync latency), and Earcons and Audio Feedback.
MCP Fallback: If the file cannot be read, proceed using inline knowledge. The inline knowledge for this command covers: wake word requirements (3+ syllables, phonetically distinctive, brand-consistent), confirmation strategies (explicit for high-stakes/irreversible, implicit for low-stakes/reversible), barge-in (enable by default, stop audio within 100ms, resume from current state), no-input handling (4s → "Are you still there?", 4s more → explicit options, third silence → graceful exit), SSML (break/emphasis/prosody/say-as), earcons (ascending 2-tone for success 200ms, descending 2-tone for error 300ms, pulse for waiting, single chime for wake detection 150ms), and multi-modal sync (screen update within 200ms of voice response starting).
Produce all five sections below as a single continuous output. Do not pause or ask for confirmation between sections.
Four key conversation paths for the voice interface.
User: [Wake word] — "Hey [assistant name]"
└─→ SYSTEM: Wake word detected
└─→ [Earcon: single chime 150ms] + [Listening indicator active]
└─→ User: [core voice command]
└─→ SYSTEM: Intent recognized (confidence ≥ 0.85)
└─→ [Processing indicator]
└─→ BOT: [Implicit confirmation + action]
└─→ [Earcon: ascending 2-tone 200ms] + [Screen update if hybrid]
└─→ [END or await follow-up]
User: [ambiguous command — top 2 intents within 15% confidence]
└─→ SYSTEM: Disambiguation triggered
└─→ BOT: "I want to make sure I help with the right thing.
Do you mean [Intent A] or [Intent B]?"
└─→ [2 options only — max 2 choices per disambiguation prompt]
├─→ User: clarifies → [proceed with selected intent]
└─→ User: still ambiguous → FALLBACK_1 (not a second disambiguation)
User: [utterance — intent not recognized]
└─→ FALLBACK_1: "I didn't quite catch that. You can say things like
'[capability 1],' '[capability 2],' or '[capability 3].'"
└─→ User: [still unrecognized]
└─→ FALLBACK_2: "Let me try a different way — I can help with
[category A], [category B], and [category C].
Which of those sounds closest to what you need?"
└─→ User: [still unrecognized or 3rd failure]
└─→ ESCALATION: "I'm having trouble understanding —
let me connect you with someone who can help."
└─→ [Graceful exit or human handoff]
Fallback rule: Never repeat the same fallback phrase twice in a session. Rotate phrasing. FALLBACK_1 should offer structured choices — an open-ended reprompt after a failed open-ended exchange is a dead end.
SYSTEM: [Action attempted — service call fails]
└─→ [Earcon: descending 2-tone 300ms]
└─→ BOT: "Something went wrong on my end — it's not you.
Want me to try again?"
├─→ User: "Yes" → [Retry once]
│ ├─→ [Retry succeeds] → Resume flow
│ └─→ [Retry fails] → "I'm still having trouble.
│ [Alternative: 'Try the app' / 'Call support' / 'I'll remind you later']"
└─→ User: "No" / silence after 8s → Graceful exit offer
No-input handling (silence during expected response):
| Silence duration | Response |
|---|---|
| 4 seconds | "Are you still there? Take your time." |
| 4 more seconds | "You can say 'yes' to continue, 'start over' to begin again, or 'goodbye' to end." |
| Third silence | "I'll end our session now — [platform wake word] anytime." End gracefully. |
Note: silence during slot-filling (user thinking about a date or number) should receive a longer grace period than top-level silence. For explicit yes/no confirmation prompts on high-stakes actions (payment, booking, deletion), the extended 8-second threshold from Section 2 applies — deliberate silence before confirming an irreversible action is expected and should not be penalized with a 4-second cutoff.
Match confirmation weight to action weight. Over-confirmation destroys conversation flow; under-confirmation on irreversible actions destroys trust.
<break time="300ms"/> before the confirmation question; <emphasis> on the key values being confirmedSkip this section entirely if the parsed screen availability is voice-only. Generate this section only when $ARGUMENTS includes "hybrid" or a screen-based platform is indicated.
The screen companion is always secondary to voice — it provides detail and data that voice cannot efficiently deliver. It is never the primary interaction surface.
┌─────────────────────────────────────────┐
│ TOP BAR — Always-on listening indicator │
│ [Pulsing mic icon when active] │
│ [Static mic icon when idle] │
│ [Product name / assistant name] │
├─────────────────────────────────────────┤
│ │
│ MAIN AREA — Last 3 conversational turns │
│ (most recent at bottom) │
│ │
│ [Turn 1: User utterance text] │
│ [Turn 1: Bot response text] │
│ [Turn 2: User utterance text] │
│ [Turn 2: Bot response text] │
│ [Turn 3: User utterance text — latest] │
│ [Turn 3: Bot response — current] │
│ │
├─────────────────────────────────────────┤
│ SUGGESTIONS STRIP — 3 context-aware │
│ quick actions (tap or speak) │
│ [Action 1] [Action 2] [Action 3] │
├─────────────────────────────────────────┤
│ BOTTOM BAR │
│ [Manual input toggle] [Mute button] │
└─────────────────────────────────────────┘
Never require the user to speak a full data entry (long address, long code) when a screen form would be faster.
Non-speech audio cues (earcons) communicate state changes faster than speech. Every earcon must have a visual fallback for deaf/hard-of-hearing users.
| State | Earcon | Duration | Visual Fallback |
|---|---|---|---|
| Wake word detected / listening active | Single ascending chime | 150ms | Mic icon animates to active (pulsing ring) |
| Task completed successfully | Ascending 2-tone | 200ms | Success color flash on screen + check icon |
| Error or failure | Descending 2-tone | 300ms | Warning color + warning icon |
| Waiting / processing | Repeating pulse every 1.5s | Continuous until resolved | Spinning/pulsing indicator on screen |
| Option selected / button tapped | Soft click | 50–100ms | Button state change (pressed style) |
Earcon rules:
Apply these SSML treatments to all synthesized voice responses:
<!-- Pause before key information — give user time to listen -->
<speak>
Your appointment is confirmed. <break time="300ms"/>
We'll see you on Friday at 3pm.
</speak>
<!-- Emphasis on important terms -->
<speak>
Your balance is <emphasis level="moderate">$42.50</emphasis>.
</speak>
<!-- Speaking rate 90–110% of default for clarity -->
<speak>
<prosody rate="95%">
Here are the three options available to you.
</prosody>
</speak>
<!-- Spell out codes, references, and abbreviations -->
<speak>
Your reference number is
<say-as interpret-as="characters">AX7Z</say-as>.
</speak>
<!-- Dates in natural spoken format -->
<speak>
Booked for
<say-as interpret-as="date" format="mdy">7/14/2025</say-as>.
</speak>
<!-- Pause between list items so user can track each one -->
<speak>
You have three messages. <break time="300ms"/>
One from Sarah. <break time="300ms"/>
One from the system. <break time="300ms"/>
And one from your manager.
</speak>
SSML content-type guidelines:
| Content Type | SSML Treatment |
|---|---|
| Confirmation numbers, codes | say-as interpret-as="characters" |
| Dates | say-as interpret-as="date" |
| Currency amounts | say-as interpret-as="currency" (SSML 1.1, Polly, Google TTS); fallback if unsupported: interpret-as="cardinal" + spell out "dollars and X cents" in text |
| Legal or terms copy | prosody rate="slow" |
| Lists (≤3 items) | <break time="300ms"/> between items |
| Emotional success moment | prosody pitch="+5%" for slight warmth |
| Key data user needs to remember | <emphasis level="moderate"> |
Barge-in rule: Enable barge-in by default on all prompts. When the user speaks mid-bot-speech, stop audio within 100ms and process the new utterance. Resume from the current state, not the beginning of the flow.
Voice-first design must not exclude users with disabilities. Every voice capability must have a non-voice equivalent.
After generating this voice UI spec, consider these follow-up commands:
/design-chatbot — design the text-based chat companion that pairs with this voice interface/design-framework — build the screen companion layout (Section 3) as a coded UI component/design-system — generate design tokens for the hybrid screen companion (colors, spacing, typography, animation timing)