Build production AI features with LLM integration, RAG pipelines, prompt engineering, and guardrails
npx claudepluginhub cure-consulting-group/productengineeringskillsThis skill uses the workspace's default tool permissions.
Build production AI features: LLM integration, RAG pipelines, voice/vision, and intelligent automation. Ship AI that's reliable, cost-aware, and safe — not a demo.
Generates design tokens/docs from CSS/Tailwind/styled-components codebases, audits visual consistency across 10 dimensions, detects AI slop in UI.
Records polished WebM UI demo videos of web apps using Playwright with cursor overlay, natural pacing, and three-phase scripting. Activates for demo, walkthrough, screen recording, or tutorial requests.
Delivers idiomatic Kotlin patterns for null safety, immutability, sealed classes, coroutines, Flows, extensions, DSL builders, and Gradle DSL. Use when writing, reviewing, refactoring, or designing Kotlin code.
Build production AI features: LLM integration, RAG pipelines, voice/vision, and intelligent automation. Ship AI that's reliable, cost-aware, and safe — not a demo.
Before starting, gather project context silently:
PORTFOLIO.md if it exists in the project root or parent directories for product/team contextcat package.json 2>/dev/null || cat build.gradle.kts 2>/dev/null || cat Podfile 2>/dev/null to detect stackgit log --oneline -5 2>/dev/null for recent changesls src/ app/ lib/ functions/ 2>/dev/null to understand project structureopenai|anthropic|gemini|Claude|GPT|completion|embedding to understand current AI integrationYou MUST generate actual implementation code using Write, not just describe patterns:
src/llm/client.ts — type-safe wrapper with retry, timeout, streaming supportsrc/llm/prompts/{feature}.ts — versioned prompt templates with variable injectionsrc/llm/guardrails.ts — input validation, output parsing, PII detection, content filteringsrc/llm/cost-tracker.ts — token counting and cost logging middlewaretests/llm/{feature}.eval.ts — golden dataset tests for prompt qualitysrc/llm/rag/ — embedder, vector store client, retriever, rerankerBefore generating, Grep for existing LLM code and Read it to extend rather than duplicate.
| Feature | Architecture |
|---|---|
| Chatbot / conversational | LLM + conversation memory + streaming UI |
| Document processing | Upload → OCR/parse → LLM extract → structured output |
| Smart search | Embeddings + vector DB + semantic search |
| Recommendations | User data → embedding similarity → ranked results |
| Content generation | LLM + prompt template + guardrails + human review |
| Voice interaction | Speech-to-text → LLM → text-to-speech |
| Image/vision analysis | Vision model → structured extraction |
| Workflow automation | Trigger → LLM decision → action → verification |
| RAG (retrieval-augmented) | Query → retrieve context → LLM with context → response |
User Input → Prompt Template → LLM API → Parse Response → UI
Use for: content generation, simple Q&A, classification
User Query
→ Embed query (text-embedding model)
→ Search vector DB (Pinecone, Firestore vector, pgvector)
→ Retrieve top-K relevant chunks
→ Construct prompt: system + context chunks + user query
→ LLM generates answer grounded in retrieved context
→ Response with source citations
Use for: knowledge bases, documentation search, domain-specific Q&A
User Request
→ LLM plans steps (tool selection)
→ Execute tool 1 → result
→ Execute tool 2 → result
→ LLM synthesizes final response
Use for: complex workflows, multi-source data, actions with side effects
Input guardrails:
- Validate input length (reject > max tokens)
- Sanitize PII before sending to external LLM (if required by policy)
- Rate limit per user (prevent abuse / cost spikes)
- Content moderation on user input (if public-facing)
Output guardrails:
- Parse structured output with schema validation (Zod)
- Reject responses that fail schema validation → fallback
- Content filtering on LLM output (profanity, harmful content)
- Confidence thresholds — low confidence → human review queue
- Never display raw LLM output without parsing
Per-request cost formula:
(input_tokens × input_price) + (output_tokens × output_price)
Cost controls:
- Set max_tokens on every request (prevents runaway responses)
- Cache identical requests (hash prompt → cache response, TTL 1hr+)
- Use smaller models for simple tasks (classification, extraction)
- Use larger models only for complex reasoning
- Log token usage per feature for cost attribution
- Set monthly budget alerts
// For chat/conversational features — always stream
// Users perceive streaming as faster even when total time is the same
// Server: return ReadableStream
// Client: consume with async iterator, render token-by-token
// Show typing indicator while first token loads
// Handle stream interruption gracefully (partial response display)
Document Ingestion:
1. Upload document (PDF, DOCX, HTML, TXT)
2. Extract text (pdf-parse, mammoth, cheerio)
3. Chunk text (500-1000 tokens per chunk, 100 token overlap)
4. Generate embeddings (text-embedding-3-small or equivalent)
5. Store in vector DB with metadata (source, page, date)
Query Pipeline:
1. Embed user query with same model
2. Vector similarity search (top 5-10 chunks)
3. Re-rank results (optional, improves quality)
4. Construct prompt with retrieved context
5. Generate response with citations
LLM API errors:
- 429 Rate Limited → exponential backoff (1s, 2s, 4s, max 3 retries)
- 500/503 Server Error → retry once, then fallback
- Timeout (>30s) → cancel, show fallback UI
- Invalid response → log, show "I couldn't process that" message
Fallback hierarchy:
1. Retry with same model
2. Try backup model (e.g., GPT-4 fails → try Gemini)
3. Return cached similar response (if available)
4. Show graceful error with manual alternative
5. Never: crash, hang, or show raw error to user
Unit tests:
- Prompt template generates correct string for given inputs
- Output parser handles valid JSON, malformed JSON, empty response
- Guardrails block known-bad inputs
- Cost calculation is accurate
Integration tests (use recorded responses):
- Record real LLM responses → replay in tests (VCR pattern)
- Test full pipeline: input → prompt → (recorded) response → parsed output
- Test fallback paths with simulated errors
Evaluation tests:
- Maintain a golden dataset (input → expected output pairs)
- Run weekly eval: measure accuracy, hallucination rate, relevance
- Track eval scores over time (regression detection)
Before shipping any AI feature: