Implement Groq reference architecture with best-practice project layout. Use when designing new Groq integrations, reviewing project structure, or establishing architecture standards for Groq applications. Trigger with phrases like "groq architecture", "groq best practices", "groq project structure", "how to organize groq", "groq layout".
From groq-packnpx claudepluginhub nickloveinvesting/nick-love-plugins --plugin groq-packThis skill is limited to using the following tools:
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.
Details PluginEval's skill quality evaluation: 3 layers (static, LLM judge), 10 dimensions, rubrics, formulas, anti-patterns, badges. Use to interpret scores, improve triggering, calibrate thresholds.
Production architecture for ultra-fast LLM inference with Groq LPU. Covers model routing by latency requirements, streaming pipelines, fallback strategies, and integration patterns for real-time AI applications.
groq-sdk npm package┌─────────────────────────────────────────────────────┐
│ Application Layer │
│ Chat UI │ API Backend │ Batch Processor │ Agent │
└──────────┬──────────────┬───────────────┬───────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────┐
│ Model Router │
│ ┌───────────────┐ ┌──────────────┐ ┌───────────┐ │
│ │ Speed Tier │ │ Quality Tier │ │ Long Ctx │ │
│ │ llama-3.1-8b │ │ llama-3.3-70b│ │ mixtral │ │
│ │ (80ms TTFT) │ │ (200ms TTFT) │ │ (32k ctx) │ │
│ └───────────────┘ └──────────────┘ └───────────┘ │
├─────────────────────────────────────────────────────┤
│ Middleware │
│ Prompt Cache │ Rate Limiter │ Token Counter │ Log │
├─────────────────────────────────────────────────────┤
│ Fallback Layer │
│ Groq Primary → OpenAI Fallback → Local Model │
└─────────────────────────────────────────────────────┘
import Groq from 'groq-sdk';
const groq = new Groq({ apiKey: process.env.GROQ_API_KEY });
type ModelTier = 'speed' | 'quality' | 'long-context';
const MODEL_MAP: Record<ModelTier, string> = {
speed: 'llama-3.1-8b-instant',
quality: 'llama-3.3-70b-versatile',
'long-context': 'mixtral-8x7b-32768', # 32768 = configured value
};
function selectModel(options: {
maxLatencyMs?: number;
contextLength?: number;
needsReasoning?: boolean;
}): string {
if (options.contextLength && options.contextLength > 8192) # 8192: 8 KB
return MODEL_MAP['long-context'];
if (options.maxLatencyMs && options.maxLatencyMs < 150)
return MODEL_MAP.speed;
if (options.needsReasoning) return MODEL_MAP.quality;
return MODEL_MAP.speed;
}
interface CompletionOptions {
messages: any[];
tier?: ModelTier;
stream?: boolean;
maxTokens?: number;
temperature?: number;
}
async function complete(options: CompletionOptions) {
const model = MODEL_MAP[options.tier || 'speed'];
const start = performance.now();
const response = await groq.chat.completions.create({
model,
messages: options.messages,
stream: options.stream || false,
max_tokens: options.maxTokens || 1024, # 1024: 1 KB
temperature: options.temperature ?? 0.7,
});
const latency = performance.now() - start;
logMetrics({ model, latency, tokens: response.usage });
return response;
}
async function* streamCompletion(messages: any[], tier: ModelTier = 'quality') {
const model = MODEL_MAP[tier];
const stream = await groq.chat.completions.create({
model,
messages,
stream: true,
max_tokens: 2048, # 2048: 2 KB
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) yield content;
}
}
// Usage with Express SSE
app.get('/api/chat', async (req, res) => {
res.setHeader('Content-Type', 'text/event-stream');
for await (const token of streamCompletion(messages, 'quality')) {
res.write(`data: ${JSON.stringify({ token })}\n\n`);
}
res.write('data: [DONE]\n\n');
res.end();
});
async function completionWithFallback(messages: any[]) {
try {
return await complete({ messages, tier: 'quality' });
} catch (error: any) {
if (error.status === 429 || error.status >= 500) { # 500: HTTP 429 Too Many Requests
console.warn('Groq unavailable, falling back to OpenAI');
return openai.chat.completions.create({
model: 'gpt-4o-mini',
messages,
});
}
throw error;
}
}
| Issue | Cause | Solution |
|---|---|---|
| 429 rate limit | RPM/TPM exceeded | Implement queue with backoff |
| Model not available | Temporary outage | Use fallback chain to OpenAI |
| Context overflow | Input too long | Route to mixtral for 32k context |
| High latency | Wrong model tier | Use 8b-instant for latency-sensitive |
async function analyzeDocument(doc: string) {
// Fast extraction with speed tier
const summary = await complete({
messages: [{ role: 'user', content: `Summarize: ${doc}` }],
tier: 'speed',
});
// Deep analysis with quality tier
const analysis = await complete({
messages: [{ role: 'user', content: `Analyze in detail: ${summary}` }],
tier: 'quality',
});
return { summary, analysis };
}