Skill

groq-performance-tuning

Optimize Groq API performance with caching, batching, and connection pooling. Use when experiencing slow API responses, implementing caching strategies, or optimizing request throughput for Groq integrations. Trigger with phrases like "groq performance", "optimize groq", "groq latency", "groq caching", "groq slow", "groq batch".

From groq-pack

Install

Run in your terminal

npx claudepluginhub nickloveinvesting/nick-love-plugins --plugin groq-pack

Tool Access

This skill is limited to using the following tools:

ReadWriteEdit

Skill Content

Similar Skills

cache-components

Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.

cache-components

138.6k

claude-opus-4-5-migration

2 files

Migrates code, prompts, and API calls from Claude Sonnet 4.0/4.5 or Opus 4.1 to Opus 4.5, updating model strings on Anthropic, AWS, GCP, Azure platforms.

claude-opus-4-5-migration

83.2k

evaluation-methodology

1 file

Details PluginEval's skill quality evaluation: 3 layers (static, LLM judge), 10 dimensions, rubrics, formulas, anti-patterns, badges. Use to interpret scores, improve triggering, calibrate thresholds.

plugin-eval

32.9k

Stats

Parent Repo Stars0

Parent Repo Forks0

Last CommitMar 20, 2026

Actions

View Source View Plugin View on GitHub View README

Groq Performance Tuning

Overview

Maximize Groq's ultra-low-latency LPU inference. Groq delivers sub-100ms token generation; tuning focuses on streaming efficiency, prompt caching, model selection for speed vs quality, and parallel request orchestration.

Prerequisites

Groq API key with rate limit awareness
groq-sdk npm package installed
Understanding of LLM token economics
Monitoring for TTFT (time to first token)

Instructions

Step 1: Select Optimal Model for Speed

import Groq from 'groq-sdk';

const groq = new Groq({ apiKey: process.env.GROQ_API_KEY });

// Model speed tiers (approximate TTFT):
// llama-3.3-70b-versatile: ~200ms TTFT, best quality
// llama-3.1-8b-instant:    ~80ms TTFT, fastest
// mixtral-8x7b-32768:      ~150ms TTFT, long context  # 32768 = configured value

async function fastCompletion(prompt: string) {
  return groq.chat.completions.create({
    model: 'llama-3.1-8b-instant', // Fastest model
    messages: [{ role: 'user', content: prompt }],
    temperature: 0,       // Deterministic = cacheable
    max_tokens: 256,      // Limit output for speed  # 256 bytes
  });
}

Step 2: Streaming for Perceived Performance

async function streamCompletion(
  messages: any[],
  onToken: (token: string) => void
) {
  const stream = await groq.chat.completions.create({
    model: 'llama-3.3-70b-versatile',
    messages,
    stream: true,
    max_tokens: 1024,  # 1024: 1 KB
  });

  let fullResponse = '';
  for await (const chunk of stream) {
    const token = chunk.choices[0]?.delta?.content || '';
    fullResponse += token;
    onToken(token);
  }
  return fullResponse;
}

Step 3: Semantic Prompt Cache

import { LRUCache } from 'lru-cache';
import { createHash } from 'crypto';

const promptCache = new LRUCache<string, string>({
  max: 500,  # HTTP 500 Internal Server Error
  ttl: 1000 * 60 * 10, // 10 min for deterministic prompts  # 1000: 1 second in ms
});

function hashPrompt(messages: any[], model: string): string {
  return createHash('sha256')
    .update(JSON.stringify({ messages, model }))
    .digest('hex');
}

async function cachedCompletion(messages: any[], model: string) {
  const key = hashPrompt(messages, model);
  const cached = promptCache.get(key);
  if (cached) return cached;

  const response = await groq.chat.completions.create({
    model,
    messages,
    temperature: 0,
  });

  const result = response.choices[0].message.content!;
  promptCache.set(key, result);
  return result;
}

Step 4: Parallel Request Orchestration

async function parallelCompletions(
  prompts: string[],
  concurrency = 5
) {
  const results: string[] = [];

  for (let i = 0; i < prompts.length; i += concurrency) {
    const batch = prompts.slice(i, i + concurrency);
    const batchResults = await Promise.all(
      batch.map(prompt =>
        cachedCompletion(
          [{ role: 'user', content: prompt }],
          'llama-3.1-8b-instant'
        )
      )
    );
    results.push(...batchResults);
  }
  return results;
}

Error Handling

Issue	Cause	Solution
Rate limit 429	Over RPM/TPM quota	Use exponential backoff, batch requests
High TTFT	Using 70b model	Switch to 8b-instant for latency-sensitive tasks
Stream disconnect	Network timeout	Implement reconnection with partial response recovery
Token overflow	max_tokens too high	Set conservative limits, truncate prompts

Examples

Latency Benchmark

async function benchmarkModels(prompt: string) {
  const models = ['llama-3.1-8b-instant', 'llama-3.3-70b-versatile'];

  for (const model of models) {
    const start = performance.now();
    await groq.chat.completions.create({
      model,
      messages: [{ role: 'user', content: prompt }],
      max_tokens: 100,
    });
    console.log(`${model}: ${(performance.now() - start).toFixed(0)}ms`);
  }
}

Resources

Output

Configuration files or code changes applied to the project
Validation report confirming correct implementation
Summary of changes made and their rationale