From perplexity-pack
Optimizes Perplexity Sonar API with model routing by query complexity, query hashing for caching (LRU/Redis), streaming, and batching to reduce latency and boost throughput.
npx claudepluginhub jeremylongshore/claude-code-plugins-plus-skills --plugin perplexity-packThis skill is limited to using the following tools:
Optimize Perplexity Sonar API for latency, throughput, and cost. Key insight: every Perplexity call performs a live web search, so response times are inherently variable. Typical latencies: sonar 1-3s, sonar-pro 3-8s, sonar-deep-research 10-60s.
Guides Next.js Cache Components and Partial Prerendering (PPR) with cacheComponents enabled. Implements 'use cache', cacheLife(), cacheTag(), revalidateTag(), static/dynamic optimization, and cache debugging.
Guides building MCP servers enabling LLMs to interact with external services via tools. Covers best practices, TypeScript/Node (MCP SDK), Python (FastMCP).
Generates original PNG/PDF visual art via design philosophy manifestos for posters, graphics, and static designs on user request.
Optimize Perplexity Sonar API for latency, throughput, and cost. Key insight: every Perplexity call performs a live web search, so response times are inherently variable. Typical latencies: sonar 1-3s, sonar-pro 3-8s, sonar-deep-research 10-60s.
| Model | Typical Latency | Max Tokens | Best For |
|---|---|---|---|
sonar | 1-3s | 4096 | Quick answers, simple facts |
sonar-pro | 3-8s | 8192 | Deep research, many citations |
sonar-reasoning-pro | 5-15s | 8192 | Multi-step analysis |
sonar-deep-research | 10-60s | 8192 | Comprehensive reports |
import OpenAI from "openai";
const perplexity = new OpenAI({
apiKey: process.env.PERPLEXITY_API_KEY,
baseURL: "https://api.perplexity.ai",
});
type QueryComplexity = "simple" | "standard" | "deep";
function classifyQuery(query: string): QueryComplexity {
const words = query.split(/\s+/).length;
const simplePatterns = [/^what is/i, /^who is/i, /^when did/i, /^define/i, /^how many/i];
const deepPatterns = [/compare.*vs/i, /analysis of/i, /comprehensive/i, /pros and cons/i, /in-depth/i];
if (simplePatterns.some((p) => p.test(query)) && words < 15) return "simple";
if (deepPatterns.some((p) => p.test(query)) || words > 30) return "deep";
return "standard";
}
function selectModel(complexity: QueryComplexity): { model: string; maxTokens: number } {
switch (complexity) {
case "simple": return { model: "sonar", maxTokens: 256 };
case "standard": return { model: "sonar", maxTokens: 1024 };
case "deep": return { model: "sonar-pro", maxTokens: 4096 };
}
}
async function smartSearch(query: string) {
const complexity = classifyQuery(query);
const { model, maxTokens } = selectModel(complexity);
return perplexity.chat.completions.create({
model,
messages: [{ role: "user", content: query }],
max_tokens: maxTokens,
});
}
import { LRUCache } from "lru-cache";
import { createHash } from "crypto";
const CACHE_TTL = {
news: 30 * 60 * 1000, // 30 min for current events
research: 4 * 60 * 60 * 1000, // 4 hours for research
factual: 24 * 60 * 60 * 1000, // 24 hours for stable facts
};
const searchCache = new LRUCache<string, any>({
max: 1000,
ttl: CACHE_TTL.research, // default TTL
});
function cacheKey(query: string, model: string): string {
return createHash("sha256")
.update(`${model}:${query.toLowerCase().trim()}`)
.digest("hex");
}
function detectTTL(query: string): number {
if (/\b(latest|today|breaking|current price|this week)\b/i.test(query))
return CACHE_TTL.news;
if (/\b(what is|define|how does|who is)\b/i.test(query))
return CACHE_TTL.factual;
return CACHE_TTL.research;
}
async function cachedSearch(query: string, model = "sonar") {
const key = cacheKey(query, model);
const cached = searchCache.get(key);
if (cached) return { ...cached, cached: true };
const result = await perplexity.chat.completions.create({
model,
messages: [{ role: "user", content: query }],
});
searchCache.set(key, result, { ttl: detectTTL(query) });
return { ...result, cached: false };
}
async function streamSearch(
query: string,
onChunk: (text: string) => void,
onCitations: (urls: string[]) => void
) {
const stream = await perplexity.chat.completions.create({
model: "sonar-pro",
messages: [{ role: "user", content: query }],
stream: true,
max_tokens: 4096,
});
let fullText = "";
for await (const chunk of stream) {
const text = chunk.choices[0]?.delta?.content || "";
fullText += text;
onChunk(text);
if ((chunk as any).citations) {
onCitations((chunk as any).citations);
}
}
return fullText;
}
import PQueue from "p-queue";
const queue = new PQueue({ concurrency: 3, interval: 1500, intervalCap: 1 });
async function parallelResearch(queries: string[]): Promise<Map<string, any>> {
const results = new Map<string, any>();
await Promise.all(
queries.map((q) =>
queue.add(async () => {
const result = await cachedSearch(q, "sonar");
results.set(q, result);
})
)
);
return results;
}
// Limit tokens to what you actually need
async function optimizedSearch(query: string, detail: "brief" | "full" = "brief") {
return perplexity.chat.completions.create({
model: "sonar",
messages: [
{
role: "system",
content: detail === "brief"
? "Answer in 2-3 sentences maximum."
: "Provide a thorough answer with examples.",
},
{ role: "user", content: query },
],
max_tokens: detail === "brief" ? 150 : 2048,
});
}
| Issue | Cause | Solution |
|---|---|---|
| Latency >10s on sonar | Complex query triggering deep search | Add max_tokens: 512 to limit response |
| Cache hit rate <20% | Queries too unique | Normalize queries (lowercase, trim) |
| Burst 429 errors | Parallel requests too aggressive | Use PQueue with intervalCap |
| Stale cached results | TTL too long for news | Use query-type-aware TTL |
For cost optimization, see perplexity-cost-tuning.